Stata Journal
Stata Journal
Associate Editors
Christopher F. Baum, Boston College Frauke Kreuter, Univ. of Maryland–College Park
Nathaniel Beck, New York University Peter A. Lachenbruch, Oregon State University
Rino Bellocco, Karolinska Institutet, Sweden, and Jens Lauritsen, Odense University Hospital
University of Milano-Bicocca, Italy Stanley Lemeshow, Ohio State University
Maarten L. Buis, WZB, Germany J. Scott Long, Indiana University
A. Colin Cameron, University of California–Davis Roger Newson, Imperial College, London
Mario A. Cleves, University of Arkansas for Austin Nichols, Urban Institute, Washington DC
Medical Sciences Marcello Pagano, Harvard School of Public Health
William D. Dupont, Vanderbilt University Sophia Rabe-Hesketh, Univ. of California–Berkeley
Philip Ender, University of California–Los Angeles J. Patrick Royston, MRC Clinical Trials Unit,
David Epstein, Columbia University London
Allan Gregory, Queen’s University Philip Ryan, University of Adelaide
James Hardin, University of South Carolina Mark E. Schaffer, Heriot-Watt Univ., Edinburgh
Ben Jann, University of Bern, Switzerland Jeroen Weesie, Utrecht University
Stephen Jenkins, London School of Economics and Ian White, MRC Biostatistics Unit, Cambridge
Political Science Nicholas J. G. Winter, University of Virginia
Ulrich Kohler, University of Potsdam, Germany Jeffrey Wooldridge, Michigan State University
The Stata Journal publishes reviewed papers together with shorter notes or comments, regular columns, book
reviews, and other material of interest to Stata users. Examples of the types of papers include 1) expository
papers that link the use of Stata commands or programs to associated principles, such as those that will serve
as tutorials for users first encountering a new field of statistics or a major new technique; 2) papers that go
“beyond the Stata manual” in explaining key features or uses of Stata that are of interest to intermediate
or advanced users of Stata; 3) papers that discuss new commands or Stata programs of interest either to
a wide spectrum of users (e.g., in data management or graphics) or to some large segment of Stata users
(e.g., in survey statistics, survival analysis, panel analysis, or limited dependent variable modeling); 4) papers
analyzing the statistical properties of new or existing estimators and tests in Stata; 5) papers that could
be of interest or usefulness to researchers, especially in fields that are of practical importance but are not
often included in texts or other journals, such as the use of Stata in managing datasets, especially large
datasets, with advice from hard-won experience; and 6) papers of interest to those who teach, including Stata
with topics such as extended examples of techniques and interpretation of results, simulations of statistical
concepts, and overviews of subject areas.
The Stata Journal is indexed and abstracted by CompuMath Citation Index, Current Contents/Social and Behav-
ioral Sciences, RePEc: Research Papers in Economics, Science Citation Index Expanded (also known as SciSearch,
Scopus, and Social Sciences Citation Index.
For more information on the Stata Journal, including information for authors, see the webpage
http://www.stata-journal.com
Subscriptions are available from StataCorp, 4905 Lakeway Drive, College Station, Texas 77845, telephone
979-696-4600 or 800-STATA-PC, fax 979-696-4601, or online at
http://www.stata.com/bookstore/sj.html
Subscription rates listed below include both a printed and an electronic copy unless otherwise mentioned.
1-year university library subscription $125 1-year university library subscription $165
2-year university library subscription $215 2-year university library subscription $295
3-year university library subscription $315 3-year university library subscription $435
http://www.stata.com/bookstore/sjj.html
Individual articles three or more years old may be accessed online without charge. More recent articles may
be ordered online.
http://www.stata-journal.com/archives.html
The Stata Journal is published quarterly by the Stata Press, College Station, Texas, USA.
Address changes should be sent to the Stata Journal, StataCorp, 4905 Lakeway Drive, College Station, TX
77845, USA, or emailed to [email protected].
Copyright
c 2013 by StataCorp LP
Copyright Statement: The Stata Journal and the contents of the supporting files (programs, datasets, and
help files) are copyright
c by StataCorp LP. The contents of the supporting files (programs, datasets, and
help files) may be copied or reproduced by any means whatsoever, in whole or in part, as long as any copy
or reproduction includes attribution to both (1) the author and (2) the Stata Journal.
The articles appearing in the Stata Journal may be copied or reproduced as printed copies, in whole or in part,
as long as any copy or reproduction includes attribution to both (1) the author and (2) the Stata Journal.
Written permission must be obtained from StataCorp if you wish to make electronic copies of the insertions.
This precludes placing electronic copies of the Stata Journal, in whole or in part, on publicly accessible websites,
fileservers, or other locations where the copy may be accessed by anyone other than the subscriber.
Users of any of the software, ideas, data, or other materials published in the Stata Journal or the supporting
files understand that such use is made without warranty of any kind, by either the Stata Journal, the author,
or StataCorp. In particular, there is no warranty of fitness of purpose or merchantability, nor for special,
incidental, or consequential damages such as loss of profits. The purpose of the Stata Journal is to promote
free communication among Stata users.
The Stata Journal, electronic version (ISSN 1536-8734) is a publication of Stata Press. Stata, , Stata
Press, Mata, , and NetCourse are registered trademarks of StataCorp LP.
Volume 13 Number 2 2013
Stata tip 115: How to properly estimate the multinomial probit model with het-
eroskedastic errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Herrmann 401
1 Introduction
Cliff–Ord (1973, 1981) models, which build on Whittle (1954), allow for cross-unit
interactions. Many models in the social sciences, biostatistics, and geographic sciences
have included such interactions. Following Cliff and Ord (1973, 1981), much of the
original literature was developed to handle spatial interactions. However, space is not
restricted to geographic space, and many recent applications use these techniques in
other situations of cross-unit interactions, such as social-interaction models and network
models; see, for example, Kelejian and Prucha (2010) and Drukker, Egger, and Prucha
(2013) for references. Much of the nomenclature still includes the adjective “spatial”,
and we continue this tradition to avoid confusion while noting the wider applicability
of these models. For texts and reviews, see, for example, Anselin (1988, 2010), Arbia
(2006), Cressie (1993), Haining (2003), and LeSage and Pace (2009).
The simplest Cliff–Ord model only considers spatial spillovers in the dependent vari-
able, with spillovers modeled by including a right-hand-side variable known as a spatial
lag. Each observation of the spatial-lag variable is a weighted average of the values of the
dependent variable observed for the other cross-sectional units. The matrix containing
the weights is known as the spatial-weighting matrix. This model is frequently referred
to as a spatial-autoregressive (SAR) model. A generalized version of this model also
allows for the disturbances to be generated by a SAR process. The combined SAR model
c 2013 StataCorp LP st0291
222 ML and GS2SLS estimators for a SARAR model
with SAR disturbances is often referred to as a SARAR model; see Anselin and Florax
(1995).1
In modeling the outcome for each unit as dependent on a weighted average of the
outcomes of other units, SARAR models determine outcomes simultaneously. This si-
multaneity implies that the ordinary least-squares estimator will not be consistent; see
Anselin (1988) for an early discussion of this point.
In this article, we describe the spreg command, which implements a maximum
likelihood (ML) estimator and a generalized spatial two-stage least-squares (GS2SLS)
estimator for the parameters of a SARAR model with exogenous regressors. For discus-
sions of the ML estimator, see, for example, the above cited texts and Lee (2004) for the
asymptotic properties of the estimator. For a discussion of the estimation theory for the
implemented GS2SLS estimator, see Arraiz et al. (2010) and Drukker, Egger, and Prucha
(2013), which build on Kelejian and Prucha (1998, 1999, 2010) and the references cited
therein.
Section 2 describes the SARAR model. Section 3 describes the spreg command. Sec-
tion 4 provides some examples. Section 5 describes postestimation commands. Section 6
presents methods and formulas. The conclusion follows.
We use the notation that for any matrix A and vector a, the elements are denoted
as aij and ai , respectively.
or more compactly,
y = λWy + Xβ + u (1)
u = ρMu + (2)
1. These models are also known as Cliff–Ord models because of the impact that Cliff and Ord (1973,
1981) had on the subsequent literature. To avoid confusion, we simply refer to these models as
SARAR models while still acknowledging the importance of the work of Cliff and Ord.
D. M. Drukker, I. R. Prucha, and R. Raciborski 223
where
• is an n × 1 vector of innovations.
The model in (1) and (2) is a SARAR with exogenous regressors. Spatial interactions
are modeled through spatial lags. The model allows for spatial interactions in the
dependent variable, the exogenous variables, and the disturbances.2
The spatial-weighting matrices W and M are taken to be known and nonstochastic.
These matrices are part of the model definition, and in many applications, W = M.
Let y = Wy. Then
n
yi = wij yj
j=1
which clearly shows the dependence of yi on neighboring outcomes via the spatial lag
y i . By construction, the spatial lag Wy is an endogenous variable. The weights wij
will typically be modeled as inversely related to some measure of proximity between
the units. The SAR parameter λ measures the extent of these interactions. For further
discussions of spatial-weighting matrices and the parameter space for the SAR parameter,
see, for example, the literature cited in the introduction, including Kelejian and Prucha
(2010); see Drukker et al. (2013) for more information about creating spatial-weighting
matrices in Stata.
The innovations are assumed to be independent and identically distributed (IID)
or independent but heteroskedastically distributed, where the heteroskedasticity is of
unknown form. The GS2SLS estimator produces consistent estimates in either case
when the heteroskedastic option is specified; see Kelejian and Prucha (1998, 1999,
2010), Arraiz et al. (2010), and Drukker, Egger, and Prucha (2013) for discussions and
formal results. The ML estimator produces consistent estimates in the IID case but
generally not in the heteroskedastic case; see Lee (2004) for some formal results for the
ML estimator, and see Arraiz et al. (2010) for evidence that the ML estimator does not
generally produce consistent estimates in the heteroskedastic case.
Because the model in (1) and (2) is a first-order SAR model with first-order SAR
disturbances, it is also referred to as a SARAR(1, 1) model, which is a special case of
the more general SARAR(p, q) model. We refer to a SARAR(1, 1) model as a SARAR
model. When ρ = 0, the model in equations (1) and (2) reduces to the SAR model
y = λWy + Xβ + . When λ = 0, the model in equations (1) and (2) reduces to
y = Xβ + u with u = ρMu + , which is sometimes referred to as the SAR error model.
Setting ρ = 0 and λ = 0 causes the model in equations (1) and (2) to reduce to a linear
regression model with exogenous variables.
spreg requires that the spatial-weighting matrices M and W be provided in the form
of an spmat object as described in Drukker et al. (2013). spreg gs2sls supports both
general and banded spatial-weighting matrices; spreg ml supports general matrices
only.
spreg gs2sls depvar indepvars if in , id(varname) noconstant
level(#) dlmat(objname) elmat(objname) heteroskedastic impower(q)
maximize options
gridsearch(#) specifies the fineness of the grid used in searching for the initial values
of the parameters λ and ρ in the concentrated log likelihood. The allowed range is
[.001, .1]. The default is gridsearch(.1).
maximize options: difficult, technique(algorithm spec), iterate(#), no log,
trace, gradient, showstep, hessian, showtolerance, tolerance(#),
ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see
[R] maximize. These options are seldom used. from() takes precedence over
gridsearch().
id(varname) specifies a numeric variable that contains a unique identifier for each
observation. id() is required.
noconstant suppresses the constant term.
level(#) specifies the confidence level, as a percentage, for confidence intervals. The
default is level(95) or as set by set level.
dlmat(objname) specifies an spmat object that contains the spatial-weighting matrix
W to be used in the SAR term.
elmat(objname) specifies an spmat object that contains the spatial-weighting matrix
M to be used in the spatial-error term.
heteroskedastic specifies that spreg use an estimator that allows the errors to be
heteroskedastically distributed over the observations. By default, spreg uses an
estimator that assumes homoskedasticity.
impower(q) specifies how many powers of W to include in calculating the instrument
matrix H. The √ default is impower(2). The allowed values of q are integers in the
set 2, 3, . . . , n, where n is the number of observations.
maximize options: iterate(#), no log, trace, gradient, showstep,
showtolerance, tolerance(#), and ltolerance(#); see [R] maximize.
from(init specs) is also allowed, but because ρ is the only parameter in this opti-
mization problem, only initial values for ρ may be specified.
Macros
e(cmd) spreg e(user) name of likelihood-evaluator
e(cmdline) command as typed program
e(depvar) name of dependent variable e(estimator) ml
e(indeps) names of independent variables e(model) lr, sar, sare, or sarar
e(title) title in estimation output e(constant) noconstant or hasconstant
e(chi2type) type of model χ2 test e(idvar) name of ID variable
e(vce) oim e(dlmat) name of spmat object used in
e(technique) maximization technique dlmat()
e(crittype) type of optimization e(elmat) name of spmat object used in
e(estat cmd) program used to implement elmat()
estat e(properties) b V
e(predict) program used to implement
predict
Matrices
e(b) coefficient vector e(gradient) gradient vector
e(Cns) constraints matrix e(V) variance–covariance matrix of
e(ilog) iteration log the estimators
Functions
e(sample) marks estimation sample
4 Example
In our examples, we use spreg.dta, which contains simulated data on the number of
arrests for driving under the influence for the continental U.S. counties.3 We use a
normalized contiguity matrix taken from Drukker et al. (2013). In Stata, we type
. use dui
. spmat use ccounty using ccounty.spmat
to read the dataset into memory and to put the spatial-weighting matrix into the
spmat object ccounty. This row-normalized spatial-weighting matrix was created in
Drukker et al. (2013, sec. 2.4) and saved to disk in Drukker et al. (2013, sec. 11.4).
Our dependent variable, dui, is defined as the alcohol-related arrest rate per 100,000
daily vehicle miles traveled (DVMT). Figure 1 shows the distribution of dui across
counties, with darker colors representing higher values of the dependent variable. Spatial
patterns of dui are clearly visible.
3. The geographical location data came from the U.S. Census Bureau and can be found at
ftp://ftp2.census.gov/geo/tiger/TIGER2008/. The variables are simulated but inspired by
Powers and Wilson (2004).
228 ML and GS2SLS estimators for a SARAR model
Our explanatory variables include police (number of sworn officers per 100,000
DVMT); nondui (nonalcohol-related arrests per 100,000 DVMT); vehicles (number of
registered vehicles per 1,000 residents); and dry (a dummy for counties that prohibit
alcohol sale within their borders). In other words, in this illustration,
X = [police,nondui,vehicles,dry,intercept].
We obtain the GS2SLS parameter estimates of the SARAR model parameters by typing
dui
police -.5591567 .0148772 -37.58 0.000 -.5883155 -.529998
nondui -.0001128 .0005645 -0.20 0.842 -.0012193 .0009936
vehicles .062474 .0006198 100.79 0.000 .0612592 .0636889
dry .303046 .0183119 16.55 0.000 .2671553 .3389368
_cons 2.482489 .1473288 16.85 0.000 2.19373 2.771249
lambda
_cons .4672164 .0051261 91.14 0.000 .4571694 .4772633
rho
_cons .1932962 .0726583 2.66 0.008 .0508885 .3357038
Given the normalization of the spatial-weighting matrix, the parameter space for
λ and ρ is taken to be the interval (−1, 1); see Kelejian and Prucha (2010) for further
discussions of the parameter space. The estimated λ is positive and significant, indicat-
ing moderate SAR dependence in dui. In other words, the dui alcohol-arrest rate for
a given county is affected by the dui alcohol-arrest rates of the neighboring counties.
This result may be because of coordination among police departments or because strong
enforcement in one county leads some people to drink in neighboring counties.
The estimated ρ coefficient is positive, moderate, and significant, indicating mod-
erate SAR dependence in the error term. In other words, an exogenous shock to one
county will cause moderate changes in the alcohol-related arrest rate in the neighboring
counties.
The estimated β vector does not have the same interpretation as in a simple linear
model, because including a spatial lag of the dependent variable implies that the out-
comes are determined simultaneously. We present one way to interpret the coefficients
in section 5.
D. M. Drukker, I. R. Prucha, and R. Raciborski 229
dui
police -.5593526 .014864 -37.63 0.000 -.5884854 -.5302197
nondui -.0001214 .0005645 -0.22 0.830 -.0012279 .0009851
vehicles .0624729 .0006195 100.84 0.000 .0612586 .0636872
dry .3030522 .018311 16.55 0.000 .2671633 .3389412
_cons 2.490301 .1471885 16.92 0.000 2.201817 2.778785
lambda
_cons .4671198 .0051144 91.33 0.000 .4570957 .4771439
rho
_cons .1962348 .0711659 2.76 0.006 .0567522 .3357174
sigma2
_cons .0859662 .0021815 39.41 0.000 .0816905 .0902418
There are no apparent differences between the two sets of parameter estimates.
5 Postestimation commands
The postestimation commands supported by spreg include estat, test, and predict;
see help spreg postestimation for the full list. Most postestimation methods have
standard interpretations; for example, a Wald test is just a Wald test.
Predictions from SARAR models require some additional explanation. Kelejian and
Prucha (2007) consider different information sets and define predictors as conditional
means based on these information sets. They also derive the mean squared errors of
these predictors, provide some efficiency rankings based on these mean squared errors,
and provide Monte Carlo evidence that the additional efficiencies obtained by using
more information can be practically important.
One of the predictors that Kelejian and Prucha (2007) consider is based on the infor-
mation set {X, W, M, wi y}, where wi denotes the ith row of W, which will be referred
to as the limited-information predictor.4 We denote the limited-information predictor
by limited in the syntax diagram below. Another estimator that Kelejian and Prucha
(2007) consider is based on the information set {X, W, M}, which yields the reduced-
form predictor. This predictor is denoted by rform in the syntax diagram below.
4. Kelejian and Prucha (2007) also consider a full-information predictor. We have postponed imple-
menting this predictor because it is computationally more demanding; we plan to implement it in
future work.
230 ML and GS2SLS estimators for a SARAR model
Kelejian and Prucha (2007) show that their limited-information predictor can be much
more efficient than the reduced-form predictor.
In addition to the limited-information predictor and the reduced-form predictor,
predict can compute two other observation-level quantities, which are not recom-
mended as predictors but may be used in subsequent computations. These quantities
are denoted by naive and xb in the syntax diagram below.
While prediction is frequently of interest in applied statistical work, predictions
can also be used to compute marginal effects.5 A change to one observation in one
exogenous variable potentially changes the predicted values for all the observations of
the dependent variable because the n observations for the dependent variable form a
system of simultaneous equations in a SARAR model. Below we use predict to calculate
predictions that we in turn use to calculate marginal effects.
Various methods have been proposed to interpret the parameters of SAR models: see,
for example, Anselin (2003); Abreu, De Groot, and Florax (2004); Kelejian and Prucha
(2007); and LeSage and Pace (2009).
5.1 Syntax
Before using predict, we discuss its syntax.
predict type
newvar if in , rform | limited | naive | xb
rftransform(matname)
5.2 Options
rform, the default, calculates the reduced-form predictions.
limited calculates the Kelejian and Prucha (2007) limited-information predictor. This
predictor is more efficient than the reduced-form predictor, but we call it limited
because it is not as efficient as the Kelejian and Prucha (2007) full-information pre-
dictor, which we plan to implement in the future.
i y + xi β
naive calculates λw for each observation.
xb calculates the linear prediction Xβ.
5. We refer to the effects of both infinitesimal changes in a continuous variable and discrete changes
in a discrete variable as marginal effects. While some authors refer to “partial” effects to cover the
continuous and discrete cases, we avoid the term “partial” because it means something else in a
simultaneous-equations framework.
D. M. Drukker, I. R. Prucha, and R. Raciborski 231
5.3 Example
In this section, we discuss two marginal effects that measure how changes in the ex-
ogenous variables affect the endogenous variable. These measures use the reduced-form
= E(y|X, W, M) = (I − λW)−1 Xβ, which we discuss in section 6.3, where
predictor y
it is denoted as y (1) . The expression for the predictor shows that a change in a single
observation on an exogenous variable will typically affect the values of the endogenous
variable for all n units because the SARAR model forms a system of simultaneous equa-
tions.
Without loss of generality, we explore the effects of changes in the kth exogenous
variable. Letting xk = (x1k , . . . , xnk ) denote the vector of observations on the kth
exogenous variable allows us to denote the dependence of y on xk by using the notation
(xk ) = {
y y1 (xk ), . . . , yn (xk )}
∂
y(xk + δi) ∂
y(x1k , . . . , xi−1,k , xik + δ, xi+1,k , . . . , xnk ) ∂
y(xk )
= =
∂δ ∂δ ∂xik
where i = [0, . . . , 0, 1, 0, . . . , 0] is the ith column of the identity matrix. In terminology
consistent with that of LeSage and Pace (2009, 36–37), we refer to the above effect as
the total direct impact of a change in the ith unit of xk . LeSage and Pace (2009, 36–37)
define the corresponding summary measure
n
∂ yi (xk + δi)
n
∂ yi (x1k , . . . , xi−1,k , xik + δ, xi+1,k , . . . , xnk )
n−1 = n−1
i=1
∂δ i=1
∂δ
n
∂ yi (xk )
= n−1 (3)
i=1
∂xik
which they call the average total direct impact (ATDI). The ATDI is the average over
i = {1, . . . , n} of the changes in the yi attributable to the changes in the corresponding
xik . The ATDI can be calculated by computing y (xk ), y
(xk + δi), and the average of
the difference of these vectors of predicted values, where δ is the magnitude by which
xik is changed. The ATDI measures the average change in yi attributable to sequentially
changing xik for a given k.
232 ML and GS2SLS estimators for a SARAR model
Sequentially changing xik for each i = {1, . . . , n} differs from simultaneously chang-
ing the xik for all n units. The second marginal effect we consider measures the effect
of simultaneously changing x1k , . . . , xnk on a specific yi and is defined by
∂ yi (x1k + δ, . . . , xik + δ, . . . , xnk + δ) ∂ yi (xk )
n
∂ yi (xk + δe)
= =
∂δ ∂δ r=1
∂xrk
where e = [1, . . . , 1] is a vector of 1s. LeSage and Pace (2009, 36–37) define the corre-
sponding summary measure
n
∂ yi (xk + δe)
n
∂ yi (x1k + δ, . . . , xik + δ, . . . , xnk + δ)
n−1 = n−1
i=1
∂δ i=1
∂δ
n n
∂ yi (xk )
= n−1 (4)
i=1 r=1
∂xrk
which they call the average total impact (ATI). The ATI can be calculated by computing
(xk ), y
y (xk + δe), and the average difference in these vectors of predicted values, where
δ is the magnitude by which x1k , . . . , xnk is changed.
We now continue our example from section 4 and use the reduced-form predictor to
compute the marginal effects of adding one officer per 100,000 DVMT in Elko County,
Nevada. We begin by using the reduced-form predictor and the observed values of the
exogenous variables to obtain predicted values for dui:
. predict y0
(option rform assumed)
Next we increase police by 1 in Elko County, Nevada, and calculate the reduced-form
predictions:
. generate police_orig = police
. quietly replace police = police_orig + 1 if st==32 & NAME00=="Elko"
. predict y1
(option rform assumed)
The output below lists the predicted difference and the level of dui for Elko County,
Nevada:
. list deltay dui if (st==32 & NAME00=="Elko")
deltay dui
The predicted effect of the change would be a 2.9% reduction in dui in Elko County,
Nevada.
D. M. Drukker, I. R. Prucha, and R. Raciborski 233
Below we use four commands to summarize the changes and levels in the contiguous
counties:
In the first command, we use spmat getmatrix to store a copy of the normalized-
contiguity spatial-weighting matrix in Mata memory; see Drukker et al. (2013, sec. 14)
for a discussion of spmat getmatrix. In the second and third commands, we generate
and fill in a new variable for which the ith observation is 1 if it contains information on a
county that is contiguous with Elko County and is 0 otherwise. In the fourth command,
we summarize the predicted changes and the levels in the contiguous counties. The
mean predicted reduction is less than 0.1% of the mean level of dui in the contiguous
counties.
In the output below, we get a summary of the levels of dui and a detailed summary
of the predicted changes for all the counties in the sample.
. summarize dui
Variable Obs Mean Std. Dev. Min Max
Percentiles Smallest
1% -.0007572 -.5654716
5% 0 -.0204239
10% 0 -.0203991 Obs 3109
25% 0 -.0203991 Sum of Wgt. 3109
50% 0 Mean -.0002495
Largest Std. Dev. .0101996
75% 0 0
90% 0 0 Variance .000104
95% 0 0 Skewness -54.78661
99% 0 0 Kurtosis 3035.363
Less than 1% of the sample had any socially significant difference, with no change at
all predicted for at least 95% of the sample.
234 ML and GS2SLS estimators for a SARAR model
In some of the computations below, we will use the matrix S = (In − λW)−1
, where
λ is the estimate of the SAR parameter and W is the spatial-weighting matrix. In the
output below, we use the W stored in Mata memory in an example above to compute
S.
We next compute the ATDI defined in (3). The output below shows an instructive
(but slow) method to compute the ATDI. For each county in the data, we set police to
be the original value for all the observations except the ith, which we set to police + 1.
Then we compute the predicted value of dui for observation i and store this prediction in
the ith observation of y1. (We use the rftransform() option to use the inverse matrix
S computed above. Without this option, we would recompute the inverse matrix for
each of the 3,109 observations, which would cause the calculation to take hours.) After
computing the predicted values of y1 for each observation, we compute the differences
in the predictions and compute the sample average.
. drop y1 deltay
. generate y1 = .
(3109 missing values generated)
. local N = _N
. forvalues i = 1/`N´ {
2. quietly capture drop tmp
3. quietly replace police = police_orig
4. quietly replace police = police_orig + 1 in `i´
5. quietly predict tmp in `i´, rftransform(S)
6. quietly replace y1 = tmp in `i´
7. }
. generate deltay = y1-y0
. summarize deltay
Variable Obs Mean Std. Dev. Min Max
The absolute value of the estimated ATDI is −0.56, so the estimated effect is 2.7%
of the sample mean of dui.
D. M. Drukker, I. R. Prucha, and R. Raciborski 235
As mentioned, the above method for computing the estimate of the ATDI is slow.
LeSage and Pace (2009, 36–37) show that the estimate of the ATDI can also be computed
as
βk
trace(S)
n
where β k is the kth component of β and S = (In − λW)−1 , which we computed above.
Below we use this formula to compute the ATDI,
. mata: (b[1,1]/rows(W))*trace(S)
-.5633844076
. drop y1 deltay
. quietly replace police = police_orig + 1
. predict y1
(option rform assumed)
. generate deltay = y1-y0
. summarize deltay
Variable Obs Mean Std. Dev. Min Max
The absolute value of the estimated average total effect is about 3.4% of the sample
mean of dui.
LeSage and Pace (2009, 36–37) show that the ATI is given by
βk
n n
Si,j
n i=1 j=1
where β k is the kth component of β and Sij is the (i, j)th element of S = (In − λW)−1 .
In the output below, we use the spmat getmatrix command discussed in Drukker et al.
(2013) and a few Mata computations to show that the above expression yields the same
value for the ATI as our calculations above.
236 ML and GS2SLS estimators for a SARAR model
In general, it is not possible to say whether the ATDI is greater than or less than the
ATI. Using the expressions from LeSage and Pace (2009, 36–37), we see that
βk βk βk
n n n n n
ATI − ATDI = Si,j − Si,i = Si,j
n i=1 j=1 n i=1 n i=1 j=1
j=i
y = λWy + Xβ + u (5)
u = ρMu + (6)
In the following, we give the log-likelihood function under the assumption that ∼
N (0, σ 2 I). As usual, we refer to the maximizer of the likelihood function when the
innovations are not normally distributed as the quasi-maximum likelihood (QML) esti-
mator. Lee (2004) gives results concerning the consistency and asymptotic normality of
the QML estimator when is IID but not necessarily normally distributed. Violations
of the assumption that the innovations are IID can cause the QML estimator to pro-
duce inconsistent results. In particular, this may be the case if the innovations are
heteroskedastic, as discussed by Arraiz et al. (2010).
Likelihood function
ρ)
(I − λW)y − Xβ(λ,
Substitution of the above expressions into (7) yields the concentrated log-likelihood
function
n n
Lc (y|λ, ρ) = − {ln(2π) + 1} − ln( σ 2 (λ, ρ)) + ln ||I − λW|| + ln ||I − ρM||
2 2
The QML estimates for the autoregressive parameters λ and ρ can now be computed by
maximizing the concentrated log-likelihood function. Once we have obtained the QML
and ρ, we can calculate the QML estimates for β and σ 2 as β
estimates λ = β(
λ,
ρ) and
2
=σ
σ 2
(λ, ρ).
Initial values
As noted in Anselin (1988, 186), poor initial starting values for ρ and λ in the concen-
trated likelihood may result in the optimization algorithm settling on a local, rather
than the global, maximum.
To prevent this problem from happening, spreg ml performs a grid search to find
suitable initial values for ρ and λ. To override the grid search, you may specify your
own initial values in the option from().
where q = 2 has worked well in Monte Carlo simulations over a wide range of reasonable
specifications. The choice of those instruments provides a computationally convenient
approximation of the ideal instruments; see Lee (2003) and Kelejian, Prucha, and Yuze-
fovich (2004) for further discussions and refined estimators. At a minimum, the instru-
ments should include the linearly independent columns of X and MX. When there is a
constant in the model and thus X contains a constant term, the constant term is only
included once in H.
(1) = (I − λW)−1 Xβ
y
(2) cov(ui , wi y)
i
y = λwi y + xi β + {wi y − E(wi y)}
var(wi y)
where
We call this unbiased predictor the limited-information predictor because Kelejian and
Prucha (2007) consider a more efficient predictor, the full-information predictor. The
former can be calculated by specifying statistic limited to predict after spreg.
D. M. Drukker, I. R. Prucha, and R. Raciborski 239
i = λwi y + xi β
y
However, as pointed out in Kelejian and Prucha (2007), this estimator is generally bi-
ased. While this biased predictor should not be used for predictions, it has uses as
an intermediate computation, and it can be calculated by specifying statistic naive to
predict after spreg.
The above predictors are computed by replacing the parameters in the prediction
formula with their estimates.
7 Conclusion
After reviewing some basic concepts related to SARAR models, we presented the spreg
ml and spreg gs2sls commands, which implement ML and GS2SLS estimators for the
parameters of these models. We also discussed postestimation prediction. In future
work, we would like to investigate further methods and commands for parameter inter-
pretation.
8 Acknowledgment
We gratefully acknowledge financial support from the National Institutes of Health
through the SBIR grants R43 AG027622 and R44 AG027622.
9 References
Abreu, M., H. L. F. De Groot, and R. J. G. M. Florax. 2004. Space and growth: A
survey of empirical evidence and methods. Working Paper TI 04-129/3, Tinbergen
Institute.
———. 2003. Spatial externalities, spatial multipliers, and spatial econometrics. Inter-
national Regional Science Review 26: 153–166.
———. 2010. Thirty years of spatial econometrics. Papers in Regional Science 89: 3–25.
Anselin, L., and R. J. G. M. Florax. 1995. Small sample properties of tests for spatial
dependence in regression models: Some further results. In New Directions in Spatial
Econometrics, ed. L. Anselin and R. J. G. M. Florax, 21–74. Berlin: Springer.
Cressie, N. A. C. 1993. Statistics for Spatial Data. Revised ed. New York: Wiley.
Drukker, D. M., H. Peng, I. R. Prucha, and R. Raciborski. 2013. Creating and managing
spatial-weighting matrices with the spmat command. Stata Journal 13: 242–286.
Haining, R. 2003. Spatial Data Analysis: Theory and Practice. Cambridge: Cambridge
University Press.
Kelejian, H. H., and I. R. Prucha. 1998. A generalized spatial two-stage least squares
procedure for estimating a spatial autoregressive model with autoregressive distur-
bances. Journal of Real Estate Finance and Economics 17: 99–121.
———. 2007. The relative efficiencies of various predictors in spatial econometric models
containing spatial lags. Regional Science and Urban Economics 37: 363–374.
———. 2010. Specification and estimation of spatial autoregressive models with au-
toregressive and heteroskedastic disturbances. Journal of Econometrics 157: 53–67.
Lee, L.-F. 2003. Best spatial two-stage least squares estimators for a spatial autoregres-
sive model with autoregressive disturbances. Econometric Reviews 22: 307–335.
LeSage, J., and R. K. Pace. 2009. Introduction to Spatial Econometrics. Boca Raton:
Chapman & Hall/CRC.
D. M. Drukker, I. R. Prucha, and R. Raciborski 241
Powers, E. L., and J. K. Wilson. 2004. Access denied: The relationship between alcohol
prohibition and driving under the influence. Sociological Inquiry 74: 318–337.
Abstract. We present the spmat command for creating, managing, and storing
spatial-weighting matrices, which are used to model interactions between spatial
or more generally cross-sectional units. spmat can store spatial-weighting matrices
in a general and banded form. We illustrate the use of the spmat command and
discuss some of the underlying issues by using United States county and postal-
code-level data.
Keywords: st0292, spmat, spatial-autoregressive models, Cliff–Ord models, spa-
tial lag, spatial-weighting matrix, spatial econometrics, spatial statistics, cross-
sectional interaction models, social-interaction models
1 Introduction
Building on Whittle (1954), Cliff and Ord (1973, 1981) developed statistical models
that not only accommodate forms of cross-unit correlation but also allow for explicit
forms of cross-unit interactions. The latter is a feature of interest in many social sci-
ence, biostatistical, and geographic science models. Following Cliff and Ord (1973,
1981), much of the original literature was developed to handle spatial interactions.
However, space is not restricted to geographic space, and many recent applications
use spatial techniques in other situations of cross-unit interactions, such as social-
interaction models and network models; see, for example, Kelejian and Prucha (2010)
and Drukker, Egger, and Prucha (2013) for references. Much of the nomenclature still
includes the adjective “spatial”, and we continue this tradition to avoid confusion while
noting the wider applicability of these models. For texts and reviews, see, for example,
Anselin (1988, 2010), Arbia (2006), Cressie (1993), Haining (2003), and LeSage and Pace
(2009).
The models derived and discussed in the literature cited above model cross-unit
interactions and correlation in terms of spatial lags, which may involve the dependent
variable, the exogenous variables, and the disturbances. A spatial lag of a variable is
c 2013 StataCorp LP st0292
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 243
where yi denotes the dependent variable corresponding to unit i, the wij (with wii = 0)
are nonstochastic weights, εi is a disturbance term, and λ is a parameter.
n In the above
model, the yi are determined simultaneously. The weighted average j=1 wij yj , on the
right-hand side, is called a spatial lag, and the wij are called the spatial weights. It
often proves convenient to write the model in matrix notation as
y = λWy + ε
where
⎡ ⎤
⎡ ⎤ 0 w12 ··· w1,n−1 w1n ⎡ ⎤
y1 ⎢ w21 0 ... w2,n−1 w2n ⎥ ε1
⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
y = ⎣ ... ⎦ , W =⎢ ..
.
..
.
..
.
..
.
..
. ⎥, ε = ⎣ ... ⎦
⎢ ⎥
yn ⎣ wn−1,1 wn−1,2 ··· 0 wn−1,n ⎦ εn
wn1 wn2 ··· wn,n−1 0
Again the n × 1 vector Wy is typically referred to as the spatial lag in y and the n × n
matrix W as the spatial-weighting matrix. More generally, as indicated above, the
concept of a spatial lag can be applied to any variable, including exogenous variables
and disturbances, which—as can be seen in the literature cited above—provides for a
fairly general class of Cliff–Ord types of models.
Spatial-weighting matrices allow us to conveniently implement Tobler’s first law of
geography—“everything is related to everything else, but near things are more related
than distant things” (Tobler 1970, 236)—which applies whether the space is geographic,
biological, or social. The spmat command creates, imports, manipulates, and saves W
matrices. The matrices are stored in a spatial-weighting matrix object (spmat object).
The spmat object contains additional information about a spatial-weighting matrix,
such as the identification codes of the cross-section units, and other items discussed
below.1
The generic syntax of spmat is
spmat subcommand . . .
where each subcommand performs a specific task. Some subcommands create spmat ob-
jects from a Stata dataset (contiguity, idistance, dta), a Mata matrix (putmatrix),
or a text file (import). Other subcommands save objects to a disk (save, export) or
read them back in (use, import). Still other subcommands summarize spatial-weighting
1. We use the term “units” instead of “places” because spatial-econometric methods have been applied
to many cases in which the units of analysis are individuals or firms instead of geographical places;
for example, see Leenders (2002).
244 Creating and managing spatial-weighting matrices
2. Refer to http://www.esri.com for details about the ESRI format and to http://www.pbinsight.com
for details about the MIF format. The ESRI format is much more common.
3. shp2dta and mif2dta save the coordinates data in the format required by spmap (Pisati 2007),
which graphs data onto maps.
4. We use the term “attribute” instead of “database” because “database” does not adequately distin-
guish between attribute data and coordinates data.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 245
The code below illustrates the use of shp2dta and spmap (Pisati 2007) on the county
boundaries data for the continental United States; Crow and Gould (2007) provide a
broader introduction to shapefiles, shp2dta, and spmap.
shp2dta, mif2dta, and spmap use a common set of conventions for defining the
polygons in the coordinates data translated from the coordinates file. Crow and Gould
(2007) discuss these conventions.
We downloaded ts 2008 us county00.db and ts 2008 us county00.shp, which are
the attribute file and the coordinates file, respectively, and which make up the shapefile
for U.S. counties from the U.S. Census Bureau.5 We begin by using shp2dta to translate
these files to the files county.dta and countyxy.dta.
county.dta contains the attribute information from the attribute file in the shape-
file, and countyxy.dta contains the coordinates data from the shapefile. The attribute
dataset county.dta has one observation per county on variables such as county name
and state code. Because we specified the option gencentroids(c), county.dta also
contains the variables x c and y c, which contain the coordinates of the county cen-
troids, measured in degrees. (See the help file for shp2dta for details and the x–y
naming convention.) countyxy.dta contains the coordinates of the county boundaries
in the long-form panel format used by spmap.6
Below we use use to read county.dta into memory and use destring (see [D] de-
string) to create a new, numeric state-code variable st from the original string state-
identifying variable STATEFP. Next we use drop to drop the observations defining the
coordinates of county boundaries in Alaska, Hawaii, and U.S. territories. Finally, we
use rename to rename the variables containing coordinates of the county centroids and
use save to save our changes into the county.dta dataset file.
. use county
. quietly destring STATEFP, generate(st)
. *keep continental US counties
. drop if st==2 | st==15 | st>56
(123 observations deleted)
. rename x_c longitude
. rename y_c latitude
. save county, replace
file county.dta saved
Having completed the translation and selected our subsample, we use spmap to draw
the map, given in figure 1, of the boundaries in the coordinates dataset.
A banded matrix is a matrix whose nonzero elements are confined to a diagonal band
that comprises the main diagonal, zero or more diagonals above the main diagonal, and
zero or more diagonals below the main diagonal. The number of diagonals above the
main diagonal that contain nonzero elements is the upper bandwidth, say, bU . The
number of diagonals below the main diagonal that contain nonzero elements is the
lower bandwidth, say, bL . An example of a banded matrix having an upper bandwidth
of 1 and a lower bandwidth of 2 is
⎡ ⎤
0 1 0 0 0 0 0 0 0 0
⎢ 1 0 1 0 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 1 1 0 1 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 0 1 1 0 1 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 0 0 1 1 0 1 0 0 0 0 ⎥
⎢ ⎥
⎢ 0 0 0 1 1 0 1 0 0 0 ⎥
⎢ ⎥
⎢ 0 0 0 0 1 1 0 1 0 0 ⎥
⎢ ⎥
⎢ 0 0 0 0 0 1 1 0 1 0 ⎥
⎢ ⎥
⎣ 0 0 0 0 0 0 1 1 0 1 ⎦
0 0 0 0 0 0 0 1 1 0
We can save a lot of space by storing only the elements in the diagonal band because
the elements outside the band are 0 by construction. Using this information, we can
efficiently store this matrix without any loss of information as
⎡ ⎤
0 1 1 1 1 1 1 1 1 1
⎢ 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
⎣ 1 1 1 1 1 1 1 1 1 0 ⎦
1 1 1 1 1 1 1 1 0 0
The above matrix only contains the elements of the diagonals with nonzero elements.
To store the elements in a rectangular array, we added zeros as necessary. The row
dimension of the banded matrix is the upper bandwidth plus the lower bandwidth plus
1, or b = bU + bL + 1. We will use the b × n shorthand to refer to the dimensions of
banded matrices.
Banded matrices require less storage space than general matrices. The spmat suite
provides tools for creating, storing, and manipulating banded matrices. In addition,
computing an operation on a banded matrix is much faster than on a general matrix.
Drukker et al. (2011) show that many spatial-weighting matrices have a banded
structure after an appropriate reordering. In particular, a banded structure is often
attained by sorting the data in an ascending order of the distance from a well-chosen
place. In section 5, we will illustrate this method with data on U.S. counties and U.S.
five-digit zip codes. In the case of U.S. five-digit zip codes, we show how to create a
contiguity matrix with upper and lower bandwidths of 913. This allows us to store
the data in a 1,827 × 31,713 matrix, which requires only 1,827 × 31,713 × 8/230 ≈ 0.43
gigabytes instead of the 7.5 gigabytes required for the general matrix.
We are now ready to describe the spmat subcommands.
248 Creating and managing spatial-weighting matrices
options Description
normalize(norm) control the normalization method
rook require that two units share a common border
instead of just a common point to
be neighbors
banded store the matrix in the banded format
replace replace existing spmat object
saving(filename , replace ) save the neighbor list to a file
nomatrix suppress creations of spmat object
tolerance(#) use when determining whether
units share a common border
2.2 Description
spmat contiguity computes a contiguity or normalized-contiguity matrix from a coor-
dinates dataset containing a polygon representation of geospatial data. More precisely,
spmat contiguity constructs a contiguity matrix or normalized-contiguity matrix from
the boundary information in a coordinates dataset and puts it into the new spmat object
objname.
In a contiguity matrix, contiguous units are assigned weights of 1, and noncontiguous
units are assigned weights of 0. Contiguous units are known as neighbors.
spmat contiguity uses the polygon data in coordinates file to determine the neigh-
bors of each unit. The coordinates file must be a Stata dataset containing the polygon
information in the format produced by shp2dta and mif2dta. Crow and Gould (2007)
discuss the conventions used to represent polygons in the Stata datasets created by
these commands.
2.3 Options
id(varname) specifies a numeric variable that contains a unique identifier for each
observation. (shp2dta and mif2dta name this ID variable ID.) id() is required.
normalize(norm) specifies one of the three available normalization techniques: row,
minmax, and spectral. In a row-normalized matrix, each element in row i is divided
by the sum of row i’s elements. In a minmax-normalized matrix, each element is
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 249
divided by the minimum of the largest row sum and column sum of the matrix. In
a spectral-normalized matrix, each element is divided by the modulus of the largest
eigenvalue of the matrix. See section 2.5 for details.
rook specifies that only units that share a common border be considered neighbors (edge
or rook contiguity). The default is queen contiguity, which treats units that share a
common border or a single common point as neighbors. Computing rook-contiguity
matrices is more computationally intensive than the default queen-contiguity com-
putation.7
banded requests that the new matrix be stored in a banded form. The banded matrix
is constructed without creating the underlying n × n representation.
replace permits spmat contiguity to overwrite an existing spmat object.
saving(filename , replace ) saves the neighbor list to a space-delimited text file.
The first line of the file contains the number of units and, if applicable, bands; each
remaining line lists a unit identification code followed by the identification codes of
units that share a common border, if any. You can read the file back into an spmat
object with spmat import ..., nlist. replace allows filename to be overwritten
if it already exists.
nomatrix specifies that the spmat object objname and spatial-weighting matrix W not
be created. In conjunction with saving(), this option allows for creating a text
file containing a neighbor list without allocating space for the underlying contiguity
matrix.
tolerance(#) specifies the numerical tolerance used in deciding whether two units are
edge neighbors. The default is tolerance(1e-7).
2.4 Examples
As discussed above, spatial-weighting matrices are used to compute weighted averages
in which more weight is placed on nearby observations than on distant observations.
While Haining (2003, 83) and Cliff and Ord (1981, 17) discuss formulations of weights
matrices, contiguity and inverse-distance matrices are the two most common spatial-
weighting matrices.
7. These definitions for rook neighbor and queen neighbor are commonly used; see, for example,
Lai, So, and Chan (2009). (As many readers will recognize, the “rook” and “queen” terminology
arises by analogy with chess, in which a rook may only move across sides of squares, whereas a
queen may also move diagonally.)
250 Creating and managing spatial-weighting matrices
Example
We continue the example from section 1.1 and assume both of the Stata datasets
created in section 1.1 are in the current working directory. After loading the attribute
dataset into memory, we create the spmat object ccounty containing a normalized-
contiguity matrix for U.S. counties by typing
Matrix Description
The table shows basic information about the normalized contiguity matrix, including
the dimensions of the matrix and its storage. The number of neighbors found is reported
as 18,474, with each county having 6 neighbors on average.
becomes w
In a row-normalized matrix, the (i, j)th element of W ij = wij /ri , where
ri is the sum of the ith row of W. After row normalization, each row of W will sum
to 1. Row normalizing a symmetric W produces an asymmetric W except in very
special cases. Kelejian and Prucha (2010) point out that normalizing by a vector of row
sums needs to be guided by theory.
In a minmax-normalized matrix, the (i, j)th element of W becomes w ij = wij /m,
where m = min{maxi (ri ), maxi (ci )}, with maxi (ri ) being the largest row sum of W
and maxi (ci ) being the largest column sum of W. Normalizing by a scalar preserves
symmetry and the basic model specification.
In a spectral-normalized matrix, the (i, j)th element of W becomes w ij = wij /v,
where v is the largest of the moduli of the eigenvalues of W. As for the minmax norm,
normalizing by a scalar preserves symmetry and the basic model specification.
3.2 Description
An inverse-distance spatial-weighting matrix is composed of weights that are inversely
related to the distances between the units. spmat idistance uses the coordinate vari-
ables from the attribute data in memory and the specified distance measure to compute
the distances between units, to create an inverse-distance spatial-weighting matrix, and
to store the result in an spmat object.
252 Creating and managing spatial-weighting matrices
3.3 Options
id(varname) specifies a numeric variable that contains a unique identifier for each
observation. id() is required.
dfunction(function , miles ) specifies the distance function. function may be one of
euclidean (default), dhaversine, rhaversine, or the Minkowski distance of order
p, where p is an integer greater than or equal to 1.
When the default dfunction(euclidean) is specified, a Euclidean distance measure
is applied to the coordinate variable list cvarlist.
When dfunction(rhaversine) or dfunction(dhaversine) is specified, the haver-
sine distance measure is applied to the two coordinate variables cvarlist. (The
first coordinate variable must specify longitude, and the second coordinate vari-
able must specify latitude.) The coordinates must be in radians when rhaversine
is specified. The coordinates must be in degrees when dhaversine is specified.
The haversine distance measure is calculated in kilometers by default. Specify
dfunction(rhaversine, miles) or dfunction(dhaversine, miles) if you want
the distance returned in miles.
When dfunction(p) (p is an integer) is specified, a Minkowski distance measure of
order p is applied to the coordinate variable list cvarlist.
The formulas for the distance measure are discussed in section 3.5.
normalize(norm) specifies one of the three available normalization techniques: row,
minmax, and spectral. In a row-normalized matrix, each element in row i is divided
by the sum of row i’s elements. In a minmax-normalized matrix, each element is
divided by the minimum of the largest row sum and column sum of the matrix. In
a spectral-normalized matrix, each element is divided by the modulus of the largest
eigenvalue of the matrix. See section 2.5 for details.
truncmethod options specify one of the three truncation criteria. The values of the
spatial-weighting matrix W that meet the truncation criterion will be changed to 0.
Only apply truncation methods when supported by theory.
btruncate(b B) partitions the values of W into B equal-length bins and truncates
to 0 entries that fall into bin b or below, b < B.
dtruncate(dL dU ) truncates to 0 the values of W that fall more than dL diagonals
below and dU diagonals above the main diagonal. Neither value can be greater than
(cols(W)−1)/4.9
vtruncate(v) truncates to 0 the values of W that are less than or equal to v.
See section 3.6 for more details about the truncation options.
9. This limit ensures that a cross product of the spatial-weighting matrix is stored more efficiently in
banded form than in general form. The limit is based on the cross product instead of the matrix
itself because the generalized spatial two-stage least-squares estimators use cross products of the
spatial-weighting matrices.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 253
banded requests that the new matrix be stored in a banded form. The banded matrix is
constructed without creating the underlying n×n representation. Note that without
banded, a matrix with truncated values will still be stored in an n × n form.
replace permits spmat idistance to overwrite an existing spmat object.
3.4 Examples
As discussed above, spatial-weighting matrices are used to compute weighted averages
in which more weight is placed on nearby observations than on distant observations.
Haining (2003, 83) and Cliff and Ord (1981, 17) discuss formulations of weights matri-
ces, contiguity matrices, inverse-distance matrices, and combinations thereof.
In inverse-distance spatial-weighting matrices, the weights are inversely related to the
distances between the units. spmat idistance provides several measures for calculating
the distances between the units.
The coordinates may or may not be geospatial. Distances between geospatial units
are commonly computed from the latitudes and longitudes of unit centroids.10 Social
distances are frequently computed from individual-person attributes.
In much of the literature, the attributes are known as coordinates because the nomen-
clature has developed around the common geospatial case in which the attributes are
map coordinates. For ease of use, spmat idistance follows this convention and refers
to coordinates, even though coordinate variables specified in cvarlist need not be spatial
coordinates.
The (i, j)th element of an inverse-distance spatial-weighting matrix is 1/dij , where
dij is the distance between unit i and j computed from the specified coordinates and dis-
tance measure. Creating spatial-weighting matrices with elements of the form 1/f (dij ),
where f (·) is some function, is described in section 17.4.
Example
county.dta from section 1.1 contains the coordinates of the centroids of each county,
measured in degrees, in the variables longitude and latitude. To get a feel for the
data, we create an unnormalized inverse-distance spatial-weighting matrix, store it in
the spmat object dcounty, and summarize it by typing
10. The word “centroid” in the literature on geographic information systems differs from the standard
term in geometry. In the geographic information systems literature, a centroid is a weighted average
of the vertices of a polygon that approximates the center of the polygon; see Waller and Gotway
(2004, 44–45) for the formula and some discussion.
254 Creating and managing spatial-weighting matrices
Matrix Description
From the summary table, we can see that the centroids of the two closest counties
lie within less than one kilometer of each other (1/1.081453), while the two most distant
counties are 4, 577 kilometers apart (1/0.0002185).
Below we compute a minmax-normalized inverse-distance matrix, store it in the
spmat object dcounty2, and summarize it by typing
. spmat idistance dcounty2 longitude latitude, id(id) dfunction(dhaversine)
> normalize(minmax)
. spmat summarize dcounty2
Summary of spatial-weighting object dcounty2
Matrix Description
x1 [s] and x1 [t] are the longitudes of point s and point t, respectively
x2 [s] and x2 [t] are the latitudes of point s and point t, respectively
id x
1. 1 0
2. 2 1
3. 3 2
4. 4 503
5. 5 504
6. 6 505
7. 7 1006
8. 8 1007
9. 9 1008
The units are grouped into three clusters. The units belonging to the same cluster
are close to one another, while the distance between the units belonging to different
clusters is large. For real-world data, the units may represent, for example, cities in
different states. We use spmat idistance to create the spmat object ex from the data:
Theoretical considerations may suggest that the weights should actually be 0 below
a certain threshold. For example, choosing the threshold value of 1/500 = 0.002 for our
matrix results in the following structure:
⎡ ⎤
0 1 0.5 0 0 0 0 0 0
⎢ 1 0 1 0 0 0 ⎥
⎢ 0 0 0 ⎥
⎢ ⎥
⎢ 0.5 1 0 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 0 0 0 0 1 0.5 0 0 0 ⎥
⎢ ⎥
⎢ ⎥
⎢ 0 0 0 1 0 1 0 0 0 ⎥
⎢ ⎥
⎢ 0 0 0 0.5 1 0 0 0 0 ⎥
⎢ ⎥
⎢ ⎥
⎢ 0 0 0 0 0 0 0 1 0.5 ⎥
⎢ ⎥
⎢ 0 0 0 0 0 0 1 0 1 ⎥
⎣ ⎦
0 0 0 0 0 0 0.5 1 0
Now the matrix with the truncated values can be stored more efficiently in a banded
form: ⎡ ⎤
0 0 0.5 0 0 0.5 0 0 0.5
⎢ 0 1 1 0 1 1 ⎥
⎢ 0 1 1 ⎥
⎢ ⎥
⎢ 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 1 1 0 1 1 0 1 1 0 ⎥
⎣ ⎦
0.5 0 0 0.5 0 0 0.5 0 0
spmat idistance provides tools for truncating the values of an inverse-distance ma-
trix and storing the truncated matrix in a banded form. Like spmat contiguity, spmat
idistance is capable of creating a banded matrix without creating the underlying n × n
representation of the matrix. The user must specify a theoretically justified truncation
criterion for such an application.
Here we illustrate how one could apply each of the truncation methods mentioned
in section 3.3 to our hypothetical inverse-distance matrix. The most natural way is to
use value truncation. In the code below, we create a new spmat object ex1 with the
values of W that are less than or equal to 1/500 set to 0.11 We also request that W be
stored in banded form.
. spmat idistance ex1 x, id(id) banded vtruncate(1/500)
The same outcome can be achieved with bin truncation. In bin truncation, we find
the maximum value in W denoted by m, divide the interval (0, m] into B bins of equal
length, and then truncate to 0 elements that fall into bins 1, . . . , b; see Bin truncation
details below for a more technical description. In our hypothetical matrix, the largest
element of W is 1. If we divide the values in W into three bins, the bins will be defined
by (0, 1/3], (1/3, 2/3], (2/3, 1]. The values we wish to round to 0 fall into the first bin.
11. vtruncate() accepts any expression that evaluates to a number.
258 Creating and managing spatial-weighting matrices
In the code below, we create a new spmat object ex2 with the values of W that fall
into the first bin set to 0. We also request that W be stored in banded form.
A word of warning: While truncation leads to matrices that can be stored more
efficiently, truncation should only be applied if supported by theory. Ad hoc truncation
may lead to a misspecification of the model and a subsequent inconsistent inference.
4.2 Description
spmat summarize reports summary statistics about the elements in the spatial-weight-
ing matrix in the existing spmat object objname.
4.3 Options
links is useful when objname contains a contiguity or a normalized-contiguity matrix.
Rather than the default summary of the values in the spatial-weighting matrix,
links causes spmat summarize to summarize the number of neighbors.
detail requests a tabulation of links for a contiguity or a normalized-contiguity matrix.
The values of the identifying variable with the minimum and maximum number of
links will be displayed.
banded reports the bands for the matrix that already has a (possibly) banded structure
but is stored in an n × n form.
truncmethods are useful when you want to see summary statistics calculated on a spatial-
weighting matrix after some elements have been truncated to 0. spmat summarize
with a truncmethod will report the lower and upper band based on a matrix to
which the specified truncation criterion has been applied. (Note: No data are ac-
tually changed by selecting these options. These options only specify that spmat
summarize calculate results as if the requested truncation criterion has been ap-
plied.)
btruncate(b B) partitions the values of W into B bins and truncates to 0 entries
that fall into bin b or below.
dtruncate(dL dU ) truncates to 0 the values of W that fall more than dL diagonals
below and dU diagonals above the main diagonal. Neither value can be greater than
(cols(W)−1)/4.
vtruncate(v) truncates to 0 the values of W that are less than or equal to v.
260 Creating and managing spatial-weighting matrices
4.5 Examples
It is generally useful to know some summary statistics for the elements of your spatial-
weighting matrices. In sections 2.4 and 3.4, we used spmat summarize to report sum-
mary statistics for spatial-weighting matrices.
Many spatial-weighting matrices contain many elements that are not 0 but are very
small. At times, theoretical considerations such as threshold effects suggest that these
small weights should be truncated to 0. In these cases, you might want to summarize
the elements of the spatial-weighting subject to different truncation criteria as part of
some sensitivity analysis.
Example
The Bands row reports the lower and upper bands with nonzero values. Those values
tell us whether the matrix can be stored in banded form. As mentioned in section 3.3,
neither value can be greater than (cols(W)−1)/4. In our case, the maximum values
for bands is (3,109 − 1)/4 = 777; therefore, if we truncated the values of the matrix
according to our criterion, we would not be able to store the matrix in banded form.12
In section 5.1, we show how we can use the sorting tricks of Drukker et al. (2011) to
store this matrix in banded form.
12. In practice, rather than calculating the maximum value for bands by hand, we would use the
r(canband), r(lband), and r(uband) scalars returned by spmat summarize; see section 4.4 for
details.
262 Creating and managing spatial-weighting matrices
Figure 2 shows that zero and nonzero entries are scattered all over the matrix.
This pattern arises because the original shapefiles had the counties sorted in an order
unrelated to their distances from a common point.
Columns
1 100 200 311
1
100
Rows
200
311
13. For best results, pick a place located in a remote corner of the map; see Drukker et al. (2011) for
further details.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 263
Matrix Description
Specifying the option banded in the spmat contiguity command caused the con-
tiguity matrix to be stored as a banded matrix. The summary table shows that the
contiguity information is now stored in a 465 × 3,109 matrix, which requires much
less space than the original 3,109 × 3,109 matrix. Figure 3 clearly shows the banded
structure.
Columns
1 100 200 311
1
100
Rows
200
311
Similarly, we can re-create the dcounty object calculated on the sorted data and
see whether the inverse-distance matrix can be stored in banded form after applying a
truncation criterion.
Matrix Description
We can see that the Values summary for this matrix and the matrix from section 4.5
is the same; however, the matrix in this example is stored in banded form.
Instead, we repeat the sorting trick and call spmat contiguity with option banded,
hoping that we will be able to fit the banded representation into memory.
Matrix Description
The output from spmat summarize indicates that the normalized contiguity matrix
is stored in a 1,827 × 31,713 matrix. This fits into less than half a gigabyte of memory!
All we did to store the matrix in a banded format was change the sort order of the data
and specify the banded option. We discuss storing an existing n × n spatial-weighting
matrix in banded form in sections 18.1 and 18.2.
Having illustrated the importance of banded matrices, we return to documenting
the spmat commands.
6.2 Description
spmat note creates and manipulates a note attached to the spmat object.
266 Creating and managing spatial-weighting matrices
6.3 Options
replace causes spmat note to overwrite the existing note with a new one.
drop causes spmat note to clear the note associated with objname.
6.4 Examples
If you plan to use a spatial-weighting matrix outside a given do-file or session, you
should attach some documentation to the spmat object.
spmat note stores the note in a string scalar; however, it is possible to store multiple
notes in the scalar by repeatedly appending notes.
Example
We attach a note to the spmat object ccounty and then display it by typing
. spmat note ccounty : "Source: Tiger 2008 county files."
. spmat note ccounty
Source: Tiger 2008 county files.
7.2 Description
spmat graph produces an intensity plot of the spatial-weighting matrix contained in
the spmat object objname. Zero elements are plotted in white; the remaining elements
are partitioned into bins of equal length and assigned gray-scale colors gs0–gs15 (see
[G-4] colorstyle), with darker colors representing higher values.
7.3 Options
blocks( (stat) p) specifies that the matrix be divided into blocks of size p and that
block maximums be plotted. This option is useful when the matrix is large. To plot a
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 267
statistic other than the default maximum, you can specify the optional stat argument.
For example, to plot block medians, type blocks((p50) p). The supported statistics
include those returned by summarize, detail; see [R] summarize for a complete
list.
twoway options are any options other than by(); they are documented in
[G-3] twoway options.
7.4 Examples
An intensity plot of a spatial-weighting matrix can reveal underlying structure. For
example, if there is a banded structure to the spatial-weighting matrix, large amounts
of memory may be saved.
See section 5.1 for an example in which we use spmat graph to reveal the banded
structure in a spatial-weighting matrix.
8.2 Description
spmat lag uses a spatial-weighting matrix to compute the weighted averages of a vari-
able known as the spatial lag of a variable.
More precisely, spmat lag uses the spatial-weighting matrix in the spmat object
objname to compute the spatial lag of the variable varname and stores the result in the
new variable newvar.
8.3 Examples
Spatial lags of the exogenous right-hand-side variables are frequently included in SAR
models; see, for example, LeSage and Pace (2009).
Recall that a spatial lag is a weighted average of the variable being lagged. If x spl
denotes the spatial lag of the existing variable x, using the spatial-weighting matrix W,
then the algebraic definition is x spl = Wx.
268 Creating and managing spatial-weighting matrices
The code below generates the new variable x spl, which contains the spatial lag of x,
using the spatial-weighting matrix W, which is contained in the spmat object ccounty:
. clear all
. use county
. spmat contiguity ccounty using countyxy, id(id) normalize(minmax)
. generate x = runiform()
. spmat lag x_spl ccounty x
9.2 Description
spmat eigenvalues calculates the eigenvalues of the spatial-weighting matrix contained
in the spmat object objname and stores them in vecname. The maximum-likelihood
estimator implemented in the spreg ml command, as described in Drukker, Prucha,
and Raciborski (2013b), uses the eigenvalues of the spatial-weighting matrix during the
optimization process. If you are estimating several models by maximum likelihood with
the same spatial-weighting matrix, computing and storing the eigenvalues in an spmat
object will remove the need to recompute the eigenvalues.
9.3 Options
eigenvalues(vecname) stores the user-defined vector of eigenvalues in the spmat object
objname. vecname must be a Mata row vector of length n, where n is the dimension
of the spatial-weighting matrix in the spmat object objname.
replace permits spmat eigenvalues to overwrite existing eigenvalues in objname.
9.4 Examples
Putting the eigenvalues into the spmat object can dramatically speed up the com-
putations performed by the spreg ml command; see Drukker, Prucha, and Raciborski
(2013b) for details and references therein.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 269
10.2 Description
spmat drop removes the spmat object objname from memory.
10.3 Examples
To drop the spmat object dcounty from memory, we type
11.2 Description
spmat save saves the spmat object objname to a file in a native Stata format.
11.3 Option
replace permits spmat save to overwrite filename.
270 Creating and managing spatial-weighting matrices
11.4 Examples
Creating a spatial-weighting matrix, and perhaps its eigenvalues as well, can be a time-
consuming process. If you are going to repeatedly use a spatial-weighting matrix, you
probably want to save it to a disk and read it back in for subsequent uses. spmat save
will save the spmat object to disk for you. Section 12 discusses spmat use, which reads
the object from disk into memory.
If you are going to save an spmat object to disk, it is a good practice to use spmat
note to attach some documentation to the object before saving it. Section 6 discusses
spmat note.
Just like with Stata datasets, you can save your spmat objects to disk and share
them with other Stata users. The file format is platform independent. So, for example,
a Mac user could save an spmat object to disk and email it to a coauthor, and the
Windows-using coauthor could read in this spmat object by using spmat use.
We can save the information contained in the spmat object ccounty in the file
ccounty.spmat by typing
12.2 Description
spmat use reads into memory an spmat object from a file created by spmat save; see
section 11 for a discussion of spmat save.
12.3 Option
replace permits spmat use to overwrite an existing spmat object.
12.4 Examples
As mentioned in section 11, creating a spatial-weighting matrix can be time consuming.
When repeatedly using a spatial-weighting matrix, you might want to save it to disk
with spmat save and read it back in with spmat use for subsequent uses.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 271
In section 11, we saved the spmat object ccounty to the file ccounty.spmat. We
now drop the existing ccounty object from memory and read it back in with spmat
use:
13.2 Description
spmat export saves the spatial-weighting matrix contained in the spmat object objname
to a space-delimited text file. The matrix is written in a rectangular format with unique
place identifiers saved in the first column. spmat export can also save lists of neighbors
to a text file.
13.3 Options
noid causes spmat export not to save unique place identifiers, only matrix entries.
nlist causes spmat export to write the matrix in the neighbor-list format described
in section 2.3.
replace permits spmat export to overwrite filename.
13.4 Examples
The main use of spmat export is to export a spatial-weighting matrix to a text file that
can be read by another program. Long (2009, 336) recommends exporting all data to
text files that will be read by future software as part of archiving one’s research work.
Another use of spmat export is to review neighbor lists from a contiguity matrix.
Here we illustrate how one can export the contiguity matrix in the neighbor-list format
described in section 2.3.
. spmat export ccounty using nlist.txt, nlist
272 Creating and managing spatial-weighting matrices
We call the Unix command head to list the first 10 lines of nlist.txt:15
. !head nlist.txt
3109
1 1054 1657 2063 2165 2189 2920 2958
2 112 2250 2277 2292 2362 2416 3156
3 2294 2471 2575 2817 2919 2984
4 8 379 1920 2024 2258 2301
5 6 73 1059 1698 2256 2886 2896
6 5 1698 2256 2795 2886 2896 3098
7 517 1924 2031 2190 2472 2575
8 4 379 1832 2178 2258 2987
9 413 436 1014 1320 2029 2166
The first line of the file indicates that there are 3,109 total spatial units. The
second line indicates that the unit with identification code 1 is a neighbor of units with
identification codes 1054, 1657, 2063, 2165, 2189, 2920, and 2958. The interpretation of
the remaining lines is analogous to that for the second line.
14.2 Description
spmat getmatrix copies the spatial-weighting matrix contained in the spmat object
objname and stores it in the Mata matrix matname; see [M-0] intro for an introduction
to using Mata. If specified, the vector of unique identifiers and the eigenvalues of the
spatial-weighting matrix will be stored in Mata vectors.
14.3 Options
id(vecname) specifies the name of a Mata vector to contain IDs.
eig(vecname) specifies the name of a Mata vector to contain eigenvalues.
15. Users of other operating systems should open the file in a text editor.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 273
14.4 Examples
If you want to make changes to an existing spatial-weighting matrix, you need to retrieve
it from the spmat object, store it in Mata, make the desired changes, and store the new
matrix back in the spmat object by using spmat putmatrix. (See section 17 for a
discussion of spmat putmatrix.)
spmat getmatrix performs the first two tasks: it makes a copy of the spatial-
weighting matrix from the spmat object and stores it in Mata.
As we discussed in section 3, spmat idistance creates a spatial-weighting matrix
of the form 1/dij , where dij is the distance between units i and j. In section 17.4, we
use spmat getmatrix in an example in which we change a spatial-weighting matrix to
the form 1/ exp(0.1 × dij ) instead of just 1/dij .
15.2 Description
spmat import imports a spatial-weighting matrix from a space-delimited text file and
stores it in a new spmat object.
15.3 Options
noid specifies that the first column of numbers in filename does not contain unique place
identifiers and that spmat import should create and use the identifiers 1, . . . , n.
nlist specifies that the text file to be imported contain a list of neighbors in the format
described in section 2.3.
geoda specifies that filename be in the .gwt or .gal format created by the GeoDaTM
software.
idistance specifies that the file contains raw distances and that the raw distances
should be converted to inverse distances. In other words, idistance specifies that
the (i, j)th element in the file be dij and that the (i, j)th element in the spatial-
weighting matrix be 1/dij , where dij is the distance between units i and j.
normalize(norm) specifies one of the three available normalization techniques: row,
minmax, and spectral. In a row-normalized matrix, each element in row i is divided
by the sum of row i’s elements. In a minmax-normalized matrix, each element is
274 Creating and managing spatial-weighting matrices
divided by the minimum of the largest row sum and column sum of the matrix. In
a spectral-normalized matrix, each element is divided by the modulus of the largest
eigenvalue of the matrix. See section 2.5 for details.
replace permits spmat import to overwrite an existing spmat object.
15.4 Examples
One frequently needs to import a spatial-weighting matrix from a text file. spmat
import supports three of the most common formats: simple text files, GeoDaTM text
files, and text files that require minor changes such as converting from raw to inverse
distances.
By default, the unique place-identifying variable is assumed to be stored in the first
column of the file, but this can be overridden with the noid option.
In section 17.4, we provide an extended example that begins with using spmat
import to import a spatial-weighting matrix.
16.2 Description
spmat dta imports a spatial-weighting matrix from the variables in a Stata dataset and
stores it in an spmat object.
The number of variables in varlist must equal the number of observations because
spatial-weighting matrices are n × n.
16.3 Options
id(varname) specifies that the unique place identifiers be contained in varname. The
default is to create an identifying vector containing 1, . . . , n.
idistance specifies that the variables contain raw distances and that the raw distances
be converted to inverse distances. In other words, idistance specifies that the ith
observation on the jth variable be dij and that the (i, j)th element in the spatial-
weighting matrix be 1/dij , where dij is the distance between units i and j.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 275
16.4 Examples
People have created Stata datasets that contain spatial-weighting matrices. Given the
power of infile and infix (see [D] infile (fixed format) and [D] infix (fixed for-
mat)), it is likely that more such datasets will be created. spmat dta imports these
spatial-weighting matrices and stores them in an spmat object.
Here we illustrate how we can create an spmat object from a Stata dataset. The
dataset schools.dta contains the distance in miles between five schools in the variables
c1-c5. The unique school identifier is recorded in the variable id. In Stata, we type
id c1 c2 c3 c4 c5
17.2 Description
spmat putmatrix puts Mata matrices into an existing spmat object objname or into
a new spmat object if the specified object does not exist. The optional unique place
identifiers can be provided as a Mata vector or a Stata variable. The optional eigenvalues
of the Mata matrix can be provided in a Mata vector.
276 Creating and managing spatial-weighting matrices
17.3 Options
id(varname | vecname) specifies a Mata vector vecname or a Stata variable varname
that contains unique place identifiers.
eig(vecname) specifies a Mata vector vecname that contains the eigenvalues of the
matrix.
idistance specifies that the Mata matrix contains raw distances and that the raw
distances be converted to inverse distances. In other words, idistance specifies
that the (i, j)th element in the Mata matrix be dij and that the (i, j)th element in
the spatial-weighting matrix be 1/dij , where dij is the distance between units i and
j.
bands(l u) specifies that the Mata matrix matname be banded with l lower and u upper
diagonals.
normalize(norm) specifies one of the three available normalization techniques: row,
minmax, and spectral. In a row-normalized matrix, each element in row i is divided
by the sum of row i’s elements. In a minmax-normalized matrix, each element is
divided by the minimum of the largest row sum and column sum of the matrix. In
a spectral-normalized matrix, each element is divided by the modulus of the largest
eigenvalue of the matrix. See section 2.5 for details.
replace permits spmat putmatrix to overwrite an existing spmat object.
17.4 Examples
spmat contiguity and spmat idistance create spatial-weighting matrices from raw
data. This section describes situations in which we have the spatial-weighting matrix
precomputed and simply want to put it in an spmat object. The spatial-weighting matrix
can be any matrix that satisfies the conditions discussed, for example, in Kelejian and
Prucha (2010).
In this section, we show how to create an spmat object from a text file by using
spmat import and how to use spmat getmatrix and spmat putmatrix to generate an
inverse-distance matrix according to a user-specified functional form.
The file schools.txt contains the distance in miles between five schools. We call
the Unix command cat to print the contents of the file:
. !cat schools.txt
5
101 0 5.9 8.25 6.22 7.66
205 5.9 0 2.97 4.87 7.63
113 8.25 2.97 0 4.47 7
441 6.22 4.87 4.47 0 2.77
573 7.66 7.63 7 2.77 0
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 277
The school ID is recorded in the first column of the file, and column i records the
distance from school i to all the other schools, including itself. We can use spmat
import to create a spatial-weighting matrix from this file:
. spmat import schools using schools.txt, replace
We now illustrate how to create a spatial-weighting matrix with the distance de-
clining in an exponential fashion, exp(−0.1dij ), where dij is the original distance from
school i to school j.
. spmat getmatrix schools x
. mata: x = exp(-.1:*x)
. mata: _diag(x,0)
. spmat putmatrix schools x, normalize(minmax) replace
Thus we read in the original distances, extract the distance matrix with spmat
getmatrix, use Mata to transform the matrix entries according to our specifications,
and reset the diagonal elements to 0. Finally, we use spmat putmatrix to put the
transformed matrix into an spmat object. The resulting minmax-normalized spatial-
weighting matrix is
⎡ ⎤
0 0.217 0.172 0.211 0.182
⎢ 0.217 0 0.292 0.241 0.183 ⎥
⎢ ⎥
⎢ 0.172 0.292 0 0.251 0.195 ⎥
⎢ ⎥
⎣ 0.211 0.241 0.251 0 0.297 ⎦
0.182 0.183 0.195 0.297 0
to a new row sort order and then 2) storing the spatial-weighting matrix in banded
format. We accomplish step 1 by storing the new row sort order in a permutation
vector, as explained below, and then by using spmat permute. We use spmat tobanded
to perform step 2.
Note that most of the time, it is more convenient to sort the data as described
in section 5.1 and to call spmat contiguity or spmat idistance with a truncation
criterion. With very large datasets, spmat contiguity and spmat idistance will be
the only choices because they are capable of creating banded matrices from data without
first storing the matrices in a general form.
Description
spmat permute permutes the rows and columns of the n × n spatial-weighting matrix
stored in the spmat object objname. The permutation vector stored in pvarname con-
tains a permutation of the integers {1, . . . , n}, where n is both the sample size and the
dimension of W. That the value of the ith observation of pvarname is j specifies that
we must move row j to row i in the permuted matrix. After moving all the rows as
specified in pvarname, we move the columns in an analogous fashion. See Permutation
details: Mathematics below for a more thorough explanation.
Examples
Let p be the permutation vector created from pvarname, and let W be the spatial-
weighting matrix contained in the specified spmat object. The n × 1 permutation vector
p contains a permutation of the integers {1, . . . , n}, where n is the dimension of W.
The permutation of W is obtained by reordering the rows and columns of W as specified
by the elements of p. Each element of p specifies a row and column reordering of W.
That element i of p is j—that is, p[i]=j—specifies that we must move row j to row i in
the permuted matrix. After moving all the rows according to p, we move the columns
analogously.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 279
. mata: W
[symmetric]
1 2 3 4 5
1 0
2 1 0
3 0 0 0
4 0 1 0 0
5 1 0 1 0 0
Suppose that we also have a permutation vector p that we could use to permute W to a
banded matrix.
. mata: p
1 2 3 4 5
1 3 5 1 2 4
See Permutation details: An example below to see how we used the sorting trick of
Drukker et al. (2011) to obtain this p. See Examples in section 18.2 for an example
with real data.
The values in the permutation vector p specify how to permute (that is, reorder) the
rows and the columns of W. Let’s start with the rows. That 3 is element 1 of p specifies
that row 3 of W be moved to row 1 in the permuted matrix. In other words, we must
move row 3 to row 1.
Applying this logic to all the elements of p yields that we must reorder the rows of
W by moving row 3 to row 1, row 5 to row 2, row 1 to row 3, row 2 to row 4, and row 4
to row 5. In the output below, we use Mata to perform this operation on W, store the
result in A, and display A. If the Mata code is confusing, just check that A contains the
described row reordering of W.
. mata: A = W[p,.]
. mata: A
1 2 3 4 5
1 0 0 0 0 1
2 1 0 1 0 0
3 0 1 0 0 1
4 1 0 0 1 0
5 0 1 0 0 0
Having reordered the rows, we reorder the columns in the analogous fashion. Oper-
ating on A, we move column 3 to column 1, column 5 to column 2, column 1 to column 3,
column 2 to column 4, and column 4 to column 5. In the output below, we use Mata
to perform this operation on A, store the result in B, and display B. If the Mata code is
confusing, just check that B contains the reordering of A described above.
280 Creating and managing spatial-weighting matrices
. mata: B = A[.,p]
. mata: B
[symmetric]
1 2 3 4 5
1 0
2 1 0
3 0 1 0
4 0 0 1 0
5 0 0 0 1 0
Note that B is the desired banded matrix. For Mata aficionados, typing W[p,p]
would produce this permutation in one step.
For those whose intuition is grounded in linear algebra, here is the permutation-
matrix explanation. The permutation vector p defines the permutation matrix E, where
E is obtained by performing the row reordering described above on the identity matrix
of dimension 5. Then the permuted form of W is given by E*W*E’, as we illustrate below:
. mata: E = I(5)
. mata: E
[symmetric]
1 2 3 4 5
1 1
2 0 1
3 0 0 1
4 0 0 0 1
5 0 0 0 0 1
. mata: E = E[p,.]
. mata: E
1 2 3 4 5
1 0 0 1 0 0
2 0 0 0 0 1
3 1 0 0 0 0
4 0 1 0 0 0
5 0 0 0 1 0
. mata: E*W*E´
[symmetric]
1 2 3 4 5
1 0
2 1 0
3 0 1 0
4 0 0 1 0
5 0 0 0 1 0
spmat permute requires that the permutation vector be stored in the Stata variable
pvarname. Assume that we now have the unpermuted matrix W stored in the spmat
object cobj. The matrix represents contiguity information for the following data:
. list
id distance
1. 79 5.23
2. 82 27.56
3. 100 0
4. 114 1.77
5. 140 20.47
The variable distance measures the distance from the centroid of the place with id=100
to the centroids of all the other places. We sort the data on distance and generate the
permutation vector p, which is just a running index 1, . . . , 5:
. sort distance
. generate p = _n
. list
id distance p
1. 100 0 1
2. 114 1.77 2
3. 79 5.23 3
4. 140 20.47 4
5. 82 27.56 5
We obtain our permutation vector by sorting the data back to the original order
based on the id variable:
. sort id
. list
id distance p
1. 79 5.23 3
2. 82 27.56 5
3. 100 0 1
4. 114 1.77 2
5. 140 20.47 4
Now coding spmat permute cobj p will reorder the rows and columns of W in
exactly the same way as the Mata code did above.
282 Creating and managing spatial-weighting matrices
Description
Options
truncmethod specifies one of the three truncation criteria. The values of W that meet
the truncation criterion will be changed to 0.
btruncate(b B) partitions the values of W into B bins and truncates to 0 entries
that fall into bin b or below.
dtruncate(dL dU ) truncates to 0 the values of W that fall more than dL diagonals
below and dU diagonals above the main diagonal. Neither value can be greater than
(cols(W)−1)/4.
vtruncate(#) truncates to 0 the values of W that are less than or equal to #.
replace allows objname1 or objname2 to be overwritten if it already exists.
Examples
Sometimes, we have large spatial-weighting matrices that fit in memory, but they take
up so much space that there is too little room to do anything else. In these cases, we are
better off storing these spatial-weighting matrices in a banded format when possible.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 283
We can now use this permutation vector and spmat permute to perform the per-
mutation, and we can finally call spmat tobanded to band the spatial-weighting matrix
stored inside the spmat object ccounty. Note that the reported summary is identical
to the one in section 5.1.
Matrix Description
19 Conclusion
We discussed the spmat command for creating, managing, importing, manipulating, and
storing spatial-weighting matrix objects. In future work, we will consider additional
subcommands for creating specific types of spatial-weighting matrices.
20 Acknowledgment
We gratefully acknowledge financial support from the National Institutes of Health
through the SBIR grants R43 AG027622 and R44 AG027622.
21 References
Anselin, L. 1988. Spatial Econometrics: Methods and Models. Dordrecht: Kluwer
Academic Publishers.
———. 2010. Thirty years of spatial econometrics. Papers in Regional Science 89: 3–25.
Cressie, N. A. C. 1993. Statistics for Spatial Data. Revised ed. New York: Wiley.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 285
Crow, K. 2006. shp2dta: Stata module to convert shape boundary files to Stata datasets.
Statistical Software Components S456718, Department of Economics, Boston College.
http://ideas.repec.org/c/boc/bocode/s456718.html.
Crow, K., and W. Gould. 2007. FAQ: How do I graph data onto a map with spmap?
http://www.stata.com/support/faqs/graphics/spmap-and-maps/.
Drukker, D. M., P. Egger, and I. R. Prucha. 2013. On two-step estimation of a spatial
autoregressive model with autoregressive disturbances and endogenous regressors.
Econometric Reviews 32: 686–733.
Drukker, D. M., H. Peng, I. R. Prucha, and R. Raciborski. 2011. Sorting induces
a banded structure in spatial-weighting matrices. Working paper, Department of
Economics, University of Maryland.
Drukker, D. M., I. R. Prucha, and R. Raciborski. 2013a. A command for estimating
spatial-autoregressive models with spatial-autoregressive disturbances and additional
endogenous variables. Stata Journal 13: 287–301.
———. 2013b. Maximum likelihood and generalized spatial two-stage least-squares
estimators for a spatial-autoregressive model with spatial-autoregressive disturbances.
Stata Journal 13: 221–241.
Haining, R. 2003. Spatial Data Analysis: Theory and Practice. Cambridge: Cambridge
University Press.
Kelejian, H. H., and I. R. Prucha. 2010. Specification and estimation of spatial au-
toregressive models with autoregressive and heteroskedastic disturbances. Journal of
Econometrics 157: 53–67.
Lai, P.-C., F.-M. So, and K.-W. Chan. 2009. Spatial Epidemiological Approaches in
Disease Mapping and Analysis. Boca Raton, FL: CRC Press.
Leenders, R. T. A. J. 2002. Modeling social influence through network autocorrelation:
Constructing the weight matrix. Social Networks 24: 21–47.
LeSage, J., and R. K. Pace. 2009. Introduction to Spatial Econometrics. Boca Raton:
Chapman & Hall/CRC.
Long, J. S. 2009. The Workflow of Data Analysis Using Stata. College Station, TX:
Stata Press.
Pisati, M. 2005. mif2dta: Stata module to convert MapInfo Interchange Format bound-
ary files to Stata boundary files. Statistical Software Components S448403, Depart-
ment of Economics, Boston College.
http://ideas.repec.org/c/boc/bocode/s448403.html.
———. 2007. spmap: Stata module to visualize spatial data. Statistical Software
Components S456812, Department of Economics, Boston College.
http://ideas.repec.org/c/boc/bocode/s456812.html.
286 Creating and managing spatial-weighting matrices
Tobler, W. R. 1970. A computer movie simulating urban growth in the Detroit region.
Economic Geography 46: 234–240.
Waller, L. A., and C. A. Gotway. 2004. Applied Spatial Statistics for Public Health
Data. Hoboken, NJ: Wiley.
1 Introduction
Building on the work of Whittle (1954), Cliff and Ord (1973, 1981) developed statistical
models that accommodate forms of cross-unit interactions. The latter is a feature of
interest in many social science, biostatistical, and geographic science models. A simple
version of these models, typically referred to as spatial-autoregressive (SAR) models,
augments the linear regression model by including an additional right-hand-side (RHS)
variable known as a spatial lag. Each observation of the spatial-lag variable is a weighted
average of the values of the dependent variable observed for the other cross-sectional
units. Generalized versions of the SAR model also allow for the disturbances to be
generated by a SAR process and for the exogenous RHS variables to be spatial lags of
exogenous variables. The combined SAR model with SAR disturbances is often referred
to as a SARAR model; see Anselin and Florax (1995).1
1. These models are also known as Cliff–Ord models because of the impact that Cliff and Ord (1973,
1981) had on the subsequent literature. To avoid confusion, we simply refer to these models as
SARAR models while still acknowledging the importance of the work of Cliff and Ord.
c 2013 StataCorp LP st0293
288 Spatial models with additional endogenous variables
In modeling the outcome for each unit as dependent on a weighted average of the
outcomes of other units, SARAR models determine outcomes simultaneously. This si-
multaneity implies that the ordinary least-squares estimator will not be consistent; see
Anselin (1988) for an early discussion of this point. Drukker, Prucha, and Raciborski
(2013) discuss the spreg command, which implements estimators for the model when
the RHS variables are a spatial lag of the dependent variable, exogenous variables, and
spatial lags of the exogenous variables.
The model we consider allows for additional endogenous RHS variables. Thus the
model of interest is a linear cross-sectional SAR model with additional endogenous vari-
ables, exogenous variables, and SAR disturbances. We discuss an estimator for the
parameters of this model and the command that implements this estimator, spivreg.
Kelejian and Prucha (1998, 1999, 2004, 2010) and the references cited therein derive
the main results used by the estimator implemented in spivreg, with Drukker, Egger,
and Prucha (2013) and Arraiz et al. (2010) producing some important extensions that
are used in the code.
While SARAR models have a wide range of possible applications, following Cliff
and Ord (1973, 1981), much of the original literature was developed to handle spatial
interactions; see, for example, Anselin (1988, 2010), Cressie (1993), and Haining (2003).
However, space is not restricted to geographic space, and many recent applications
employ these techniques in other situations of cross-unit dependence, such as social-
interaction models and network models; see, for example, Kelejian and Prucha (2010)
and Drukker, Egger, and Prucha (2013) for references. Much of the nomenclature still
includes the adjective “spatial”, and we continue this tradition to avoid confusion while
noting the wider applicability of these models.
Section 2 defines the generalized SARAR model. Section 3 describes the spivreg
command. Section 4 illustrates the estimation of a SARAR model on example data
for U.S. counties. Section 5 describes postestimation commands. Section 6 presents
methods and formulas. The conclusion follows.
2 The model
The model of interest is given by
y = Yπ + Xβ + λWy + u (1)
u = ρMu + (2)
where
The model in equations (1) and (2) is a SARAR model with exogenous regressors and
additional endogenous regressors. Spatial interactions are modeled through spatial lags,
and the model allows for spatial interactions in the dependent variable, the exogenous
variables, and the disturbances.
Because the model in equations (1) and (2) is a first-order SAR process with first-
order SAR disturbances, it is also referred to as a SARAR(1,1) model, which is a special
case of the more general SARAR(p, q) model. We refer to a SARAR(1,1) model as a
SARAR model. Setting ρ = 0 yields the SAR model y = Yπ + Xβ + λWy+. Setting
λ = 0 yields the model y = Yπ + Xβ + u with u = ρMu + , which is sometimes
referred to as the SAR error model. Setting ρ = 0 and λ = 0 causes the model to reduce
to a linear regression model with endogenous variables.
The spatial-weighting matrices W and M are taken to be known and nonstochastic.
These matrices are part of the model definition, and in many applications, W = M;
see Drukker et al. (2013) for more about creating spatial-weighting matrices in Stata.
Let y= Wy, let y i and yi denote the ith element of y and y, respectively, and let wij
denote the (i, j)th element of W. Then
n
yi = wij yj
j=1
which clearly shows the dependence of yi on neighboring outcomes via the spatial lag
y i . The weights wij will typically be modeled as inversely related to some measure
of distance between the units. The SAR parameter λ measures the extent of these
interactions.
The innovations are assumed to be independent and identically distributed or in-
dependent but heteroskedastically distributed. The option heteroskedastic, discussed
below, should be specified under the latter assumption.
The spivreg command implements the generalized method of moments (GMM)
and instrumental-variable (IV) estimation strategy discussed in Arraiz et al. (2010) and
2. The variables and parameters in this model are allowed to depend on the sample size; see
Kelejian and Prucha (2010) for further discussions. We suppress this dependence for notational
simplicity. In allowing, in particular, the elements of X to depend on the sample size, we find
that the specification is consistent with some of the variables in X being spatial lags of exogenous
variables.
290 Spatial models with additional endogenous variables
Drukker, Egger, and Prucha (2013) for the above class of SARAR models. This estima-
tion strategy builds on Kelejian and Prucha (1998, 1999, 2004, 2010) and the references
cited therein. More in-depth discussions regarding issues of model specifications and
estimation approaches can be found in these articles and the literature cited therein.
spivreg requires that the spatial-weighting matrices M and W be provided in the
form of an spmat object as described in Drukker et al. (2013). Both general and banded
spatial-weighting matrices are supported.
3.2 Options
id(varname) specifies a numeric variable that contains a unique identifier for each
observation. id() is required.
dlmat(objname) specifies an spmat object that contains the spatial-weighting matrix
W to be used in the SAR term.
elmat(objname) specifies an spmat object that contains the spatial-weighting matrix
M to be used in the spatial-error term.
noconstant suppresses the constant term in the model.
heteroskedastic specifies that spivreg use an estimator that allows e to be het-
eroskedastically distributed over the observations. By default, spivreg uses an
estimator that assumes homoskedasticity.
impower(q) specifies how many powers of the matrix W to include in calculating the
instrument matrix H. The default √ is impower(2). The allowed values of q are
integers in the set {2, 3, . . . , n}.
level(#) specifies the confidence level, as a percentage, for confidence intervals. The
default is level(95) or as set by set level.
maximize options: iterate(#), no log, trace, gradient, showstep,
showtolerance, tolerance(#), ltolerance(#), and from(init specs);
see [R] maximize for details. These options are seldom used.
D. M. Drukker, I. R. Prucha, and R. Raciborski 291
Macros
e(cmd) spivreg e(exogr) exogenous regressors
e(cmdline) command as typed e(insts) instruments
e(depvar) name of dependent variable e(instd) instrumented variables
e(title) title in estimation output e(constant) noconstant or
e(properties) b V hasconstant
e(estat cmd) program used to implement e(H omitted) names of omitted
estat instruments in H
e(predict) program used to implement matrix
predict e(idvar) name of ID variable
e(model) sarar, sar, sare, or lr e(dlmat) name of spmat object
e(het) heteroskedastic or used in dlmat()
homoskedastic e(elmat) name of spmat object
e(indeps) names of independent used in elmat()
variables
Matrices
e(b) coefficient vector e(delta 2sls) initial estimate of β
e(V) variance–covariance matrix and λ
of the estimators
Functions
e(sample) marks estimation sample
4 Examples
To provide a simple illustration, we use the artificial dataset spivreg.dta for the con-
tinental U.S. counties.3 The contiguity matrix for the U.S. counties is taken from
Drukker et al. (2013). In Stata, we issue the following commands:
. use dui
. spmat use ccounty using ccounty.spmat
The spatial-weighting matrix is now contained in the spmat object ccounty. This
minmax-normalized spatial-weighting matrix was created in section 2.4 of Drukker et al.
(2013) and was saved to disk in section 11.4.
In the output above, we are just reading in the spatial-weighting-matrix object that
was created and saved in Drukker et al. (2013).
3. The geographical county location data came from the U.S. Census Bureau and can be found
at ftp://ftp2.census.gov/geo/tiger/TIGER2008/. The variables are simulated but inspired by
Powers and Wilson (2004) and Levitt (1997).
292 Spatial models with additional endogenous variables
Our dependent variable, dui, is defined as the alcohol-related arrest rate per 100,000
daily vehicle miles traveled (DVMT). Figure 1 shows the distribution of dui across
counties, with darker colors representing higher values of the dependent variable. Spatial
patterns in dui are clearly visible.
Our explanatory variables include police (number of sworn officers per 100,000
DVMT); nondui (nonalcohol-related arrests per 100,000 DVMT); vehicles (number of
registered vehicles per 1,000 residents); and dry (a dummy for counties that prohibit
alcohol sale within their borders). Because the size of the police force may be a function
of dui arrest rates, we treat police as endogenous; that is, in this example, Y =
(police). All other included explanatory variables, apart from the spatial lag, are taken
to be exogenous; that is, X = (nondui, vehicles, dry, intercept). Furthermore, we
assume the variable elect is a valid instrument, where elect is 1 if a county government
faces an election and is 0 otherwise. Thus the instrument matrix H is based on Xf =
(nondui, vehicles, dry, elect, intercept) as described above.
D. M. Drukker, I. R. Prucha, and R. Raciborski 293
In Stata, we can estimate the SARAR model with endogenous variables by typing
dui
police -1.467068 .0434956 -33.73 0.000 -1.552318 -1.381818
nondui -.0004088 .0008344 -0.49 0.624 -.0020442 .0012267
vehicles .0989662 .0017653 56.06 0.000 .0955063 .1024261
dry .4553992 .0278049 16.38 0.000 .4009026 .5098958
_cons 9.671655 .3682685 26.26 0.000 8.949862 10.39345
lambda
_cons .7340818 .013378 54.87 0.000 .7078614 .7603023
rho
_cons .2829313 .071908 3.93 0.000 .1419941 .4238685
Instrumented: police
Instruments: elect
Given the normalization of the spatial-weighting matrix, the parameter space for
λ and ρ is taken to be the interval (−1, 1); see Kelejian and Prucha (2010) for further
discussions of the parameter space. The estimate of λ is positive, large, and significant,
indicating strong SAR dependence in dui. In other words, the alcohol-related arrest
rate for a given county is strongly affected by the alcohol-related arrest rates in the
neighboring counties. One possible explanation for this may be coordination among
police departments. Another may be that strong enforcement in one county may lead
some people to drink in neighboring counties.
The estimated ρ is positive, moderate, and significant, indicating moderate spatial
autocorrelation in the innovations.
The estimated β vector does not have the same interpretation as in a simple lin-
ear model, because including a spatial lag of the dependent variable implies that the
outcomes are determined simultaneously.
294 Spatial models with additional endogenous variables
5 Postestimation commands
5.1 Syntax
The syntax for predict after spivreg is
predict type newvar if in , statistic
y = Zδ + u
u = ρMu +
where Z = (Y, X, Wy) and δ = (π , β , λ) . In the following, we review the two-step
GMM and IV estimation approach as discussed in Drukker, Egger, and Prucha (2013) for
the homoskedastic case and in Arraiz et al. (2010) for the heteroskedastic case. Those
articles build on and specialize the estimation theory developed in Kelejian and Prucha
(1998, 1999, 2004, 2010). A full set of assumptions, formal consistency and asymptotic
normality theorems, and further details and discussions are given in that literature.
The IV estimators δ depend on the choice of a set of instruments, say, H. Suppose
that in addition to the included exogenous variables X, we also have excluded exogenous
variables Xe , allowing us to define Xf = (X, Xe ). If we do not have excluded exogenous
D. M. Drukker, I. R. Prucha, and R. Raciborski 295
variables, then Xf = X. Following the above literature, the instruments H may then
be taken as the linearly independent columns of
The motivation for the above instruments is that they are computationally simple
while facilitating an approximation of the ideal instruments under reasonable assump-
tions. Taking q = 2 has worked well in Monte Carlo simulations over a wide range of
specifications. At a minimum, the instruments should include the linearly independent
columns of Xf and MXf , and the rank of H should be at least the number of variables
in Z.4 For the following discussion, it proves convenient to define the instrument pro-
jection matrix PH = H(H H)−1 H . When there is a constant in the model, it is only
included once in H.
The GMM estimators for ρ are motivated by quadratic moment conditions of the
form
E ( As ) = 0, s = 1, . . . , S
where the matrices As satisfy tr(As ) = 0. Specific choices for those matrices will be
given below. We note that under heteroskedasticity, it is furthermore assumed that the
diagonal elements of the matrices As are 0. This assumption simplifies the formula for
the asymptotic variance–covariance (VC) matrix; in particular, it avoids the fact that
the VC matrix must depend on third and fourth moments of the innovations in addition
to second moments.
We next describe the steps involved in computing the GMM and IV estimators and
an estimate of their asymptotic VC matrix. The second step operates on a spatial
Cochrane–Orcutt transformation of the above model given by
y(ρ) = Z(ρ)δ+
In the first step, we apply two-stage least squares (2SLS) to the untransformed model
by using the instruments H. The 2SLS estimator of δ is then given by
−1
= Z
δ Z y
Z
= PH Z.
where Z
4. Note that if Xf contains spatially lagged variables, H will contain collinear columns and will not
be full rank. In those cases, we drop collinear columns from H and return the names of omitted
instruments in e(H omitted).
296 Spatial models with additional endogenous variables
Writing the GMM estimator in this form shows that we can calculate it by solving
a simple nonlinear least-squares problem. By default, S = 2 and homoskedastic is
specified. In this case,
2 −1
A1 = 1 + n−1 tr(M M) M M − n−1 tr(M M)In
and
A2 = M
If heteroskedastic is specified, then by default,
A1 = M M − diag(M M)
and
A2 = M
In the second step, we first estimate δ by 2SLS from the transformed model by using
the instruments H and from where the spatial Cochrane–Orcutt transformation uses ρ.
The resulting generalized spatial two-stage least-squares (GS2SLS) estimator of δ is now
given by
−1
(
δ ρ) Z (
ρ) = Z( ρ) ρ) y(
Z( ρ)
and where Ψ ρρ (
ρ) is an estimator for the VC matrix of the (normalized) sample moment
vector based on GS2SLS residuals, say, Ψρρ . The estimator Ψ ρρ (
ρ) and Ψρρ differ for the
cases of homoskedastic and heteroskedastic errors. When homoskedastic is specified,
the r, s element of Ψ ρρ (
ρ) is given by (r, s = 1, 2),
ρρ 2 2
Ψ ( ( ρ) (2n)−1 tr {(Ar + Ar )(As + As )}
r,s ρ) = σ
+σ ρ) n−1 a
2 ( r (
ρ) as (
ρ)
2 (3)
+ n−1 μ (4) (
ρ) − 3 σ 2 (
ρ) vecD (Ar ) vecD (As )
+ n−1 μ
(3) (
ρ) a r ( ρ) vecD (Ar )
s (
ρ) vecD (As ) + a
where
r (
a ρ)α
ρ) = T( r (
ρ)
ρ) = HP(
T( ρ)
−1
ρ) = Q
P( −1 Q HZ ( −1 Q
ρ) Q ρ)
HH HZ ( ρ) Q HH HZ (
! "
HH = n−1 H H
Q
HZ (
Q ρ) = n−1 H Z(ρ)
Z(ρ) = (I − ρM)Z
α ρ) = −n−1 {Z(
r ( ρ) (Ar + Ar )
(
ρ)}
(ρ) = (I − ρM)
u
σ ρ) = n−1
2 ( ρ)
( (ρ)
n
μ ρ) = n−1
(3) (
i (
ρ)3
i=1
n
μ ρ) = n−1
(4) (
i (
ρ)4
i=1
Ψ ρρ (
ρ ) = (2n)−1
tr (A r + A
)Σ(
ρ )(A s + A
)Σ(
ρ ) + n−1 a ρ)
ρ) Σ(
r ( as (
ρ) (4)
r,s r s
Having computed the estimator θ , ρ) in steps 1a, 1b, 2a, and 2b, we next
= (δ
compute a consistent estimator for its asymptotic VC matrix, say, Ω. The estimator is
where
given by nΩ
# $
δδ
Ω δρ
Ω
=
Ω
δρ
Ω ρρ
Ω
δδ = P(
Ω δδ (
ρ) Ψ ρ)
ρ)P(
δρ
ρρ
−1 % ρρ
−1 &−1
δρ ( J
Ψ
(
Ω = P( ρ) Ψ ( ρ) Ψ ρ) J ρ) J
%
−1 &−1
ρρ = J
Ω Ψ ρρ (
ρ)
J
=Γ
J 1
2
ρ
ρρ (
In the above, Ψ ρ) are as defined in (3) and (4) with ρ replaced by ρ. The
ρ) and P(
δδ δρ
(
estimators Ψ ρ) and Ψ (ρ) are defined as follows:
When homoskedastic is specified,
δδ (
Ψ 2 (
ρ) = σ HH
ρ)Q
δρ (
Ψ ρ) = σ ρ)n−1 H {a1 (
2 ( ρ)n−1 H {vecD (A1 ), vecD (A2 )}
ρ)} + μ(3) (
ρ), a2 (
δδ (
Ψ ρ)H
ρ) = n−1 H Σ(
δρ (
Ψ ρ) {a1 (
ρ) = n−1 H Σ( ρ), a2 (
ρ)}
We note that the expression for Ω ρρ has the simple form given above because the
estimator in step 2b is the efficient GMM estimator.
= PH Z.
where Z
D. M. Drukker, I. R. Prucha, and R. Raciborski 299
can be estimated
When homoskedastic is specified, the asymptotic VC matrix of δ
consistently by
−1
2 Z
σ
Z
−1 −1
Z
Z Σ
Z Z Z Z
7 Conclusion
We have described the spivreg command for estimating the parameters of a SARAR
model with additional endogenous RHS variables. In the future, we plan to add options
for optimal predictors corresponding to different information sets.
8 Acknowledgment
We gratefully acknowledge financial support from the National Institutes of Health
through the SBIR grants R43 AG027622 and R44 AG027622.
300 Spatial models with additional endogenous variables
9 References
Anselin, L. 1988. Spatial Econometrics: Methods and Models. Dordrecht: Kluwer
Academic Publishers.
———. 2010. Thirty years of spatial econometrics. Papers in Regional Science 89: 3–25.
Anselin, L., and R. J. G. M. Florax. 1995. Small sample properties of tests for spatial
dependence in regression models: Some further results. In New Directions in Spatial
Econometrics, ed. L. Anselin and R. J. G. M. Florax, 21–74. Berlin: Springer.
Cressie, N. A. C. 1993. Statistics for Spatial Data. Revised ed. New York: Wiley.
Drukker, D. M., H. Peng, I. R. Prucha, and R. Raciborski. 2013. Creating and managing
spatial-weighting matrices with the spmat command. Stata Journal 13: 242–286.
Drukker, D. M., I. R. Prucha, and R. Raciborski. 2013. Maximum likelihood and gen-
eralized spatial two-stage least-squares estimators for a spatial-autoregressive model
with spatial-autoregressive disturbances. Stata Journal 13: 221–241.
Haining, R. 2003. Spatial Data Analysis: Theory and Practice. Cambridge: Cambridge
University Press.
Kelejian, H. H., and I. R. Prucha. 1998. A generalized spatial two-stage least squares
procedure for estimating a spatial autoregressive model with autoregressive distur-
bances. Journal of Real Estate Finance and Economics 17: 99–121.
———. 2007. The relative efficiencies of various predictors in spatial econometric models
containing spatial lags. Regional Science and Urban Economics 37: 363–374.
———. 2010. Specification and estimation of spatial autoregressive models with au-
toregressive and heteroskedastic disturbances. Journal of Econometrics 157: 53–67.
D. M. Drukker, I. R. Prucha, and R. Raciborski 301
Levitt, S. D. 1997. Using electoral cycles in police hiring to estimate the effect of police
on crime. American Economic Review 87: 270–290.
Powers, E. L., and J. K. Wilson. 2004. Access denied: The relationship between alcohol
prohibition and driving under the influence. Sociological Inquiry 74: 318–337.
Nicola Orsini
Unit of Biostatistics and Unit of Nutritional Epidemiology
Institute of Environmental Medicine
Karolinska Institutet
Stockholm, Sweden
[email protected]
Abstract. We present the new laplace command for estimating Laplace re-
gression, which models quantiles of a possibly censored outcome variable given
covariates. We illustrate laplace with an example from a clinical trial on survival
in patients with metastatic renal carcinoma. We also report the results of a small
simulation study.
Keywords: st0294, laplace, quantile regression, censored outcome, survival analy-
sis, Kaplan–Meier
1 Introduction
Estimating percentiles for a time-to-event variable of interest conditionally on covariates
may offer a useful complement to current approaches to survival analysis. For exam-
ple, comparing survival across treatments or exposure levels in observational studies
at various percentiles (for example, at the 50th or 10th percentiles) provides impor-
tant insights. At the univariate level, this can be accomplished with the Kaplan–Meier
estimator.
Laplace regression can be used to estimate the effect of risk factors and impor-
tant predictors on survival percentiles while adjusting for other covariates. The user-
written clad command (Jolliffe, Krushelnytskyy, and Semykina 2000) estimates condi-
tional quantiles only when censoring times are fixed and known for all observations
(Powell 1986), and its applicability is limited.
In this article, we present the laplace command for estimating Laplace regression
(Bottai and Zhang 2010). In section 3, we describe the syntax and options. In section 3,
we illustrate laplace with data from a randomized clinical trial. In section 4, we sketch
the methods and formulas. In section 5, we present the results of a small simulation
study.
c 2013 StataCorp LP st0294
M. Bottai and N. Orsini 303
by, statsby, and xi are allowed with laplace; see [U] 11.1.10 Prefix commands.
See [R] qreg postestimation for features available after estimation.
2.2 Options
quantiles(numlist) specifies the quantiles as numbers between 0 and 1; numbers larger
than 1 are interpreted as percentages. The default is quantiles(0.5), which cor-
responds to the median.
failure(varname) specifies the failure event; the value 0 indicates censored observa-
tions. If failure() is not specified, all observations are assumed to be uncensored.
sigma(varlist) specifies the variables to be included in the scale parameter model. The
default is constant only.
reps(#) specifies the number of bootstrap replications to be performed for estimating
the variance–covariance matrix and standard errors of the regression coefficients.
seed(#) sets the initial value of the random-number seed used by the bootstrap. If
seed() is specified, the bootstrapped estimates are reproducible (see [R] set seed).
tolerance(#) specifies the tolerance for the optimization algorithm. When the abso-
lute change in the log likelihood from one iteration to the next is less than or equal to
#, the tolerance() convergence criterion is met. The default is tolerance(1e-10).
maxiter(#) specifies the maximum number of iterations. When the number of itera-
tions equals maxiter(), the optimizer stops, displays an x, and presents the current
results. The default is maxiter(2000).
level(#) specifies the confidence level, as a percentage, for confidence intervals. The
default is level(95) or as set by set level.
304 A command for Laplace regression
. use kidney_ca_l
(kidney cancer data)
. quietly stset months, failure(cens)
The numeric variable months represents the time to event or censoring, and the binary
variable cens indicates the failure status (0 = censored, 1 = death).
Robust
months Coef. Std. Err. z P>|z| [95% Conf. Interval]
q50
trt 3.130258 1.195938 2.62 0.009 .7862628 5.474254
_cons 6.80548 .7188408 9.47 0.000 5.396578 8.214382
The estimated median survival in the MPA group is 6.8 months (95% confidence
interval: [5.4, 8.2]). The difference (trt) in median survival between the treatment
groups is 3.1 months (95% confidence interval: [0.8, 5.5]). Median survival among
patients on IFN can be obtained with the postestimation command lincom.
Percentiles of survival time by treatment group can also be obtained from the Kaplan–
Meier estimate of the survivor function by using the command stci.
. stci, by(trt)
failure _d: cens
analysis time _t: months
no. of
trt subjects 50% Std. Err. [95% Conf. Interval]
The estimated median in the IFN group (9.8 months) differs slightly from the laplace
estimate (9.9 months) shown above. The Kaplan–Meier curve in the IFN group is flat
at the 50th percentile between 9.83 and 9.96 months of follow-up. The command stci
shows the lower limit of this interval while laplace shows a middle value.
306 A command for Laplace regression
Bootstrap
months Coef. Std. Err. z P>|z| [95% Conf. Interval]
q25
trt 1.509151 .8289345 1.82 0.069 -.1155312 3.133832
_cons 2.49863 .399623 6.25 0.000 1.715384 3.281877
q50
trt 3.130258 1.209658 2.59 0.010 .7593719 5.501145
_cons 6.80548 .9100921 7.48 0.000 5.021732 8.589227
q75
trt 3.663238 3.482536 1.05 0.293 -3.162407 10.48888
_cons 15.87945 1.714295 9.26 0.000 12.5195 19.23941
The treatment effect is larger at higher percentiles of survival time. The difference
between the two treatment groups at the 25th, 50th, and 75th percentiles is 1.5, 3.1,
and 3.7 months, respectively. When bootstrap is requested, one can test for differences
in treatment effects across survival percentiles with the postestimation command test.
We fail to reject the hypothesis that the treatment effects at the 25th and 50th survival
percentiles are equal (p-value > 0.05).
Figure 1 shows the predicted percentiles from the 1st to the 99th in each treatment
group. The difference of 3 months in median survival between groups is represented by
the horizontal distance between the points A and B. Approximately 30% and 40% of the
patients on MPA and IFN, respectively, are estimated to live longer than 12 months. The
absolute difference of about 10% in the probability of surviving 12 months is represented
by the vertical distance between the points C and D.
M. Bottai and N. Orsini 307
100
90
80
C
70
Percentiles
60 D
50 A B
40
30
20
10
0
0 12 24 36 48 60
Follow−up time (months)
Figure 1. Survival percentiles in the MPA (solid line) and IFN (dashed line) groups
estimated with Laplace regression. The horizontal distance between the points A and B
(3.1 months) indicates the difference in median survival between groups. The vertical
distance between C and D (about 10%) indicates the difference in the proportion of
patients estimated to survive 12 months.
Robust
months Coef. Std. Err. z P>|z| [95% Conf. Interval]
q50
_Itrt_1 8.01462 2.270786 3.53 0.000 3.563962 12.46528
_Icwcc_1 2.262442 2.068403 1.09 0.274 -1.791554 6.316438
_Icwcc_2 -2.496523 1.645959 -1.52 0.129 -5.722544 .7294982
_ItrtXcwc_1_1 -5.737988 3.241483 -1.77 0.077 -12.09118 .6152021
_ItrtXcwc_1_2 -7.751629 2.645534 -2.93 0.003 -12.93678 -2.566478
_cons 6.90203 1.658547 4.16 0.000 3.651337 10.15272
The predicted median survival can be obtained with standard postestimation commands
such as predict or adjust.
. adjust, by(trt cwcc) format(%2.0f) noheader
MPA 7 9 4
IFN 15 11 5
We reject the null hypothesis of equal treatment effect across categories of white cell
counts (p = 0.0137). The treatment effect seems to be largest in patients with low white
cell counts.
treatment xb
MPA 6.77
IFN 9.89
treatment xb
MPA 6.77
IFN 9.96
The number of observations in the MPA group is odd (175 patients), and the sample
median survival is 6.77 months. The number of observations in the IFN group is even
(172 patients), and the median is not uniquely defined. The two nearest values are 9.83
and 9.96 months. The command qreg picks the larger of the two, while laplace picks
a value in between.
where βp = {βp,1 , . . . , βp,r } and σp = {σp,1 , . . . , σp,s } indicate the unknown parameter
vectors, and εi are independent and identically distributed error terms that follow a
standard Laplace distribution, f (εi ) = p(1 − p) exp{[I(εi ≤ 0) − p]εi }. For any given
p ∈ (0, 1), the p-quantile of the conditional distribution of ti given xi and zi is xi βp
because P (ti ≤ xi βp |xi , zi ) = p.
The command laplace estimates the (r + s)-dimensional parameter vector {βp , σp }
by maximizing the Laplace likelihood function described by Bottai and Zhang (2010).
It uses an iterative maximization algorithm based on the gradient of the log likelihood
that generates a finite sequence of parameter values along which the likelihood increases.
Briefly, from a current parameter value, the algorithm searches the positive semiline in
the direction of the gradient for a new parameter value where the likelihood is larger.
310 A command for Laplace regression
The algorithm stops when the change in the likelihood is less than the specified tolerance.
Convergence is guaranteed by the continuity and concavity of the likelihood.
The asymptotic variance of the estimator βp for the parameter βp is derived by con-
sidering the estimating condition reported by Bottai and Zhang (2010, eq. 4), S(βp ) = 0,
where
' (
1 n
p − 1
S βp =
xi p − I (yi ≤ xi βp ) − I (yi ≤ xi βp ) (1 − di )
exp (zi σ
) i=1 1 − F (yi |xi )
5 Simulation
In this section, we present the setup and results of a small simulation study to as-
sess the finite sample performance of the Laplace regression estimator under different
data-generating mechanisms. We contrast the performance of Laplace with that of the
Kaplan–Meier estimator, a standard, nonparametric, uniformly consistent, and asymp-
totically normal estimator of the survival function. To generate the survival estimates,
we used the sts command.
We generated 500 samples from (1) in each of the six different simulation scenarios
that arose from the combination of two sample sizes and three data-generating mech-
anisms. In each scenario, we estimated five percentiles (p = 0.10, 0.30, 0.50, 0.70, 0.90)
with Laplace regression and the Kaplan–Meier estimator. The two sample sizes were
n = 100 and n = 1,000. The three different data-generating mechanisms were obtained
by changing the values of zi , σp , and the censoring variable ci . In all simulation sce-
narios, xi = (1, x1,i ) , with x1,i ∼ Bernoulli(0.5), βp = (5, 3) , and εi was a standard
normal centered at the quantile being estimated.
In scenario number 1, zi = 1, σp = 1, and the censoring variable was set equal
to a constant ci = 1,000 for all individuals. In this scenario, no observations were
censored, and Laplace regression was equivalent to ordinary quantile regression. In
scenario number 2, zi = 1, σp = 1, and the censoring variable was generated from the
same distribution as the outcome variable ti . This ensured an expected censoring rate
of 50% in both covariate patterns (x1,i = 0, 1). In scenario number 3, zi = (1, x1,i ) and
σp = (0.5, 0.5) . The censoring variable ci was generated from the same distribution as
the outcome variable ti . In this scenario, the standard deviation of ti was equal to 0.5
when x1,i = 0 and equal to 1 when x1,i = 1.
M. Bottai and N. Orsini 311
The following table shows the observed relative mean squared error multiplied by
1,000 for the predicted quantile in the group x1,i = 1 in each combination of sample size
(obs), data-generating scenario (scenario), and percentile (percentile) for Laplace
(top entry) and Kaplan–Meier (bottom entry).
The relative mean squared error was smaller for Laplace than for Kaplan–Meier at lower
quantiles and with the smaller sample size.
Figure 2 shows the relative mean squared error of Laplace (x axis) and Kaplan–Meier
(y axis) estimators of the quantile in group x1,i = 1 over all simulation scenarios.
The Laplace estimator had fewer extreme values than Kaplan–Meier. The overall
concordance correlation coefficient (command concord) was 72.2%. After the 10%
largest differences were excluded, the coefficient was 99.1%.
312 A command for Laplace regression
30
Relative MSE − Kaplan−Meier
10 0 20
0 5 10 15 20
Relative MSE − Laplace
Figure 2. Relative mean squared error of Laplace (x axis) and Kaplan–Meier (y axis)
estimators of the percentiles in group x1,i = 1 over all simulation scenarios. The solid
45-degree line indicates the equal relative mean squared error of the two estimators.
The following two tables show the performance of the estimator of the asymptotic
standard error for the regression coefficients βp,0 (first table) and βp,1 (second table).
In each cell of each table, the top entry is the average estimated asymptotic standard
error, and the bottom entry is the corresponding observed standard deviation across
the simulated samples.
The estimated standard errors were similar to the observed standard deviation across
all cells for both regression coefficients.
6 Acknowledgment
Nicola Orsini was partly supported by a Young Scholar Award from the Karolinska
Institutet’s Strategic Program in Epidemiology.
7 References
Bottai, M., and J. Zhang. 2010. Laplace regression with censored data. Biometrical
Journal 52: 487–503.
Jolliffe, D., B. Krushelnytskyy, and A. Semykina. 2000. sg153: Censored least absolute
deviations estimator: CLAD. Stata Technical Bulletin 58: 13–16. Reprinted in Stata
Technical Bulletin Reprints, vol. 10, pp. 240–244. College Station, TX: Stata Press.
Medical Research Council Renal Cancer Collaborators. 1999. Interferon-α and survival
in metastatic renal carcinoma: Early results of a randomised controlled trial. Lancet
353: 14–17.
Mehmet F. Dicle
Loyola University New Orleans
New Orleans, LA
[email protected]
1 Introduction
Economic and financial researchers must often convert between currencies to facilitate
cross-country comparisons. We provide a command, fxrates, that downloads daily
foreign exchange rates relative to the U.S. dollar from the Federal Reserve’s database.
Working with multiple cross-country datasets, such as international foreign exchange
rates, introduces a unique problem: variations in country names. They are often spelled
differently or follow different grammatical conventions across datasets. For example,
North Korea is often different among datasets; it could be “North Korea”, “Korea,
North”, “Korea, Democratic People’s Republic”, or even “Korea, DPR”. Likewise,
“United States of America” is often “United States”, “USA”, “U.S.A.”, “U.S.”, or “US”.
A dataset may have country names in all caps. Country names could also have inad-
vertent leading or trailing spaces. Thus we provide a second command, countrynames,
that renames many country names to follow a standard convention. The command is,
of course, editable, so researchers may opt to use their own naming preferences.
c 2013 StataCorp LP dm0069
316 fxrates and countrynames
2.2 Options
namelist is a list of country abbreviations for the countries whose foreign exchange data
you wish to download from the Federal Reserve’s website. Exchange rates for all
available countries will be downloaded if namelist is omitted. The list of countries
includes the following:
al Australia ma Malaysia
au Austria mx Mexico
be Belgium ne Netherlands
bz Brazil nz New Zealand
ca Canada no Norway
ch China, P.R. po Portugal
dn Denmark si Singapore
eu Economic and Monetary Union member countries sf South Africa
ec European Union ko South Korea
fn Finland sp Spain
fr France sl Sri Lanka
ge Germany sd Sweden
gr Greece sz Switzerland
hk Hong Kong ta Taiwan
in India th Thailand
ir Ireland uk United Kingdom
it Italy ve Venezuela
ja Japan
period(2000 | 1999 | 1989) specifies which block of dates to download. The Federal
Reserve foreign exchange database is separated into three blocks: one ending in
1989, a second for 1990–1999, and a third for 2000 through the present. The default
(obtained by omitting period()) is to download the three separate files and merge
them automatically so that the user has all foreign exchange market data available.
You can specify one or more periods. If you know which data range you wish to
download, however, you can save time by specifying which of the three blocks to
download. Specifying all three periods is equivalent to the default of downloading
all the data.
B. Dicle, J. Levendis, and M. F. Dicle 317
chg(ln | per | sper) is the periodic return. Three different percent changes can be cal-
culated for the adjusted closing price: natural log difference, percentage change, and
symmetrical percentage change. Whenever one of these is specified, a new variable
is created with the appropriate prefix: ln for the first-difference of logs method, per
for the percent change, and sper for the symmetric percent change.
save(filename) is the output filename. filename is created under the current working
directory.
Example
In this example, we use fxrates to import the entire daily exchange rate dataset
from the Federal Reserve. Because we did not specify the countries, fxrates downloads
data from all countries. Because we did not specify the period, fxrates defaults to
downloading data for all available dates.
. fxrates
au does not have 00
be does not have 00
(output omitted )
ve does not have 89
. summarize
Variable Obs Mean Std. Dev. Min Max
(output omitted )
Output such as au does not have 00 indicates that there were no observations in a
particular block of years (in this case, the 2000–present block) for the particular country.
When this appears, it is most often the case that the currency has been discontinued,
as when Austria started using the euro.
Example
In this second example, we download the exchange rates of the U.S. dollar versus the
French franc, the German deutschmark, and the Hong Kong dollar for all the available
dates.
. fxrates fr ge hk
fr does not have 00
ge does not have 00
. summarize
Variable Obs Mean Std. Dev. Min Max
Example
In this example, we download the exchange rate data for United States versus France,
Germany, and Hong Kong. Because no period was specified, fxrates downloads the
data from all available dates. We also specified that fxrates calculate the daily percent
change, calculated in two different ways: as the log first-difference and as the arithmetic
daily percent change. The log-difference percent change for each country is prefixed by
ln; the arithmetic percent change for each country is prefixed by per.
. fxrates fr ge hk, chg(ln per)
fr does not have 00
ge does not have 00
. summarize
Variable Obs Mean Std. Dev. Min Max
Example
In this final example, we download the U.S. dollar exchange rate versus the Japanese
yen and the Mexican peso. We calculate the daily percent change by calculating the
first-differences of natural logs for the data ending in 1999 (that is, for the data ending
in 1989 plus the data from 1990 through 1999).
countrynames countryvar
3.2 Description
The command countrynames changes the name of a country in a dataset to corre-
spond to a more standard set of names. By default, countrynames creates a new
variable, changed, containing numeric codes that indicate which country names have
been changed. A code of 0 indicates no change; a code of 1 indicates that the coun-
try’s name has been changed. We recommend you run countrynames on both datasets
whenever two different cross-country datasets are being merged. This minimizes the
chance that a difference in names between datasets will prevent a proper merge from
occurring. However, if you wish to keep a variable with the original names, you need
to copy the variable to another variable. For example, before running countrynames
country, you would need to type generate origcountry = country.
320 fxrates and countrynames
Example
In this example, we use two macroeconomic datasets that have countries named
slightly differently. The first dataset is native to and shipped with Stata.
Though the dataset is very small, it suffices for our purposes. Notice the spelling of
United States in this dataset.
. list
1. Australia .7 .7
2. Britain .7 .4
3. Canada 1.5 .9
4. Denmark 1.5 .1
5. France .9 .4
6. Germany .9 .2
7. Ireland 1.1 .3
8. Netherlands 1 .4
9. Sweden 1.5 .2
10. United States 1.1 1.2
In fact, all the spellings in this dataset correspond with the preferred names listed in
countrynames, so nothing is required of us here. We could run countrynames just to
be on the safe side, but it would not have any effect. It is, however, good practice to
run countrynames whenever merging datasets to maximize the chances that the two
datasets use the same country names.
The second dataset, using World Health Organization data, is from Kohler and
Kreuter (2005). The data are available from the Stata website.
. net from http://www.stata-press.com/data/kk/
(output omitted )
. net get data
(output omitted )
. use who2001.dta, clear
B. Dicle, J. Levendis, and M. F. Dicle 321
Notice how the United States is called United States of America in this dataset.
. list country
country
1. Afghanistan
2. Albania
(output omitted )
180. United States of America
(output omitted )
187. Zambia
188. Zimbabwe
We now run countrynames on this dataset to standardize the names of the countries.
This will rename United States of America to United States, as it was in the first
dataset.
. countrynames country
. list country _changed
country _changed
1. Afghanistan 0
2. Albania 0
(output omitted )
180. United States 1
(output omitted )
187. Zambia 0
188. Zimbabwe 0
Notice that the generated variable, changed, is equal to 1 for the United States entry;
this indicates that its name was once something different.
Having run countrynames on both datasets, we have increased the chances that
countries in both datasets follow the same naming convention. We are now safe to
merge the datasets:
. drop _changed
. sort country
. merge 1:1 country using temp1.dta
Result # of obs.
The merge results table above is important: It is the result of merging a dataset that
used the countrynames command (master: who2001.dta) with a dataset that did not
use the countrynames command (using: temp1.dta). If the dataset using the command
includes a country name that is not renamed with countrynames, then it will appear
in the merge results table.
. sort country
. list country
country
1. Afghanistan
2. Albania
(output omitted )
180. United Arab Emirates
(output omitted )
188. Zambia
189. Zimbabwe
4 Reference
Kohler, U., and F. Kreuter. 2005. Data Analysis Using Stata. College Station, TX:
Stata Press.
1. Please note that if we ever do an update to our program, the user edits to the ado-file will be lost
when users grab the updated ado-file.
The Stata Journal (2013)
13, Number 2, pp. 323–328
1 Introduction
The number of published genome-wide association studies (GWAS) has seen a staggering
level of growth from 453 in 2007 to 2,137 in 2010 (Hindorff et al. 2011). These studies
aim to identify the genetic cause for a wide range of diseases, including Alzheimer’s
(Harold et al. 2009), cancer (Hunter et al. 2007), and diabetes (Hayes et al. 2007), and
to elucidate variability in traits, behavior, and other phenotypes. This is accom-
plished by looking at hundreds of thousands to millions of single nucleotide poly-
morphisms and other genetic features across upward of 10,000 individual genomes
(Corvin, Craddock, and Sullivan 2010). These studies generate enormous amounts of
data, which present challenges for researchers in handling data, conducting statistics,
and visualizing data (Buckingham 2008).
One method of visualizing GWAS data is through the use of Manhattan plots, so
called because of their resemblance to the Manhattan skyline. Manhattan plots are
scatterplots, but they are graphed in a characteristic way. To create a Manhattan plot,
you need to calculate p-values, which are generated through one of a variety of statistical
tests. However, because of the large number of hypotheses being tested in a GWAS, local
significance levels typically fall below p = 10−5 (Ziegler, König, and Thompson 2008).
Resulting p-values associated with each marker are −log10 transformed and plotted on
the y axis against their chromosomal position on the x axis. Chromosomes lie end to
end on the x axis and often include the 22 autosomal chromosomes and the X, Y, and
mitochondrial chromosomes.
Manhattan plots are useful for a variety of reasons. They allow investigators to
visualize hundreds of thousands to millions of p-values across an entire genome and to
quickly identify potential genetic features associated with phenotypes. They also enable
c 2013 StataCorp LP st0295
324 Manhattan plots
2 Data formatting
Following data cleaning and statistical tests, researchers are typically left with a dataset
consisting of, at a minimum, a list of genetic features (string), p-values (real), chromo-
somes (integer), and their base pair location on a chromosome (integer). Using the
manhattan command, a user specifies these variables. manhattan uses temporary vari-
ables to manipulate data into a format necessary for plotting. The program first iden-
tifies the number of chromosomes present and generates base pair locations relative to
their distance from the beginning of the first chromosome as if they were laid end to end
in numerical order. The format in which p-values are specified is detected and, if need
be, log transformed. manhattan then calculates the median base pair location of each
chromosome as locations to place labels. Labels are generated by using chromosome
numbers except for the sex chromosomes and mitochondrial chromosomes, which define
chromosomes 23, 24, and 25 with the X, Y , and M labels, respectively.
Once data have been reformatted in manhattan, plots are generated. Additional
options may require additional data manipulation. These options include spacing(),
bonferroni(), and mlabel().
3.2 Options
options Description
Plot options
title(string) display a title
caption(string) display a caption
xlabel(string) set x label; default is xlabel(Chromosome)
width(#) set width of plot; default is width(15)
height(#) set height of plot; default is height(5)
Chromosome options
x(#) specify chromosome number to be labeled as
X; default is x(23)
y(#) specify chromosome number to be labeled as
Y ; default is y(24)
mito(#) specify chromosome number to be labeled as
M ; default is mito(25)
Graph options
bonferroni(h | v | n) draw a line at Bonferroni significance level;
label line with horizontal (h), vertical (v),
or no (n) labels
mlabel(var) set a variable to use for labeling markers
mthreshold(# | b) set a −log(p-value) above which markers will
be labeled, or use b to set your threshold
at the Bonferroni significance level
yline(#) set log(p-value) at which to draw a line
labelyline(h | v) label line specified with yline() by using
horizontal labels (h) or vertical labels (v)
addmargin add a margin to the left and right of the plot,
leaving room for labels
Style options
color1(color) set first color of markers
color2(color) set second color of markers
linecolor(color) set the color of Bonferroni line and label
or y line and label
4 Examples
The following examples were created using manhattan gwas.dta, which is available
as an ancillary file within the manhattan package. All the p-values were generated
randomly; therefore, all genetic elements are in linkage equilibrium and are not linked.
326 Manhattan plots
4.1 Example 1
Below you will find a typical Manhattan plot generated with manhattan. Several options
were specified in the generation of this plot. First, bonferroni(h) is used to specify
that a line be drawn at the Bonferroni level of significance. The h indicates that the
label should be placed horizontally, on the line. Next, mlabel(snp) is used to indicate
that markers should be labeled with the variable snp, which contains the names of each
marker. Additionally, mthreshold(b) is used to set a value at which to begin labeling
markers. In this case, b is used to indicate that markers should be labeled at −log10
(p-values) greater than the Bonferroni significance level. Finally, addmargin is used to
add space on either side of the plot to prevent labels from running off the plot.
snp_97994
snp_69797
snp_63831
snp_94406
6
1 3 5 7 9 11 13 15 17 19 21X
2 4 6 8 10 12 14 16 18 2022
Chromosome
4.2 Example 2
Here yline(6.5) is used to draw a horizontal line at log10 (6.5), and labelyline(v)
adds an axis label for the value of this line. Additionally, the variable used for marker
labels is identified using mlabel(snp), and a threshold at which to begin adding labels
to markers is given as the same value as the horizontal line by using mthreshold(6.5).
Spacing is added between chromosomes with spacing(1) to keep labels on the x axis
from running into one another. Finally, a margin is added on either side of the plot by
using addmargin, because some of the marker labels would otherwise fall off the plot.
The colors of the markers are changed with color1(black) and color2(gray). The
color of the line plotted on the y axis by using yline(v) has been changed to black by
using linecolor(black).
D. E. Cook, K. R. Ryckman, and J. C. Murray 327
snp_97994
6 6.5
pvalue
4
2
0
1 3 5 7 9 11 13 15 17 19 21 X
2 4 6 8 10 12 14 16 18 20 22
Chromosome
5 Conclusions
As the number of GWAS publications continues to grow, easier tools are needed for in-
vestigators to manipulate, perform statistics on, and visualize data. manhattan aims to
provide an easier, more standard method by which to visualize GWAS data in Stata. We
welcome help in the development of manhattan by users and hope to improve manhattan
in response to user suggestions and comments.
6 Acknowledgments
This work was supported by the March of Dimes (1-FY05-126 and 6-FY08-260), the Na-
tional Institutes of Health (R01 HD-52953, R01 HD-57192), and the Eunice Kennedy Shriver
National Institute of Child Health and Human Development (K99 HD-065786). The con-
tent is solely the responsibility of the authors and does not necessarily represent the
official views of the National Institutes of Health or the Eunice Kennedy Shriver Na-
tional Institute of Child Health and Human Development.
7 References
Buckingham, S. D. 2008. Scientific software: Seeing the SNPs between us. Nature
Methods 5: 903–908.
Vincenzo Verardi
University of Namur
Centre for Research in the Economics of Development (CRED)
Namur, Belgium
and
Université Libre de Bruxelles
European Center for Advanced Research in Economics and Statistics (ECARES)
and Center for Knowledge Economics (CKE)
Brussels, Belgium
[email protected]
1 Introduction
The objective of this article is to present a Stata implementation of Baltagi and Li’s
(2002) series estimation of partially linear panel-data models.
The structure of the article is as follows. Section 2 describes Baltagi and Li’s (2002)
fixed-effects semiparametric regression estimator. Section 3 presents the implemented
Stata command (xtsemipar). Some simple simulations assessing the performance of
the estimator are shown in section 4. Section 5 provides a conclusion.
c 2013 StataCorp LP st0296
330 Semiparametric fixed-effects estimator
2 Estimation method
2.1 Baltagi and Li’s (2002) semiparametric fixed-effects regression
estimator
Consider a general panel-data semiparametric model with distributed intercept of the
type
which can be consistently estimated by using ordinary least squares. Having estimated
θ and γ
, we propose to fit the fixed effects α i and go back to (1) to estimate the error
component residual
it = yit − xit θ − α
u i = f (zit ) + εit (4)
it on zit by using some standard nonparametric
The curve f can be fit by regressing u
regression estimator.
A typical example of pk series is spline, which is a fractional polynomial with pieces
defined by a sequence of knots c1 < c2 < · · · < ck , where they join smoothly.
The simplest case is a linear spline. For a spline of degree m, the polynomials and
their first m − 1 derivatives agree at the knots, so m − 1 derivatives are continuous (see
Royston and Sauerbrei [2007] for further details).
A spline of degree m with k knots can be represented as a power series:
m
k
z − cj if z > cj
S(z) = j
ζj z + λj (z − cj )m
+ where (z − cj )m
+ =
0 otherwise
j=0 j=1
F. Libois and V. Verardi 331
The problem here is that successive terms tend to be highly correlated. A probably
better representation of splines is a linear combination of a set of basic splines called
(kth degree) B-splines, which are defined for a set of k + 2 consecutive knots c1 < c2 <
· · · < ck+2 as
⎧ ⎫−1
⎨
k+2 , ⎬
B(z, c1 , . . . , ck+2 ) = (k + 1) (ch − cj ) (z − cj )k+
⎩ ⎭
j=1 1≤h≤k+2,h=j
B-splines are intrinsically a rescaling of each of the piecewise functions. The tech-
nicalities of this method are beyond the scope of this article, and we refer the reader to
Newson (2000b) for further details.
We implemented this estimator in Stata under the command xtsemipar, which we
describe below.
The first option, nonpar(), is required. It declares that the variable enters the model
nonparametrically. None of the remaining options are compulsory. The user has the
opportunity to recover the error component residual—the left-hand side of (4)—whose
name can be chosen by specifying string2. This error component can then be used to
draw any kind of nonparametric regression. Because the error component has already
been partialled out from fixed effects and from the parametrically dependent variables,
this amounts to estimating the net nonparametric relation between the dependent and
the variable that enters the model nonparametrically. By default, xtsemipar reports
one estimation of this net relationship. string1 makes it possible to reproduce the values
of the fitted dependent variable. Note that the plot of residuals is recentered around its
mean. The remaining part of this section describes options that affect this fit.
A key option in the quality of the fit is degree(). It determines the power of
the B-splines that are used to consistently estimate the function resulting from the first
difference of the f (zit ) and f (zit−1 ) functions. The default is degree(4). If the nograph
option is not specified—that is, the user wants the graph of the nonparametric fit of the
variable in nonpar() to appear—degree() will also determine the degree of the local
332 Semiparametric fixed-effects estimator
weighted polynomial fit used in the Epanechnikov kernel performed at the last stage
fit. If spline is specified, this last nonparametric estimation will also be estimated by
the B-spline method, and degree() is then the power of these splines. knots1() and
knots2() are both rarely used. They define a list of knots where the different pieces
of the splines agree. If left unspecified, the number and location of the knots will be
chosen optimally, which is the most common practice. knots1() refers to the B-spline
estimation in (3). knots2() can only be used if the spline option is specified and refers
to the last stage fit. More details about B-spline can be found in Newson (2000b). The
bwidth() option can only be used if spline is not specified. It gives the half-width
of the smoothing window in the Epanechnikov kernel estimation. If left unspecified,
a rule-of-thumb bandwidth estimator is calculated and used (see [R] lpoly for more
details).
The remaining options refer to the inference. The robust and cluster() options
correct the inference, respectively, for heteroskedasticity and for clustering of error
terms. In the graph, confidence intervals can be displayed by a shaded area around
the curve of fitted values by specifying the option ci. Confidence intervals are set to
95% by default; however, it is possible to modify them by setting a different confidence
level through the level() option. This affects the confidence intervals both in the
nonparametric and in the parametric part of estimations.
4 Simulation
In this section, we show, by using some simple simulations, how xtsemipar behaves
in finite samples. At the end of the section, we illustrate how this command can be
extended to tackle some endogeneity problems.
In brief, the simulation setup is a standard fixed-effects panel of 200 individuals
over five time periods (1,000 observations). For the design space, four variables, x1 ,
x2 , x3 , and d, are generated from a normal distribution with mean μ = (0, 0, 0, 0) and
variance–covariance matrix
x x2 x3 d
⎛ 1 ⎞
x1 1
x2 ⎜
⎜ 0.2 1 ⎟
⎟
x3 ⎝ 0.8 0.4 1 ⎠
d 0 0.3 0.6 1
Variable d is categorized in such a way that five individuals are identified by each
category of d. In practice, we generate these variables in a two-step procedure where
the x’s have two components. The first one is fixed for each individual and is correlated
with d. The second one is a random realization for each time period.
F. Libois and V. Verardi 333
Five hundred replications are carried out, and for each replication, an error term
e is drawn from an N (0, 1). The dependent variable y is generated according to the
data-generating process (DGP): y = x1 + x2 − (x3 + 2 × x23 − 0.25 × x33 ) + d + e. As
is obvious from this estimation setting, multivariate regressions with individual fixed
effects should be used if we want to consistently estimate the parameters. So we regress
y on the x’s by using three regression models:
1. xtsemipar, considering that x1 and x2 enter the model linearly and x3 enters
nonparametrically.
2. xtreg, considering that x1 , x2 , and x3 enter the model linearly.
3. xtreg, considering that x1 and x2 enter the model linearly, whereas x3 enters the
model parametrically with the correct polynomial form (x23 and x33 ).
Table 1 reports the bias and mean squared error (MSE) of coefficients associated
with x1 and x2 for the three regression models. What we find is that Baltagi and
Li’s (2002) estimator performs much better than the usual fixed-effects estimator with
linear control for x3 , in terms of both bias and efficiency. As expected, the most effi-
cient and unbiased estimator remains the fixed-effects estimator with the appropriate
polynomial specification. However, this specification is generally unknown. Figure 1
displays the average nonparametric fit of x3 (plain line) obtained in the simulation with
the corresponding 95% band. The true DGP is represented by the dotted line.
0
−10
f(x3)
−20
−30
−40
−4 −2 0 2 4
x3
Nonparametric prediction of x3
10
5
f(x3)
0
−5
−4 −2 0 2 4
x3
DGP
avg. fit − 4th−order Taylor exp.
confidence interval at 95%
avg. fit − 4th−order local polyn.
5 Conclusion
In econometrics, semiparametric regression estimators are becoming standard tools for
applied researchers. In this article, we presented Baltagi and Li’s (2002) series semi-
parametric fixed-effects regression estimator. We then introduced the Stata program
we created to put it into practice. Some simple simulations to illustrate the usefulness
and the performance of the procedure were also shown.
6 Acknowledgments
We would like to thank Rodolphe Desbordes, Patrick Foissac, our colleagues at CRED
and ECARES, and especially Wouter Gelade and Peter-Louis Heudtlass, who helped im-
prove the quality of the article. The usual disclaimer applies. François Libois wishes to
thank the ERC grant SSD 230290 for financial support. Vincenzo Verardi is an associate
researcher at the FNRS and gratefully acknowledges their financial support.
7 References
Ahamada, I., and E. Flachaire. 2008. Econométrie Non Paramétrique. Paris:
Économica.
Baltagi, B. H., and D. Li. 2002. Series estimation of partially linear panel data models
with fixed effects. Annals of Economics and Finance 3: 103–116.
———. 2000b. sg151: B-splines and splines parameterized by their values at reference
points on the x-axis. Stata Technical Bulletin 57: 20–27. Reprinted in Stata Technical
Bulletin Reprints, vol. 10, pp. 221–230. College Station, TX: Stata Press.
Royston, P., and W. Sauerbrei. 2007. Multivariable modeling with cubic regression
splines: A principled approach. Stata Journal 7: 45–70.
James W. Hardin
Institute for Families in Society
Department of Epidemiology & Biostatistics
University of South Carolina
Columbia, SC
[email protected]
Abstract. We present new Stata commands for carrying out exact Wilcoxon
one-sample and two-sample comparisons of the median. Nonparametric tests are
often used in clinical trials, in which it is not uncommon to have small samples.
In such situations, researchers are accustomed to making inferences by using exact
statistics. The ranksum and signrank commands in Stata provide only asymptotic
results, which assume normality. Because large-sample results are unacceptable
in many clinical trials studies, these researchers must use other software packages.
To address this, we have developed new commands for Stata that provide exact
statistics in small samples. Additionally, when samples are large, we provide results
based on the Student’s t distribution that outperform those based on the normal
distribution.
Keywords: st0297, ranksumex, signrankex, exact distributions, nonparametric
tests, median, Wilcoxon matched-pairs signed-rank test, Wilcoxon ranksum test
1 Introduction
Many statistical analysis methods are derived after making an assumption about the
underlying distribution of the data (for example, normality). However, one may also
consider nonparametric methods from which to draw statistical inferences where no as-
sumptions are made about an underlying population or distribution. For the nonpara-
metric equivalents to the parametric one-sample and two-sample t tests, the Wilcoxon
signed-rank test (one sample) is used to test the hypothesis that the median differ-
ence between the absolute values of positive and negative paired differences is 0. The
Wilcoxon Mann–Whitney ranksum test is used to test the hypothesis of a zero-median
difference between two independently sampled populations.
c 2013 StataCorp LP st0297
338 Exact Wilcoxon rank tests
where I(Di > 0) is an indicator function that the ith difference is positive. Ranks of
tied absolute differences are averaged for the relevant set of observations. The variance
of S is given by
1
m
1
V = nr (nr + 1)(2nr + 1) − tj (tj + 1)(tj − 1)
24 48 j
where tj is the number of values tied in absolute value for the jth rank (Lehmann 1975)
out of the m unique assigned ranks; m = nr and tj = 1 ∀j if there are no ties. The
significance of S is then computed one of two ways, contingent on sample size (nr ). If
nr > 25, the significance of S can be based on the normal approximation (as is done in
Stata’s signrank command) or on Student’s t distribution,
6
nr − 1
S
nr V − S 2
with nr − 1 degrees of freedom (Iman 1974). When nr ≤ 25, the significance of S is
computed from the exact distribution.
An algorithm for calculation of associated probabilities is the network algorithm of
Mehta and Patel (1986). Many new improvements and modifications of that algorithm
have been implemented in various applications to compute the exact p-value. Some in-
clude polynomial time algorithms for permutation distributions (Pagano and Tritchler
T. Harris and J. W. Hardin 339
1983), Mann–Whitney-shifted fast Fourier transform (FFT) (Nagarajan and Keich 2009),
and decreased computation time for the network algorithm described in Requena and
Martı́n Ciudad (2006). Comprehensive summaries for exact inference methods are pub-
lished in Agresti (1992) and Waller, Turnbull, and Hardin (1995).
To allow tied ranks in the commands, we multiply all ranks by L to ensure that the
ranks and sums of ranks will be integers. This can be accomplished for our two statistics
by setting L = 2. The ranges of the values of the two statistics are easily calculated
so that we may choose Q ≥ U . Defining U as the largest possible value of our statistic
(formed from the largest possible ranks), we can choose log2 Q = ceiling{log2 (U )}. We
choose Q to be a power of 2 because of the requirements of the FFT algorithm in Stata
(Fourier analysis is carried out by using the Mata fft command).
Using rk to denote the rank of the kth observation, the characteristic function for
the one-sample statistic S1 is given by
' (
,N
φ1 (−2πij/Q) = exp(−2πij/Q) cos(−2πjLrk /Q)
k=1
while the characteristic function for S2 is calculated by using the difference equation
3 Stata syntax
Software accompanying this article includes the command files as well as supporting
files for dialogs and help. Equivalent to the signrank command, the basic syntax for
the new Wilcoxon signed-rank test command is
signrankex varname = exp if in
Equivalent to the ranksum command, the basic syntax for the new Wilcoxon Mann–
Whitney ranksum test command is
ranksumex varname if in , by(groupvar) porder
4 Example
In this section, we present real-world examples with the new nonparametric Wilcoxon
test commands. In clinical trials, talinolol is used as a β blocker and is controlled by
P–glycoprotein, which protects xenobiotic compounds. Eight healthy men between the
ages of 22 and 26 were evaluated based on their serum-concentration time profiles of
talinolol with kinetic profile differences. These differences were two enantiomers, S(–)
talinolol and R(+) talinolol. The trial examined single intravenous (iv) and repeated
oral talinolol profiles before and after rifampicin comedication. Area under the serum
concentration time curves (AUC) was collected for each subject (see Zschiesche et al.
[2002]). We compare AUC values of S(–) iv talinolol before and after comedication of
rifampicin by using the Wilcoxon signed-rank test. The results are given below, where
S is the Wilcoxon signed-rank test statistic.
T. Harris and J. W. Hardin 341
positive 8 36 18
negative 0 0 18
zero 0
all 8 36 36
Ho: iv_s_before = iv_s_after
S = 18.000
Prob >= |S| = 0.0078
The results show there was a statistically significant difference (p-value = 0.0078) be-
tween iv S(–) talinolol before and after comedication of rifampicin. There were greater
S(–) talinolol AUC values shown before rifampicin administration than after.
For the Wilcoxon Mann–Whitney ranksum test example, we will use performance
data (table 1) collected on rats’ rotarod endurance (in seconds) from two treatment
groups. The rats were randomly selected to be in the control group (received saline
solvent) or the treatment group (received centrally acting muscle relaxant) (Bergmann,
Ludbrook, and Spooren 2000).
0 12 180 150
1 12 120 150
where each of the new commands returns the p-value as well as the numerator and
denominator of the exact fraction (see the return values in the previous example).
5 Summary
In this article, we introduced two supporting Stata commands for the exact nonparamet-
ric Wilcoxon signed-rank test and the Wilcoxon Mann–Whitney ranksum test. These
one-sample and two-sample test statistics can be used to assess the difference in location
(median difference) for small samples (exact distribution) and larger samples (Student’s
t distribution).
T. Harris and J. W. Hardin 343
6 References
Agresti, A. 1992. A survey of exact inference for contingency tables. Statistical Science
7: 131–153.
Conover, W. J. 1999. Practical Nonparametric Statistics. 3rd ed. New York: Wiley.
Nagarajan, N., and U. Keich. 2009. Reliability and efficiency of algorithms for computing
the significance of the Mann–Whitney test. Computational Statistics 24: 605–622.
Requena, F., and N. Martı́n Ciudad. 2006. A major improvement to the network al-
gorithm for Fisher’s exact test in 2 × c contingency tables. Computational Statistics
and Data Analysis 51: 490–498.
Abstract. Competing risks are present when the patients within a dataset could
experience one or more of several exclusive events and the occurrence of any one of
these could impede the event of interest. One of the measures of interest for analy-
ses of this type is the cumulative incidence function. stpm2cif is a postestimation
command used to generate predictions of the cumulative incidence function after
fitting a flexible parametric survival model using stpm2. There is also the option
to generate confidence intervals, cause-specific hazards, and two other measures
that will be discussed in further detail. The new command is illustrated through
a simple example.
Keywords: st0298, stpm2cif, survival analysis, competing risks, cumulative inci-
dence, cause-specific hazard
1 Introduction
In survival analysis, if interest lies in the true probability of death from a particular
cause, then it is important to appropriately account for competing risks. Competing
risks occur when patients are at risk of more than one mutually exclusive event, such
as death from different causes (Putter, Fiocco, and Geskus 2007). The occurrence of
a competing event may prevent the event of interest from ever occurring. It therefore
seems logical to conduct an analysis that considers these competing risks. The two
main measures of interest for analyses of this type are the cause-specific hazard and
the cumulative incidence function. The cause-specific hazard is the instantaneous risk
of dying from a specific cause given that the patient is still alive at a particular time.
The cumulative incidence function is the proportion of patients who have experienced a
particular event at a certain time in the follow-up period. Several methods are already
available to estimate this; however, it is not always clear which approach should be
used.
In this article, we explain how to fit flexible parametric models using the stpm2 com-
mand by estimating the cause-specific hazard for each cause of interest in a competing-
risks situation. The stpm2cif command is a postestimation command used to estimate
the cumulative incidence function for up to 10 competing causes along with confidence
intervals, cause-specific hazards, and two other useful measures.
c 2013 StataCorp LP st0298
S. R. Hinchliffe and P. C. Lambert 345
2 Methods
If a patient is at risk from K different causes, the cause-specific hazard, hk (t), is the
risk of failure at time t given that no failure from cause k or any of the K − 1 other
causes has occurred. In a proportional hazards model, hk (t) is
! "
hk (t | Z) = hk,0 (t) exp βkT Z (1)
where hk,0 (t) is the baseline cause-specific hazard for cause k, and βk is the vector of
parameters for covariates Z. The cumulative incidence function, Ck (t), can be derived
from the cause-specific hazards through the equation
8t ,
K
Ck (t) = hk (u | Z) Sk (u)du (2)
0 k=1
9
K :t
K
where Sk (u)du = exp(− hk ) is the overall survival function (Prentice et al.
k=1 0 k=1
1978).
Several programs are currently available in Stata that can compute the cumula-
tive incidence function. The command stcompet calculates the function by using the
Kaplan–Meier estimator of the overall survival function (Coviello and Boggess 2004).
It therefore does not allow for the incorporation of covariate effects. A follow-on to
stcompet is stcompadj, which fits the cumulative incidence function based on the Cox
model or the flexible parametric regression model (Coviello 2009). However, it only
allows one competing event, and because the regression models are built into the com-
mand internally, it does not allow users to specify their own options with stcox or
stpm2. Finally, Fine and Gray’s (1999) proportional subhazards model can be fit using
stcrreg.
The flexible parametric model was first proposed by Royston and Parmar in 2002.
The approach uses restricted cubic spline functions to model the baseline log cumulative
hazard. It has the advantage over other well-known models such as the Cox model be-
cause it produces smooth predictions and can be extended to incorporate complex time-
dependent effects, again through the use of restricted cubic splines. The Stata implemen-
tation of the model using stpm2 is described in detail elsewhere (Royston and Parmar
2002; Lambert and Royston 2009). Both the cause-specific hazard (1) and the overall
survival function can be obtained from the flexible parametric model to give the inte-
grand in (2). This can be done by fitting separate models for each of the k causes, but
this will not allow for shared parameters. It is possible to fit one model for all k causes
simultaneously by stacking the data so that each individual patient has k rows of data,
one for each of the k causes. Table 1 illustrates how the data should look once they
have been stacked (in the table, CVD stands for cardiovascular disease). Each patient
can fail from one of three causes. Patient 1 is at risk from all three causes for 10 years
but does not experience any of them and so is censored. Patient 2 is at risk from all
three causes for eight years but then experiences a cardiovascular event. By expanding
346 Competing risks
the dataset, one can allow for covariate effects to be shared across the causes, although
it is possible to include covariates that vary for each cause.
3 Syntax
stpm2cif newvarlist, cause1(varname # varname # ... ) cause2(varname #
varname # ... ) cause3(varname # varname # ... ) ...
cause10(varname # varname # ... ) obs(#) ci mint(#) maxt(#)
timename(newvar) hazard contmort conthaz
The names specified in newvarlist coincide with the order of the causes inputted in the
options.
3.1 Options
cause1(varname # varname # ... ) . . . cause10(varname # varname # ... )
request that the covariates specified by the listed varname be set to # when pre-
dicting the cumulative incidence functions for each cause. cause1() and cause2()
are required.
obs(#) specifies the number of observations (of time) to predict. The default is
obs(1000). Observations are evenly spread between the minimum and maximum
values of follow-up time.
ci calculates a 95% confidence interval for the cumulative incidence function and stores
the confidence limits in CIF newvar lci and CIF newvar uci.
mint(#) specifies the minimum value of follow-up time. The default is set as the
minimum event time from stset.
maxt(#) specifies the maximum value of follow-up time. The default is set as the
maximum event time from stset.
S. R. Hinchliffe and P. C. Lambert 347
timename(newvar) specifies the time variable generated during predictions for the cu-
mulative incidence function. The default is timename( newt). This is the variable
for time that needs to be used when plotting curves for the cumulative incidence
function and the cause-specific hazard function.
hazard predicts the cause-specific hazard function for each cause.
contmort predicts the relative contribution to total mortality.
conthaz predicts the relative contribution to hazard.
4 Example
Data were used on 506 patients with prostate cancer who were randomly allocated
to treatment with diethylstilbestrol. The data have been used previously to illustrate
the command stcompet (Coviello and Boggess 2004). Patients are classified as alive
or having died from one of three causes: cancer (the event of interest), cardiovascular
disease (CVD), or other causes. To use stpm2cif, the user must first expand the dataset:
. use prostatecancer
. expand 3
(1012 observations created)
. by id, sort: generate cause= _n
. generate cancer = cause==1
. generate cvd = cause==2
. generate other = cause==3
. generate treatcancer = treatment*cancer
. generate treatcvd = treatment*cvd
. generate treatother = treatment*other
. generate event = (cause==status)
The data have been expanded so that each patient has three rows of data, one for
each cause as shown in table 1. Three indicator variables have been created for each
of the three competing causes and interactions between treatment. Three causes have
also been generated. The indicator variable event defines whether a patient has died
and the cause of death. We now need to stset the data and run stpm2.
348 Competing risks
1518
obs. remaining, representing
356
failures in single record/single failure data
54898.8
total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 76
. stpm2 cancer cvd other treatcancer treatcvd treatother, scale(hazard)
> rcsbaseoff dftvc(3) nocons tvc(cancer cvd other) eform nolog
Log likelihood = -1150.4866 Number of obs = 1518
xb
cancer .2363179 .0275697 -12.37 0.000 .188015 .2970303
cvd .1801868 .0238668 -12.94 0.000 .1389876 .2335983
other .1008464 .018132 -12.76 0.000 .070895 .1434515
treatcancer .6722196 .1096964 -2.43 0.015 .4882109 .9255819
treatcvd 1.188189 .2013301 1.02 0.309 .8524237 1.65621
treatother .6345498 .1672676 -1.73 0.084 .3785199 1.063758
_rcs_cancer1 3.501847 .44435 9.88 0.000 2.730788 4.49062
_rcs_cancer2 .8842712 .0742915 -1.46 0.143 .7500191 1.042554
_rcs_cancer3 1.046436 .0371625 1.28 0.201 .9760756 1.121868
_rcs_cvd1 2.841936 .2619063 11.33 0.000 2.372299 3.404545
_rcs_cvd2 .8772848 .0498866 -2.30 0.021 .7847607 .9807176
_rcs_cvd3 1.008804 .0352009 0.25 0.802 .9421175 1.08021
_rcs_other1 2.751505 .3563037 7.82 0.000 2.134738 3.546467
_rcs_other2 .7962094 .0558593 -3.25 0.001 .6939208 .913576
_rcs_other3 .9614597 .0512891 -0.74 0.461 .8660117 1.067428
By including the three cause indicators (cancer, cvd, and other) as both main
effects and time-dependent effects (using the tvc() option), we have fit a stratified
model with three separate baselines, one for each cause. For this reason, we have used
the rcsbaseoff option together with the nocons option, which excludes the baseline
hazard from the model. The interactions between treatment and the three causes have
also been included in the model. This estimates a different treatment effect for each of
the three causes. The hazard ratios (95% confidence intervals) for the treatment effect
are 0.67 [0.49, 0.93], 1.19 [0.85, 1.66], and 0.63 [0.38, 1.06] for cancer, CVD, and other
causes, respectively.
Now that we have run stpm2, we can run the new postestimation command stpm2cif
to obtain the cumulative incidence function for each cause. Because we have two groups
of patients, treated and untreated, we must run the command twice. This will give
separate cumulative incidence functions for the treated and the untreated groups and
for each of the three causes.
S. R. Hinchliffe and P. C. Lambert 349
The cause1() to cause3() options give the linear predictor for each of the three
causes for which we want a prediction. The commands have generated six new vari-
ables containing the cumulative incidence functions. The untreated group members are
denoted with a 0 at the end of the variable name, and the treated group members are
denoted with a 1. These labels come from the input into newvarlist in the above com-
mand line. The six cumulative incidence functions are therefore labeled CIF cancer0,
CIF cvd0, CIF other0, CIF cancer1, CIF cvd1, and CIF other1. Each of these vari-
ables has a corresponding high and low confidence bound, for example, CIF cancer0 lci
and CIF cancer1 uci. These were created because the ci option was specified. The
maxt() option has been specified to restrict the predictions for the cumulative incidence
function to a maximum follow-up time of 60 months; this was done for illustrative pur-
poses only.
By specifying the hazard option, we have generated cause-specific hazards that
correspond with each of the cumulative incidence functions. These are labeled as
h cancer0, h cvd0, h other0, h cancer1, h cvd1, and h other1. The options contmort
and conthaz are the two additional measures mentioned previously. The contmort op-
tion produces what we have named the “relative contribution to the total mortality”.
This is essentially the cumulative incidence function for each specific cause divided by
the sum of all the cumulative incidence functions. It can be interpreted as the prob-
ability that you will die from a particular cause given that you have died by time t.
The conthaz option produces what we have named the “relative contribution to the
overall hazard”. This is similar to the last measure in that it is the cause-specific hazard
for a particular cause divided by the sum of all the cause-specific hazards. It can be
interpreted as the probability that you will die from a particular cause given that you
die at time t.
350 Competing risks
If we plot the cumulative incidence functions for each cause against time, we can
achieve plots as shown in figure 1.
Untreated Treated
Figure 1. Cumulative incidence of cancer, CVD, and other causes of death in treated
and untreated patients with prostate cancer
The plots in figure 1 give the actual probabilities of dying from each cause, taking
into account the competing causes. The treated group have a lower probability of dying
from cancer or other causes compared with the untreated group, but have a higher
probability of dying from CVD.
The model fit above is relatively simple because it only considers treatment as a
predictor for the three causes of death. Age is an important factor when fitting the
probability of death, so we shall now consider a model including age as a continuous
variable with a time-dependent effect. Although the effect of age will most likely differ
between the three causes of death, for demonstrative purposes, we will assume that the
effect of age can be shared across all three causes. This is one of the main advantages of
stacking the data as shown previously. The stpm2 command can be rerun to include age
in both the variable list and the tvc() option. The three cause indicators (cancer, cvd,
and other) remain as time-dependent effects with 3 degrees of freedom to maintain the
stratified model with three separate baselines. Age is now included as a time-dependent
effect with only 1 degree of freedom.
S. R. Hinchliffe and P. C. Lambert 351
xb
cancer .0146644 .0103431 -5.99 0.000 .0036804 .0584297
cvd .0109487 .0078047 -6.33 0.000 .0027076 .0442727
other .0061321 .0044357 -7.04 0.000 .0014856 .0253121
treatcancer .6862214 .112055 -2.31 0.021 .4982751 .9450598
treatcvd 1.208279 .2048582 1.12 0.264 .8666626 1.684553
treatother .6468979 .1705538 -1.65 0.099 .3858491 1.084561
age 1.039325 .009951 4.03 0.000 1.020004 1.059013
_rcs_cancer1 15.00829 10.53069 3.86 0.000 3.793838 59.37229
_rcs_cancer2 .897379 .0757066 -1.28 0.199 .7606152 1.058734
_rcs_cancer3 1.046672 .0375211 1.27 0.203 .9756565 1.122858
_rcs_cvd1 12.71949 9.099712 3.55 0.000 3.129737 51.69301
_rcs_cvd2 .8897659 .0515221 -2.02 0.044 .7943039 .9967009
_rcs_cvd3 1.013176 .0358712 0.37 0.712 .9452535 1.085979
_rcs_other1 12.19435 8.773031 3.48 0.001 2.976976 49.95076
_rcs_other2 .7976752 .0553103 -3.26 0.001 .6963126 .9137932
_rcs_other3 .9682531 .0511771 -0.61 0.542 .8729684 1.073938
_rcs_age1 .980301 .0092192 -2.12 0.034 .9623973 .9985378
As before, we can now use the stpm2cif command to obtain the cumulative incidence
functions for cancer, CVD, and other causes. This time, we want to predict for ages 65
and 75 in both of the two treatment groups, so we will need to run the command four
times.
The stpm2cif commands have generated 12 new variables for the cumulative inci-
dence functions, labeled CIF age65cancer0, CIF age65cvd0, CIF age65other0,
CIF age65cancer1, CIF age65cvd1, CIF age65other1, CIF age75cancer0,
CIF age75cvd0, CIF age75other0, CIF age75cancer1, CIF age75cvd1, and
CIF age75other1. A 65 next to age represents the prediction for those 65 years old; a
75 represents a prediction for those 75 years old.
352 Competing risks
Rather than plotting the cumulative incidence function as a line for each cause
separately as we did previously, we display them by stacking them on top of each other.
This produces a graph as shown in figure 2. To do this, we need to generate new
variables that sum up the cumulative incidence functions. This is done for each of the
two treatment groups and two ages. The code shown below is for the 65-year-olds in
the treatment group only.
. generate age65treat1 = CIF_age65cancer1
(518 missing values generated)
. generate age65treat2 = age65treat1+CIF_age65cvd1
(518 missing values generated)
. generate age65treat3 = age65treat2+CIF_age65other1
(518 missing values generated)
. twoway (area age65treat3 _newt, sort fintensity(100))
> (area age65treat2 _newt, sort fintensity(100))
> (area age65treat1 _newt, sort fintensity(100)), ylabel(0(0.2)1, angle(0)
> format(%3.1f)) ytitle("") xtitle("")
> legend(order(3 "Cancer" 2 "CVD" 1 "Other") rows(1) size(small))
> title("Treated") plotregion(margin(zero)) scheme(sj)
> saving(treatedage65, replace)
Age 65
Treated Untreated
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
Cumulative incidence
0.2 0.2
0.0 0.0
0 20 40 60 0 20 40 60
Age 75
Treated Untreated
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0 20 40 60 0 20 40 60
Figure 2. Stacked cumulative incidence of cancer, CVD, and other causes of death for
those aged 65 and 75 in treated and untreated patients with prostate cancer
S. R. Hinchliffe and P. C. Lambert 353
The results in figure 2 allow us to visualize the total probability of dying in both
the treated and the untreated groups for those aged 65 and 75 and allow us to see how
this is broken down by the specific causes. As expected, the total probability of death
is higher for the oldest age in both treatment groups. The distribution of deaths across
the three causes in each treatment group is roughly the same for both ages. Again
we see that although the treatment reduces the total probability of death, it actually
increases the probability of death from CVD.
Using a similar process to the one used above to obtain the stacked cumulative
incidence plots, we can also produce stacked plots of the relative contribution to the
total mortality and the relative contribution to the hazard. These graphs are shown in
figures 3 and 4.
Age 65
Treated Untreated
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
Relative contribution
to the total mortality
0.2 0.2
0.0 0.0
0 20 40 60 0 20 40 60
Age 75
Treated Untreated
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0 20 40 60 0 20 40 60
Figure 3. Relative contribution to the total mortality for those aged 65 and 75 in treated
and untreated patients with prostate cancer
Figure 3 shows the relative contribution to the total mortality for those aged 65
and 75 in the two treatment groups. If we focus on the 65-year-olds in the treated
group, the plot shows us that given a patient 65 years old is going to die by 40 months
if treated, then the probability of dying from cancer is 0.39, the probability of dying
from CVD is 0.48, and the probability of dying from other causes is 0.13. However, if the
patient is untreated, then the same probabilities are 0.49, 0.34, and 0.17, respectively.
354 Competing risks
Age 65
Treated Untreated
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
Relative contribution
to the overall hazard
0.2 0.2
0.0 0.0
0 20 40 60 0 20 40 60
Age 75
Treated Untreated
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0 20 40 60 0 20 40 60
Figure 4. Relative contribution to the hazard for those aged 65 and 75 in treated and
untreated patients with prostate cancer
Figure 4 shows the relative contribution to the overall hazard for those aged 65 and 75
in the two treatment groups. Again, if we focus on the 65-year-olds in the treated group,
the plot shows us that given a patient 65 years old is going to die at 40 months if treated,
then the probability of dying from cancer is 0.39, the probability of dying from CVD
is 0.45, and the probability of dying from other causes is 0.16. However, if the patient
is untreated, then the same probabilities are 0.48, 0.32, and 0.20, respectively.
5 Conclusion
The new command stpm2cif provides an extension to the command stpm2 to enable
users to estimate the cumulative incidence function through the flexible parametric
function. We hope that it will be a useful tool in medical research.
6 References
Coviello, E. 2009. stcompadj: Stata module to estimate the covariate-adjusted cu-
mulative incidence function in the presence of competing risks. Statistical Software
Components S457063, Department of Economics, Boston College.
http://ideas.repec.org/c/boc/bocode/s457063.html.
Coviello, V., and M. Boggess. 2004. Cumulative incidence estimation in the presence of
competing risks. Stata Journal 4: 103–112.
Fine, J. P., and R. J. Gray. 1999. A proportional hazards model for the subdistribution
of a competing risk. Journal of the American Statistical Association 94: 496–509.
S. R. Hinchliffe and P. C. Lambert 355
c 2013 StataCorp LP st0299
R. Bellocco and S. Algeri 357
where π(x) is the probability of success given the set of covariates x = (x1 , . . . , xp ). Con-
sidering β = (β0 , . . . , βp ), the vector containing the unknown parameters in (1), under
the assumption of independent outcomes, we can obtain the corresponding maximum
likelihood estimates β by maximizing the following log-likelihood function:
n
[yi ln{π(xi )} + (1 − yi ) ln{1 − π(xi )}] (2)
i=1
where n is the total number of observations and yi is the observed outcome for the
ith subject. This situation is based on having subjects as analytical units; thus the
data layout presents one record for each individual considered in the dataset (individual
format).
When one works with categorical data, it is possible (and frequently more useful)
to consider some groups of subjects as units of analysis. These groups correspond
to the covariate patterns (that is, the specific combinations of predictor values xj ).
Thus it is possible to reshape the dataset so that each record will correspond to a
particular covariate pattern or profile (events–trials format), including the total number
of individuals and total number of successes (deaths, recoveries, etc.). In this case, the
goal is to predict the proportion of successes for each group. The quantity π will be the
same for any individual in the same group (Kleinbaum and Klein 2010), and we adopt
the binomial distribution as reference to model this probability. So if we rewrite the
log-likelihood function (2) in terms of covariate patterns, we obtain
K
[sj ln{π(xj )} + (mj − sj ) ln{1 − π(xj )}] (3)
j=1
where K is the total number of possible (observed) covariate patterns, sj represents the
number of successes, mj is the number of total individuals, and π(xj ) is the proportion
of successes corresponding to the jth covariate pattern. Therefore, in spite of differ-
ent structures, because the information contained is exactly the same, the parameter
estimates from (2) and (3) are exactly the same.
Having defined the log-likelihood function, we can perform the assessment of good-
ness of fit with different methods. In this article, we focus our attention on the likelihood-
ratio test (LRT) based on the deviance statistics. The deviance statistic compares, in
terms of likelihood, the model being fit with the saturated model. The deviance statistic
for a generalized linear model (see Agresti [2007]) is defined as
G2 = 2 ln Ls β − ln Lm β (4)
is the maximized log likelihood of the model of interest and ln{Ls (β)}
where ln{Lm (β)}
is the maximized log likelihood of the saturated model. This quantity can also be
interpreted as a comparison between the values predicted by the fitted model and those
predicted by the most complete model. Evidence for model lack-of-fit occurs when the
value of G2 is large (see Hosmer et al. [1997]).
358 Goodness-of-fit tests for categorical data
H0 : βh = 0
where βh is the vector containing the additional parameters of the saturated model
compared with the model considered. So H0 is rejected when
G2 ≥ χ2 1−α
where α is the level of significance. If H0 cannot be rejected, we can safely conclude that
the fitting of the model of interest is substantially similar to that of the most completed
model that can be built (see section 2). We must clarify that the LRT can always be
used to compare two nested models in terms of differences of deviances.
This approach is generally followed in the case of continuous covariates whose values can-
not be grouped into categorical values. In fact, in this situation, each covariate pattern
will most likely correspond to one subject (n = K), and obviously, the most reasonable
analytical unit is the subject. However, in this case, the G2 goodness-of-fit statistics
cannot be approximated to a χ2 distribution (see Kuss [2002] and Kleinbaum and Klein
R. Bellocco and S. Algeri 359
G2 = 2 ln Ls β − ln Lm β
⎛ ⎞
K % &
π (x ) 1 −
π (x )
= 2⎝ ⎠
s j s j
sj ln + (mj − sj ) ln
j=1
m (xj )
π 1−π m (xj )
where πs (xj ) is the proportion of successes for the jth covariate pattern predicted by
the saturated model and π m (xj ) is the one predicted by the fitted model.
The collapsing approach has a main drawback: it uses different saturated models
corresponding to different models of interest, complicating the comparison of their re-
sults in terms of goodness of fit. On the other hand, the contingency approach may
require the listing of a high number of covariates. In this case, we could have many
covariate patterns with a small number of subjects, making the use of the χ2 approx-
imation in the LRT for goodness of fit difficult once again. A possible remedy could
be that the hypothetical saturated model in the contingency approach should be based
on variables identified through the corresponding directed acyclic graphs. In a causal
inference framework, we could then use only the variables suggested by the d-separation
algorithm applied to the directed acyclic graph, which imposes the researcher to specify
the interrelationship among the variables (Greenland, Pearl, and Robins 1999).
class passenger, third-class passenger, or crew), and age (adult or child), which defines
16 different covariate patterns (among which 14 were observed). The outcome of interest
is either passenger’s survival (1 = survivor, 0 = deceased) or the number of survivors
and total number of passengers.
As anticipated above, two possible ways to represent these data can be considered
with respect to the goal and the unit of analysis (Kleinbaum and Klein 2010):
• Individual-record format: One record for each subject considered with the infor-
mation on survival (or death) contained in a binary variable (individ.txt).
• Events–trials format: One record for each covariate pattern with frequencies on
survivors and total number of passengers (grouped.txt) available as follows:
OIM
survival Coef. Std. Err. z P>|z| [95% Conf. Interval]
econ_status
2 .8808128 .1569718 5.61 0.000 .5731537 1.188472
3 -.0717844 .1709268 -0.42 0.675 -.4067948 .263226
4 -.7774228 .1423145 -5.46 0.000 -1.056354 -.4984916
The LRT for goodness of fit can be obtained with the following code:
. scalar dev=e(deviance)
. scalar df=e(df)
. display "GOF casewise "" G^2="dev " df="df " p-value= " chiprob(df, dev)
GOF casewise G^2=2228.9128 df=2196 p-value= .30705384
Thus the deviance statistic G2 is 2228.91 with 2196 (= 2201 − 5) degrees of freedom,
and the p-value referred to the deviance test is 0.3071. We notice that as expected, the
G2 corresponds to −2{lnm (β)} (= −2[−1114.46]). So in this case, the null hypothesis
cannot be rejected, and the fit of the model of interest is not different from the fit of
the saturated model.
In Stata, we also need to add the variable containing the number of trials, n, in the
family() option:
. insheet using grouped.txt, tab clear
(5 vars, 16 obs)
. generate male = sex=="Male"
. encode status, generate(econ_status)
. glm survival i.male i.econ_status if n>0, family(binomial n) link(logit)
Iteration 0: log likelihood = -91.841683
Iteration 1: log likelihood = -89.026084
Iteration 2: log likelihood = -89.019672
Iteration 3: log likelihood = -89.019672
Generalized linear models No. of obs = 14
Optimization : ML Residual df = 9
Scale parameter = 1
Deviance = 131.4183066 (1/df) Deviance = 14.60203
Pearson = 127.8463371 (1/df) Pearson = 14.20515
Variance function: V(u) = u*(1-u/n) [Binomial]
Link function : g(u) = ln(u/(n-u)) [Logit]
AIC = 13.43138
Log likelihood = -89.01967223 BIC = 107.6668
OIM
survival Coef. Std. Err. z P>|z| [95% Conf. Interval]
econ_status
2 .8808128 .1569718 5.61 0.000 .5731537 1.188472
3 -.0717844 .1709268 -0.42 0.675 -.4067948 .263226
4 -.7774228 .1423145 -5.46 0.000 -1.056354 -.4984916
The parameter estimates do not change as they do in the casewise approach. But
as expected, the deviance statistic (131.42) has significantly decreased; the degrees of
freedom have changed (9 = 14 − 5); and the p-value for the deviance test will now let
us reject the null hypothesis, implying that the model of interest is not as good as the
saturated model.
Thus, concerning the grouped dataset and by using the egen command, we first generate
a variable that allows us to identify all the possible covariate patterns referring just to
the variables male and econ status.
Second, we collapse the data by using the variable obtained in the previous step
and applying it to the two variables introduced into the model of interest (male and
econ status). In this way, we obtain a dataset where each record corresponds to a
covariate pattern identified by the combination of the covariates in the model.
1. 1 20 23 0 1
2. 2 141 145 0 2
3. 3 93 106 0 3
4. 4 90 196 0 4
5. 5 192 862 1 1
6. 6 62 180 1 2
7. 7 25 179 1 3
8. 8 88 510 1 4
364 Goodness-of-fit tests for categorical data
OIM
survival Coef. Std. Err. z P>|z| [95% Conf. Interval]
econ_status
2 .8808128 .1569718 5.61 0.000 .5731537 1.188472
3 -.0717844 .1709268 -0.42 0.675 -.4067948 .263226
4 -.7774228 .1423145 -5.46 0.000 -1.056354 -.4984916
. scalar dev=e(deviance)
. scalar df=e(df)
. display "GOF contingency "" G^2="dev " df="df " p-value= " chiprob(df, dev)
GOF contingency G^2=65.179831 df=3 p-value= 4.591e-14
By reshaping the data, we obtain the results according to the collapsing approach
definition. As in the contingency table approach, we reject H0 , but now the value of
the deviance statistic has changed to 65.18 with 3 (= 8 − 5) degrees of freedom. As
expected, estimates do not change.
4 Discussion
The casewise approach is often considered the standard for defining the saturated model.
The reason is that the analysis is focused on subjects, and the saturated model, instead
of the fully parameterized model, is seen as the model that gives the “perfect fit” (see
Kleinbaum and Klein [2010]). This fact does not affect the estimation process; however,
it fatally compromises the inferential step in a goodness-of-fit evaluation where the χ2
approximation becomes questionable. The consideration of the other approaches can
lead to different and meaningful results in terms of both descriptive and inferential
analysis, but the problem is how to implement them in the right way with the statistical
package we are working on.
R. Bellocco and S. Algeri 365
Considering Stata 12.1, we have noticed that in all cases, the default procedures for
goodness of fit consider the saturated model to be the one with as many covariates as
the number of records present in the dataset. Thus, using an individual data layout,
we obtain results relative to the casewise saturated model, where the analytical units
are subjects. However, when considering an events–trials data format, we assess the
goodness of fit based on the contingency table approach, where the unit of analysis
is the covariate pattern defined by the possible values of all the independent variables
in the dataset. The less intuitive implementation is the one based on the collapsing
approach, which uses the covariate patterns defined by the variables involved in the
model. One simple solution could be to build a new dataset containing only these
variables, like we did with the useful commands egen and collapse, which are very
helpful in showing how the collapsing approach works.
5 References
Agresti, A. 2007. An Introduction to Categorical Data Analysis. 2nd ed. Hoboken, NJ:
Wiley.
Greenland, S., J. Pearl, and J. M. Robins. 1999. Causal diagrams for epidemiologic
research. Epidemiology 10: 37–48.
Kleinbaum, D. G., and M. Klein. 2010. Logistic Regression: A Self-Learning Text. 3rd
ed. New York: Springer.
Kuss, O. 2002. Global goodness-of-fit tests in logistic regression with sparse data.
Statistics in Medicine 21: 3789–3801.
Simonoff, J. S. 1998. Logistic regression, categorical predictors, and goodness-of-fit: It
depends on who you ask. American Statistician 52: 10–14.
Tim J. Cole
MRC Centre of Epidemiology for Child Health
UCL Institute of Child Health
London, UK
[email protected]
Huiqi Pan
MRC Centre of Epidemiology for Child Health
UCL Institute of Child Health
London, UK
[email protected]
c 2013 StataCorp LP dm0004 1
S. I. Vidmar, T. J. Cole, and H. Pan 367
1 Introduction
Comparison of anthropometric data from children of different ages is complicated by
the fact that children are still growing. We cannot directly compare the height of a
5-year-old with that of a 10-year-old. Clinicians and researchers are often interested in
determining how a child compares with other children of the same age and sex: Is the
child taller, shorter, or about the same height as the average for his or her age and sex?
The growth references available to zanthro() tabulate values obtained by the LMS
method, developed by Cole (1990) and Cole and Green (1992). The LMS values are used
to transform raw anthropometric data, such as height, to standard deviation scores (z
scores). These are standardized to the reference population for the child’s age and sex (or
for length/height and sex). Two sets of population-based reference data that were widely
used at the time zanthro() was initially developed are the 2000 Centers for Disease Con-
trol and Prevention (CDC) Growth Reference in the United States (Kuczmarski et al.
2000) and the British 1990 Growth Reference (Cole, Freeman, and Preece 1998). Since
then, the following population-based reference data have been released and are now
available in zanthro(): the WHO Child Growth Standards, the WHO Reference 2007,
the UK-WHO Preterm Growth Reference, and the UK-WHO Term Growth Reference.
• A postnatal section from 2 weeks to 4 years copied from the WHO Child Growth
Standards.
• The 4–20 years section from the British 1990 Growth Reference.
Term infants are those born at 37 completed weeks’ gestation and beyond. The UK-
WHO Term Growth Reference can be used for these infants. For infants born before 37
completed weeks’ gestation, the UK-WHO Preterm Growth Reference can be used, with
gestationally corrected age.
CDC: Weight
Age (years) 0 0.04 . . . 19.88 19.96 20
Age (months) 0 0.5 . . . 238.5 239.5 240
zanthro() + + ... + + +
; <= > ; <= >
linear linear
LMSgrowth + + ... + + -
; <= > ; <= >
linear linear
Figure 1. Use of linear interpolation for charts with different age ranges
2 Syntax
egen type newvar = zanthro(varname,chart,version) if in ,
xvar(varname) gender(varname) gencode(male=code, female=code)
ageunit(unit) gestage(varname) nocutoff
egen type newvar = zbmicat(varname) if
in , xvar(varname)
gender(varname) gencode(male=code, female=code) ageunit(unit)
3 Functions
zanthro(varname,chart,version) calculates z scores for anthropometric measures in
children and adolescents according to United States, UK, WHO, and composite UK-
WHO reference growth charts. The three arguments are the following:
varname is the variable name of the measure in your dataset for which z scores are
calculated (for example, height, weight, or BMI).
chart; see tables 3–7 for a list of valid chart codes.
version is US, UK, WHO, UKWHOpreterm, or UKWHOterm. US calculates z scores by using
the 2000 CDC Growth Reference; UK uses the British 1990 Growth Reference; WHO
uses the WHO Child Growth Standards and WHO Reference 2007 composite data
files as the reference data; and UKWHOpreterm and UKWHOterm use the British and
WHO Child Growth Standards composite data files for preterm and term births,
respectively.
zbmicat(varname) categorizes children and adolescents aged 2–18 years into three thin-
ness grades—normal weight, overweight, and obese—by using BMI cutoffs (table 2).
BMI is in kg/m2 . This function generates a variable with the following values and
labels:
Note that since the previous version of zbmicat(), the value label for BMI category
has been changed from 1 = Normal wt, 2 = Overweight, and 3 = Obese.
4 Options
xvar(varname) specifies the variable used (along with gender) as the basis for stan-
dardizing the measure of interest. This variable is usually age but can also be length
or height when the measurement is weight; that is, weight-for-age, weight-for-length,
and weight-for-height are all available growth charts.
gender(varname) specifies the gender variable. It can be string or numeric. The codes
for male and female must be specified by the gencode() option.
372 Standardizing anthropometric measures: Update
gencode(male=code, female=code) specifies the codes for male and female. The gen-
der can be specified in either order, and the comma is optional. Quotes around the
codes are not allowed, even if the gender variable is a string.
ageunit(unit) gives the unit for the age variable and is only valid for measurement-for-
age charts; that is, omit this option when the chart code is wl or wh (see section 5).
The unit can be day, week, month, or year. This option may be omitted if the unit
is year, because this is the default. Time units are converted as follows:
1 year = 12 months = 365.25/7 weeks = 365.25 days
1 month = 365.25/84 weeks = 365.25/12 days
1 week = 7 days
Note: Ages cannot be expressed to full accuracy for all units. The consequence of
this will be most apparent at the extremes of age in the growth charts, where z
scores may be generated when the age variable is in one unit and missing for some
of those same ages when they have been converted to another unit.
gestage(varname) specifies the gestational age variable in weeks. This option enables
age to be adjusted for gestational age. The default is 40 weeks. If gestational age is
greater than 40 weeks, the child’s age will be corrected by the amount over 40 weeks.
A warning will be given if the gestational age variable contains a nonmissing value
over 42. As with the ageunit() option, this option is only valid for measurement-
for-age charts.
nocutoff forces calculation of all z scores, allowing for extreme values in your dataset.
By default, any z scores with absolute values greater than or equal to 5 (that is,
values that are 5 standard deviations or more away from the mean) are set to missing.
The decision to have a default cutoff at 5 standard deviations from the mean was
made as a way of attempting to capture extreme data entry errors. Apart from this
and setting to missing any z scores where the measurement is a nonpositive number,
these functions will not automatically detect data errors. As always, please check
your data!
5 Growth charts
Growth charts available in zanthro() are presented in tables 3–7. Note: Where xvar()
is outside the permitted range, zanthro() and zbmicat() return a missing value.
S. I. Vidmar, T. J. Cole, and H. Pan 373
Length/height and BMI growth data are available from 33 weeks gestation. Weight and
head circumference growth data are available from 23 weeks gestation.
374 Standardizing anthropometric measures: Update
Table 5. WHO Child Growth Charts and WHO Reference 2007 Charts, version WHO
Length/height growth data are available from 25 weeks gestation. Weight and head
circumference growth data are available from 23 weeks gestation.
Length/height, weight, and head circumference growth data are available from 37 weeks
gestation.
S. I. Vidmar, T. J. Cole, and H. Pan 375
6 Examples
Below is an illustration with data on a set of British newborns. The British 1990 Growth
Reference is used; the variable sex is coded male = 1, female = 2; and the variable
gestation is “completed weeks gestation”.
. use zwtukeg
. list, noobs abbreviate(9)
1 .01 3.53 38
2 .073 5.05 40
2 .115 4.68 42
1 .135 4.89 36
2 .177 2.75 28
To compare the weight of the babies in this sample, for instance, with respect to
socioeconomic grouping, we can convert weight to standardized z scores. The z scores
are created using the following command:
. egen zwtuk = zanthro(weight,wa,UK), xvar(ageyrs) gender(sex)
> gencode(male=1, female=2)
(Z values generated for 5 cases)
(gender was assumed to be coded male=1, female=2)
(age was assumed to be in years)
In the command above, we have assumed all are term births. If some babies are born
prematurely, we can adjust for gestational age as follows.
. egen zwtuk_gest = zanthro(weight,wa,UK), xvar(ageyrs) gender(sex)
> gencode(male=1, female=2) gestage(gestation)
(Z values generated for 5 cases)
(gender was assumed to be coded male=1, female=2)
(age was assumed to be in years)
Note that at gestation = 40 weeks, the z score is the same whether or not the
gestage() option is used. The formula for gestationally corrected age is
actual age + (gestation at birth − 40)
where “actual age” and “gestation at birth” are in weeks.
376 Standardizing anthropometric measures: Update
Gestational age may be recorded as weeks and days, as in the following example:
gestwks gestdays
38 3
40 6
42 0
36 2
28 1
These variables first need to be combined into a single gestation variable, which can
then be used with the gestage() option:
Here we use the UK-WHO Term Growth Reference for term babies:
Here we use the UK-WHO Preterm Growth Reference for preterm babies, adjusting
for gestational age:
Note: Where the gestationally corrected age is from 37 to 42 weeks, the UK-WHO
preterm and term growth charts generate different z scores. For example, the gestation-
ally corrected age of a 2-week-old baby girl who was born at 37 weeks gestation is 39
weeks. If her weight is 3.34 kg, the following z scores are generated using the UK-WHO
preterm and term growth charts:
To determine the proportion of children who are thin, normal weight, overweight,
and obese, we can categorize each child by using the following command:
. list, noobs
7 Acknowledgment
This work was supported by the Victorian Government’s Operational Infrastructure
Support Program.
8 References
Binagwaho, A., N. Ratnayake, and M. C. Smith Fawzi. 2009. Holding multilateral
organizations accountable: The failure of WHO in regards to childhood malnutrition.
Health and Human Rights 10(2): 1–4.
Cole, T. J. 1990. The LMS method for constructing normalized growth standards.
European Journal of Clinical Nutrition 44: 45–60.
378 Standardizing anthropometric measures: Update
Abstract. I describe the use of the Bonferroni and Holm formulas as approxi-
mations for Šidák and Holland–Copenhaver formulas when issues of precision are
encountered, especially with q-values corresponding to very small p-values.
Keywords: st0300, parmest, qqvalue, smileplot, multproc, multiple-test procedure,
familywise error rate, Bonferroni, Šidák, Holm, Holland, Copenhaver
1 Introduction
Frequentist q-values for a range of multiple-test procedures are implemented in Stata by
using the package qqvalue (Newson 2010), downloadable from the Statistical Software
Components (SSC) archive. The Šidák q-value for a p-value p is given by qsid = 1 −
(1 − p)m , where m is the number of multiple comparisons (Šidák 1967). It is a less
conservative alternative to the Bonferroni q-value, given by qbon = min(1, mp). However,
the Šidák formula may be incorrectly evaluated by a computer to 0 when the input p-
value is too small to give a result lower than 1 when subtracted from 1, which is the
case for p-values of 10−17 or less, even in double precision. q-values of 0 are logically
possible as a consequence of p-values of 0, but in this case, they may be overliberal. This
liberalism may possibly be a problem in the future, given the current technology-driven
trend of exponentially increasing multiple comparisons and the human-driven problem
of ingenious data dredging. I present a remedy for this problem and discuss its use in
computing q-values and discovery sets.
c 2013 StataCorp LP st0300
380 Approximations for Šidák and Holland–Copenhaver q-values
A similar argument shows that the same problem exists with the q-values output
by the Holland–Copenhaver procedure (Holland and Copenhaver 1987). If the m input
p-values, sorted in ascending order, are denoted pi for i from 1 to m, then the Holland–
Copenhaver procedure is defined by the formula
si = 1 − (1 − pi )m−i+1
where si is the ith s-value. (In the terminology of Newson [2010], s-values are truncated
at 1 to give r-values, which are in turn input into a step-down procedure to give the
eventual q-values.) The remedy used by qqvalue here is to substitute the s-value
formula for the procedure of Holm (1979), which is
si = (m − i + 1)pi
whenever 1 − pi is evaluated as 1. This also works because the two s-value formulas
converge in ratio as pi tends to 0. Note that the Holm procedure is derived from the
Bonferroni procedure by using the same step-down method as is used to derive the
Holland–Copenhaver procedure from the Šidák procedure.
ci = 1 − (1 − punc )1/(m−i+1)
for i from 1 to m and selects the corrected critical p-value corresponding to a given
uncorrected critical p-value from these candidates by using a step-down procedure. If
the quantity (1 − punc )1/(m−i+1) is evaluated as 1, then smileplot substitutes the
corresponding Holm critical p-value threshold
ci = punc /(m − i + 1)
which again is conservative as m − i + 1 becomes large (corresponding to the smallest
p-values from a large number of multiple comparisons), but is less conservative than the
value of 0, which would otherwise be computed.
Newson (2010) argues that q-values are an improvement on discovery sets because,
given the q-values, different members of the audience can apply different input critical
p-values and derive their own discovery sets. The technical issue of precision presented
here may be one more minor reason for preferring q-values to discovery sets.
4 Acknowledgment
I would like to thank Tiago V. Pereira of the University of São Paulo in Brazil for
drawing my attention to this issue of precision with the Šidák and Holland–Copenhaver
procedures.
5 References
Holland, B. S., and M. D. Copenhaver. 1987. An improved sequentially rejective Bon-
ferroni test procedure. Biometrics 43: 417–423.
Holm, S. 1979. A simple sequentially rejective multiple test procedure. Scandinavian
Journal of Statistics 6: 65–70.
Newson, R. 2003. Confidence intervals and p-values for delivery to the end user. Stata
Journal 3: 245–269.
Newson, R., and the ALSPAC Study Team. 2003. Multiple-test procedures and smile
plots. Stata Journal 3: 109–132.
Newson, R. B. 2010. Frequentist q-values for multiple-test procedures. Stata Journal
10: 568–584.
Šidák, Z. 1967. Rectangular confidence regions for the means of multivariate normal
distributions. Journal of the American Statistical Association 62: 626–633.
Stephanie Knox
Centre for Health Economics Research and Evaluation
University of Technology, Sydney
Sydney, Australia
[email protected]
Abstract. In this article, we describe the gmnl Stata command, which can be
used to fit the generalized multinomial logit model and its special cases.
Keywords: st0301, gmnl, gmnlpred, gmnlcov, generalized multinomial logit, scale
heterogeneity multinomial logit, maximum simulated likelihood
1 Introduction
Explaining variations in the behaviors of individuals is of central importance in choice
analysis. For the last decade, the most popular explanation has been preference or taste
heterogeneity; that is, some individuals care more about particular product attributes
than do others. This assumption is most naturally represented via random parameter
models, among which the mixed logit (MIXL) model has become the standard to use
(McFadden and Train 2000).
Recently, however, a group of researchers (for example, Louviere et al. [1999], Lou-
viere et al. [2002], Louviere and Eagle [2006], and Louviere et al. [2007]) has argued that
in most choice contexts, much of the preference heterogeneity may be better described
as “scale” heterogeneity; that is, with attribute coefficients fixed, the scale of the id-
iosyncratic error term is greater for some consumers than it is for others. Because the
scale of the error term is inversely related to the error variance, this argument implies
that choice behavior is more random for some consumers than it is for others. Although
the scale of the error term in discrete choice models cannot be separately identified from
the attribute coefficients, it is possible to identify relative scale terms across consumers.
Thus the statement that all heterogeneity is in the scale of the error term “is observa-
tionally equivalent to the statement that heterogeneity takes the form of the vector of
utility weights being scaled up or down proportionately as one ‘looks’ across consumers”
(Fiebig et al. 2010). These arguments have led to the scale heterogeneity multinomial
logit (S-MNL) model, a much more parsimonious model specification than MIXL.
c 2013 StataCorp LP st0301
Y. Gu, A. R. Hole, and S. Knox 383
To accommodate both preference and scale heterogeneity, Fiebig et al. (2010) devel-
oped a generalized multinomial logit (G-MNL) model that nests MIXL and S-MNL. Their
research also shows that the two sources of heterogeneity often coexist but that their
importance varies in different choice contexts.
In this article, we will describe the gmnl Stata command, which can be used to fit the
G-MNL model and its special cases. The command is a generalization of the mixlogit
command developed by Hole (2007). We will also present an empirical example that
demonstrates how to use gmnl, and we will discuss related computational issues.
βi = σi β + {γ + σi (1 − γ)}ηi (2)
1. We could also consider a different number of alternatives and choice situations for each respondent;
for example, see Greene and Hensher (2010). The gmnl command can handle both of these cases.
2. Greene and Hensher (2010) call this the “scaled mixed logit model”.
384 Fitting the generalized multinomial logit model
• MIXL: βi = β + ηi (when σi = 1)
• S-MNL: βi = σi β (when var(ηi ) = 0)
• Standard multinomial logit: βi = β (when σi = 1 and var(ηi ) = 0)
The gmnl command includes an option for fitting MIXL models, but we recommend that
mixlogit be used for this purpose because it is usually faster.
To complete the model specification, we need to choose a distribution for σi . Al-
though any distribution defined on the positive real line is a theoretical possibility,
Fiebig et al. (2010) assume that σi is distributed lognormal with standard deviation τ
and mean σ + θzi , where σ is a normalizing constant and zi is a vector of characteristics
of individual i that can be used to explain why σi differs across people.
where yitj is the observed choice variable, Pr(choiceit = j|βi ) is given by (1), and
p(βi |β, γ, τ, θ, Σ) is implied by (2).
Maximizing the log likelihood in (3) directly is rather difficult because the integral
does not have a closed-form representation and so must be evaluated numerically. We
choose to approximate it with simulation (see Train [2009], for example). The simulated
likelihood is
⎧ ⎫
N ⎨1 R , T , J ⎬
[r]
SLL(β, γ, τ, θ, Σ) = ln Pr(choiceit = j|βi )yitj
⎩R ⎭
i=1 r=1 t=1 j=1
The command gmnlpred can be used following gmnl to obtain predicted probabili-
ties. The predictions are available both in and out of sample; type gmnlpred . . . if
e(sample) . . . if predictions are wanted for the estimation sample only.
gmnlpred newvar if in , nrep(#) burn(#) ll
The command gmnlcov can be used following gmnl to obtain the elements in the coeffi-
cient covariance matrix along with their standard errors. This command is only relevant
when the coefficients are specified to be correlated; see the corr option below. gmnlcov
is a wrapper for nlcom (see [R] nlcom).
gmnlcov , sd
The command gmnlbeta can be used following gmnl to obtain the individual-level pa-
rameters corresponding to the variables in the specified varlist by using the method
proposed by Revelt and Train (2000) (see also Train [2009, chap. 11]). The individual-
level parameters are stored in a data file specified by the user. As with gmnlpred, the
predictions are available both in and out of sample; type gmnlbeta . . . if e(sample)
. . . if predictions are wanted for the estimation sample only.
gmnlbeta varlist if in , saving(filename) replace nrep(#) burn(#)
id(varname) specifies a numeric identifier variable for the decision makers. This option
should be specified only when each individual performs several choices, that is, when
the dataset is a panel.
corr specifies that the random coefficients be correlated. The default is that they
are independent. When the corr option is specified, the estimated parameters are
the means of the (fixed and random) coefficients plus the elements of the lower-
triangular matrix L, where the covariance matrix for the random coefficients is given
by Σ = LL . The estimated parameters are reported in the following order: the
means of the fixed coefficients, the means of the random coefficients, and the elements
of the L matrix. The gmnlcov command can be used postestimation to obtain the
elements in the Σ matrix along with their standard errors.
If the corr option is not specified, the estimated parameters are the means of the
fixed coefficients and the means and standard deviations of the random coefficients,
reported in that order. The sign of the estimated standard deviations is irrelevant.
Although in practice the estimates may be negative, interpret them as being positive.
The sequence of the parameters is important to bear in mind when specifying starting
values.
nrep(#) specifies the number of draws used for the simulation. The default is nrep(50).
burn(#) specifies the number of initial sequence elements to drop when creating the
Halton sequences. The default is burn(15). Specifying this option helps reduce the
correlation between the sequences in each dimension. Train (2009, 227) recommends
that # should be at least as large as the largest prime number used to generate the
sequences. If there are K random coefficients, gmnl uses the first K primes to
generate the Halton draws.
gamma(#) constrains the gamma parameter to the specified value in the estimations.
scale(matrix) specifies a matrix whose elements indicate whether their corresponding
variable will be scaled (1 = scaled and 0 = not scaled). The matrix should have
one row, and the number of columns should be equal to the number of explanatory
variables in the model.
het(varlist) specifies the variables in the zi vector (if any).
mixl specifies that a mixed logit model should be estimated instead of a G-MNL model.
seed(#) specifies the seed. The default is seed(12345).
level(#); see [R] estimation options.
constraints(numlist); see [R] estimation options.
vce(vcetype); vcetype may be oim, robust, cluster clustvar, or opg; see [R] vce option.
maximize options: difficult, technique(algorithm spec), iterate(#), trace,
gradient, showstep, hessian, tolerance(#), ltolerance(#), gtolerance(#),
nrtolerance(#), and from(init specs); see [R] maximize.
Y. Gu, A. R. Hole, and S. Knox 387
5 Computational issues
As in any model estimated using maximum simulated likelihood, parameter estimates of
G-MNL would depend on four factors: the random-number seed, number of draws, start-
ing values, and optimization method. If the four factors are fixed, the same maximum
likelihood estimates would be obtained at each simulation.
388 Fitting the generalized multinomial logit model
6 Empirical example
We will now present some examples that demonstrate how the gmnl command can
be used to fit the different models described in section 2. We will start by fitting
some relatively simple models, and then we will build up the complexity gradually.
The data used in the examples come from a stated preference study on Australian
women who were asked to choose whether to have a pap smear test; see Fiebig and Hall
(2004). There were 79 women in the sample, and each respondent was presented with
32 scenarios. Thus in terms of the model structure described in section 2, N = 79,
T = 32, and J = 2. The dataset also contains five attributes, which are described
in table 1. Besides these five attributes, an alternative specific constant (ASC) will be
used to measure intangible aspects of the pap smear test not captured by the design
attributes (some women would choose or not choose the test just because of these
intangible aspects no matter what attributes they are presented with).
Variable Definition
knowgp 1 if the general practitioner is known to the patient; 0 otherwise
malegp 1 if the general practitioner is male; 0 if the general practitioner
is female
testdue 1 if the patient is due or overdue for a pap smear test; 0 otherwise
drrec 1 if the general practitioner recommends that the patient have a pap
smear test; 0 otherwise
cost cost of test (unit: 10 Australian dollar)
To give an impression of how the data are structured, we have listed the first six
observations below. Each observation corresponds to an alternative, and the dependent
variable y is 1 for the chosen alternative in each choice situation and 0 otherwise.
gid identifies the alternatives in a choice situation; rid identifies the choice situations
faced by a given individual; and the remaining variables are the alternative attributes
described in table 1 and the ASC (dummy test). In the listed data, the same individual
faces three choice situations.
390 Fitting the generalized multinomial logit model
. use paptest.dta
. generate cost = papcost/10
. list y dummy_test knowgp malegp testdue drrec cost gid rid in 1/6, sep(2)
> abb(10)
1. 0 1 1 0 0 0 2 1 1
2. 1 0 0 0 0 0 0 1 1
3. 0 1 1 0 0 1 2 2 1
4. 1 0 0 0 0 0 0 2 1
5. 0 1 0 1 0 1 2 3 1
6. 1 0 0 0 0 0 0 3 1
We start by fitting a relatively simple S-MNL model with a fixed (nonrandom) ASC.
Fiebig et al. (2010) have pointed out that ASCs should not be scaled, because they are
fundamentally different from observed attributes. We can fit the model with a fixed
ASC by using the scale() option of gmnl as described below.
To avoid scaling the ASC, we create a matrix whose elements indicate whether their
corresponding variable will be scaled (1 = scaled and 0 = not scaled). Here the “scale”
matrix defined as (0, 1, 1, 1, 1, 1) corresponds to the variables in the order in which they
are specified in the model (dummy test, knowgp, etc.). Therefore, among these six
variables, only dummy test (that is, the ASC) is not scaled.
Y. Gu, A. R. Hole, and S. Knox 391
We should mention that the number of observations reported in the table, 5,056, is
N × T × J, that is, the total number of choices times the number of alternatives. For
most purposes, such as computing information criteria, it is more appropriate to use
the total number of choices (N × T ); therefore, we do not recommend that you use the
estat ic command after gmnl.
We then let dummy test be random, which leads to our second model: S-MNL with
random ASC.3
Mean
knowgp .6263819 .1431434 4.38 0.000 .3458261 .9069378
malegp -1.350731 .2024933 -6.67 0.000 -1.74761 -.9538514
testdue 2.954924 .2950128 10.02 0.000 2.37671 3.533139
drrec .7730114 .1608242 4.81 0.000 .4578018 1.088221
cost -.1701498 .0585679 -2.91 0.004 -.2849408 -.0553588
dummy_test -.7052151 .3578936 -1.97 0.049 -1.406674 -.0037565
SD
dummy_test 2.660664 .2798579 9.51 0.000 2.112152 3.209175
3. Strictly speaking, this is not an S-MNL but a parsimonious form of G-MNL; that is, we model
ASCs using preference heterogeneity but model other attributes using scale heterogeneity. This
specification of G-MNL has been used by Fiebig et al. (2011) and Knox et al. (2013).
392 Fitting the generalized multinomial logit model
Comparing “S-MNL with fixed ASC” with “S-MNL with random ASC”, we can see that
the latter model improved the model fit by adding one more parameter, the standard
deviation of dummy test, which is statistically significant.4 The improvement in fit is
not surprising because the random ASC captures preference heterogeneity and allows
for correlation across choice situations because of the panel nature of the data. The
parameter estimates between the two models are somewhat different, but they cannot
be compared directly because of differences in scale across models, as indicated by the
estimate of τ . Instead, we should run the gmnlpred command to compare the predicted
probabilities. We shall demonstrate how to do predictions after fitting the full G-MNL
model.
The third example is a G-MNL model in which dummy test, testdue, and drrec
are given random coefficients. For the moment, the coefficients are specified to be
uncorrelated; that is, the off-diagonal elements of Σ are all 0. To speed up the estimation,
we constrain γ to 0 by using the gamma(0) option, which implies that the fitted model
is a G-MNL-II (or “scaled mixed logit”).
Mean
knowgp .9123367 .1748867 5.22 0.000 .5695651 1.255108
malegp -2.742707 .3543753 -7.74 0.000 -3.43727 -2.048144
cost -.1419785 .0637313 -2.23 0.026 -.2668895 -.0170675
dummy_test -.4904328 .2685654 -1.83 0.068 -1.016811 .0359457
testdue 5.79628 .8667601 6.69 0.000 4.097462 7.495099
drrec 1.492487 .2652157 5.63 0.000 .9726734 2.0123
SD
dummy_test 2.988542 .3213189 9.30 0.000 2.358769 3.618316
testdue 3.166774 .4859329 6.52 0.000 2.214363 4.119185
drrec 1.356382 .194595 6.97 0.000 .974983 1.737781
4. Note that we constrain γ to 0 by using the gamma(0) option. This is to prevent gmnl from attempting
to estimate the gamma parameter, because it is not identified in this model.
Y. Gu, A. R. Hole, and S. Knox 393
The square root of the diagonal elements of Σ is estimated and shown in the block
under SD. All the standard deviations are significantly different from 0, which suggests
the presence of substantial preference heterogeneity in the data.
In the last example, we allow the random coefficients of dummy test, testdue, and
drrec to be correlated, which implies that the off-diagonal elements of Σ will not be
fixed as zeros. Instead of using the default starting values, we use the parameters from
the previous model, setting the starting values for the off-diagonal elements of Σ to 0.
. *Starting values
. matrix start = b[1,1..7],0,0,b[1,8],0,b[1,9..10]
. /*G-MNL with correlated random coefficients*/
. gmnl y knowgp malegp cost, group(gid) id(rid) rand(dummy_test testdue drrec)
> nrep(500) from(start,copy) scale(scale) corr gamma(0)
Iteration 0: log likelihood = -991.41088 (not concave)
(output omitted )
Iteration 8: log likelihood = -987.7783
Generalized multinomial logit model Number of obs = 5056
Wald chi2(6) = 57.86
Log likelihood = -987.7783 Prob > chi2 = 0.0000
(Std. Err. adjusted for clustering on rid)
The six parameters from l11 to l33 are the elements of the lower-triangular matrix
L, the Cholesky factorization of Σ (Σ = LL ). Given the estimate of L, we may recover
Σ and the standard deviations of the random coefficients by using gmnlcov:
. gmnlcov
v11: [l11]_b[_cons]*[l11]_b[_cons]
v21: [l21]_b[_cons]*[l11]_b[_cons]
v31: [l31]_b[_cons]*[l11]_b[_cons]
v22: [l21]_b[_cons]*[l21]_b[_cons] + [l22]_b[_cons]*[l22]_b[_cons]
v32: [l31]_b[_cons]*[l21]_b[_cons] + [l32]_b[_cons]*[l22]_b[_cons]
v33: [l31]_b[_cons]*[l31]_b[_cons] + [l32]_b[_cons]*[l32]_b[_cons] +
> [l33]_b[_cons]*[l33]_b[_cons]
. gmnlcov, sd
dummy_test: sqrt([l11]_b[_cons]*[l11]_b[_cons])
testdue: sqrt([l21]_b[_cons]*[l21]_b[_cons] + [l22]_b[_cons]*[l22]_b[_cons])
drrec: sqrt([l31]_b[_cons]*[l31]_b[_cons] +
> [l32]_b[_cons]*[l32]_b[_cons] + [l33]_b[_cons]*[l33]_b[_cons])
There are other useful postestimation commands besides gmnlcov. For example, to
generate predicted probabilities, we may use the gmnlpred command:
1. 1 1 0 .51986608
2. 1 1 1 .48013392
3. 1 2 0 .63789177
4. 1 2 1 .36210823
Y. Gu, A. R. Hole, and S. Knox 395
rid loglik
1. 1 -22.560377
A file is now created and saved as the beta.dta dataset, which contains all the estimated
individual β’s.
7 Conclusion
In this article, we described the gmnl Stata command, which can be used to fit the
G-MNL model and its variants. As pointed out in Fiebig et al. (2010), G-MNL is very
flexible and nests a rich family of model specifications. In the previous sections, we
demonstrated several important models, which are summarized (along with some other
useful specifications) below in table 2. This list does not exhaust all the possible models
that the gmnl routine can estimate. One example is the type of model considered in
Fiebig et al. (2011) and Knox et al. (2013), which includes interaction terms between
sociodemographic variables and ASCs.
Finally, a word of warning: While we have found that the gmnl command can
be used successfully to implement a range of model specifications, analysts need to
bear in mind that estimation times can be substantial when fitting complex models
with large datasets. As discussed in section 5, it may also be necessary to experiment
with alternative starting values, number of draws, and estimation algorithms to achieve
convergence.
396 Fitting the generalized multinomial logit model
8 Acknowledgments
We are grateful to a referee and to Kristin MacDonald of StataCorp for helpful com-
ments. The research of Yuanyuan Gu and Stephanie Knox was partially supported by
a Faculty of Business Research Grant at the University of Technology in Sydney.
9 References
Drukker, D. M., and R. Gates. 2006. Generating Halton sequences using Mata. Stata
Journal 6: 214–228.
Fiebig, D. G., and J. Hall. 2004. Discrete choice experiments in the analysis of health
policy. Productivity Commission Conference: Quantitative Tools for Microeconomic
Policy Analysis 6: 119–136.
Fiebig, D. G., M. P. Keane, J. Louviere, and N. Wasi. 2010. The generalized multinomial
logit model: Accounting for scale and coefficient heterogeneity. Marketing Science 29:
393–421.
Fiebig, D. G., S. Knox, R. Viney, M. Haas, and D. J. Street. 2011. Preferences for new
and existing contraceptive products. Health Economics 20 (Suppl.): 35–52.
Greene, W. H., and D. A. Hensher. 2010. Does scale heterogeneity across individuals
matter? An empirical assessment of alternative logit models. Transportation 37:
413–428.
Hole, A. R. 2007. Fitting mixed logit models by using maximum simulated likelihood.
Stata Journal 7: 388–401.
Y. Gu, A. R. Hole, and S. Knox 397
Louviere, J., and T. Eagle. 2006. Confound it! That pesky little scale constant messes
up our convenient assumptions. In Proceedings of the Sawtooth Software Conference,
211–228. Sequim, WA: Sawtooth Software.
Louviere, J. J., R. J. Meyer, D. S. Bunch, R. T. Carson, B. Dellaert, W. M. Hanemann,
D. Hensher, and J. Irwin. 1999. Combining sources of preference data for modeling
complex decision processes. Marketing Letters 10: 205–217.
Revelt, D., and K. Train. 2000. Customer-specific taste parameters and mixed logit:
Households’ choice of electricity supplier. Working Paper No. E00-274, Department
of Economics, University of California, Berkeley.
Train, K. E. 2009. Discrete Choice Methods with Simulation. 2nd ed. Cambridge:
Cambridge University Press.
. sysuse lifeexp
. egen upq = pctile(lexp), by(region) p(75)
. egen loq = pctile(lexp), by(region) p(25)
. generate iqr = upq - loq
This code works correctly if there are no values beyond where the whiskers should end.
Otherwise, it yields upper quartile + 1.5 IQR as the position of the upper whisker,
but this position will be correct only if there are values equal to that. Commonly,
that position will be too high. A similar problem applies to the lower whisker, which
commonly will be too low.
More careful code might be
That division / may look odd if you have not seen it before in similar examples. But it
is very like a common kind of conditional notation often seen,
c 2013 StataCorp LP gr0039 1
N. J. Cox 399
max(argument | condition)
or
min(argument | condition)
where we seek the maximum or minimum of some argument, restricting attention to
cases in which a specified condition is satisfied, or true.
The connection is given in this way. Divide an argument by a logical expression that
evaluates to 1 when the expression is true and 0 otherwise. The result is the argument
remains unchanged on division by 1 but evaluates as missing on division by 0. In any
context where Stata ignores missings, that is what is wanted. True cases are included
in the computation, and false cases are excluded.
This “divide by zero” trick appears not to be widely known. There was some pub-
licity within a later article (Cox 2011).
Turning back to the box plots, we will see what the difference is in our example.
Here upper2 and lower2 are from the more careful code just given, and upper and
lower are from the code in the 2009 column. The results can be the same but need not
be.
Checking Stata’s own box plot
1 References
Cox, N. J. 2009. Speaking Stata: Creating and varying box plots. Stata Journal 9:
478–496.
———. 2011. Speaking Stata: Compared with ... Stata Journal 11: 305–314.
400 Speaking Stata: Creating and varying box plots: Correction
Models for multinomial outcomes are frequently used to analyze individual decision
making in consumer research, labor market research, voting, and other areas. The
multinomial probit model provides a flexible approach to analyzing decisions in these
fields because it does not impose some of the restrictive assumptions inherent in the
often used conditional logit approach. In particular, multinomial probit relaxes 1) the
assumption of independent error terms, allowing for correlation in individual choices
across alternatives, and 2) it does not impose the assumption of identically distributed
errors, allowing unobserved factors to affect the choice of some alternatives more strongly
than others (that is, heteroskedasticity).
By default, asmprobit relaxes both the assumptions of independence and homoske-
dasticity. To avoid overfitting, however, the researcher may sometimes wish to relax
these assumptions one at a time.1 A seemingly straightforward solution would be to rely
on the options stddev() and correlation(), which allow the user to set the structure
for the error variances and their covariances, respectively (see [R] asmprobit).
When doing so, however, the user should be aware that specifying std(het) and
corr(ind) does not actually fit a pure heteroskedastic multinomial probit model. With
J outcome categories, if errors are independent, J − 1 error variances are identified (see
below). Instead, Stata estimates J − 2 error variances and, hence, imposes an additional
constraint, which causes the model to be overidentified. As a result, the estimated model
is not invariant to the choice of base and scale outcomes; that is, changing the base or
scale outcome leads to different values of the likelihood function.
To properly estimate a pure heteroskedastic model, the user needs to define the
structure of the error variances manually. This is easy to accomplish using the pattern
or fixed option. The following example illustrates the problem and shows how to
estimate the model correctly.
1. Another reason to relax them one at a time is that heteroskedasticity and error correlation cannot
be distinguished from each other in the default specification. That is, one cannot simply look
at the estimated covariance matrix of the errors and see whether the errors are heteroskedastic,
correlated, or both. What Stata estimates is the normalized covariance matrix of error differences
whose elements do not allow one to draw any conclusions on the covariance structure of the errors
themselves.
c 2013 StataCorp LP st0302
402 Stata tip 115
Consider an individual’s choice of travel mode with the alternatives being air, train,
bus, and car and predictor variables, including general cost of travel, terminal time,
household income, and traveling group size. One might suspect the choice of some
alternatives to be driven more by unobserved factors than the choice of others. For
example, there might be more unobserved reasons related to an individual’s decision to
travel by plane than by train, bus, or car. Allowing the error variances associated with
the alternatives to differ, we fit the following model:
. use http://www.stata-press.com/data/r12/travel
. asmprobit choice travelcost termtime, casevars(income partysize)
> case(id) alternatives(mode) std(het) corr(ind) nolog
Alternative-specific multinomial probit Number of obs = 840
Case variable: id Number of cases = 210
Alternative variable: mode Alts per case: min = 4
avg = 4.0
max = 4
Integration sequence: Hammersley
Integration points: 200 Wald chi2(8) = 71.57
Log simulated-likelihood = -181.81521 Prob > chi2 = 0.0000
mode
travelcost -.012028 .0030838 -3.90 0.000 -.0180723 -.0059838
termtime -.050713 .0071117 -7.13 0.000 -.0646517 -.0367743
train
income -.03859 .0093287 -4.14 0.000 -.0568739 -.0203062
partysize .7590228 .190438 3.99 0.000 .3857711 1.132274
_cons -.9960951 .4750053 -2.10 0.036 -1.927088 -.0651019
bus
income -.0119789 .0081057 -1.48 0.139 -.0278658 .003908
partysize .5876645 .1751734 3.35 0.001 .2443309 .930998
_cons -1.629348 .4803384 -3.39 0.001 -2.570794 -.6879016
car
income -.004147 .0078971 -0.53 0.599 -.019625 .011331
partysize .5737318 .163719 3.50 0.000 .2528485 .8946151
_cons -3.903084 .750675 -5.20 0.000 -5.37438 -2.431788
As can be seen, two of the four error variances are set to one. These are the base
and scale alternatives. While choosing a base and scale alternative is necessary to
identify the model, the problem here is that because errors are uncorrelated, fixing the
variance of the base alternative is not necessary to identify the model. As a result, an
additional constraint is imposed, which leads to a different model structure depending
on the choice of base and scale alternatives. For example, changing the base alternative
to car produces a different log likelihood:
. quietly asmprobit choice travelcost termtime, casevars(income partysize)
> case(id) alternatives(mode) std(het) corr(ind) nolog base(4)
. display e(ll)
-181.58795
mode
travelcost -.0196389 .0067143 -2.92 0.003 -.0327988 -.006479
termtime -.0664153 .0140353 -4.73 0.000 -.093924 -.0389065
train
income -.0498732 .0154884 -3.22 0.001 -.08023 -.0195165
partysize 1.126922 .3651321 3.09 0.002 .4112761 1.842568
_cons -1.072849 .680711 -1.58 0.115 -2.407018 .2613198
bus
income -.0210642 .0139892 -1.51 0.132 -.0484826 .0063542
partysize .8678651 .3179559 2.73 0.006 .244683 1.491047
_cons -1.831363 .7345686 -2.49 0.013 -3.271091 -.3916349
car
income -.010205 .0131711 -0.77 0.438 -.0360199 .01561
partysize .8708577 .3202671 2.72 0.007 .2431458 1.49857
_cons -4.971594 1.261002 -3.94 0.000 -7.443112 -2.500075
Now the model is properly normalized, and the user may verify that changing either
the scale alternative (that is, changing the location of the 1 in stdpat) or the base
alternative leaves results unchanged. Note that while, in theory, the only restriction
necessary to identify the heteroskedastic probit model is to fix one of the variance
terms, in the Stata implementation of the model, the base and scale outcomes must be
different. That is, Stata does not allow the same alternative to be the base outcome
and the scale outcome. However, this is more of an inconvenience than a restriction:
such a model would be equivalent to one in which the base and scale outcomes differed.
M. Herrmann 405
with elements θ∗ relating to the actual error variances σjj and covariances σij as follows:
Under independence, σij = 0. Fixing σ22 = 1 (that is, choosing j = 2 as the scale
∗ ∗
outcome) yields θ23 = σ11 /(1 + σ11 ) and θ33 = (σ33 + σ11 )/(1 + σ11 ). Obviously, σ11
∗ ∗
can be calculated from θ23 , and subsequent substitution produces σ33 from θ33 . The
same is true if we choose to fix either σ11 or σ33 because in each case, we would obtain
two equations in two unknowns. Similar conclusions follow when there are four or
more outcome categories. Thus, with independent errors, J − 1 variance parameters are
estimable.
Reference
Train, K. E. 2009. Discrete Choice Methods with Simulation. 2nd ed. Cambridge:
Cambridge University Press.
The Stata Journal (2013)
13, Number 2, p. 406
Software Updates
c 2013 StataCorp LP up0040