0% found this document useful (0 votes)
167 views192 pages

Stata Journal

Uploaded by

Chakale1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
167 views192 pages

Stata Journal

Uploaded by

Chakale1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 192

The Stata Journal

Volume 13 Number 2 2013

A Stata Press publication


StataCorp LP
College Station, Texas
The Stata Journal
Editors
H. Joseph Newton Nicholas J. Cox
Department of Statistics Department of Geography
Texas A&M University Durham University
College Station, Texas Durham, UK
[email protected] [email protected]

Associate Editors
Christopher F. Baum, Boston College Frauke Kreuter, Univ. of Maryland–College Park
Nathaniel Beck, New York University Peter A. Lachenbruch, Oregon State University
Rino Bellocco, Karolinska Institutet, Sweden, and Jens Lauritsen, Odense University Hospital
University of Milano-Bicocca, Italy Stanley Lemeshow, Ohio State University
Maarten L. Buis, WZB, Germany J. Scott Long, Indiana University
A. Colin Cameron, University of California–Davis Roger Newson, Imperial College, London
Mario A. Cleves, University of Arkansas for Austin Nichols, Urban Institute, Washington DC
Medical Sciences Marcello Pagano, Harvard School of Public Health
William D. Dupont, Vanderbilt University Sophia Rabe-Hesketh, Univ. of California–Berkeley
Philip Ender, University of California–Los Angeles J. Patrick Royston, MRC Clinical Trials Unit,
David Epstein, Columbia University London
Allan Gregory, Queen’s University Philip Ryan, University of Adelaide
James Hardin, University of South Carolina Mark E. Schaffer, Heriot-Watt Univ., Edinburgh
Ben Jann, University of Bern, Switzerland Jeroen Weesie, Utrecht University
Stephen Jenkins, London School of Economics and Ian White, MRC Biostatistics Unit, Cambridge
Political Science Nicholas J. G. Winter, University of Virginia
Ulrich Kohler, University of Potsdam, Germany Jeffrey Wooldridge, Michigan State University

Stata Press Editorial Manager Stata Press Copy Editors


Lisa Gilmore David Culwell and Deirdre Skaggs

The Stata Journal publishes reviewed papers together with shorter notes or comments, regular columns, book
reviews, and other material of interest to Stata users. Examples of the types of papers include 1) expository
papers that link the use of Stata commands or programs to associated principles, such as those that will serve
as tutorials for users first encountering a new field of statistics or a major new technique; 2) papers that go
“beyond the Stata manual” in explaining key features or uses of Stata that are of interest to intermediate
or advanced users of Stata; 3) papers that discuss new commands or Stata programs of interest either to
a wide spectrum of users (e.g., in data management or graphics) or to some large segment of Stata users
(e.g., in survey statistics, survival analysis, panel analysis, or limited dependent variable modeling); 4) papers
analyzing the statistical properties of new or existing estimators and tests in Stata; 5) papers that could
be of interest or usefulness to researchers, especially in fields that are of practical importance but are not
often included in texts or other journals, such as the use of Stata in managing datasets, especially large
datasets, with advice from hard-won experience; and 6) papers of interest to those who teach, including Stata
with topics such as extended examples of techniques and interpretation of results, simulations of statistical
concepts, and overviews of subject areas.

The Stata Journal is indexed and abstracted by CompuMath Citation Index, Current Contents/Social and Behav-
ioral Sciences, RePEc: Research Papers in Economics, Science Citation Index Expanded (also known as SciSearch,
Scopus, and Social Sciences Citation Index.

For more information on the Stata Journal, including information for authors, see the webpage

http://www.stata-journal.com
Subscriptions are available from StataCorp, 4905 Lakeway Drive, College Station, Texas 77845, telephone
979-696-4600 or 800-STATA-PC, fax 979-696-4601, or online at
http://www.stata.com/bookstore/sj.html

Subscription rates listed below include both a printed and an electronic copy unless otherwise mentioned.

U.S. and Canada Elsewhere

Printed & electronic Printed & electronic


1-year subscription $ 98 1-year subscription $138
2-year subscription $165 2-year subscription $245
3-year subscription $225 3-year subscription $345

1-year student subscription $ 75 1-year student subscription $ 99

1-year university library subscription $125 1-year university library subscription $165
2-year university library subscription $215 2-year university library subscription $295
3-year university library subscription $315 3-year university library subscription $435

1-year institutional subscription $245 1-year institutional subscription $285


2-year institutional subscription $445 2-year institutional subscription $525
3-year institutional subscription $645 3-year institutional subscription $765

Electronic only Electronic only


1-year subscription $ 75 1-year subscription $ 75
2-year subscription $125 2-year subscription $125
3-year subscription $165 3-year subscription $165

1-year student subscription $ 45 1-year student subscription $ 45

Back issues of the Stata Journal may be ordered online at

http://www.stata.com/bookstore/sjj.html

Individual articles three or more years old may be accessed online without charge. More recent articles may
be ordered online.

http://www.stata-journal.com/archives.html

The Stata Journal is published quarterly by the Stata Press, College Station, Texas, USA.

Address changes should be sent to the Stata Journal, StataCorp, 4905 Lakeway Drive, College Station, TX
77845, USA, or emailed to [email protected].

Copyright 
c 2013 by StataCorp LP

Copyright Statement: The Stata Journal and the contents of the supporting files (programs, datasets, and
help files) are copyright 
c by StataCorp LP. The contents of the supporting files (programs, datasets, and
help files) may be copied or reproduced by any means whatsoever, in whole or in part, as long as any copy
or reproduction includes attribution to both (1) the author and (2) the Stata Journal.
The articles appearing in the Stata Journal may be copied or reproduced as printed copies, in whole or in part,
as long as any copy or reproduction includes attribution to both (1) the author and (2) the Stata Journal.
Written permission must be obtained from StataCorp if you wish to make electronic copies of the insertions.
This precludes placing electronic copies of the Stata Journal, in whole or in part, on publicly accessible websites,
fileservers, or other locations where the copy may be accessed by anyone other than the subscriber.
Users of any of the software, ideas, data, or other materials published in the Stata Journal or the supporting
files understand that such use is made without warranty of any kind, by either the Stata Journal, the author,
or StataCorp. In particular, there is no warranty of fitness of purpose or merchantability, nor for special,
incidental, or consequential damages such as loss of profits. The purpose of the Stata Journal is to promote
free communication among Stata users.
The Stata Journal, electronic version (ISSN 1536-8734) is a publication of Stata Press. Stata, , Stata
Press, Mata, , and NetCourse are registered trademarks of StataCorp LP.
Volume 13 Number 2 2013

The Stata Journal


Articles and Columns 221

Maximum likelihood and generalized spatial two-stage least-squares estimators for


a spatial-autoregressive model with spatial-autoregressive disturbances . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . D. M. Drukker, I. R. Prucha, and R. Raciborski 221
Creating and managing spatial-weighting matrices with the spmat command . . . .
. . . . . . . . . . . . . . . . . . . D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 242
A command for estimating spatial-autoregressive models with spatial-autoregressive
disturbances and additional endogenous variables . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . D. M. Drukker, I. R. Prucha, and R. Raciborski 287
A command for Laplace regression . . . . . . . . . . . . . . . . . . . . . M. Bottai and N. Orsini 302
Importing U.S. exchange rate data from the Federal Reserve and standardizing
country names across datasets . . . . . . . . . B. Dicle, J. Levendis, and M. F. Dicle 315
Generating Manhattan plots in Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .D. E. Cook, K. R. Ryckman, and J. C. Murray 323
Semiparametric fixed-effects estimator . . . . . . . . . . . . . . . . . F. Libois and V. Verardi 329
Exact Wilcoxon signed-rank and Wilcoxon Mann–Whitney ranksum tests . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T. Harris and J. W. Hardin 337
Extending the flexible parametric survival model for competing risks. . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S. R. Hinchliffe and P. C. Lambert 344
Goodness-of-fit tests for categorical data . . . . . . . . . . . . . . R. Bellocco and S. Algeri 356
Standardizing anthropometric measures in children and adolescents with functions
for egen: Update . . . . . . . . . . . . . . . . . . . . . . . S. I. Vidmar, T. J. Cole, and H. Pan 366
Bonferroni and Holm approximations for Šidák and Holland–Copenhaver q-values
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R. B. Newson 379
Fitting the generalized multinomial logit model in Stata . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Y. Gu, A. R. Hole, and S. Knox 382
Speaking Stata: Creating and varying box plots: Correction. . . . . . . . . . .N. J. Cox 398

Notes and Comments 401

Stata tip 115: How to properly estimate the multinomial probit model with het-
eroskedastic errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Herrmann 401

Software Updates 406


The Stata Journal (2013)
13, Number 2, pp. 221–241

Maximum likelihood and generalized spatial


two-stage least-squares estimators for a
spatial-autoregressive model with
spatial-autoregressive disturbances
David M. Drukker Ingmar R. Prucha Rafal Raciborski
StataCorp Department of Economics StataCorp
College Station, TX University of Maryland College Station, TX
[email protected] College Park, MD [email protected]
[email protected]

Abstract. We describe the spreg command, which implements a maximum


likelihood estimator and a generalized spatial two-stage least-squares estimator
for the parameters of a linear cross-sectional spatial-autoregressive model with
spatial-autoregressive disturbances.
Keywords: st0291, spreg, spatial-autoregressive models, Cliff–Ord models, maxi-
mum likelihood estimation, generalized spatial two-stage least squares, instrumen-
tal-variable estimation, generalized method of moments estimation, prediction,
spatial econometrics, spatial statistics

1 Introduction
Cliff–Ord (1973, 1981) models, which build on Whittle (1954), allow for cross-unit
interactions. Many models in the social sciences, biostatistics, and geographic sciences
have included such interactions. Following Cliff and Ord (1973, 1981), much of the
original literature was developed to handle spatial interactions. However, space is not
restricted to geographic space, and many recent applications use these techniques in
other situations of cross-unit interactions, such as social-interaction models and network
models; see, for example, Kelejian and Prucha (2010) and Drukker, Egger, and Prucha
(2013) for references. Much of the nomenclature still includes the adjective “spatial”,
and we continue this tradition to avoid confusion while noting the wider applicability
of these models. For texts and reviews, see, for example, Anselin (1988, 2010), Arbia
(2006), Cressie (1993), Haining (2003), and LeSage and Pace (2009).
The simplest Cliff–Ord model only considers spatial spillovers in the dependent vari-
able, with spillovers modeled by including a right-hand-side variable known as a spatial
lag. Each observation of the spatial-lag variable is a weighted average of the values of the
dependent variable observed for the other cross-sectional units. The matrix containing
the weights is known as the spatial-weighting matrix. This model is frequently referred
to as a spatial-autoregressive (SAR) model. A generalized version of this model also
allows for the disturbances to be generated by a SAR process. The combined SAR model


c 2013 StataCorp LP st0291
222 ML and GS2SLS estimators for a SARAR model

with SAR disturbances is often referred to as a SARAR model; see Anselin and Florax
(1995).1
In modeling the outcome for each unit as dependent on a weighted average of the
outcomes of other units, SARAR models determine outcomes simultaneously. This si-
multaneity implies that the ordinary least-squares estimator will not be consistent; see
Anselin (1988) for an early discussion of this point.
In this article, we describe the spreg command, which implements a maximum
likelihood (ML) estimator and a generalized spatial two-stage least-squares (GS2SLS)
estimator for the parameters of a SARAR model with exogenous regressors. For discus-
sions of the ML estimator, see, for example, the above cited texts and Lee (2004) for the
asymptotic properties of the estimator. For a discussion of the estimation theory for the
implemented GS2SLS estimator, see Arraiz et al. (2010) and Drukker, Egger, and Prucha
(2013), which build on Kelejian and Prucha (1998, 1999, 2010) and the references cited
therein.
Section 2 describes the SARAR model. Section 3 describes the spreg command. Sec-
tion 4 provides some examples. Section 5 describes postestimation commands. Section 6
presents methods and formulas. The conclusion follows.
We use the notation that for any matrix A and vector a, the elements are denoted
as aij and ai , respectively.

2 The SARAR model


The spreg command estimates the parameters of the cross-sectional model (i = 1, . . . , n)
n k
yi = λ j=1 wij yj + p=1 xip βp + ui
n
ui = ρ j=1 mij uj + εi

or more compactly,

y = λWy + Xβ + u (1)
u = ρMu +  (2)

1. These models are also known as Cliff–Ord models because of the impact that Cliff and Ord (1973,
1981) had on the subsequent literature. To avoid confusion, we simply refer to these models as
SARAR models while still acknowledging the importance of the work of Cliff and Ord.
D. M. Drukker, I. R. Prucha, and R. Raciborski 223

where

• y is an n × 1 vector of observations on the dependent variable;

• W and M are n × n spatial-weighting matrices (with 0 diagonal elements);

• Wy and Mu are n × 1 vectors typically referred to as spatial lags, and λ and ρ


are the corresponding scalar parameters typically referred to as SAR parameters;

• X is an n × k matrix of observations on k right-hand-side exogenous variables


(where some of the variables may be spatial lags of exogenous variables), and β
is the corresponding k × 1 parameter vector;

•  is an n × 1 vector of innovations.

The model in (1) and (2) is a SARAR with exogenous regressors. Spatial interactions
are modeled through spatial lags. The model allows for spatial interactions in the
dependent variable, the exogenous variables, and the disturbances.2
The spatial-weighting matrices W and M are taken to be known and nonstochastic.
These matrices are part of the model definition, and in many applications, W = M.
Let y = Wy. Then

n
yi = wij yj
j=1

which clearly shows the dependence of yi on neighboring outcomes via the spatial lag
y i . By construction, the spatial lag Wy is an endogenous variable. The weights wij
will typically be modeled as inversely related to some measure of proximity between
the units. The SAR parameter λ measures the extent of these interactions. For further
discussions of spatial-weighting matrices and the parameter space for the SAR parameter,
see, for example, the literature cited in the introduction, including Kelejian and Prucha
(2010); see Drukker et al. (2013) for more information about creating spatial-weighting
matrices in Stata.
The innovations  are assumed to be independent and identically distributed (IID)
or independent but heteroskedastically distributed, where the heteroskedasticity is of
unknown form. The GS2SLS estimator produces consistent estimates in either case
when the heteroskedastic option is specified; see Kelejian and Prucha (1998, 1999,
2010), Arraiz et al. (2010), and Drukker, Egger, and Prucha (2013) for discussions and
formal results. The ML estimator produces consistent estimates in the IID case but
generally not in the heteroskedastic case; see Lee (2004) for some formal results for the
ML estimator, and see Arraiz et al. (2010) for evidence that the ML estimator does not
generally produce consistent estimates in the heteroskedastic case.

2. An extension of the model to a limited-information-systems framework with additional endogenous


right-hand-side variables is considered in Drukker, Prucha, and Raciborski (2013), which discusses
the spivreg command.
224 ML and GS2SLS estimators for a SARAR model

Because the model in (1) and (2) is a first-order SAR model with first-order SAR
disturbances, it is also referred to as a SARAR(1, 1) model, which is a special case of
the more general SARAR(p, q) model. We refer to a SARAR(1, 1) model as a SARAR
model. When ρ = 0, the model in equations (1) and (2) reduces to the SAR model
y = λWy + Xβ + . When λ = 0, the model in equations (1) and (2) reduces to
y = Xβ + u with u = ρMu + , which is sometimes referred to as the SAR error model.
Setting ρ = 0 and λ = 0 causes the model in equations (1) and (2) to reduce to a linear
regression model with exogenous variables.
spreg requires that the spatial-weighting matrices M and W be provided in the form
of an spmat object as described in Drukker et al. (2013). spreg gs2sls supports both
general and banded spatial-weighting matrices; spreg ml supports general matrices
only.

3 The spreg command


3.1 Syntax
      
spreg ml depvar indepvars if in , id(varname) noconstant level(#)
   
dlmat(objname , eig ) elmat(objname , eig ) constraints(constraints)

gridsearch(#) maximize options

      
spreg gs2sls depvar indepvars if in , id(varname) noconstant
level(#) dlmat(objname) elmat(objname) heteroskedastic impower(q)

maximize options

3.2 Options for spreg ml


id(varname) specifies a numeric variable that contains a unique identifier for each
observation. id() is required.
noconstant suppresses the constant term in the model.
level(#) specifies the confidence level, as a percentage, for confidence intervals. The
default is level(95) or as set by set level.
 
dlmat(objname , eig ) specifies an spmat object that contains the spatial-weighting
matrix W to be used in the SAR term. eig forces the calculation of the eigenvalues
of W, even if objname already contains them.
 
elmat(objname , eig ) specifies an spmat object that contains the spatial-weighting
matrix M to be used in the spatial-error term. eig forces the calculation of the
eigenvalues of M, even if objname already contains them.
constraints(constraints); see [R] estimation options.
D. M. Drukker, I. R. Prucha, and R. Raciborski 225

gridsearch(#) specifies the fineness of the grid used in searching for the initial values
of the parameters λ and ρ in the concentrated log likelihood. The allowed range is
[.001, .1]. The default is gridsearch(.1).
 
maximize options: difficult, technique(algorithm spec), iterate(#), no log,
trace, gradient, showstep, hessian, showtolerance, tolerance(#),
ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see
[R] maximize. These options are seldom used. from() takes precedence over
gridsearch().

Options for spreg gs2sls

id(varname) specifies a numeric variable that contains a unique identifier for each
observation. id() is required.
noconstant suppresses the constant term.
level(#) specifies the confidence level, as a percentage, for confidence intervals. The
default is level(95) or as set by set level.
dlmat(objname) specifies an spmat object that contains the spatial-weighting matrix
W to be used in the SAR term.
elmat(objname) specifies an spmat object that contains the spatial-weighting matrix
M to be used in the spatial-error term.
heteroskedastic specifies that spreg use an estimator that allows the errors to be
heteroskedastically distributed over the observations. By default, spreg uses an
estimator that assumes homoskedasticity.
impower(q) specifies how many powers of W to include in calculating the instrument
matrix H. The √ default is impower(2). The allowed values of q are integers in the
set 2, 3, . . . ,  n, where n is the number of observations.
 
maximize options: iterate(#), no log, trace, gradient, showstep,
showtolerance, tolerance(#), and ltolerance(#); see [R] maximize.
from(init specs) is also allowed, but because ρ is the only parameter in this opti-
mization problem, only initial values for ρ may be specified.

3.3 Saved results


spreg ml saves the following in e():
Scalars
e(N) number of observations e(p) significance
e(k) number of parameters e(rank) rank of e(V)
e(df m) model degrees of freedom e(converged) 1 if converged, 0 otherwise
e(ll) log likelihood e(iterations) number of ML iterations
e(chi2) χ2
226 ML and GS2SLS estimators for a SARAR model

Macros
e(cmd) spreg e(user) name of likelihood-evaluator
e(cmdline) command as typed program
e(depvar) name of dependent variable e(estimator) ml
e(indeps) names of independent variables e(model) lr, sar, sare, or sarar
e(title) title in estimation output e(constant) noconstant or hasconstant
e(chi2type) type of model χ2 test e(idvar) name of ID variable
e(vce) oim e(dlmat) name of spmat object used in
e(technique) maximization technique dlmat()
e(crittype) type of optimization e(elmat) name of spmat object used in
e(estat cmd) program used to implement elmat()
estat e(properties) b V
e(predict) program used to implement
predict
Matrices
e(b) coefficient vector e(gradient) gradient vector
e(Cns) constraints matrix e(V) variance–covariance matrix of
e(ilog) iteration log the estimators
Functions
e(sample) marks estimation sample

spreg gs2sls saves the following in e():


Scalars
e(N) number of observations e(converged) 1 if generalized method
e(k) number of parameters of moments
e(rho 2sls) initial estimate of ρ converged, 0
e(iterations) number of generalized otherwise
method of moments e(converged 2sls) 1 if two-stage least-
iterations squares converged,
e(iterations 2sls) number of two-stage 0 otherwise
least-squares iterations
Macros
e(cmd) spreg e(idvar) name of ID variable
e(cmdline) command as typed e(dlmat) name of spmat object
e(estimator) gs2sls used in dlmat()
e(model) lr, sar, sare, or sarar e(elmat) name of spmat object
e(het) homoskedastic or used in elmat()
heteroskedastic e(estat cmd) program used to
e(depvar) name of dependent variable implement estat
e(indeps) names of independent variables e(predict) program used to
e(title) title in estimation output implement predict
e(exogr) exogenous regressors e(properties) b V
e(constant) noconstant or hasconstant
e(H omitted) names of omitted instruments
in H matrix
Matrices
e(b) coefficient vector e(delta 2sls) initial estimate of β
e(V) variance–covariance matrix and λ
of the estimators
Functions
e(sample) marks estimation sample
D. M. Drukker, I. R. Prucha, and R. Raciborski 227

4 Example
In our examples, we use spreg.dta, which contains simulated data on the number of
arrests for driving under the influence for the continental U.S. counties.3 We use a
normalized contiguity matrix taken from Drukker et al. (2013). In Stata, we type

. use dui
. spmat use ccounty using ccounty.spmat

to read the dataset into memory and to put the spatial-weighting matrix into the
spmat object ccounty. This row-normalized spatial-weighting matrix was created in
Drukker et al. (2013, sec. 2.4) and saved to disk in Drukker et al. (2013, sec. 11.4).
Our dependent variable, dui, is defined as the alcohol-related arrest rate per 100,000
daily vehicle miles traveled (DVMT). Figure 1 shows the distribution of dui across
counties, with darker colors representing higher values of the dependent variable. Spatial
patterns of dui are clearly visible.

Figure 1. Hypothetical alcohol-related arrests for continental U.S. counties

3. The geographical location data came from the U.S. Census Bureau and can be found at
ftp://ftp2.census.gov/geo/tiger/TIGER2008/. The variables are simulated but inspired by
Powers and Wilson (2004).
228 ML and GS2SLS estimators for a SARAR model

Our explanatory variables include police (number of sworn officers per 100,000
DVMT); nondui (nonalcohol-related arrests per 100,000 DVMT); vehicles (number of
registered vehicles per 1,000 residents); and dry (a dummy for counties that prohibit
alcohol sale within their borders). In other words, in this illustration,
X = [police,nondui,vehicles,dry,intercept].
We obtain the GS2SLS parameter estimates of the SARAR model parameters by typing

. spreg gs2sls dui police nondui vehicles dry, id(id)


> dlmat(ccounty) elmat(ccounty) nolog
Spatial autoregressive model Number of obs = 3109
(GS2SLS estimates)

dui Coef. Std. Err. z P>|z| [95% Conf. Interval]

dui
police -.5591567 .0148772 -37.58 0.000 -.5883155 -.529998
nondui -.0001128 .0005645 -0.20 0.842 -.0012193 .0009936
vehicles .062474 .0006198 100.79 0.000 .0612592 .0636889
dry .303046 .0183119 16.55 0.000 .2671553 .3389368
_cons 2.482489 .1473288 16.85 0.000 2.19373 2.771249

lambda
_cons .4672164 .0051261 91.14 0.000 .4571694 .4772633

rho
_cons .1932962 .0726583 2.66 0.008 .0508885 .3357038

Given the normalization of the spatial-weighting matrix, the parameter space for
λ and ρ is taken to be the interval (−1, 1); see Kelejian and Prucha (2010) for further
discussions of the parameter space. The estimated λ is positive and significant, indicat-
ing moderate SAR dependence in dui. In other words, the dui alcohol-arrest rate for
a given county is affected by the dui alcohol-arrest rates of the neighboring counties.
This result may be because of coordination among police departments or because strong
enforcement in one county leads some people to drink in neighboring counties.
The estimated ρ coefficient is positive, moderate, and significant, indicating mod-
erate SAR dependence in the error term. In other words, an exogenous shock to one
county will cause moderate changes in the alcohol-related arrest rate in the neighboring
counties.
The estimated β vector does not have the same interpretation as in a simple linear
model, because including a spatial lag of the dependent variable implies that the out-
comes are determined simultaneously. We present one way to interpret the coefficients
in section 5.
D. M. Drukker, I. R. Prucha, and R. Raciborski 229

For comparison, we obtain the ML parameter estimates by typing

. spreg ml dui police nondui vehicles dry, id(id)


> dlmat(ccounty) elmat(ccounty) nolog
Spatial autoregressive model Number of obs = 3109
(Maximum likelihood estimates) Wald chi2(4) = 62376.4
Prob > chi2 = 0.0000

dui Coef. Std. Err. z P>|z| [95% Conf. Interval]

dui
police -.5593526 .014864 -37.63 0.000 -.5884854 -.5302197
nondui -.0001214 .0005645 -0.22 0.830 -.0012279 .0009851
vehicles .0624729 .0006195 100.84 0.000 .0612586 .0636872
dry .3030522 .018311 16.55 0.000 .2671633 .3389412
_cons 2.490301 .1471885 16.92 0.000 2.201817 2.778785

lambda
_cons .4671198 .0051144 91.33 0.000 .4570957 .4771439

rho
_cons .1962348 .0711659 2.76 0.006 .0567522 .3357174

sigma2
_cons .0859662 .0021815 39.41 0.000 .0816905 .0902418

There are no apparent differences between the two sets of parameter estimates.

5 Postestimation commands
The postestimation commands supported by spreg include estat, test, and predict;
see help spreg postestimation for the full list. Most postestimation methods have
standard interpretations; for example, a Wald test is just a Wald test.
Predictions from SARAR models require some additional explanation. Kelejian and
Prucha (2007) consider different information sets and define predictors as conditional
means based on these information sets. They also derive the mean squared errors of
these predictors, provide some efficiency rankings based on these mean squared errors,
and provide Monte Carlo evidence that the additional efficiencies obtained by using
more information can be practically important.
One of the predictors that Kelejian and Prucha (2007) consider is based on the infor-
mation set {X, W, M, wi y}, where wi denotes the ith row of W, which will be referred
to as the limited-information predictor.4 We denote the limited-information predictor
by limited in the syntax diagram below. Another estimator that Kelejian and Prucha
(2007) consider is based on the information set {X, W, M}, which yields the reduced-
form predictor. This predictor is denoted by rform in the syntax diagram below.

4. Kelejian and Prucha (2007) also consider a full-information predictor. We have postponed imple-
menting this predictor because it is computationally more demanding; we plan to implement it in
future work.
230 ML and GS2SLS estimators for a SARAR model

Kelejian and Prucha (2007) show that their limited-information predictor can be much
more efficient than the reduced-form predictor.
In addition to the limited-information predictor and the reduced-form predictor,
predict can compute two other observation-level quantities, which are not recom-
mended as predictors but may be used in subsequent computations. These quantities
are denoted by naive and xb in the syntax diagram below.
While prediction is frequently of interest in applied statistical work, predictions
can also be used to compute marginal effects.5 A change to one observation in one
exogenous variable potentially changes the predicted values for all the observations of
the dependent variable because the n observations for the dependent variable form a
system of simultaneous equations in a SARAR model. Below we use predict to calculate
predictions that we in turn use to calculate marginal effects.
Various methods have been proposed to interpret the parameters of SAR models: see,
for example, Anselin (2003); Abreu, De Groot, and Florax (2004); Kelejian and Prucha
(2007); and LeSage and Pace (2009).

5.1 Syntax
Before using predict, we discuss its syntax.
    
predict type
newvar if in , rform | limited | naive | xb

rftransform(matname)

5.2 Options
rform, the default, calculates the reduced-form predictions.
limited calculates the Kelejian and Prucha (2007) limited-information predictor. This
predictor is more efficient than the reduced-form predictor, but we call it limited
because it is not as efficient as the Kelejian and Prucha (2007) full-information pre-
dictor, which we plan to implement in the future.
 i y + xi β
naive calculates λw  for each observation.

xb calculates the linear prediction Xβ.

5. We refer to the effects of both infinitesimal changes in a continuous variable and discrete changes
in a discrete variable as marginal effects. While some authors refer to “partial” effects to cover the
continuous and discrete cases, we avoid the term “partial” because it means something else in a
simultaneous-equations framework.
D. M. Drukker, I. R. Prucha, and R. Raciborski 231

rftransform(matname) is a seldom-used option that specifies a matrix to use in com-


puting the reduced-form predictions. This option is only useful when computing
reduced-form predictions in a loop, when the option removes the need to recom-
pute the inverse of a large matrix. See section 5.3 for an example that uses this
option, and see section 6.3 for the details. rftransform() may only be specified
with statistic rform.

5.3 Example
In this section, we discuss two marginal effects that measure how changes in the ex-
ogenous variables affect the endogenous variable. These measures use the reduced-form
 = E(y|X, W, M) = (I − λW)−1 Xβ, which we discuss in section 6.3, where
predictor y
it is denoted as y (1) . The expression for the predictor shows that a change in a single
observation on an exogenous variable will typically affect the values of the endogenous
variable for all n units because the SARAR model forms a system of simultaneous equa-
tions.
Without loss of generality, we explore the effects of changes in the kth exogenous
variable. Letting xk = (x1k , . . . , xnk ) denote the vector of observations on the kth
exogenous variable allows us to denote the dependence of y  on xk by using the notation

 (xk ) = {
y y1 (xk ), . . . , yn (xk )}

The first marginal effect we consider is

∂
y(xk + δi) ∂
y(x1k , . . . , xi−1,k , xik + δ, xi+1,k , . . . , xnk ) ∂
y(xk )
= =
∂δ ∂δ ∂xik

where i = [0, . . . , 0, 1, 0, . . . , 0] is the ith column of the identity matrix. In terminology
consistent with that of LeSage and Pace (2009, 36–37), we refer to the above effect as
the total direct impact of a change in the ith unit of xk . LeSage and Pace (2009, 36–37)
define the corresponding summary measure

n
∂ yi (xk + δi) 
n
∂ yi (x1k , . . . , xi−1,k , xik + δ, xi+1,k , . . . , xnk )
n−1 = n−1
i=1
∂δ i=1
∂δ
n
∂ yi (xk )
= n−1 (3)
i=1
∂xik

which they call the average total direct impact (ATDI). The ATDI is the average over
i = {1, . . . , n} of the changes in the yi attributable to the changes in the corresponding
xik . The ATDI can be calculated by computing y  (xk ), y
 (xk + δi), and the average of
the difference of these vectors of predicted values, where δ is the magnitude by which
xik is changed. The ATDI measures the average change in yi attributable to sequentially
changing xik for a given k.
232 ML and GS2SLS estimators for a SARAR model

Sequentially changing xik for each i = {1, . . . , n} differs from simultaneously chang-
ing the xik for all n units. The second marginal effect we consider measures the effect
of simultaneously changing x1k , . . . , xnk on a specific yi and is defined by
∂ yi (x1k + δ, . . . , xik + δ, . . . , xnk + δ)  ∂ yi (xk )
n
∂ yi (xk + δe)
= =
∂δ ∂δ r=1
∂xrk

where e = [1, . . . , 1] is a vector of 1s. LeSage and Pace (2009, 36–37) define the corre-
sponding summary measure

n
∂ yi (xk + δe) 
n
∂ yi (x1k + δ, . . . , xik + δ, . . . , xnk + δ)
n−1 = n−1
i=1
∂δ i=1
∂δ
n n
∂ yi (xk )
= n−1 (4)
i=1 r=1
∂xrk

which they call the average total impact (ATI). The ATI can be calculated by computing
 (xk ), y
y  (xk + δe), and the average difference in these vectors of predicted values, where
δ is the magnitude by which x1k , . . . , xnk is changed.
We now continue our example from section 4 and use the reduced-form predictor to
compute the marginal effects of adding one officer per 100,000 DVMT in Elko County,
Nevada. We begin by using the reduced-form predictor and the observed values of the
exogenous variables to obtain predicted values for dui:
. predict y0
(option rform assumed)

Next we increase police by 1 in Elko County, Nevada, and calculate the reduced-form
predictions:
. generate police_orig = police
. quietly replace police = police_orig + 1 if st==32 & NAME00=="Elko"
. predict y1
(option rform assumed)

Now we compute the difference between these two predictions:


. generate deltay = y1-y0

The output below lists the predicted difference and the level of dui for Elko County,
Nevada:
. list deltay dui if (st==32 & NAME00=="Elko")

deltay dui

1891. -.5654716 19.777429

The predicted effect of the change would be a 2.9% reduction in dui in Elko County,
Nevada.
D. M. Drukker, I. R. Prucha, and R. Raciborski 233

Below we use four commands to summarize the changes and levels in the contiguous
counties:

. spmat getmatrix ccounty W


. generate double elko_neighbor = .
(3109 missing values generated)
. mata: st_store(.,"elko_neighbor",W[1891,.]´)
. summarize deltay dui if elko_neighbor>0
Variable Obs Mean Std. Dev. Min Max

deltay 9 -.0203756 .0000364 -.0204239 -.020298


dui 9 21.29122 1.6468 19.2773 23.49109

In the first command, we use spmat getmatrix to store a copy of the normalized-
contiguity spatial-weighting matrix in Mata memory; see Drukker et al. (2013, sec. 14)
for a discussion of spmat getmatrix. In the second and third commands, we generate
and fill in a new variable for which the ith observation is 1 if it contains information on a
county that is contiguous with Elko County and is 0 otherwise. In the fourth command,
we summarize the predicted changes and the levels in the contiguous counties. The
mean predicted reduction is less than 0.1% of the mean level of dui in the contiguous
counties.
In the output below, we get a summary of the levels of dui and a detailed summary
of the predicted changes for all the counties in the sample.

. summarize dui
Variable Obs Mean Std. Dev. Min Max

dui 3109 20.84307 1.457163 15.01375 26.61978


. summarize deltay, detail
deltay

Percentiles Smallest
1% -.0007572 -.5654716
5% 0 -.0204239
10% 0 -.0203991 Obs 3109
25% 0 -.0203991 Sum of Wgt. 3109
50% 0 Mean -.0002495
Largest Std. Dev. .0101996
75% 0 0
90% 0 0 Variance .000104
95% 0 0 Skewness -54.78661
99% 0 0 Kurtosis 3035.363

Less than 1% of the sample had any socially significant difference, with no change at
all predicted for at least 95% of the sample.
234 ML and GS2SLS estimators for a SARAR model


In some of the computations below, we will use the matrix S = (In − λW)−1
, where

λ is the estimate of the SAR parameter and W is the spatial-weighting matrix. In the
output below, we use the W stored in Mata memory in an example above to compute
S.

. spmat getmatrix ccounty W


. mata:
mata (type end to exit)
: b = st_matrix("e(b)")
: lam = b[1,6]
: S = luinv(I(rows(W))-lam*W)
: (b[1,1]/rows(W))*sum(S)
-.6993674779
: end

We next compute the ATDI defined in (3). The output below shows an instructive
(but slow) method to compute the ATDI. For each county in the data, we set police to
be the original value for all the observations except the ith, which we set to police + 1.
Then we compute the predicted value of dui for observation i and store this prediction in
the ith observation of y1. (We use the rftransform() option to use the inverse matrix
S computed above. Without this option, we would recompute the inverse matrix for
each of the 3,109 observations, which would cause the calculation to take hours.) After
computing the predicted values of y1 for each observation, we compute the differences
in the predictions and compute the sample average.
. drop y1 deltay
. generate y1 = .
(3109 missing values generated)
. local N = _N
. forvalues i = 1/`N´ {
2. quietly capture drop tmp
3. quietly replace police = police_orig
4. quietly replace police = police_orig + 1 in `i´
5. quietly predict tmp in `i´, rftransform(S)
6. quietly replace y1 = tmp in `i´
7. }
. generate deltay = y1-y0
. summarize deltay
Variable Obs Mean Std. Dev. Min Max

deltay 3109 -.5633844 .0009144 -.5690784 -.5599785


. summarize dui
Variable Obs Mean Std. Dev. Min Max

dui 3109 20.84307 1.457163 15.01375 26.61978

The absolute value of the estimated ATDI is −0.56, so the estimated effect is 2.7%
of the sample mean of dui.
D. M. Drukker, I. R. Prucha, and R. Raciborski 235

As mentioned, the above method for computing the estimate of the ATDI is slow.
LeSage and Pace (2009, 36–37) show that the estimate of the ATDI can also be computed
as
βk
trace(S)
n
where β k is the kth component of β and S = (In − λW)−1 , which we computed above.
Below we use this formula to compute the ATDI,

. mata: (b[1,1]/rows(W))*trace(S)
-.5633844076

and note that the result is the same as above.


Now we estimate the ATI, which simultaneously adds one more police officer per
100,000 residents to each county. In the output below, we add 1 to police in each
observation and then calculate the differences in the predictions. We then calculate the
ATI defined in (4) by computing the sample average.

. drop y1 deltay
. quietly replace police = police_orig + 1
. predict y1
(option rform assumed)
. generate deltay = y1-y0
. summarize deltay
Variable Obs Mean Std. Dev. Min Max

deltay 3109 -.6993675 .0309541 -.8945923 -.5801525


. summarize dui
Variable Obs Mean Std. Dev. Min Max

dui 3109 20.84307 1.457163 15.01375 26.61978

The absolute value of the estimated average total effect is about 3.4% of the sample
mean of dui.
LeSage and Pace (2009, 36–37) show that the ATI is given by

βk  
n n
Si,j
n i=1 j=1

where β k is the kth component of β and Sij is the (i, j)th element of S = (In − λW)−1 .
In the output below, we use the spmat getmatrix command discussed in Drukker et al.
(2013) and a few Mata computations to show that the above expression yields the same
value for the ATI as our calculations above.
236 ML and GS2SLS estimators for a SARAR model

. spmat getmatrix ccounty W


. mata:
mata (type end to exit)
: b = st_matrix("e(b)")
: lam = b[1,6]
: S = luinv(I(rows(W))-lam*W)
: (b[1,1]/rows(W))*sum(S)
-.6993674779
: end

In general, it is not possible to say whether the ATDI is greater than or less than the
ATI. Using the expressions from LeSage and Pace (2009, 36–37), we see that

βk   βk  βk  
n n n n n
ATI − ATDI = Si,j − Si,i = Si,j
n i=1 j=1 n i=1 n i=1 j=1
j=i

which depends on the sum of the off-diagonal elements of S as well as on β k .


In the case at hand, one would expect the ATDI to be smaller than the ATI because the
ATDI, unlike the ATI, does not incorporate the reinforcing effects of having all counties
implement the change simultaneously.

6 Methods and formulas


6.1 ML estimator
Recall that the SARAR model under consideration is given by

y = λWy + Xβ + u (5)
u = ρMu +  (6)

In the following, we give the log-likelihood function under the assumption that  ∼
N (0, σ 2 I). As usual, we refer to the maximizer of the likelihood function when the
innovations are not normally distributed as the quasi-maximum likelihood (QML) esti-
mator. Lee (2004) gives results concerning the consistency and asymptotic normality of
the QML estimator when  is IID but not necessarily normally distributed. Violations
of the assumption that the innovations  are IID can cause the QML estimator to pro-
duce inconsistent results. In particular, this may be the case if the innovations  are
heteroskedastic, as discussed by Arraiz et al. (2010).

Likelihood function

The reduced form of the model in (5) and (6) is given by

y = (I − λW)−1 Xβ + (I − λW)−1 (I − ρM)−1 


D. M. Drukker, I. R. Prucha, and R. Raciborski 237

The unconcentrated log-likelihood function is


n n
ln L(y|β, σ 2 , λ, ρ) = − ln(2π) − ln(σ 2 ) + ln ||I − λW|| + ln ||I − ρM||
2 2
1 T
− {(I − λW)y − Xβ} (I − ρM)T (I − ρM) {(I − λW)y − Xβ} (7)
2σ 2
We can concentrate the log-likelihood function by first maximizing (7) with respect to
β and σ 2 , yielding the maximizers
 
 ρ) = XT (I − ρM)T (I − ρM)X −1 XT (I − ρM)T (I − ρM)(I − λW)y
β(λ,

T
2 (λ, ρ) = (1/n) (I − λW)y − Xβ(λ,
σ  ρ) (I − ρM)T (I − ρM)

 ρ)
(I − λW)y − Xβ(λ,

Substitution of the above expressions into (7) yields the concentrated log-likelihood
function
n n
Lc (y|λ, ρ) = − {ln(2π) + 1} − ln( σ 2 (λ, ρ)) + ln ||I − λW|| + ln ||I − ρM||
2 2
The QML estimates for the autoregressive parameters λ and ρ can now be computed by
maximizing the concentrated log-likelihood function. Once we have obtained the QML
 and ρ, we can calculate the QML estimates for β and σ 2 as β
estimates λ  = β(
 λ,
 ρ) and
2 
 =σ
σ 2
 (λ, ρ).

Initial values

As noted in Anselin (1988, 186), poor initial starting values for ρ and λ in the concen-
trated likelihood may result in the optimization algorithm settling on a local, rather
than the global, maximum.
To prevent this problem from happening, spreg ml performs a grid search to find
suitable initial values for ρ and λ. To override the grid search, you may specify your
own initial values in the option from().

6.2 GS2SLS estimator


For discussions of the generalized method of moments and instrumental-variable estima-
tion approach underlying the GS2SLS estimator, see Arraiz et al. (2010) and Drukker,
Egger, and Prucha (2013). The articles build on Kelejian and Prucha (1998, 1999,
2010) and the references cited therein. For a detailed description of the formulas, see
also Drukker, Prucha, and Raciborski (2013).
The GS2SLS estimator requires instruments. Kelejian and Prucha (1998, 1999) sug-
gest using as instruments H the linearly independent columns of
X, WX, . . . , Wq X, MX, MWX, . . . , MWq X
238 ML and GS2SLS estimators for a SARAR model

where q = 2 has worked well in Monte Carlo simulations over a wide range of reasonable
specifications. The choice of those instruments provides a computationally convenient
approximation of the ideal instruments; see Lee (2003) and Kelejian, Prucha, and Yuze-
fovich (2004) for further discussions and refined estimators. At a minimum, the instru-
ments should include the linearly independent columns of X and MX. When there is a
constant in the model and thus X contains a constant term, the constant term is only
included once in H.

6.3 Spatial predictors


The spreg command provides for several unbiased predictors corresponding to different
information sets, namely, {X, W, M} and {X, W, M, wi y}, where wi denotes the ith
row of W; for a more detailed discussion and derivations, see Kelejian and Prucha
(2007). Also in the following, xi denotes the ith row of X and ui denotes the ith
element of u.
The unbiased predictor corresponding to information set {X, W, M} is given by

 (1) = (I − λW)−1 Xβ
y

 (1) = Xβ. This predictor can


and is called the reduced-form predictor. If λ = 0, then y
be calculated by specifying statistic rform to predict after spreg.
When specified, the rftransform() option specifies the name of a matrix in Mata
memory that contains (I − λW)−1 . The rftransform() option specifies a matrix that
transforms the model to its reduced form. This option is useful when computing many
sets of reduced-form predictions from the same (I − λW)−1 because it alleviates the
need to recompute the inverse matrix.
Assuming that the innovations  are distributed N (0, σ 2 I), the unbiased predictor
corresponding to information set {X, W, M, wi y} is given by

(2) cov(ui , wi y)
i
y = λwi y + xi β + {wi y − E(wi y)}
var(wi y)
where

Σu = (I − ρM)−1 (I − ρMT )−1


Σy = (I − λW)−1 Σu (I − λWT )−1
E(wi y) = wi (I − λW)−1 Xβ
var(wi y) = σ 2 wi Σy wi
cov(ui , wi y) = σ 2 σiu (I − λWT )−1 wi
σiu is the i th row of Σu

We call this unbiased predictor the limited-information predictor because Kelejian and
Prucha (2007) consider a more efficient predictor, the full-information predictor. The
former can be calculated by specifying statistic limited to predict after spreg.
D. M. Drukker, I. R. Prucha, and R. Raciborski 239

A further predictor considered in the literature is

i = λwi y + xi β
y

However, as pointed out in Kelejian and Prucha (2007), this estimator is generally bi-
ased. While this biased predictor should not be used for predictions, it has uses as
an intermediate computation, and it can be calculated by specifying statistic naive to
predict after spreg.
The above predictors are computed by replacing the parameters in the prediction
formula with their estimates.

7 Conclusion
After reviewing some basic concepts related to SARAR models, we presented the spreg
ml and spreg gs2sls commands, which implement ML and GS2SLS estimators for the
parameters of these models. We also discussed postestimation prediction. In future
work, we would like to investigate further methods and commands for parameter inter-
pretation.

8 Acknowledgment
We gratefully acknowledge financial support from the National Institutes of Health
through the SBIR grants R43 AG027622 and R44 AG027622.

9 References
Abreu, M., H. L. F. De Groot, and R. J. G. M. Florax. 2004. Space and growth: A
survey of empirical evidence and methods. Working Paper TI 04-129/3, Tinbergen
Institute.

Anselin, L. 1988. Spatial Econometrics: Methods and Models. Dordrecht: Kluwer


Academic Publishers.

———. 2003. Spatial externalities, spatial multipliers, and spatial econometrics. Inter-
national Regional Science Review 26: 153–166.

———. 2010. Thirty years of spatial econometrics. Papers in Regional Science 89: 3–25.

Anselin, L., and R. J. G. M. Florax. 1995. Small sample properties of tests for spatial
dependence in regression models: Some further results. In New Directions in Spatial
Econometrics, ed. L. Anselin and R. J. G. M. Florax, 21–74. Berlin: Springer.

Arbia, G. 2006. Spatial Econometrics: Statistical Foundations and Applications to


Regional Convergence. Berlin: Springer.
240 ML and GS2SLS estimators for a SARAR model

Arraiz, I., D. M. Drukker, H. H. Kelejian, and I. R. Prucha. 2010. A spatial Cliff-Ord-


type model with heteroskedastic innovations: Small and large sample results. Journal
of Regional Science 50: 592–614.

Cliff, A. D., and J. K. Ord. 1973. Spatial Autocorrelation. London: Pion.

———. 1981. Spatial Processes: Models and Applications. London: Pion.

Cressie, N. A. C. 1993. Statistics for Spatial Data. Revised ed. New York: Wiley.

Drukker, D. M., P. Egger, and I. R. Prucha. 2013. On two-step estimation of a spatial


autoregressive model with autoregressive disturbances and endogenous regressors.
Econometric Reviews 32: 686–733.

Drukker, D. M., H. Peng, I. R. Prucha, and R. Raciborski. 2013. Creating and managing
spatial-weighting matrices with the spmat command. Stata Journal 13: 242–286.

Drukker, D. M., I. R. Prucha, and R. Raciborski. 2013. A command for estimating


spatial-autoregressive models with spatial-autoregressive disturbances and additional
endogenous variables. Stata Journal 13: 287–301.

Haining, R. 2003. Spatial Data Analysis: Theory and Practice. Cambridge: Cambridge
University Press.

Kelejian, H. H., and I. R. Prucha. 1998. A generalized spatial two-stage least squares
procedure for estimating a spatial autoregressive model with autoregressive distur-
bances. Journal of Real Estate Finance and Economics 17: 99–121.

———. 1999. A generalized moments estimator for the autoregressive parameter in a


spatial model. International Economic Review 40: 509–533.

———. 2007. The relative efficiencies of various predictors in spatial econometric models
containing spatial lags. Regional Science and Urban Economics 37: 363–374.

———. 2010. Specification and estimation of spatial autoregressive models with au-
toregressive and heteroskedastic disturbances. Journal of Econometrics 157: 53–67.

Kelejian, H. H., I. R. Prucha, and Y. Yuzefovich. 2004. Instrumental variable estimation


of a spatial autoregressive model with autoregressive disturbances: Large and small
sample results. In Spatial and Spatiotemporal Econometrics, ed. J. P. LeSage and
R. K. Pace, 163–198. New York: Elsevier.

Lee, L.-F. 2003. Best spatial two-stage least squares estimators for a spatial autoregres-
sive model with autoregressive disturbances. Econometric Reviews 22: 307–335.

———. 2004. Asymptotic distributions of quasi-maximum likelihood estimators for


spatial autoregressive models. Econometrica 72: 1899–1925.

LeSage, J., and R. K. Pace. 2009. Introduction to Spatial Econometrics. Boca Raton:
Chapman & Hall/CRC.
D. M. Drukker, I. R. Prucha, and R. Raciborski 241

Powers, E. L., and J. K. Wilson. 2004. Access denied: The relationship between alcohol
prohibition and driving under the influence. Sociological Inquiry 74: 318–337.

Whittle, P. 1954. On stationary processes in the plane. Biometrika 41: 434–449.

About the authors


David Drukker is the director of econometrics at StataCorp.
Ingmar Prucha is a professor of economics at the University of Maryland.
Rafal Raciborski is an econometrician at StataCorp.
The Stata Journal (2013)
13, Number 2, pp. 242–286

Creating and managing spatial-weighting


matrices with the spmat command
David M. Drukker Hua Peng Ingmar R. Prucha
StataCorp StataCorp Department of Economics
College Station, TX College Station, TX University of Maryland
[email protected] [email protected] College Park, MD
[email protected]
Rafal Raciborski
StataCorp
College Station, TX
[email protected]

Abstract. We present the spmat command for creating, managing, and storing
spatial-weighting matrices, which are used to model interactions between spatial
or more generally cross-sectional units. spmat can store spatial-weighting matrices
in a general and banded form. We illustrate the use of the spmat command and
discuss some of the underlying issues by using United States county and postal-
code-level data.
Keywords: st0292, spmat, spatial-autoregressive models, Cliff–Ord models, spa-
tial lag, spatial-weighting matrix, spatial econometrics, spatial statistics, cross-
sectional interaction models, social-interaction models

1 Introduction
Building on Whittle (1954), Cliff and Ord (1973, 1981) developed statistical models
that not only accommodate forms of cross-unit correlation but also allow for explicit
forms of cross-unit interactions. The latter is a feature of interest in many social sci-
ence, biostatistical, and geographic science models. Following Cliff and Ord (1973,
1981), much of the original literature was developed to handle spatial interactions.
However, space is not restricted to geographic space, and many recent applications
use spatial techniques in other situations of cross-unit interactions, such as social-
interaction models and network models; see, for example, Kelejian and Prucha (2010)
and Drukker, Egger, and Prucha (2013) for references. Much of the nomenclature still
includes the adjective “spatial”, and we continue this tradition to avoid confusion while
noting the wider applicability of these models. For texts and reviews, see, for example,
Anselin (1988, 2010), Arbia (2006), Cressie (1993), Haining (2003), and LeSage and Pace
(2009).
The models derived and discussed in the literature cited above model cross-unit
interactions and correlation in terms of spatial lags, which may involve the dependent
variable, the exogenous variables, and the disturbances. A spatial lag of a variable is


c 2013 StataCorp LP st0292
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 243

defined as a weighted average of observations on the variable over neighboring units. To


illustrate, we after the rudimentary spatial-autoregressive (SAR) model

n
yi = λ wij yj + εi , i = 1, . . . , n
j=1

where yi denotes the dependent variable corresponding to unit i, the wij (with wii = 0)
are nonstochastic weights, εi is a disturbance term, and λ is a parameter.
n In the above
model, the yi are determined simultaneously. The weighted average j=1 wij yj , on the
right-hand side, is called a spatial lag, and the wij are called the spatial weights. It
often proves convenient to write the model in matrix notation as

y = λWy + ε

where
⎡ ⎤
⎡ ⎤ 0 w12 ··· w1,n−1 w1n ⎡ ⎤
y1 ⎢ w21 0 ... w2,n−1 w2n ⎥ ε1
⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
y = ⎣ ... ⎦ , W =⎢ ..
.
..
.
..
.
..
.
..
. ⎥, ε = ⎣ ... ⎦
⎢ ⎥
yn ⎣ wn−1,1 wn−1,2 ··· 0 wn−1,n ⎦ εn
wn1 wn2 ··· wn,n−1 0

Again the n × 1 vector Wy is typically referred to as the spatial lag in y and the n × n
matrix W as the spatial-weighting matrix. More generally, as indicated above, the
concept of a spatial lag can be applied to any variable, including exogenous variables
and disturbances, which—as can be seen in the literature cited above—provides for a
fairly general class of Cliff–Ord types of models.
Spatial-weighting matrices allow us to conveniently implement Tobler’s first law of
geography—“everything is related to everything else, but near things are more related
than distant things” (Tobler 1970, 236)—which applies whether the space is geographic,
biological, or social. The spmat command creates, imports, manipulates, and saves W
matrices. The matrices are stored in a spatial-weighting matrix object (spmat object).
The spmat object contains additional information about a spatial-weighting matrix,
such as the identification codes of the cross-section units, and other items discussed
below.1
The generic syntax of spmat is
spmat subcommand . . .
where each subcommand performs a specific task. Some subcommands create spmat ob-
jects from a Stata dataset (contiguity, idistance, dta), a Mata matrix (putmatrix),
or a text file (import). Other subcommands save objects to a disk (save, export) or
read them back in (use, import). Still other subcommands summarize spatial-weighting
1. We use the term “units” instead of “places” because spatial-econometric methods have been applied
to many cases in which the units of analysis are individuals or firms instead of geographical places;
for example, see Leenders (2002).
244 Creating and managing spatial-weighting matrices

matrices (summarize); graph spatial-weighting matrices (graph); manage them (note,


drop, getmatrix); and perform computations on them (lag, eigenvalues). The re-
maining subcommands are used to change the storage format of the spatial-weighting
matrices inside the spmat objects. As discussed below, matrices stored inside spmat
objects can be general or banded, with general matrices occupying much more space
than banded ones. The subcommand permute rearranges the matrix elements, and the
subcommand tobanded is used to store a matrix in banded form.
spmat contiguity and spmat idistance create the frequently used inverse-distance
and contiguity spatial-weighting matrices; Haining (2003, 83) and Cliff and Ord (1981,
17) discuss typical formulations of weights matrices. The import and management ca-
pabilities allow users to create spatial-weighting matrices beyond contiguity and inverse-
distance matrices. Section 17.4 provides some discussion and examples.
Drukker, Prucha, and Raciborski (2013a, 2013b) discuss Stata commands that im-
plement estimators for SAR models. These commands use spatial-weighting matrices
previously created by the spmat command discussed in this article.
Before we describe individual subcommands in detail, we illustrate how to obtain
and transform geospatial data into the format required by spmat, and we address com-
putational problems pertinent to spatial-weighting matrices.

1.1 From shapefiles into Stata format


Many applications use geospatial data frequently made available in the form of shape-
files. Each shapefile is a pair of files: the database file and the coordinates file. The
database file contains data on the attributes of the spatial units, while the coordinates
file contains the geographical coordinates describing the boundaries of the spatial units.
In the common case where the units correspond to nonzero areas instead of points, the
boundary data in the coordinates file are stored as a series of irregular polygons.
The vast majority of geospatial data comes in the form of ESRI or MIF shapefiles.2
There are user-written tools for translating shapefiles to Stata’s .dta format and for
mapping spatial data. shp2dta (Crow 2006) and mif2dta (Pisati 2005) translate ESRI
and MIF shapefiles to Stata datasets. shp2dta and mif2dta translate the two files
that make up a shapefile to two Stata .dta files. The database file is translated to
the “attribute” .dta file, and the coordinates file is translated to the coordinates .dta
file.3,4

2. Refer to http://www.esri.com for details about the ESRI format and to http://www.pbinsight.com
for details about the MIF format. The ESRI format is much more common.
3. shp2dta and mif2dta save the coordinates data in the format required by spmap (Pisati 2007),
which graphs data onto maps.
4. We use the term “attribute” instead of “database” because “database” does not adequately distin-
guish between attribute data and coordinates data.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 245

The code below illustrates the use of shp2dta and spmap (Pisati 2007) on the county
boundaries data for the continental United States; Crow and Gould (2007) provide a
broader introduction to shapefiles, shp2dta, and spmap.
shp2dta, mif2dta, and spmap use a common set of conventions for defining the
polygons in the coordinates data translated from the coordinates file. Crow and Gould
(2007) discuss these conventions.
We downloaded ts 2008 us county00.db and ts 2008 us county00.shp, which are
the attribute file and the coordinates file, respectively, and which make up the shapefile
for U.S. counties from the U.S. Census Bureau.5 We begin by using shp2dta to translate
these files to the files county.dta and countyxy.dta.

. shp2dta using tl_2008_us_county00, database(county)


> coordinates(countyxy) genid(id) gencentroids(c)

county.dta contains the attribute information from the attribute file in the shape-
file, and countyxy.dta contains the coordinates data from the shapefile. The attribute
dataset county.dta has one observation per county on variables such as county name
and state code. Because we specified the option gencentroids(c), county.dta also
contains the variables x c and y c, which contain the coordinates of the county cen-
troids, measured in degrees. (See the help file for shp2dta for details and the x–y
naming convention.) countyxy.dta contains the coordinates of the county boundaries
in the long-form panel format used by spmap.6
Below we use use to read county.dta into memory and use destring (see [D] de-
string) to create a new, numeric state-code variable st from the original string state-
identifying variable STATEFP. Next we use drop to drop the observations defining the
coordinates of county boundaries in Alaska, Hawaii, and U.S. territories. Finally, we
use rename to rename the variables containing coordinates of the county centroids and
use save to save our changes into the county.dta dataset file.

. use county
. quietly destring STATEFP, generate(st)
. *keep continental US counties
. drop if st==2 | st==15 | st>56
(123 observations deleted)
. rename x_c longitude
. rename y_c latitude
. save county, replace
file county.dta saved

Having completed the translation and selected our subsample, we use spmap to draw
the map, given in figure 1, of the boundaries in the coordinates dataset.

5. Actually, we downloaded ts 2008 us county00.zip from


ftp://ftp2.census.gov/geo/tiger/TIGER2008/, and this .zip file contained the two files named in
the text.
6. Crow and Gould (2007), the shp2dta help file, and the spmap help file provide more information
about the input and output datasets.
246 Creating and managing spatial-weighting matrices

. spmap using countyxy, id(id)

Figure 1. County boundaries for the continental United States, 2000

1.2 Memory considerations


The spatial-weighting matrix for the n units is an n × n matrix, which implies that
memory requirements increase quadratically with data size. For example, a contiguity
matrix for the 31,713 U.S. postal codes (five-digit zip codes) is a 31,713 × 31,713 matrix,
which requires 31,713 × 31,713 × 8/230 ≈ 7.5 gigabytes of storage space.
Many users do not have this much memory on their machines. However, it is usually
possible to store spatial-weighting matrices more efficiently. Drukker et al. (2011) dis-
cuss how to judiciously reorder the observations so that many spatial-weighting matrices
can be stored as banded matrices, thereby using less space than general matrices.
This subsection describes banded matrices and the potential benefits of using banded
matrices for storing spatial-weighting matrices. If you do not have large datasets, you
may skip this section and all future references to banded matrices.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 247

A banded matrix is a matrix whose nonzero elements are confined to a diagonal band
that comprises the main diagonal, zero or more diagonals above the main diagonal, and
zero or more diagonals below the main diagonal. The number of diagonals above the
main diagonal that contain nonzero elements is the upper bandwidth, say, bU . The
number of diagonals below the main diagonal that contain nonzero elements is the
lower bandwidth, say, bL . An example of a banded matrix having an upper bandwidth
of 1 and a lower bandwidth of 2 is
⎡ ⎤
0 1 0 0 0 0 0 0 0 0
⎢ 1 0 1 0 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 1 1 0 1 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 0 1 1 0 1 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 0 0 1 1 0 1 0 0 0 0 ⎥
⎢ ⎥
⎢ 0 0 0 1 1 0 1 0 0 0 ⎥
⎢ ⎥
⎢ 0 0 0 0 1 1 0 1 0 0 ⎥
⎢ ⎥
⎢ 0 0 0 0 0 1 1 0 1 0 ⎥
⎢ ⎥
⎣ 0 0 0 0 0 0 1 1 0 1 ⎦
0 0 0 0 0 0 0 1 1 0

We can save a lot of space by storing only the elements in the diagonal band because
the elements outside the band are 0 by construction. Using this information, we can
efficiently store this matrix without any loss of information as
⎡ ⎤
0 1 1 1 1 1 1 1 1 1
⎢ 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
⎣ 1 1 1 1 1 1 1 1 1 0 ⎦
1 1 1 1 1 1 1 1 0 0

The above matrix only contains the elements of the diagonals with nonzero elements.
To store the elements in a rectangular array, we added zeros as necessary. The row
dimension of the banded matrix is the upper bandwidth plus the lower bandwidth plus
1, or b = bU + bL + 1. We will use the b × n shorthand to refer to the dimensions of
banded matrices.
Banded matrices require less storage space than general matrices. The spmat suite
provides tools for creating, storing, and manipulating banded matrices. In addition,
computing an operation on a banded matrix is much faster than on a general matrix.
Drukker et al. (2011) show that many spatial-weighting matrices have a banded
structure after an appropriate reordering. In particular, a banded structure is often
attained by sorting the data in an ascending order of the distance from a well-chosen
place. In section 5, we will illustrate this method with data on U.S. counties and U.S.
five-digit zip codes. In the case of U.S. five-digit zip codes, we show how to create a
contiguity matrix with upper and lower bandwidths of 913. This allows us to store
the data in a 1,827 × 31,713 matrix, which requires only 1,827 × 31,713 × 8/230 ≈ 0.43
gigabytes instead of the 7.5 gigabytes required for the general matrix.
We are now ready to describe the spmat subcommands.
248 Creating and managing spatial-weighting matrices

2 Creating a contiguity matrix from geospatial data


2.1 Syntax
   
spmat contiguity objname if in using coordinates file, id(varname)
 
options

options Description
normalize(norm) control the normalization method
rook require that two units share a common border
instead of just a common point to
be neighbors
banded store the matrix in the banded format
replace   replace existing spmat object
saving(filename , replace ) save the neighbor list to a file
nomatrix suppress creations of spmat object
tolerance(#) use when determining whether
units share a common border

2.2 Description
spmat contiguity computes a contiguity or normalized-contiguity matrix from a coor-
dinates dataset containing a polygon representation of geospatial data. More precisely,
spmat contiguity constructs a contiguity matrix or normalized-contiguity matrix from
the boundary information in a coordinates dataset and puts it into the new spmat object
objname.
In a contiguity matrix, contiguous units are assigned weights of 1, and noncontiguous
units are assigned weights of 0. Contiguous units are known as neighbors.
spmat contiguity uses the polygon data in coordinates file to determine the neigh-
bors of each unit. The coordinates file must be a Stata dataset containing the polygon
information in the format produced by shp2dta and mif2dta. Crow and Gould (2007)
discuss the conventions used to represent polygons in the Stata datasets created by
these commands.

2.3 Options
id(varname) specifies a numeric variable that contains a unique identifier for each
observation. (shp2dta and mif2dta name this ID variable ID.) id() is required.
normalize(norm) specifies one of the three available normalization techniques: row,
minmax, and spectral. In a row-normalized matrix, each element in row i is divided
by the sum of row i’s elements. In a minmax-normalized matrix, each element is
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 249

divided by the minimum of the largest row sum and column sum of the matrix. In
a spectral-normalized matrix, each element is divided by the modulus of the largest
eigenvalue of the matrix. See section 2.5 for details.
rook specifies that only units that share a common border be considered neighbors (edge
or rook contiguity). The default is queen contiguity, which treats units that share a
common border or a single common point as neighbors. Computing rook-contiguity
matrices is more computationally intensive than the default queen-contiguity com-
putation.7
banded requests that the new matrix be stored in a banded form. The banded matrix
is constructed without creating the underlying n × n representation.
replace permits spmat contiguity to overwrite an existing spmat object.
 
saving(filename , replace ) saves the neighbor list to a space-delimited text file.
The first line of the file contains the number of units and, if applicable, bands; each
remaining line lists a unit identification code followed by the identification codes of
units that share a common border, if any. You can read the file back into an spmat
object with spmat import ..., nlist. replace allows filename to be overwritten
if it already exists.
nomatrix specifies that the spmat object objname and spatial-weighting matrix W not
be created. In conjunction with saving(), this option allows for creating a text
file containing a neighbor list without allocating space for the underlying contiguity
matrix.
tolerance(#) specifies the numerical tolerance used in deciding whether two units are
edge neighbors. The default is tolerance(1e-7).

2.4 Examples
As discussed above, spatial-weighting matrices are used to compute weighted averages
in which more weight is placed on nearby observations than on distant observations.
While Haining (2003, 83) and Cliff and Ord (1981, 17) discuss formulations of weights
matrices, contiguity and inverse-distance matrices are the two most common spatial-
weighting matrices.

7. These definitions for rook neighbor and queen neighbor are commonly used; see, for example,
Lai, So, and Chan (2009). (As many readers will recognize, the “rook” and “queen” terminology
arises by analogy with chess, in which a rook may only move across sides of squares, whereas a
queen may also move diagonally.)
250 Creating and managing spatial-weighting matrices

In geospatial-type applications, researchers who want a contiguity matrix need to


perform a series of complicated calculations on the boundary information in a coordi-
nates dataset to identify the neighbors of each unit. spmat contiguity performs these
calculations and stores the resulting weights matrix in an spmat object.
In contrast, some social-network datasets begin with a list of neighbors instead of
the boundary information found in geospatial data. Section 15 discusses how to create
a social-network matrix from a list of neighbors.

Example

We continue the example from section 1.1 and assume both of the Stata datasets
created in section 1.1 are in the current working directory. After loading the attribute
dataset into memory, we create the spmat object ccounty containing a normalized-
contiguity matrix for U.S. counties by typing

. use county, clear


. spmat contiguity ccounty using countyxy, id(id) normalize(minmax)

We use spmat summarize, discussed in section 4, to summarize the contents of the


spatial-weighting matrix in the ccounty object we created above:

. spmat summarize ccounty, links


Summary of spatial-weighting object ccounty

Matrix Description

Dimensions 3109 x 3109


Stored as 3109 x 3109
Links
total 18474
min 1
mean 5.942104
max 14

The table shows basic information about the normalized contiguity matrix, including
the dimensions of the matrix and its storage. The number of neighbors found is reported
as 18,474, with each county having 6 neighbors on average.

2.5 Normalization details


In this section, we present details about the normalization methods.8 In each case,
the normalized matrix W = (wij ) is computed from the underlying matrix W = (wij ),
where the elements are assumed to be nonnegative; see, for example, Kelejian and
Prucha (2010) for an introduction to the use and interpretation of these normalization
methods.
8. The normalization methods are not restricted to contiguity matrices.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 251

 becomes w
In a row-normalized matrix, the (i, j)th element of W ij = wij /ri , where
ri is the sum of the ith row of W. After row normalization, each row of W  will sum

to 1. Row normalizing a symmetric W produces an asymmetric W except in very
special cases. Kelejian and Prucha (2010) point out that normalizing by a vector of row
sums needs to be guided by theory.
In a minmax-normalized matrix, the (i, j)th element of W  becomes w ij = wij /m,
where m = min{maxi (ri ), maxi (ci )}, with maxi (ri ) being the largest row sum of W
and maxi (ci ) being the largest column sum of W. Normalizing by a scalar preserves
symmetry and the basic model specification.
In a spectral-normalized matrix, the (i, j)th element of W becomes w ij = wij /v,
where v is the largest of the moduli of the eigenvalues of W. As for the minmax norm,
normalizing by a scalar preserves symmetry and the basic model specification.

3 Creating an inverse-distance matrix from data


3.1 Syntax
     
spmat idistance objname cvarlist if in , id(varname) options

where cvarlist is the list of coordinate variables.


options Description
 
dfunction(function , miles ) specify the distance function
normalize(norm) specify the normalization method
truncmethod specify the truncation method
banded store the matrix in the banded format
replace replace an existing spmat object
where function is one of euclidean, rhaversine, dhaversine, or p;
miles may only be specified with rhaversine or dhaversine; and
truncmethod is one of btruncate(b B ), dtruncate(dL dU ), or
vtruncate(v).

3.2 Description
An inverse-distance spatial-weighting matrix is composed of weights that are inversely
related to the distances between the units. spmat idistance uses the coordinate vari-
ables from the attribute data in memory and the specified distance measure to compute
the distances between units, to create an inverse-distance spatial-weighting matrix, and
to store the result in an spmat object.
252 Creating and managing spatial-weighting matrices

3.3 Options
id(varname) specifies a numeric variable that contains a unique identifier for each
observation. id() is required.
 
dfunction(function , miles ) specifies the distance function. function may be one of
euclidean (default), dhaversine, rhaversine, or the Minkowski distance of order
p, where p is an integer greater than or equal to 1.
When the default dfunction(euclidean) is specified, a Euclidean distance measure
is applied to the coordinate variable list cvarlist.
When dfunction(rhaversine) or dfunction(dhaversine) is specified, the haver-
sine distance measure is applied to the two coordinate variables cvarlist. (The
first coordinate variable must specify longitude, and the second coordinate vari-
able must specify latitude.) The coordinates must be in radians when rhaversine
is specified. The coordinates must be in degrees when dhaversine is specified.
The haversine distance measure is calculated in kilometers by default. Specify
dfunction(rhaversine, miles) or dfunction(dhaversine, miles) if you want
the distance returned in miles.
When dfunction(p) (p is an integer) is specified, a Minkowski distance measure of
order p is applied to the coordinate variable list cvarlist.
The formulas for the distance measure are discussed in section 3.5.
normalize(norm) specifies one of the three available normalization techniques: row,
minmax, and spectral. In a row-normalized matrix, each element in row i is divided
by the sum of row i’s elements. In a minmax-normalized matrix, each element is
divided by the minimum of the largest row sum and column sum of the matrix. In
a spectral-normalized matrix, each element is divided by the modulus of the largest
eigenvalue of the matrix. See section 2.5 for details.
truncmethod options specify one of the three truncation criteria. The values of the
spatial-weighting matrix W that meet the truncation criterion will be changed to 0.
Only apply truncation methods when supported by theory.
btruncate(b B) partitions the values of W into B equal-length bins and truncates
to 0 entries that fall into bin b or below, b < B.
dtruncate(dL dU ) truncates to 0 the values of W that fall more than dL diagonals
below and dU diagonals above the main diagonal. Neither value can be greater than
(cols(W)−1)/4.9
vtruncate(v) truncates to 0 the values of W that are less than or equal to v.
See section 3.6 for more details about the truncation options.

9. This limit ensures that a cross product of the spatial-weighting matrix is stored more efficiently in
banded form than in general form. The limit is based on the cross product instead of the matrix
itself because the generalized spatial two-stage least-squares estimators use cross products of the
spatial-weighting matrices.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 253

banded requests that the new matrix be stored in a banded form. The banded matrix is
constructed without creating the underlying n×n representation. Note that without
banded, a matrix with truncated values will still be stored in an n × n form.
replace permits spmat idistance to overwrite an existing spmat object.

3.4 Examples
As discussed above, spatial-weighting matrices are used to compute weighted averages
in which more weight is placed on nearby observations than on distant observations.
Haining (2003, 83) and Cliff and Ord (1981, 17) discuss formulations of weights matri-
ces, contiguity matrices, inverse-distance matrices, and combinations thereof.
In inverse-distance spatial-weighting matrices, the weights are inversely related to the
distances between the units. spmat idistance provides several measures for calculating
the distances between the units.
The coordinates may or may not be geospatial. Distances between geospatial units
are commonly computed from the latitudes and longitudes of unit centroids.10 Social
distances are frequently computed from individual-person attributes.
In much of the literature, the attributes are known as coordinates because the nomen-
clature has developed around the common geospatial case in which the attributes are
map coordinates. For ease of use, spmat idistance follows this convention and refers
to coordinates, even though coordinate variables specified in cvarlist need not be spatial
coordinates.
The (i, j)th element of an inverse-distance spatial-weighting matrix is 1/dij , where
dij is the distance between unit i and j computed from the specified coordinates and dis-
tance measure. Creating spatial-weighting matrices with elements of the form 1/f (dij ),
where f (·) is some function, is described in section 17.4.

Example

county.dta from section 1.1 contains the coordinates of the centroids of each county,
measured in degrees, in the variables longitude and latitude. To get a feel for the
data, we create an unnormalized inverse-distance spatial-weighting matrix, store it in
the spmat object dcounty, and summarize it by typing

10. The word “centroid” in the literature on geographic information systems differs from the standard
term in geometry. In the geographic information systems literature, a centroid is a weighted average
of the vertices of a polygon that approximates the center of the polygon; see Waller and Gotway
(2004, 44–45) for the formula and some discussion.
254 Creating and managing spatial-weighting matrices

. spmat idistance dcounty longitude latitude, id(id) dfunction(dhaversine)


. spmat summarize dcounty
Summary of spatial-weighting object dcounty

Matrix Description

Dimensions 3109 x 3109


Stored as 3109 x 3109
Values
min 0
min>0 .0002185
mean .0012296
max 1.081453

From the summary table, we can see that the centroids of the two closest counties
lie within less than one kilometer of each other (1/1.081453), while the two most distant
counties are 4, 577 kilometers apart (1/0.0002185).
Below we compute a minmax-normalized inverse-distance matrix, store it in the
spmat object dcounty2, and summarize it by typing
. spmat idistance dcounty2 longitude latitude, id(id) dfunction(dhaversine)
> normalize(minmax)
. spmat summarize dcounty2
Summary of spatial-weighting object dcounty2

Matrix Description

Dimensions 3109 x 3109


Stored as 3109 x 3109
Values
min 0
min>0 .0000382
mean .0002151
max .1892189

3.5 Distance calculation details


Specifying q variables in the list of coordinate variables cvarlist implies that the units
are located in a q-dimensional space. This space may or may not be geospatial. Let the
q variables in the list of coordinate variables cvarlist be x1 , x2 , . . . , xq , and denote the
coordinates of observation i by (x1 [i], x2 [i], . . . , xq [i]).
The default behavior of spmat idistance is to calculate the Euclidean distance
between units s and t, which is given by


 q  2
dst =  xj [s] − xj [t]
j=1

for observations s and t.


D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 255

The Minkowski distance of order p is given by




 q  p

dst = 
p
xj [s] − xj [t]
j=1

for observations s and t. When p = 2, the Minkowski distance is equivalent to the


Euclidean distance.
The haversine distance measure is useful when the units are located on the surface
of the earth and the coordinate variables represent the geographical coordinates of
the spatial units. In such cases, we usually wish to calculate a spherical (great-circle)
distance between the spatial units. This is accomplished by the haversine formula given
by
dst = r × c
where

r is the mean radius of the Earth (6, 371.009 km or 3, 958.761 miles)



c = 2 arcsin{min(1, a)}
a = sin2 φ + cos(φ1 ) cos(φ2 ) sin2 λ

φ = 12 (φ2 − φ1 ) = 12 (x2 [t] − x2 [s])

λ = 12 (λ2 − λ1 ) = 12 (x1 [t] − x1 [s])

x1 [s] and x1 [t] are the longitudes of point s and point t, respectively

x2 [s] and x2 [t] are the latitudes of point s and point t, respectively

Specify dfunction(dhaversine) to compute haversine distances from coordinates


in degrees, and specify dfunction(rhaversine) to compute haversine distances from
coordinates in radians. Both dfunction(dhaversine) and dfunction(rhaversine)
by default use r = 6,371.009 to compute results in kilometers. To compute haversine
distances in miles, with r = 3,958.761, instead specify dfunction(dhaversine, miles)
or dfunction(rhaversine, miles).
256 Creating and managing spatial-weighting matrices

3.6 Truncation details


Unlike contiguity matrices, inverse-distance matrices cannot naturally yield a banded
structure because the off-diagonal elements are never exactly 0. Consider an example in
which we have nine units arranged on the real line with x denoting the unit locations.

. use truncex, clear


. list

id x

1. 1 0
2. 2 1
3. 3 2
4. 4 503
5. 5 504

6. 6 505
7. 7 1006
8. 8 1007
9. 9 1008

The units are grouped into three clusters. The units belonging to the same cluster
are close to one another, while the distance between the units belonging to different
clusters is large. For real-world data, the units may represent, for example, cities in
different states. We use spmat idistance to create the spmat object ex from the data:

. spmat idistance ex x, id(id)

The resulting spatial-weighting matrix of inverse distances is


⎡ ⎤
0 1 0.5 0.00199 0.00198 0.00198 0.00099 0.00099 0.00099
⎢ 0 0.00099 ⎥
⎢ 1 1 0.00199 0.00199 0.00198 0.001 0.00099 ⎥
⎢ ⎥
⎢ 0.5 1 0 0.002 0.00199 0.00199 0.001 0.001 0.00099 ⎥
⎢ ⎥
⎢ 0.00199 0.00199 0.002 0 1 0.5 0.00199 0.00198 0.00198 ⎥
⎢ ⎥
⎢ 0 0.00198 ⎥
⎢ 0.00198 0.00199 0.00199 1 1 0.00199 0.00199 ⎥
⎢ ⎥
⎢ 0.00198 0.00198 0.00199 0.5 1 0 0.002 0.00199 0.00199 ⎥
⎢ ⎥
⎢ 0.00099 0.001 0.001 0.00199 0.00199 0.002 0 1 0.5 ⎥
⎢ ⎥
⎢ 0 ⎥
⎣ 0.00099 0.00099 0.001 0.00198 0.00199 0.00199 1 1 ⎦
0.00099 0.00099 0.00099 0.00198 0.00198 0.00199 0.5 1 0
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 257

Theoretical considerations may suggest that the weights should actually be 0 below
a certain threshold. For example, choosing the threshold value of 1/500 = 0.002 for our
matrix results in the following structure:
⎡ ⎤
0 1 0.5 0 0 0 0 0 0
⎢ 1 0 1 0 0 0 ⎥
⎢ 0 0 0 ⎥
⎢ ⎥
⎢ 0.5 1 0 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 0 0 0 0 1 0.5 0 0 0 ⎥
⎢ ⎥
⎢ ⎥
⎢ 0 0 0 1 0 1 0 0 0 ⎥
⎢ ⎥
⎢ 0 0 0 0.5 1 0 0 0 0 ⎥
⎢ ⎥
⎢ ⎥
⎢ 0 0 0 0 0 0 0 1 0.5 ⎥
⎢ ⎥
⎢ 0 0 0 0 0 0 1 0 1 ⎥
⎣ ⎦
0 0 0 0 0 0 0.5 1 0

Now the matrix with the truncated values can be stored more efficiently in a banded
form: ⎡ ⎤
0 0 0.5 0 0 0.5 0 0 0.5
⎢ 0 1 1 0 1 1 ⎥
⎢ 0 1 1 ⎥
⎢ ⎥
⎢ 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 1 1 0 1 1 0 1 1 0 ⎥
⎣ ⎦
0.5 0 0 0.5 0 0 0.5 0 0

spmat idistance provides tools for truncating the values of an inverse-distance ma-
trix and storing the truncated matrix in a banded form. Like spmat contiguity, spmat
idistance is capable of creating a banded matrix without creating the underlying n × n
representation of the matrix. The user must specify a theoretically justified truncation
criterion for such an application.
Here we illustrate how one could apply each of the truncation methods mentioned
in section 3.3 to our hypothetical inverse-distance matrix. The most natural way is to
use value truncation. In the code below, we create a new spmat object ex1 with the
values of W that are less than or equal to 1/500 set to 0.11 We also request that W be
stored in banded form.
. spmat idistance ex1 x, id(id) banded vtruncate(1/500)

The same outcome can be achieved with bin truncation. In bin truncation, we find
the maximum value in W denoted by m, divide the interval (0, m] into B bins of equal
length, and then truncate to 0 elements that fall into bins 1, . . . , b; see Bin truncation
details below for a more technical description. In our hypothetical matrix, the largest
element of W is 1. If we divide the values in W into three bins, the bins will be defined
by (0, 1/3], (1/3, 2/3], (2/3, 1]. The values we wish to round to 0 fall into the first bin.
11. vtruncate() accepts any expression that evaluates to a number.
258 Creating and managing spatial-weighting matrices

In the code below, we create a new spmat object ex2 with the values of W that fall
into the first bin set to 0. We also request that W be stored in banded form.

. spmat idistance ex2 x, id(id) banded btruncate(1 3)

Diagonal truncation is not based on value comparison; therefore, in general, we will


not be able to replicate exactly the results obtained with bin or value truncation. In
the code below, we create a new spmat object ex3 with the values of W that fall more
than two diagonals below and above the main diagonal set to 0. We also request that
W be stored in banded form.

. spmat idistance ex3 x, id(id) banded dtruncate(2 2)

The resulting matrix based on diagonal truncation is shown below. No values in W


have been changed; instead, we copied the requested elements from W and stored them
in banded form, padding the banded format with 0s when necessary (see section 1.2).
Diagonal truncation can be hard to justify on a theoretical basis. It can retain
irrelevant neighbors, as in this example, or it can wipe out our relevant ones. Its use
should be limited to situations in which one has a good knowledge of the underlying
structure of the spatial-weighting matrix. Bin or value truncation will generally be
easier to apply.
⎡ ⎤
0 0 0.5 0.00199 0.00199 0.5 0.00199 0.00199 0.5
⎢ 0 1 ⎥
⎢ 1 1 0.002 1 1 0.002 1 ⎥
⎢ ⎥
⎢ 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 1 1 0.002 1 1 0.002 1 1 0 ⎥
⎣ ⎦
0.5 0.00199 0.00199 0.5 0.00199 0.00199 0.5 0 0

A word of warning: While truncation leads to matrices that can be stored more
efficiently, truncation should only be applied if supported by theory. Ad hoc truncation
may lead to a misspecification of the model and a subsequent inconsistent inference.

Bin truncation details

Formally, letting m be the largest element in W, btruncate(b B ) divides the interval


(0, m] into B equal-length subintervals and sets elements in W whose value falls in the
b smallest subintervals to 0. We partition the interval (0,m] into B intervals (akL , akU ],
where k = {1, . . . , B}, akL = (k − 1)m/B, and akU = km/B. We set wij = 0 if
wij ∈ (akL , akU ] for k ≤ b.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 259

4 Summarizing an existing spatial-weighting matrix


4.1 Syntax
 
spmat summarize objname , links detail {banded | truncmethod}

where truncmethod is one of btruncate(b B ), dtruncate(dL dU ), or


vtruncate(v).

4.2 Description
spmat summarize reports summary statistics about the elements in the spatial-weight-
ing matrix in the existing spmat object objname.

4.3 Options
links is useful when objname contains a contiguity or a normalized-contiguity matrix.
Rather than the default summary of the values in the spatial-weighting matrix,
links causes spmat summarize to summarize the number of neighbors.
detail requests a tabulation of links for a contiguity or a normalized-contiguity matrix.
The values of the identifying variable with the minimum and maximum number of
links will be displayed.
banded reports the bands for the matrix that already has a (possibly) banded structure
but is stored in an n × n form.
truncmethods are useful when you want to see summary statistics calculated on a spatial-
weighting matrix after some elements have been truncated to 0. spmat summarize
with a truncmethod will report the lower and upper band based on a matrix to
which the specified truncation criterion has been applied. (Note: No data are ac-
tually changed by selecting these options. These options only specify that spmat
summarize calculate results as if the requested truncation criterion has been ap-
plied.)
btruncate(b B) partitions the values of W into B bins and truncates to 0 entries
that fall into bin b or below.
dtruncate(dL dU ) truncates to 0 the values of W that fall more than dL diagonals
below and dU diagonals above the main diagonal. Neither value can be greater than
(cols(W)−1)/4.
vtruncate(v) truncates to 0 the values of W that are less than or equal to v.
260 Creating and managing spatial-weighting matrices

4.4 Saved results


Only spmat summarize returns saved results.
Let Wc be a contiguity or normalized-contiguity matrix and Wd be an inverse-
distance matrix.
spmat summarize saves the following results in r():
Scalars
r(b) number of rows in W r(lmean) mean number of
r(n) number of columns in W neighbors in Wc
r(lband) lower band, if W is banded r(lmax) maximum number of
r(uband) upper band, if W is banded neighbors in Wc
r(min) minimum of Wd r(ltotal) total number of neighbors in Wc
r(min0) minimum element > 0 in Wd r(eig) 1 if object contains
r(mean) mean of Wd eigenvalues, 0 otherwise
r(max) maximum of Wd r(canband) 1 if object can be banded
r(lmin) minimum number of based on r(lband) and
neighbors in Wc r(uband), 0 otherwise

4.5 Examples
It is generally useful to know some summary statistics for the elements of your spatial-
weighting matrices. In sections 2.4 and 3.4, we used spmat summarize to report sum-
mary statistics for spatial-weighting matrices.
Many spatial-weighting matrices contain many elements that are not 0 but are very
small. At times, theoretical considerations such as threshold effects suggest that these
small weights should be truncated to 0. In these cases, you might want to summarize
the elements of the spatial-weighting subject to different truncation criteria as part of
some sensitivity analysis.

Example

In section 3.4, we stored an unnormalized inverse-distance spatial-weighting matrix


in the spmat object dcounty. In this example, we find the summary statistics of the
elements of a truncated version of this matrix.
For each county, we set to 0 the weights for counties whose centroids are farther
than 250 km from the centroid of that county. Because we are operating on inverse
distances, we specify 1/250 as the truncation criterion. The summary of the matrix
calculated after applying the truncation criterion is reported in the Truncated matrix
column. We can see that now the minimum nonzero distance is reported as 0.004.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 261

. spmat summarize dcounty, vtruncate(1/250)


Summary of spatial-weighting object dcounty

Current matrix Truncated matrix

Dimensions 3109 x 3109 3109 x 3109


Bands (3098, 3098)
Values
min 0 0
min>0 .0002185 .004
mean .0012296 .0002583
max 1.081453 1.081453

The Bands row reports the lower and upper bands with nonzero values. Those values
tell us whether the matrix can be stored in banded form. As mentioned in section 3.3,
neither value can be greater than (cols(W)−1)/4. In our case, the maximum values
for bands is (3,109 − 1)/4 = 777; therefore, if we truncated the values of the matrix
according to our criterion, we would not be able to store the matrix in banded form.12
In section 5.1, we show how we can use the sorting tricks of Drukker et al. (2011) to
store this matrix in banded form.

5 Examples of banded matrices


Thus far we have claimed that banded matrices are useful when handling spatial-
weighting matrices, but we have not yet substantiated this point. To illustrate the
usefulness of storing spatial-weighting matrices in a banded-matrix form, we revisit the
U.S. counties data and introduce U.S. five-digit zip code data.

12. In practice, rather than calculating the maximum value for bands by hand, we would use the
r(canband), r(lband), and r(uband) scalars returned by spmat summarize; see section 4.4 for
details.
262 Creating and managing spatial-weighting matrices

5.1 U.S. county data revisited


Recall that at the moment, we have the neighbor information stored in the spmat object
ccounty. We use spmat graph, discussed in section 7, to produce an intensity plot of
the n × n normalized contiguity matrix contained in the object ccounty by typing

. spmat graph ccounty, blocks(10)

Figure 2 shows that zero and nonzero entries are scattered all over the matrix.
This pattern arises because the original shapefiles had the counties sorted in an order
unrelated to their distances from a common point.

Columns
1 100 200 311
1

100
Rows

200

311

Figure 2. Normalized contiguity matrix for unsorted U.S. counties

We can store the normalized contiguity matrix more efficiently if we generate a


variable containing the distance from a particular place to all the other places and then
sort the data in an ascending order according to this variable.13 We implement this
method in four steps: 1) we sort the county data by the longitude and latitude of the
county centroids contained in longitude and latitude, respectively, so that the first
observation will be the corner observation of Curry County, OR; 2) we calculate the
distance of each county from the corner county in the first observation; 3) we sort on
the variable containing the distances calculated in step 2; and 4) we recompute and
summarize the normalized contiguity matrix.

13. For best results, pick a place located in a remote corner of the map; see Drukker et al. (2011) for
further details.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 263

. sort longitude latitude


. generate double dist =
> sqrt( (longitude-longitude[1])^2 + (latitude-latitude[1])^2 )
. sort dist
. spmat contiguity ccounty2 using countyxy, id(id) normalize(minmax) banded
> replace
. spmat summ ccounty2, links
Summary of spatial-weighting object ccounty2

Matrix Description

Dimensions 3109 x 3109


Stored as 465 x 3109
Links
total 18474
min 1
mean 5.942104
max 14

. spmat graph ccounty2, blocks(10)

Specifying the option banded in the spmat contiguity command caused the con-
tiguity matrix to be stored as a banded matrix. The summary table shows that the
contiguity information is now stored in a 465 × 3,109 matrix, which requires much
less space than the original 3,109 × 3,109 matrix. Figure 3 clearly shows the banded
structure.

Columns
1 100 200 311
1

100
Rows

200

311

Figure 3. Normalized contiguity matrix for sorted U.S. counties


264 Creating and managing spatial-weighting matrices

Similarly, we can re-create the dcounty object calculated on the sorted data and
see whether the inverse-distance matrix can be stored in banded form after applying a
truncation criterion.

. spmat idistance dcounty longitude latitude, id(id) dfunction(dhaversine)


> vtruncate(1/250) banded replace
. spmat summ dcounty
Summary of spatial-weighting object dcounty

Matrix Description

Dimensions 3109 x 3109


Stored as 769 x 3109
Values
min 0
min>0 .004
mean .0002583
max 1.081453

We can see that the Values summary for this matrix and the matrix from section 4.5
is the same; however, the matrix in this example is stored in banded form.

5.2 U.S. zip code data


The real power of banded storage is unveiled when we lack the memory to store spatial
data in an n × n matrix. We use the five-digit zip code level data for the continental
United States.14 We have information on 31,713 five-digit zip codes, and as was men-
tioned in section 1.2, we need 7.5 gigabytes of memory to store the normalized contiguity
matrix as a general matrix.

14. Data are from the U.S. Census Bureau at ftp://ftp2.census.gov/geo/tiger/TIGER2008/.


D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 265

Instead, we repeat the sorting trick and call spmat contiguity with option banded,
hoping that we will be able to fit the banded representation into memory.

. use zip5, clear


. *keep continental US zip codes
. drop if latitude > 49.5 | latitude < 24.5 | longitude < -124
(524 observations deleted)
. sort longitude latitude
. generate double dist =
> sqrt( (longitude-longitude[1])^2 + (latitude-latitude[1])^2 )
. sort dist
. spmat contiguity zip5 using zip5xy, id(id) normalize(minmax) banded
warning: spatial-weighting matrix contains 131 islands
. spmat summarize zip5, links
Summary of spatial-weighting object zip5

Matrix Description

Dimensions 31713 x 31713


Stored as 1827 x 31713
Links
total 166906
min 0
mean 5.263015
max 26

warning: spatial-weighting matrix contains 131 islands

The output from spmat summarize indicates that the normalized contiguity matrix
is stored in a 1,827 × 31,713 matrix. This fits into less than half a gigabyte of memory!
All we did to store the matrix in a banded format was change the sort order of the data
and specify the banded option. We discuss storing an existing n × n spatial-weighting
matrix in banded form in sections 18.1 and 18.2.
Having illustrated the importance of banded matrices, we return to documenting
the spmat commands.

6 Inserting documentation into your spmat objects


6.1 Syntax
 
spmat note objname { : "text", replace | drop }

6.2 Description
spmat note creates and manipulates a note attached to the spmat object.
266 Creating and managing spatial-weighting matrices

6.3 Options
replace causes spmat note to overwrite the existing note with a new one.
drop causes spmat note to clear the note associated with objname.

6.4 Examples
If you plan to use a spatial-weighting matrix outside a given do-file or session, you
should attach some documentation to the spmat object.
spmat note stores the note in a string scalar; however, it is possible to store multiple
notes in the scalar by repeatedly appending notes.

Example

We attach a note to the spmat object ccounty and then display it by typing
. spmat note ccounty : "Source: Tiger 2008 county files."
. spmat note ccounty
Source: Tiger 2008 county files.

As mentioned, we can have multiple notes:


. spmat note ccounty : "Created on 18jan2011."
. spmat note ccounty
Source: Tiger 2008 county files. Created on 18jan2011.

7 Plotting the elements of a spatial-weighting matrix


7.1 Syntax
   
spmat graph objname , blocks( (stat) p) twoway options

7.2 Description
spmat graph produces an intensity plot of the spatial-weighting matrix contained in
the spmat object objname. Zero elements are plotted in white; the remaining elements
are partitioned into bins of equal length and assigned gray-scale colors gs0–gs15 (see
[G-4] colorstyle), with darker colors representing higher values.

7.3 Options
 
blocks( (stat) p) specifies that the matrix be divided into blocks of size p and that
block maximums be plotted. This option is useful when the matrix is large. To plot a
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 267

statistic other than the default maximum, you can specify the optional stat argument.
For example, to plot block medians, type blocks((p50) p). The supported statistics
include those returned by summarize, detail; see [R] summarize for a complete
list.
twoway options are any options other than by(); they are documented in
[G-3] twoway options.

7.4 Examples
An intensity plot of a spatial-weighting matrix can reveal underlying structure. For
example, if there is a banded structure to the spatial-weighting matrix, large amounts
of memory may be saved.
See section 5.1 for an example in which we use spmat graph to reveal the banded
structure in a spatial-weighting matrix.

8 Computing spatial lags


8.1 Syntax
 
spmat lag type newvar objname varname

8.2 Description
spmat lag uses a spatial-weighting matrix to compute the weighted averages of a vari-
able known as the spatial lag of a variable.
More precisely, spmat lag uses the spatial-weighting matrix in the spmat object
objname to compute the spatial lag of the variable varname and stores the result in the
new variable newvar.

8.3 Examples
Spatial lags of the exogenous right-hand-side variables are frequently included in SAR
models; see, for example, LeSage and Pace (2009).
Recall that a spatial lag is a weighted average of the variable being lagged. If x spl
denotes the spatial lag of the existing variable x, using the spatial-weighting matrix W,
then the algebraic definition is x spl = Wx.
268 Creating and managing spatial-weighting matrices

The code below generates the new variable x spl, which contains the spatial lag of x,
using the spatial-weighting matrix W, which is contained in the spmat object ccounty:

. clear all
. use county
. spmat contiguity ccounty using countyxy, id(id) normalize(minmax)
. generate x = runiform()
. spmat lag x_spl ccounty x

We could now include both x and x spl in our model.

9 Computing the eigenvalues of a spatial-weighting ma-


trix
9.1 Syntax
 
spmat eigenvalues objname , eigenvalues(vecname) replace

9.2 Description
spmat eigenvalues calculates the eigenvalues of the spatial-weighting matrix contained
in the spmat object objname and stores them in vecname. The maximum-likelihood
estimator implemented in the spreg ml command, as described in Drukker, Prucha,
and Raciborski (2013b), uses the eigenvalues of the spatial-weighting matrix during the
optimization process. If you are estimating several models by maximum likelihood with
the same spatial-weighting matrix, computing and storing the eigenvalues in an spmat
object will remove the need to recompute the eigenvalues.

9.3 Options
eigenvalues(vecname) stores the user-defined vector of eigenvalues in the spmat object
objname. vecname must be a Mata row vector of length n, where n is the dimension
of the spatial-weighting matrix in the spmat object objname.
replace permits spmat eigenvalues to overwrite existing eigenvalues in objname.

9.4 Examples
Putting the eigenvalues into the spmat object can dramatically speed up the com-
putations performed by the spreg ml command; see Drukker, Prucha, and Raciborski
(2013b) for details and references therein.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 269

We can calculate the eigenvalues of the spatial-weighting matrix contained in the


spmat object ccounty and store them in the same object by typing

. spmat eigenvalues ccounty


Calculating eigenvalues.... finished.

10 Removing an spmat object from memory


10.1 Syntax

spmat drop objname

10.2 Description
spmat drop removes the spmat object objname from memory.

10.3 Examples
To drop the spmat object dcounty from memory, we type

. spmat drop dcounty


(note: spmat object dcounty not found)

11 Saving an spmat object to disk


11.1 Syntax
 
spmat save objname using filename , replace

11.2 Description
spmat save saves the spmat object objname to a file in a native Stata format.

11.3 Option
replace permits spmat save to overwrite filename.
270 Creating and managing spatial-weighting matrices

11.4 Examples
Creating a spatial-weighting matrix, and perhaps its eigenvalues as well, can be a time-
consuming process. If you are going to repeatedly use a spatial-weighting matrix, you
probably want to save it to a disk and read it back in for subsequent uses. spmat save
will save the spmat object to disk for you. Section 12 discusses spmat use, which reads
the object from disk into memory.
If you are going to save an spmat object to disk, it is a good practice to use spmat
note to attach some documentation to the object before saving it. Section 6 discusses
spmat note.
Just like with Stata datasets, you can save your spmat objects to disk and share
them with other Stata users. The file format is platform independent. So, for example,
a Mac user could save an spmat object to disk and email it to a coauthor, and the
Windows-using coauthor could read in this spmat object by using spmat use.
We can save the information contained in the spmat object ccounty in the file
ccounty.spmat by typing

. spmat save ccounty using ccounty.spmat

12 Reading spmat objects from disk


12.1 Syntax
 
spmat use objname using filename , replace

12.2 Description
spmat use reads into memory an spmat object from a file created by spmat save; see
section 11 for a discussion of spmat save.

12.3 Option
replace permits spmat use to overwrite an existing spmat object.

12.4 Examples
As mentioned in section 11, creating a spatial-weighting matrix can be time consuming.
When repeatedly using a spatial-weighting matrix, you might want to save it to disk
with spmat save and read it back in with spmat use for subsequent uses.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 271

In section 11, we saved the spmat object ccounty to the file ccounty.spmat. We
now drop the existing ccounty object from memory and read it back in with spmat
use:

. spmat drop ccounty


. spmat use ccounty using ccounty.spmat
. spmat note ccounty
Source: Tiger 2008 county files. Created on 18jan2011.

13 Writing a spatial-weighting matrix to a text file


13.1 Syntax
 
spmat export objname using filename , noid nlist replace

13.2 Description
spmat export saves the spatial-weighting matrix contained in the spmat object objname
to a space-delimited text file. The matrix is written in a rectangular format with unique
place identifiers saved in the first column. spmat export can also save lists of neighbors
to a text file.

13.3 Options
noid causes spmat export not to save unique place identifiers, only matrix entries.
nlist causes spmat export to write the matrix in the neighbor-list format described
in section 2.3.
replace permits spmat export to overwrite filename.

13.4 Examples
The main use of spmat export is to export a spatial-weighting matrix to a text file that
can be read by another program. Long (2009, 336) recommends exporting all data to
text files that will be read by future software as part of archiving one’s research work.
Another use of spmat export is to review neighbor lists from a contiguity matrix.
Here we illustrate how one can export the contiguity matrix in the neighbor-list format
described in section 2.3.
. spmat export ccounty using nlist.txt, nlist
272 Creating and managing spatial-weighting matrices

We call the Unix command head to list the first 10 lines of nlist.txt:15

. !head nlist.txt
3109
1 1054 1657 2063 2165 2189 2920 2958
2 112 2250 2277 2292 2362 2416 3156
3 2294 2471 2575 2817 2919 2984
4 8 379 1920 2024 2258 2301
5 6 73 1059 1698 2256 2886 2896
6 5 1698 2256 2795 2886 2896 3098
7 517 1924 2031 2190 2472 2575
8 4 379 1832 2178 2258 2987
9 413 436 1014 1320 2029 2166

The first line of the file indicates that there are 3,109 total spatial units. The
second line indicates that the unit with identification code 1 is a neighbor of units with
identification codes 1054, 1657, 2063, 2165, 2189, 2920, and 2958. The interpretation of
the remaining lines is analogous to that for the second line.

14 Getting a spatial-weighting matrix from an spmat ob-


ject
14.1 Syntax
   
spmat getmatrix objname matname , id(vecname) eig(vecname)

14.2 Description
spmat getmatrix copies the spatial-weighting matrix contained in the spmat object
objname and stores it in the Mata matrix matname; see [M-0] intro for an introduction
to using Mata. If specified, the vector of unique identifiers and the eigenvalues of the
spatial-weighting matrix will be stored in Mata vectors.

14.3 Options
id(vecname) specifies the name of a Mata vector to contain IDs.
eig(vecname) specifies the name of a Mata vector to contain eigenvalues.

15. Users of other operating systems should open the file in a text editor.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 273

14.4 Examples
If you want to make changes to an existing spatial-weighting matrix, you need to retrieve
it from the spmat object, store it in Mata, make the desired changes, and store the new
matrix back in the spmat object by using spmat putmatrix. (See section 17 for a
discussion of spmat putmatrix.)
spmat getmatrix performs the first two tasks: it makes a copy of the spatial-
weighting matrix from the spmat object and stores it in Mata.
As we discussed in section 3, spmat idistance creates a spatial-weighting matrix
of the form 1/dij , where dij is the distance between units i and j. In section 17.4, we
use spmat getmatrix in an example in which we change a spatial-weighting matrix to
the form 1/ exp(0.1 × dij ) instead of just 1/dij .

15 Importing spatial-weighting matrices


15.1 Syntax

spmat import objname using filename , noid nlist geoda idistance

normalize(norm) replace

15.2 Description
spmat import imports a spatial-weighting matrix from a space-delimited text file and
stores it in a new spmat object.

15.3 Options
noid specifies that the first column of numbers in filename does not contain unique place
identifiers and that spmat import should create and use the identifiers 1, . . . , n.
nlist specifies that the text file to be imported contain a list of neighbors in the format
described in section 2.3.
geoda specifies that filename be in the .gwt or .gal format created by the GeoDaTM
software.
idistance specifies that the file contains raw distances and that the raw distances
should be converted to inverse distances. In other words, idistance specifies that
the (i, j)th element in the file be dij and that the (i, j)th element in the spatial-
weighting matrix be 1/dij , where dij is the distance between units i and j.
normalize(norm) specifies one of the three available normalization techniques: row,
minmax, and spectral. In a row-normalized matrix, each element in row i is divided
by the sum of row i’s elements. In a minmax-normalized matrix, each element is
274 Creating and managing spatial-weighting matrices

divided by the minimum of the largest row sum and column sum of the matrix. In
a spectral-normalized matrix, each element is divided by the modulus of the largest
eigenvalue of the matrix. See section 2.5 for details.
replace permits spmat import to overwrite an existing spmat object.

15.4 Examples
One frequently needs to import a spatial-weighting matrix from a text file. spmat
import supports three of the most common formats: simple text files, GeoDaTM text
files, and text files that require minor changes such as converting from raw to inverse
distances.
By default, the unique place-identifying variable is assumed to be stored in the first
column of the file, but this can be overridden with the noid option.
In section 17.4, we provide an extended example that begins with using spmat
import to import a spatial-weighting matrix.

16 Obtaining a spatial-weighting matrix from a Stata


dataset
16.1 Syntax
    
spmat dta objname varlist if in , id(varname) idistance

normalize(norm) replace

16.2 Description
spmat dta imports a spatial-weighting matrix from the variables in a Stata dataset and
stores it in an spmat object.
The number of variables in varlist must equal the number of observations because
spatial-weighting matrices are n × n.

16.3 Options
id(varname) specifies that the unique place identifiers be contained in varname. The
default is to create an identifying vector containing 1, . . . , n.
idistance specifies that the variables contain raw distances and that the raw distances
be converted to inverse distances. In other words, idistance specifies that the ith
observation on the jth variable be dij and that the (i, j)th element in the spatial-
weighting matrix be 1/dij , where dij is the distance between units i and j.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 275

normalize(norm) specifies one of the three available normalization techniques: row,


minmax, and spectral. In a row-normalized matrix, each element in row i is divided
by the sum of row i’s elements. In a minmax-normalized matrix, each element is
divided by the minimum of the largest row sum and column sum of the matrix. In
a spectral-normalized matrix, each element is divided by the modulus of the largest
eigenvalue of the matrix. See section 2.5 for details.
replace permits spmat dta to overwrite an existing spmat object.

16.4 Examples
People have created Stata datasets that contain spatial-weighting matrices. Given the
power of infile and infix (see [D] infile (fixed format) and [D] infix (fixed for-
mat)), it is likely that more such datasets will be created. spmat dta imports these
spatial-weighting matrices and stores them in an spmat object.
Here we illustrate how we can create an spmat object from a Stata dataset. The
dataset schools.dta contains the distance in miles between five schools in the variables
c1-c5. The unique school identifier is recorded in the variable id. In Stata, we type

. use schools, clear


. list

id c1 c2 c3 c4 c5

1. 101 0 5.9 8.25 6.22 7.66


2. 205 5.9 0 2.97 4.87 7.63
3. 113 8.25 2.97 0 4.47 7
4. 441 6.22 4.87 4.47 0 2.77
5. 573 7.66 7.63 7 2.77 0

. spmat dta schools c*, id(id) idistance normalize(minmax)

17 Storing a Mata matrix in an spmat object


17.1 Syntax
  
spmat putmatrix objname , id(varname | vecname) eig(vecname)
matname

idistance bands(l u) normalize(norm) replace

17.2 Description
spmat putmatrix puts Mata matrices into an existing spmat object objname or into
a new spmat object if the specified object does not exist. The optional unique place
identifiers can be provided as a Mata vector or a Stata variable. The optional eigenvalues
of the Mata matrix can be provided in a Mata vector.
276 Creating and managing spatial-weighting matrices

17.3 Options
id(varname | vecname) specifies a Mata vector vecname or a Stata variable varname
that contains unique place identifiers.
eig(vecname) specifies a Mata vector vecname that contains the eigenvalues of the
matrix.
idistance specifies that the Mata matrix contains raw distances and that the raw
distances be converted to inverse distances. In other words, idistance specifies
that the (i, j)th element in the Mata matrix be dij and that the (i, j)th element in
the spatial-weighting matrix be 1/dij , where dij is the distance between units i and
j.
bands(l u) specifies that the Mata matrix matname be banded with l lower and u upper
diagonals.
normalize(norm) specifies one of the three available normalization techniques: row,
minmax, and spectral. In a row-normalized matrix, each element in row i is divided
by the sum of row i’s elements. In a minmax-normalized matrix, each element is
divided by the minimum of the largest row sum and column sum of the matrix. In
a spectral-normalized matrix, each element is divided by the modulus of the largest
eigenvalue of the matrix. See section 2.5 for details.
replace permits spmat putmatrix to overwrite an existing spmat object.

17.4 Examples
spmat contiguity and spmat idistance create spatial-weighting matrices from raw
data. This section describes situations in which we have the spatial-weighting matrix
precomputed and simply want to put it in an spmat object. The spatial-weighting matrix
can be any matrix that satisfies the conditions discussed, for example, in Kelejian and
Prucha (2010).
In this section, we show how to create an spmat object from a text file by using
spmat import and how to use spmat getmatrix and spmat putmatrix to generate an
inverse-distance matrix according to a user-specified functional form.
The file schools.txt contains the distance in miles between five schools. We call
the Unix command cat to print the contents of the file:

. !cat schools.txt
5
101 0 5.9 8.25 6.22 7.66
205 5.9 0 2.97 4.87 7.63
113 8.25 2.97 0 4.47 7
441 6.22 4.87 4.47 0 2.77
573 7.66 7.63 7 2.77 0
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 277

The school ID is recorded in the first column of the file, and column i records the
distance from school i to all the other schools, including itself. We can use spmat
import to create a spatial-weighting matrix from this file:
. spmat import schools using schools.txt, replace

The resulting spatial-weighting matrix is


⎡ ⎤
0 5.9 8.25 6.22 7.66
⎢ 5.9 0 2.97 4.87 7.63 ⎥
⎢ ⎥
⎢ 8.25 2.97 0 4.47 7.0 ⎥
⎢ ⎥
⎣ 6.22 4.87 4.47 0 2.77 ⎦
7.66 7.63 7.0 2.77 0

We now illustrate how to create a spatial-weighting matrix with the distance de-
clining in an exponential fashion, exp(−0.1dij ), where dij is the original distance from
school i to school j.
. spmat getmatrix schools x
. mata: x = exp(-.1:*x)
. mata: _diag(x,0)
. spmat putmatrix schools x, normalize(minmax) replace

Thus we read in the original distances, extract the distance matrix with spmat
getmatrix, use Mata to transform the matrix entries according to our specifications,
and reset the diagonal elements to 0. Finally, we use spmat putmatrix to put the
transformed matrix into an spmat object. The resulting minmax-normalized spatial-
weighting matrix is
⎡ ⎤
0 0.217 0.172 0.211 0.182
⎢ 0.217 0 0.292 0.241 0.183 ⎥
⎢ ⎥
⎢ 0.172 0.292 0 0.251 0.195 ⎥
⎢ ⎥
⎣ 0.211 0.241 0.251 0 0.297 ⎦
0.182 0.183 0.195 0.297 0

18 Converting general matrices into banded matrices


This section shows how to transform a spatial-weighting matrix stored as a general
matrix in an spmat object in a banded format. If this topic is not of interest, you can
skip this section.
The easy case is when the matrix already has a banded structure so that we can
simply use spmat tobanded.
Now consider the more difficult case in which we have a spatial-weighting matrix
stored in an spmat object and we would like to use the sorting method described in
Drukker et al. (2011) to store this matrix in a banded format. This transformation
requires 1) permuting the elements of the existing spatial-weighting matrix to correspond
278 Creating and managing spatial-weighting matrices

to a new row sort order and then 2) storing the spatial-weighting matrix in banded
format. We accomplish step 1 by storing the new row sort order in a permutation
vector, as explained below, and then by using spmat permute. We use spmat tobanded
to perform step 2.
Note that most of the time, it is more convenient to sort the data as described
in section 5.1 and to call spmat contiguity or spmat idistance with a truncation
criterion. With very large datasets, spmat contiguity and spmat idistance will be
the only choices because they are capable of creating banded matrices from data without
first storing the matrices in a general form.

18.1 Permuting a spatial-weighting matrix stored in an spmat object


Syntax

spmat permute objname pvarname

Description

spmat permute permutes the rows and columns of the n × n spatial-weighting matrix
stored in the spmat object objname. The permutation vector stored in pvarname con-
tains a permutation of the integers {1, . . . , n}, where n is both the sample size and the
dimension of W. That the value of the ith observation of pvarname is j specifies that
we must move row j to row i in the permuted matrix. After moving all the rows as
specified in pvarname, we move the columns in an analogous fashion. See Permutation
details: Mathematics below for a more thorough explanation.

Examples

spmat permute is illustrated in the Examples section of section 18.2.

Permutation details: Mathematics

Let p be the permutation vector created from pvarname, and let W be the spatial-
weighting matrix contained in the specified spmat object. The n × 1 permutation vector
p contains a permutation of the integers {1, . . . , n}, where n is the dimension of W.
The permutation of W is obtained by reordering the rows and columns of W as specified
by the elements of p. Each element of p specifies a row and column reordering of W.
That element i of p is j—that is, p[i]=j—specifies that we must move row j to row i in
the permuted matrix. After moving all the rows according to p, we move the columns
analogously.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 279

Here is an illustrative example. We have a matrix W, which is not banded:

. mata: W
[symmetric]
1 2 3 4 5

1 0
2 1 0
3 0 0 0
4 0 1 0 0
5 1 0 1 0 0

Suppose that we also have a permutation vector p that we could use to permute W to a
banded matrix.

. mata: p
1 2 3 4 5

1 3 5 1 2 4

See Permutation details: An example below to see how we used the sorting trick of
Drukker et al. (2011) to obtain this p. See Examples in section 18.2 for an example
with real data.
The values in the permutation vector p specify how to permute (that is, reorder) the
rows and the columns of W. Let’s start with the rows. That 3 is element 1 of p specifies
that row 3 of W be moved to row 1 in the permuted matrix. In other words, we must
move row 3 to row 1.
Applying this logic to all the elements of p yields that we must reorder the rows of
W by moving row 3 to row 1, row 5 to row 2, row 1 to row 3, row 2 to row 4, and row 4
to row 5. In the output below, we use Mata to perform this operation on W, store the
result in A, and display A. If the Mata code is confusing, just check that A contains the
described row reordering of W.

. mata: A = W[p,.]
. mata: A
1 2 3 4 5

1 0 0 0 0 1
2 1 0 1 0 0
3 0 1 0 0 1
4 1 0 0 1 0
5 0 1 0 0 0

Having reordered the rows, we reorder the columns in the analogous fashion. Oper-
ating on A, we move column 3 to column 1, column 5 to column 2, column 1 to column 3,
column 2 to column 4, and column 4 to column 5. In the output below, we use Mata
to perform this operation on A, store the result in B, and display B. If the Mata code is
confusing, just check that B contains the reordering of A described above.
280 Creating and managing spatial-weighting matrices

. mata: B = A[.,p]
. mata: B
[symmetric]
1 2 3 4 5

1 0
2 1 0
3 0 1 0
4 0 0 1 0
5 0 0 0 1 0

Note that B is the desired banded matrix. For Mata aficionados, typing W[p,p]
would produce this permutation in one step.
For those whose intuition is grounded in linear algebra, here is the permutation-
matrix explanation. The permutation vector p defines the permutation matrix E, where
E is obtained by performing the row reordering described above on the identity matrix
of dimension 5. Then the permuted form of W is given by E*W*E’, as we illustrate below:

. mata: E = I(5)
. mata: E
[symmetric]
1 2 3 4 5

1 1
2 0 1
3 0 0 1
4 0 0 0 1
5 0 0 0 0 1

. mata: E = E[p,.]
. mata: E
1 2 3 4 5

1 0 0 1 0 0
2 0 0 0 0 1
3 1 0 0 0 0
4 0 1 0 0 0
5 0 0 0 1 0

. mata: E*W*E´
[symmetric]
1 2 3 4 5

1 0
2 1 0
3 0 1 0
4 0 0 1 0
5 0 0 0 1 0

permutation (see [M-1] permutation) provides further details on permutation vec-


tors and permutation matrices.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 281

Permutation details: An example

spmat permute requires that the permutation vector be stored in the Stata variable
pvarname. Assume that we now have the unpermuted matrix W stored in the spmat
object cobj. The matrix represents contiguity information for the following data:

. list

id distance

1. 79 5.23
2. 82 27.56
3. 100 0
4. 114 1.77
5. 140 20.47

The variable distance measures the distance from the centroid of the place with id=100
to the centroids of all the other places. We sort the data on distance and generate the
permutation vector p, which is just a running index 1, . . . , 5:

. sort distance
. generate p = _n
. list

id distance p

1. 100 0 1
2. 114 1.77 2
3. 79 5.23 3
4. 140 20.47 4
5. 82 27.56 5

We obtain our permutation vector by sorting the data back to the original order
based on the id variable:

. sort id
. list

id distance p

1. 79 5.23 3
2. 82 27.56 5
3. 100 0 1
4. 114 1.77 2
5. 140 20.47 4

Now coding spmat permute cobj p will reorder the rows and columns of W in
exactly the same way as the Mata code did above.
282 Creating and managing spatial-weighting matrices

18.2 Banding a spatial-weighting matrix


Syntax
   
spmat tobanded objname1 objname2 , truncmethod replace

where truncmethod is one of btruncate(b B), dtruncate(dL dU ), or vtruncate(#).

Description

spmat tobanded stores an existing, general-format spatial-weighting matrix in a banded


format. spmat tobanded has truncation options for inducing a banded structure in
spatial-weighting matrices that are not already in banded form.
More precisely, spmat tobanded stores the spatial-weighting matrix in an spmat
object in banded format.

Options

truncmethod specifies one of the three truncation criteria. The values of W that meet
the truncation criterion will be changed to 0.
btruncate(b B) partitions the values of W into B bins and truncates to 0 entries
that fall into bin b or below.
dtruncate(dL dU ) truncates to 0 the values of W that fall more than dL diagonals
below and dU diagonals above the main diagonal. Neither value can be greater than
(cols(W)−1)/4.
vtruncate(#) truncates to 0 the values of W that are less than or equal to #.
replace allows objname1 or objname2 to be overwritten if it already exists.

Examples

Sometimes, we have large spatial-weighting matrices that fit in memory, but they take
up so much space that there is too little room to do anything else. In these cases, we are
better off storing these spatial-weighting matrices in a banded format when possible.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 283

spmat tobanded stores existing spatial-weighting matrices in a banded format. The


two allowed syntaxes are
spmat tobanded objname1 , replace
and
spmat tobanded objname1 objname2 [, replace]
The first syntax replaces the general-form spatial-weighting matrix in the spmat object
objname1 with its banded form.
The second syntax stores the general-form spatial-weighting matrix in the spmat object
objname1 in banded form in the spmat object objname2. You must specify replace if
objname2 already exists.
We continue with the example from section 2.4, where we have the 3,109 × 3,109
normalized-contiguity matrix stored in the spmat object ccounty. In section 5.1, we
showed that if we sort the data on a distance variable, we can call spmat contiguity
again and get a banded matrix. Here we show that we can achieve the same result
by 1) creating a permutation vector, 2) calling spmat permute, and 3) running spmat
tobanded on the existing spmat object.
We begin by generating a permutation vector and storing it in the Stata variable p.
Recall that we want the ith element of p to contain the observation number that it
will have under the new sort order. This process is given in the code below and is
analogous to the one discussed in the subsections Permutation details: Mathematics
and Permutation details: An example in section 18.1. Because the data are already
sorted by ID, we begin by sorting them by longitudes and latitudes of the centroids so
that the first observation will contain a corner place. Next we generate the distance
from the corner place. After sorting the data in ascending order from the distance to
the corner observation, we generate our permutation vector p and finally put the data
back in the original sort order.

. use county, clear


. generate p = _n
. sort longitude latitude
. generate double dist =
> sqrt( (longitude-longitude[1])^2 + (latitude-latitude[1])^2 )
. sort dist
284 Creating and managing spatial-weighting matrices

We can now use this permutation vector and spmat permute to perform the per-
mutation, and we can finally call spmat tobanded to band the spatial-weighting matrix
stored inside the spmat object ccounty. Note that the reported summary is identical
to the one in section 5.1.

. spmat permute ccounty p


. spmat tobanded ccounty, replace
. spmat summarize ccounty, links
Summary of spatial-weighting object ccounty

Matrix Description

Dimensions 3109 x 3109


Stored as 465 x 3109
Links
total 18474
min 1
mean 5.942104
max 14

(object contains eigenvalues)

19 Conclusion
We discussed the spmat command for creating, managing, importing, manipulating, and
storing spatial-weighting matrix objects. In future work, we will consider additional
subcommands for creating specific types of spatial-weighting matrices.

20 Acknowledgment
We gratefully acknowledge financial support from the National Institutes of Health
through the SBIR grants R43 AG027622 and R44 AG027622.

21 References
Anselin, L. 1988. Spatial Econometrics: Methods and Models. Dordrecht: Kluwer
Academic Publishers.

———. 2010. Thirty years of spatial econometrics. Papers in Regional Science 89: 3–25.

Arbia, G. 2006. Spatial Econometrics: Statistical Foundations and Applications to


Regional Convergence. Berlin: Springer.

Cliff, A. D., and J. K. Ord. 1973. Spatial Autocorrelation. London: Pion.

———. 1981. Spatial Processes: Models and Applications. London: Pion.

Cressie, N. A. C. 1993. Statistics for Spatial Data. Revised ed. New York: Wiley.
D. M. Drukker, H. Peng, I. R. Prucha, and R. Raciborski 285

Crow, K. 2006. shp2dta: Stata module to convert shape boundary files to Stata datasets.
Statistical Software Components S456718, Department of Economics, Boston College.
http://ideas.repec.org/c/boc/bocode/s456718.html.
Crow, K., and W. Gould. 2007. FAQ: How do I graph data onto a map with spmap?
http://www.stata.com/support/faqs/graphics/spmap-and-maps/.
Drukker, D. M., P. Egger, and I. R. Prucha. 2013. On two-step estimation of a spatial
autoregressive model with autoregressive disturbances and endogenous regressors.
Econometric Reviews 32: 686–733.
Drukker, D. M., H. Peng, I. R. Prucha, and R. Raciborski. 2011. Sorting induces
a banded structure in spatial-weighting matrices. Working paper, Department of
Economics, University of Maryland.
Drukker, D. M., I. R. Prucha, and R. Raciborski. 2013a. A command for estimating
spatial-autoregressive models with spatial-autoregressive disturbances and additional
endogenous variables. Stata Journal 13: 287–301.
———. 2013b. Maximum likelihood and generalized spatial two-stage least-squares
estimators for a spatial-autoregressive model with spatial-autoregressive disturbances.
Stata Journal 13: 221–241.
Haining, R. 2003. Spatial Data Analysis: Theory and Practice. Cambridge: Cambridge
University Press.
Kelejian, H. H., and I. R. Prucha. 2010. Specification and estimation of spatial au-
toregressive models with autoregressive and heteroskedastic disturbances. Journal of
Econometrics 157: 53–67.
Lai, P.-C., F.-M. So, and K.-W. Chan. 2009. Spatial Epidemiological Approaches in
Disease Mapping and Analysis. Boca Raton, FL: CRC Press.
Leenders, R. T. A. J. 2002. Modeling social influence through network autocorrelation:
Constructing the weight matrix. Social Networks 24: 21–47.
LeSage, J., and R. K. Pace. 2009. Introduction to Spatial Econometrics. Boca Raton:
Chapman & Hall/CRC.
Long, J. S. 2009. The Workflow of Data Analysis Using Stata. College Station, TX:
Stata Press.
Pisati, M. 2005. mif2dta: Stata module to convert MapInfo Interchange Format bound-
ary files to Stata boundary files. Statistical Software Components S448403, Depart-
ment of Economics, Boston College.
http://ideas.repec.org/c/boc/bocode/s448403.html.
———. 2007. spmap: Stata module to visualize spatial data. Statistical Software
Components S456812, Department of Economics, Boston College.
http://ideas.repec.org/c/boc/bocode/s456812.html.
286 Creating and managing spatial-weighting matrices

Tobler, W. R. 1970. A computer movie simulating urban growth in the Detroit region.
Economic Geography 46: 234–240.

Waller, L. A., and C. A. Gotway. 2004. Applied Spatial Statistics for Public Health
Data. Hoboken, NJ: Wiley.

Whittle, P. 1954. On stationary processes in the plane. Biometrika 41: 434–449.

About the authors


David Drukker is the director of econometrics at StataCorp.
Hua Peng is a senior software engineer at StataCorp.
Ingmar Prucha is a professor of economics at the University of Maryland.
Rafal Raciborski is an econometrician at StataCorp.
The Stata Journal (2013)
13, Number 2, pp. 287–301

A command for estimating


spatial-autoregressive models with
spatial-autoregressive disturbances and
additional endogenous variables
David M. Drukker Ingmar R. Prucha Rafal Raciborski
StataCorp Department of Economics StataCorp
College Station, TX University of Maryland College Station, TX
[email protected] College Park, MD [email protected]
[email protected]

Abstract. We describe the spivreg command, which estimates the parameters


of linear cross-sectional spatial-autoregressive models with spatial-autoregressive
disturbances, where the model may also contain additional endogenous variables
as well as exogenous variables. spivreg uses results and the literature cited in
Kelejian and Prucha (1998, Journal of Real Estate Finance and Economics 17:
99–121; 1999, International Economic Review 40: 509–533; 2004, Journal of Econo-
metrics 118: 27–50; 2010, Journal of Econometrics 157: 53–67); Arraiz et al. (2010,
Journal of Regional Science 50: 592–614); and Drukker, Egger, and Prucha (2013,
Econometric Reviews 32: 686–733).
Keywords: st0293, spivreg, spatial-autoregressive models, Cliff–Ord models, gener-
alized spatial two-stage least squares, instrumental-variable estimation, generalized
method of moments estimation, spatial econometrics, spatial statistics

1 Introduction
Building on the work of Whittle (1954), Cliff and Ord (1973, 1981) developed statistical
models that accommodate forms of cross-unit interactions. The latter is a feature of
interest in many social science, biostatistical, and geographic science models. A simple
version of these models, typically referred to as spatial-autoregressive (SAR) models,
augments the linear regression model by including an additional right-hand-side (RHS)
variable known as a spatial lag. Each observation of the spatial-lag variable is a weighted
average of the values of the dependent variable observed for the other cross-sectional
units. Generalized versions of the SAR model also allow for the disturbances to be
generated by a SAR process and for the exogenous RHS variables to be spatial lags of
exogenous variables. The combined SAR model with SAR disturbances is often referred
to as a SARAR model; see Anselin and Florax (1995).1

1. These models are also known as Cliff–Ord models because of the impact that Cliff and Ord (1973,
1981) had on the subsequent literature. To avoid confusion, we simply refer to these models as
SARAR models while still acknowledging the importance of the work of Cliff and Ord.


c 2013 StataCorp LP st0293
288 Spatial models with additional endogenous variables

In modeling the outcome for each unit as dependent on a weighted average of the
outcomes of other units, SARAR models determine outcomes simultaneously. This si-
multaneity implies that the ordinary least-squares estimator will not be consistent; see
Anselin (1988) for an early discussion of this point. Drukker, Prucha, and Raciborski
(2013) discuss the spreg command, which implements estimators for the model when
the RHS variables are a spatial lag of the dependent variable, exogenous variables, and
spatial lags of the exogenous variables.
The model we consider allows for additional endogenous RHS variables. Thus the
model of interest is a linear cross-sectional SAR model with additional endogenous vari-
ables, exogenous variables, and SAR disturbances. We discuss an estimator for the
parameters of this model and the command that implements this estimator, spivreg.
Kelejian and Prucha (1998, 1999, 2004, 2010) and the references cited therein derive
the main results used by the estimator implemented in spivreg, with Drukker, Egger,
and Prucha (2013) and Arraiz et al. (2010) producing some important extensions that
are used in the code.
While SARAR models have a wide range of possible applications, following Cliff
and Ord (1973, 1981), much of the original literature was developed to handle spatial
interactions; see, for example, Anselin (1988, 2010), Cressie (1993), and Haining (2003).
However, space is not restricted to geographic space, and many recent applications
employ these techniques in other situations of cross-unit dependence, such as social-
interaction models and network models; see, for example, Kelejian and Prucha (2010)
and Drukker, Egger, and Prucha (2013) for references. Much of the nomenclature still
includes the adjective “spatial”, and we continue this tradition to avoid confusion while
noting the wider applicability of these models.
Section 2 defines the generalized SARAR model. Section 3 describes the spivreg
command. Section 4 illustrates the estimation of a SARAR model on example data
for U.S. counties. Section 5 describes postestimation commands. Section 6 presents
methods and formulas. The conclusion follows.

2 The model
The model of interest is given by

y = Yπ + Xβ + λWy + u (1)
u = ρMu +  (2)

where

• y is an n × 1 vector of observations on the dependent variable;

• Y is an n × p matrix of observations on p RHS endogenous variables, and π is the


corresponding p × 1 parameter vector;
D. M. Drukker, I. R. Prucha, and R. Raciborski 289

• X is an n × k matrix of observations on k RHS exogenous variables (where some


of the variables may be spatial lags of exogenous variables), and β is the corre-
sponding p × 1 parameter vector;
• W and M are n × n spatial-weighting matrices (with 0 diagonal elements);
• Wy and Mu are n × 1 vectors typically referred to as spatial lags, and λ and ρ
are the corresponding scalar parameters typically referred to as SAR parameters;
•  is an n × 1 vector of innovations.2

The model in equations (1) and (2) is a SARAR model with exogenous regressors and
additional endogenous regressors. Spatial interactions are modeled through spatial lags,
and the model allows for spatial interactions in the dependent variable, the exogenous
variables, and the disturbances.
Because the model in equations (1) and (2) is a first-order SAR process with first-
order SAR disturbances, it is also referred to as a SARAR(1,1) model, which is a special
case of the more general SARAR(p, q) model. We refer to a SARAR(1,1) model as a
SARAR model. Setting ρ = 0 yields the SAR model y = Yπ + Xβ + λWy+. Setting
λ = 0 yields the model y = Yπ + Xβ + u with u = ρMu + , which is sometimes
referred to as the SAR error model. Setting ρ = 0 and λ = 0 causes the model to reduce
to a linear regression model with endogenous variables.
The spatial-weighting matrices W and M are taken to be known and nonstochastic.
These matrices are part of the model definition, and in many applications, W = M;
see Drukker et al. (2013) for more about creating spatial-weighting matrices in Stata.
Let y= Wy, let y i and yi denote the ith element of y and y, respectively, and let wij
denote the (i, j)th element of W. Then

n
yi = wij yj
j=1

which clearly shows the dependence of yi on neighboring outcomes via the spatial lag
y i . The weights wij will typically be modeled as inversely related to some measure
of distance between the units. The SAR parameter λ measures the extent of these
interactions.
The innovations  are assumed to be independent and identically distributed or in-
dependent but heteroskedastically distributed. The option heteroskedastic, discussed
below, should be specified under the latter assumption.
The spivreg command implements the generalized method of moments (GMM)
and instrumental-variable (IV) estimation strategy discussed in Arraiz et al. (2010) and
2. The variables and parameters in this model are allowed to depend on the sample size; see
Kelejian and Prucha (2010) for further discussions. We suppress this dependence for notational
simplicity. In allowing, in particular, the elements of X to depend on the sample size, we find
that the specification is consistent with some of the variables in X being spatial lags of exogenous
variables.
290 Spatial models with additional endogenous variables

Drukker, Egger, and Prucha (2013) for the above class of SARAR models. This estima-
tion strategy builds on Kelejian and Prucha (1998, 1999, 2004, 2010) and the references
cited therein. More in-depth discussions regarding issues of model specifications and
estimation approaches can be found in these articles and the literature cited therein.
spivreg requires that the spatial-weighting matrices M and W be provided in the
form of an spmat object as described in Drukker et al. (2013). Both general and banded
spatial-weighting matrices are supported.

3 The spivreg command


3.1 Syntax
       
spivreg depvar varlist1 (varlist2 = varlist iv ) if in , id(varname)

dlmat(objname) elmat(objname) noconstant heteroskedastic impower(q)

level(#) maximize options

3.2 Options
id(varname) specifies a numeric variable that contains a unique identifier for each
observation. id() is required.
dlmat(objname) specifies an spmat object that contains the spatial-weighting matrix
W to be used in the SAR term.
elmat(objname) specifies an spmat object that contains the spatial-weighting matrix
M to be used in the spatial-error term.
noconstant suppresses the constant term in the model.
heteroskedastic specifies that spivreg use an estimator that allows e to be het-
eroskedastically distributed over the observations. By default, spivreg uses an
estimator that assumes homoskedasticity.
impower(q) specifies how many powers of the matrix W to include in calculating the
instrument matrix H. The default √ is impower(2). The allowed values of q are
integers in the set {2, 3, . . . ,  n}.
level(#) specifies the confidence level, as a percentage, for confidence intervals. The
default is level(95) or as set by set level.
 
maximize options: iterate(#), no log, trace, gradient, showstep,
showtolerance, tolerance(#), ltolerance(#), and from(init specs);
see [R] maximize for details. These options are seldom used.
D. M. Drukker, I. R. Prucha, and R. Raciborski 291

3.3 Saved results


spivreg saves the following information in e():
Scalars
e(N) number of observations e(converged) 1 if GMM stage
e(k) number of parameters converged, 0
e(rho 2sls) initial estimate of ρ otherwise
e(iterations) number of GMM iterations e(converged 2sls) 1 if 2SLS stage
e(iterations 2sls) number of 2SLS iterations converged, 0
otherwise

Macros
e(cmd) spivreg e(exogr) exogenous regressors
e(cmdline) command as typed e(insts) instruments
e(depvar) name of dependent variable e(instd) instrumented variables
e(title) title in estimation output e(constant) noconstant or
e(properties) b V hasconstant
e(estat cmd) program used to implement e(H omitted) names of omitted
estat instruments in H
e(predict) program used to implement matrix
predict e(idvar) name of ID variable
e(model) sarar, sar, sare, or lr e(dlmat) name of spmat object
e(het) heteroskedastic or used in dlmat()
homoskedastic e(elmat) name of spmat object
e(indeps) names of independent used in elmat()
variables
Matrices
e(b) coefficient vector e(delta 2sls) initial estimate of β
e(V) variance–covariance matrix and λ
of the estimators
Functions
e(sample) marks estimation sample

4 Examples
To provide a simple illustration, we use the artificial dataset spivreg.dta for the con-
tinental U.S. counties.3 The contiguity matrix for the U.S. counties is taken from
Drukker et al. (2013). In Stata, we issue the following commands:

. use dui
. spmat use ccounty using ccounty.spmat

The spatial-weighting matrix is now contained in the spmat object ccounty. This
minmax-normalized spatial-weighting matrix was created in section 2.4 of Drukker et al.
(2013) and was saved to disk in section 11.4.
In the output above, we are just reading in the spatial-weighting-matrix object that
was created and saved in Drukker et al. (2013).

3. The geographical county location data came from the U.S. Census Bureau and can be found
at ftp://ftp2.census.gov/geo/tiger/TIGER2008/. The variables are simulated but inspired by
Powers and Wilson (2004) and Levitt (1997).
292 Spatial models with additional endogenous variables

Our dependent variable, dui, is defined as the alcohol-related arrest rate per 100,000
daily vehicle miles traveled (DVMT). Figure 1 shows the distribution of dui across
counties, with darker colors representing higher values of the dependent variable. Spatial
patterns in dui are clearly visible.

Figure 1. Hypothetical alcohol-related arrests for continental U.S. counties

Our explanatory variables include police (number of sworn officers per 100,000
DVMT); nondui (nonalcohol-related arrests per 100,000 DVMT); vehicles (number of
registered vehicles per 1,000 residents); and dry (a dummy for counties that prohibit
alcohol sale within their borders). Because the size of the police force may be a function
of dui arrest rates, we treat police as endogenous; that is, in this example, Y =
(police). All other included explanatory variables, apart from the spatial lag, are taken
to be exogenous; that is, X = (nondui, vehicles, dry, intercept). Furthermore, we
assume the variable elect is a valid instrument, where elect is 1 if a county government
faces an election and is 0 otherwise. Thus the instrument matrix H is based on Xf =
(nondui, vehicles, dry, elect, intercept) as described above.
D. M. Drukker, I. R. Prucha, and R. Raciborski 293

In Stata, we can estimate the SARAR model with endogenous variables by typing

. spivreg dui nondui vehicles dry (police = elect), id(id)


> dlmat(ccounty) elmat(ccounty) nolog
Spatial autoregressive model Number of obs = 3109
(GS2SLS estimates)

dui Coef. Std. Err. z P>|z| [95% Conf. Interval]

dui
police -1.467068 .0434956 -33.73 0.000 -1.552318 -1.381818
nondui -.0004088 .0008344 -0.49 0.624 -.0020442 .0012267
vehicles .0989662 .0017653 56.06 0.000 .0955063 .1024261
dry .4553992 .0278049 16.38 0.000 .4009026 .5098958
_cons 9.671655 .3682685 26.26 0.000 8.949862 10.39345

lambda
_cons .7340818 .013378 54.87 0.000 .7078614 .7603023

rho
_cons .2829313 .071908 3.93 0.000 .1419941 .4238685

Instrumented: police
Instruments: elect

Given the normalization of the spatial-weighting matrix, the parameter space for
λ and ρ is taken to be the interval (−1, 1); see Kelejian and Prucha (2010) for further
discussions of the parameter space. The estimate of λ is positive, large, and significant,
indicating strong SAR dependence in dui. In other words, the alcohol-related arrest
rate for a given county is strongly affected by the alcohol-related arrest rates in the
neighboring counties. One possible explanation for this may be coordination among
police departments. Another may be that strong enforcement in one county may lead
some people to drink in neighboring counties.
The estimated ρ is positive, moderate, and significant, indicating moderate spatial
autocorrelation in the innovations.
The estimated β vector does not have the same interpretation as in a simple lin-
ear model, because including a spatial lag of the dependent variable implies that the
outcomes are determined simultaneously.
294 Spatial models with additional endogenous variables

5 Postestimation commands
5.1 Syntax
The syntax for predict after spivreg is
     
predict type newvar if in , statistic

where statistic is one of the following:


naive, the default, computes Yπ  + Xβ  + λWy,
 which should not be viewed as a
predictor for yi but simply as an intermediate calculation.

 + Xβ.
xb calculates Yπ
The predictor computed by the option naive will generally be biased; see Kelejian
and Prucha (2007) for an explanation. Optimal predictors for the SARAR model with
additional endogenous RHS variables corresponding to different information sets will
be made available in the future. Optimal predictors for the SARAR model without
additional endogenous RHS variables are discussed in Kelejian and Prucha (2007).

6 Methods and formulas


In this section, we give a detailed description of the calculations performed by spivreg.
We first discuss the estimation of the general model as specified in (1) and (2), both
under the assumption that the innovations  are homoskedastic and under the assump-
tion that the innovations  are heteroskedastic of unknown form. We then discuss the
two special cases ρ = 0 and λ = 0, respectively.

6.1 SARAR model


It is helpful to rewrite the model in (1) and (2) as

y = Zδ + u
u = ρMu + 

where Z = (Y, X, Wy) and δ = (π  , β  , λ) . In the following, we review the two-step
GMM and IV estimation approach as discussed in Drukker, Egger, and Prucha (2013) for
the homoskedastic case and in Arraiz et al. (2010) for the heteroskedastic case. Those
articles build on and specialize the estimation theory developed in Kelejian and Prucha
(1998, 1999, 2004, 2010). A full set of assumptions, formal consistency and asymptotic
normality theorems, and further details and discussions are given in that literature.
The IV estimators δ depend on the choice of a set of instruments, say, H. Suppose
that in addition to the included exogenous variables X, we also have excluded exogenous
variables Xe , allowing us to define Xf = (X, Xe ). If we do not have excluded exogenous
D. M. Drukker, I. R. Prucha, and R. Raciborski 295

variables, then Xf = X. Following the above literature, the instruments H may then
be taken as the linearly independent columns of

(Xf , WXf , . . . , Wq Xf , MXf , MWXf , . . . , MWq Xf )

The motivation for the above instruments is that they are computationally simple
while facilitating an approximation of the ideal instruments under reasonable assump-
tions. Taking q = 2 has worked well in Monte Carlo simulations over a wide range of
specifications. At a minimum, the instruments should include the linearly independent
columns of Xf and MXf , and the rank of H should be at least the number of variables
in Z.4 For the following discussion, it proves convenient to define the instrument pro-
jection matrix PH = H(H H)−1 H . When there is a constant in the model, it is only
included once in H.
The GMM estimators for ρ are motivated by quadratic moment conditions of the
form
E ( As ) = 0, s = 1, . . . , S

where the matrices As satisfy tr(As ) = 0. Specific choices for those matrices will be
given below. We note that under heteroskedasticity, it is furthermore assumed that the
diagonal elements of the matrices As are 0. This assumption simplifies the formula for
the asymptotic variance–covariance (VC) matrix; in particular, it avoids the fact that
the VC matrix must depend on third and fourth moments of the innovations in addition
to second moments.
We next describe the steps involved in computing the GMM and IV estimators and
an estimate of their asymptotic VC matrix. The second step operates on a spatial
Cochrane–Orcutt transformation of the above model given by

y(ρ) = Z(ρ)δ+

with y(ρ) = (In − ρM)y and Z(ρ) = (In − ρM)Z.

Step 1a: Two-stage least-squares estimator

In the first step, we apply two-stage least squares (2SLS) to the untransformed model
by using the instruments H. The 2SLS estimator of δ is then given by
 −1
= Z
δ  Z  y
Z

 = PH Z.
where Z

4. Note that if Xf contains spatially lagged variables, H will contain collinear columns and will not
be full rank. In those cases, we drop collinear columns from H and return the names of omitted
instruments in e(H omitted).
296 Spatial models with additional endogenous variables

Step 1b: Initial GMM estimator of ρ

The initial GMM estimator of ρ is given by


       
 ρ  ρ
ρ = arg min Γ 2 − γ  
Γ 2 −γ
ρ ρ

 are the 2SLS residuals, u


 = y − Zδ
where u  = M u,
⎡ ⎤ ⎡  ⎤
  (A1 + A1 )
u u −u  A1 u
  A1 u
u 
⎢ ⎥ ⎢ ⎥
 = n−1 ⎢
Γ .. .. ⎥ and γ
 = n−1 ⎣ ... ⎦
⎣ . . ⎦
  (AS + AS )
u u   As u
−u    AS u
u 

Writing the GMM estimator in this form shows that we can calculate it by solving
a simple nonlinear least-squares problem. By default, S = 2 and homoskedastic is
specified. In this case,
  2 −1   
A1 = 1 + n−1 tr(M M) M M − n−1 tr(M M)In

and
A2 = M
If heteroskedastic is specified, then by default,

A1 = M M − diag(M M)

and
A2 = M

Step 2a: Generalized spatial two-stage least-squares estimator of δ

In the second step, we first estimate δ by 2SLS from the transformed model by using
the instruments H and from where the spatial Cochrane–Orcutt transformation uses ρ.
The resulting generalized spatial two-stage least-squares (GS2SLS) estimator of δ is now
given by

−1
 (
δ  ρ) Z (
ρ) = Z( ρ)  ρ) y(
Z( ρ)

ρ) = (In − ρM)y, Z(


where y(  ρ) = PH Z(
ρ) = (In − ρM)Z, and Z( ρ).

Step 2b: Efficient GMM estimator of ρ

The efficient GMM estimator of ρ corresponding to GS2SLS residuals is given by


    

−1   
ρ
 2 −γ ρρ
 ( ρ
 2 −γ
ρ = arg min Γ  Ψ ρ) Γ 
ρ ρ
D. M. Drukker, I. R. Prucha, and R. Raciborski 297

 denotes the GS2SLS residuals, u


 = y − Zδ
where u  = Mu,
⎡ ⎤ ⎡  ⎤
  (A1 + A1 )
u   A1 u
u −u   A1 u
u 
⎢ ⎥ ⎢ ⎥
 = n−1 ⎢
Γ .. .. ⎥ and γ
 = n−1 ⎣ ... ⎦
⎣ . . ⎦
u  (AS + AS )   As u
u −u  u  AS u


and where Ψ  ρρ (
ρ) is an estimator for the VC matrix of the (normalized) sample moment
vector based on GS2SLS residuals, say, Ψρρ . The estimator Ψ  ρρ (
ρ) and Ψρρ differ for the
cases of homoskedastic and heteroskedastic errors. When homoskedastic is specified,
the r, s element of Ψ  ρρ (
ρ) is given by (r, s = 1, 2),
ρρ  2 2
Ψ (  ( ρ) (2n)−1 tr {(Ar + Ar )(As + As )}
r,s ρ) = σ

+σ ρ) n−1 a
2 ( r (
ρ) as (
ρ)
  2 (3)

+ n−1 μ (4) (
ρ) − 3 σ 2 (
ρ) vecD (Ar ) vecD (As )
  
+ n−1 μ
(3) (
ρ) a r ( ρ) vecD (Ar )
s (
ρ) vecD (As ) + a
where

r (
a  ρ)α
ρ) = T(  r (
ρ)
 ρ) = HP(
T(  ρ)

−1
 ρ) = Q
P(  −1 Q   HZ (  −1 Q
ρ) Q  ρ)
HH HZ ( ρ) Q HH HZ (
! "
 HH = n−1 H H
Q
 
 HZ (
Q ρ) = n−1 H Z(ρ)
Z(ρ) = (I − ρM)Z
α ρ) = −n−1 {Z(
 r ( ρ) (Ar + Ar )
(
ρ)}
(ρ) = (I − ρM)
u
σ ρ) = n−1
2 ( ρ)
( (ρ)
n
μ ρ) = n−1
(3) ( 
i (
ρ)3
i=1

n
μ ρ) = n−1
(4) ( 
i (
ρ)4
i=1

When heteroskedastic is specified, the r, s element of Ψρρ is estimated by


Ψ ρρ (
ρ ) = (2n)−1
tr (A r + A  
)Σ(
ρ )(A s + A  
)Σ(
ρ ) + n−1 a  ρ)
ρ) Σ(
r ( as (
ρ) (4)
r,s r s

 ρ) is a diagonal matrix whose ith diagonal element is 


where Σ( 2i (
ρ), and 
( r (
ρ) and a ρ)
are as defined above. The last two terms in (3) do not appear in (4) because the As
matrices used in the heteroskedastic case have diagonal elements equal to 0.
298 Spatial models with additional endogenous variables

Having computed the estimator θ  , ρ) in steps 1a, 1b, 2a, and 2b, we next
 = (δ
compute a consistent estimator for its asymptotic VC matrix, say, Ω. The estimator is
 where
given by nΩ
# $
 δδ
Ω  δρ
Ω
 =
Ω
 δρ
Ω  ρρ
Ω
 δδ = P(
Ω  δδ (
 ρ) Ψ  ρ)
ρ)P(
δρ
ρρ
−1 % ρρ
−1 &−1
    δρ  (  J
 Ψ
 ( 
Ω = P( ρ) Ψ ( ρ) Ψ ρ) J ρ) J
%
−1 &−1
 ρρ = J
Ω  Ψ ρρ (
ρ) 
J
 
=Γ
J  1
2
ρ

 ρρ (
In the above, Ψ  ρ) are as defined in (3) and (4) with ρ replaced by ρ. The
ρ) and P(
δδ δρ
 (
estimators Ψ ρ) and Ψ  (ρ) are defined as follows:
When homoskedastic is specified,

 δδ (
Ψ 2 (
ρ) = σ  HH
ρ)Q
 δρ (
Ψ ρ) = σ ρ)n−1 H {a1 (
2 ( ρ)n−1 H {vecD (A1 ), vecD (A2 )}
ρ)} + μ(3) (
ρ), a2 (

When heteroskedastic is specified,

 δδ (
Ψ  ρ)H
ρ) = n−1 H Σ(
 δρ (
Ψ  ρ) {a1 (
ρ) = n−1 H Σ( ρ), a2 (
ρ)}

We note that the expression for Ω ρρ has the simple form given above because the
estimator in step 2b is the efficient GMM estimator.

6.2 SAR model without spatially correlated errors


Consider the case ρ = 0, that is, the case where the disturbances are not spatially
correlated. In this case, only step 1a is necessary, and spivreg estimates δ by 2SLS using
as instruments H the linearly independent columns of {Xf , WXf , . . . , Wq Xf }. The
2SLS estimator is given by
 −1
= Z
δ  Z  y
Z

 = PH Z.
where Z
D. M. Drukker, I. R. Prucha, and R. Raciborski 299

 can be estimated
When homoskedastic is specified, the asymptotic VC matrix of δ
consistently by
 −1
2 Z
σ 
 Z

n  denotes the 2SLS residuals.


2 = n−1
where σ i=1 u  = y − Zδ
 2i and u
 can be estimated
When heteroskedastic is specified, the asymptotic VC matrix of δ
consistently by the sandwich form

 −1  −1
Z 
 Z  Σ
Z Z Z Z


 is the diagonal matrix whose ith element is u


where Σ  2i .

6.3 Spatially correlated errors without a SAR term


Consider the case λ = 0, that is, the case where there is no spatially lagged dependent
variable in the model. In this case, we use the same formulas as in section 6.1 after re-
defining Z = Y, X, δ = (π  , β  ) , and we take H to be composed of linearly independent
columns of (Xf , MXf ).

6.4 No SAR term or spatially correlated errors


When the model does not contain a SAR term or spatially correlated errors, the 2SLS
estimator provides consistent estimates, and we obtain our results by using ivregress
(see [R] ivregress). When homoskedastic is specified, the conventional estimator of
the asymptotic VC is used. When heteroskedastic is specified, the vce(robust)
estimator of the asymptotic VC is used. When no endogenous variables are specified,
we obtain our results by using regress (see [R] regress).

7 Conclusion
We have described the spivreg command for estimating the parameters of a SARAR
model with additional endogenous RHS variables. In the future, we plan to add options
for optimal predictors corresponding to different information sets.

8 Acknowledgment
We gratefully acknowledge financial support from the National Institutes of Health
through the SBIR grants R43 AG027622 and R44 AG027622.
300 Spatial models with additional endogenous variables

9 References
Anselin, L. 1988. Spatial Econometrics: Methods and Models. Dordrecht: Kluwer
Academic Publishers.

———. 2010. Thirty years of spatial econometrics. Papers in Regional Science 89: 3–25.

Anselin, L., and R. J. G. M. Florax. 1995. Small sample properties of tests for spatial
dependence in regression models: Some further results. In New Directions in Spatial
Econometrics, ed. L. Anselin and R. J. G. M. Florax, 21–74. Berlin: Springer.

Arraiz, I., D. M. Drukker, H. H. Kelejian, and I. R. Prucha. 2010. A spatial Cliff-Ord-


type model with heteroskedastic innovations: Small and large sample results. Journal
of Regional Science 50: 592–614.

Cliff, A. D., and J. K. Ord. 1973. Spatial Autocorrelation. London: Pion.

———. 1981. Spatial Processes: Models and Applications. London: Pion.

Cressie, N. A. C. 1993. Statistics for Spatial Data. Revised ed. New York: Wiley.

Drukker, D. M., P. Egger, and I. R. Prucha. 2013. On two-step estimation of a spatial


autoregressive model with autoregressive disturbances and endogenous regressors.
Econometric Reviews 32: 686–733.

Drukker, D. M., H. Peng, I. R. Prucha, and R. Raciborski. 2013. Creating and managing
spatial-weighting matrices with the spmat command. Stata Journal 13: 242–286.

Drukker, D. M., I. R. Prucha, and R. Raciborski. 2013. Maximum likelihood and gen-
eralized spatial two-stage least-squares estimators for a spatial-autoregressive model
with spatial-autoregressive disturbances. Stata Journal 13: 221–241.

Haining, R. 2003. Spatial Data Analysis: Theory and Practice. Cambridge: Cambridge
University Press.

Kelejian, H. H., and I. R. Prucha. 1998. A generalized spatial two-stage least squares
procedure for estimating a spatial autoregressive model with autoregressive distur-
bances. Journal of Real Estate Finance and Economics 17: 99–121.

———. 1999. A generalized moments estimator for the autoregressive parameter in a


spatial model. International Economic Review 40: 509–533.

———. 2004. Estimation of simultaneous systems of spatially interrelated cross sectional


equations. Journal of Econometrics 118: 27–50.

———. 2007. The relative efficiencies of various predictors in spatial econometric models
containing spatial lags. Regional Science and Urban Economics 37: 363–374.

———. 2010. Specification and estimation of spatial autoregressive models with au-
toregressive and heteroskedastic disturbances. Journal of Econometrics 157: 53–67.
D. M. Drukker, I. R. Prucha, and R. Raciborski 301

Levitt, S. D. 1997. Using electoral cycles in police hiring to estimate the effect of police
on crime. American Economic Review 87: 270–290.

Powers, E. L., and J. K. Wilson. 2004. Access denied: The relationship between alcohol
prohibition and driving under the influence. Sociological Inquiry 74: 318–337.

Whittle, P. 1954. On stationary processes in the plane. Biometrika 41: 434–449.

About the authors


David Drukker is the director of econometrics at StataCorp.
Ingmar Prucha is a professor of economics at the University of Maryland.
Rafal Raciborski is an econometrician at StataCorp.
The Stata Journal (2013)
13, Number 2, pp. 302–314

A command for Laplace regression


Matteo Bottai
Unit of Biostatistics
Institute of Environmental Medicine
Karolinska Institutet
Stockholm, Sweden
[email protected]

Nicola Orsini
Unit of Biostatistics and Unit of Nutritional Epidemiology
Institute of Environmental Medicine
Karolinska Institutet
Stockholm, Sweden
[email protected]

Abstract. We present the new laplace command for estimating Laplace re-
gression, which models quantiles of a possibly censored outcome variable given
covariates. We illustrate laplace with an example from a clinical trial on survival
in patients with metastatic renal carcinoma. We also report the results of a small
simulation study.
Keywords: st0294, laplace, quantile regression, censored outcome, survival analy-
sis, Kaplan–Meier

1 Introduction
Estimating percentiles for a time-to-event variable of interest conditionally on covariates
may offer a useful complement to current approaches to survival analysis. For exam-
ple, comparing survival across treatments or exposure levels in observational studies
at various percentiles (for example, at the 50th or 10th percentiles) provides impor-
tant insights. At the univariate level, this can be accomplished with the Kaplan–Meier
estimator.
Laplace regression can be used to estimate the effect of risk factors and impor-
tant predictors on survival percentiles while adjusting for other covariates. The user-
written clad command (Jolliffe, Krushelnytskyy, and Semykina 2000) estimates condi-
tional quantiles only when censoring times are fixed and known for all observations
(Powell 1986), and its applicability is limited.
In this article, we present the laplace command for estimating Laplace regression
(Bottai and Zhang 2010). In section 3, we describe the syntax and options. In section 3,
we illustrate laplace with data from a randomized clinical trial. In section 4, we sketch
the methods and formulas. In section 5, we present the results of a small simulation
study.


c 2013 StataCorp LP st0294
M. Bottai and N. Orsini 303

2 The laplace command


2.1 Syntax
   
laplace depvar indepvars if in
, quantiles(numlist) failure(varname)

sigma(varlist) reps(#) seed(#) tolerance(#) maxiter(#) level(#)

by, statsby, and xi are allowed with laplace; see [U] 11.1.10 Prefix commands.
See [R] qreg postestimation for features available after estimation.

2.2 Options
quantiles(numlist) specifies the quantiles as numbers between 0 and 1; numbers larger
than 1 are interpreted as percentages. The default is quantiles(0.5), which cor-
responds to the median.
failure(varname) specifies the failure event; the value 0 indicates censored observa-
tions. If failure() is not specified, all observations are assumed to be uncensored.
sigma(varlist) specifies the variables to be included in the scale parameter model. The
default is constant only.
reps(#) specifies the number of bootstrap replications to be performed for estimating
the variance–covariance matrix and standard errors of the regression coefficients.
seed(#) sets the initial value of the random-number seed used by the bootstrap. If
seed() is specified, the bootstrapped estimates are reproducible (see [R] set seed).
tolerance(#) specifies the tolerance for the optimization algorithm. When the abso-
lute change in the log likelihood from one iteration to the next is less than or equal to
#, the tolerance() convergence criterion is met. The default is tolerance(1e-10).
maxiter(#) specifies the maximum number of iterations. When the number of itera-
tions equals maxiter(), the optimizer stops, displays an x, and presents the current
results. The default is maxiter(2000).
level(#) specifies the confidence level, as a percentage, for confidence intervals. The
default is level(95) or as set by set level.
304 A command for Laplace regression

2.3 Saved results


laplace saves the following in e():
Scalars
e(N) number of observations e(n q) number of estimated quantiles
e(N fail) number of failures e(reps) number of bootstrap replications
Macros
e(cmd) laplace e(qlist) requested quantiles
e(cmdline) command as typed e(vcetype) title used to label Std. Err.
e(depvar) name of dependent variable e(properties) b V
e(eqnames) names of equations e(predict) program used to implement
predict
Matrices
e(b) coefficient vector e(V) variance–covariance matrix of
the estimators
Functions
e(sample) marks estimation sample

3 Example: Survival in metastatic renal carcinoma


We illustrate the use of laplace with data from a clinical trial on 347 patients with
metastatic renal carcinoma. The patients were randomly assigned to either interferon-
α (IFN) or oral medroxyprogesterone (MPA) (Medical Research Council Renal Cancer
Collaborators 1999). A total of 322 patients died during follow-up. The outcome of
primary research interest is overall survival.

. use kidney_ca_l
(kidney cancer data)
. quietly stset months, failure(cens)

The numeric variable months represents the time to event or censoring, and the binary
variable cens indicates the failure status (0 = censored, 1 = death).

3.1 Median survival


We estimate a Laplace regression model where the response variable is time to death or
censoring (months) and the binary indicator for treatment (trt) is the only covariate.
We specify the event status with the option failure(). The default percentile is the
median (q50).
M. Bottai and N. Orsini 305

. laplace months trt, failure(cens)


Laplace regression No. of subjects = 347
No. of failures = 322

Robust
months Coef. Std. Err. z P>|z| [95% Conf. Interval]

q50
trt 3.130258 1.195938 2.62 0.009 .7862628 5.474254
_cons 6.80548 .7188408 9.47 0.000 5.396578 8.214382

The estimated median survival in the MPA group is 6.8 months (95% confidence
interval: [5.4, 8.2]). The difference (trt) in median survival between the treatment
groups is 3.1 months (95% confidence interval: [0.8, 5.5]). Median survival among
patients on IFN can be obtained with the postestimation command lincom.

. lincom _cons + trt


( 1) [q50]trt + [q50]_cons = 0

months Coef. Std. Err. z P>|z| [95% Conf. Interval]

(1) 9.935738 .9557906 10.40 0.000 8.062423 11.80905

Percentiles of survival time by treatment group can also be obtained from the Kaplan–
Meier estimate of the survivor function by using the command stci.

. stci, by(trt)
failure _d: cens
analysis time _t: months
no. of
trt subjects 50% Std. Err. [95% Conf. Interval]

MPA 175 6.80548 .8902896 4.86575 8.15342


IFN 172 9.830137 .8982793 7.7589 11.7041

total 347 7.956164 .5699226 6.90411 9.1726

The estimated median in the IFN group (9.8 months) differs slightly from the laplace
estimate (9.9 months) shown above. The Kaplan–Meier curve in the IFN group is flat
at the 50th percentile between 9.83 and 9.96 months of follow-up. The command stci
shows the lower limit of this interval while laplace shows a middle value.
306 A command for Laplace regression

3.2 Multiple survival percentiles


When it is relevant to estimate multiple percentiles of the distribution of survival time,
these can be specified with the option quantiles().

. laplace months trt, failure(cens) quantiles(25 50 75) rep(100) seed(123)


Laplace regression No. of subjects = 347
No. of failures = 322

Bootstrap
months Coef. Std. Err. z P>|z| [95% Conf. Interval]

q25
trt 1.509151 .8289345 1.82 0.069 -.1155312 3.133832
_cons 2.49863 .399623 6.25 0.000 1.715384 3.281877

q50
trt 3.130258 1.209658 2.59 0.010 .7593719 5.501145
_cons 6.80548 .9100921 7.48 0.000 5.021732 8.589227

q75
trt 3.663238 3.482536 1.05 0.293 -3.162407 10.48888
_cons 15.87945 1.714295 9.26 0.000 12.5195 19.23941

The treatment effect is larger at higher percentiles of survival time. The difference
between the two treatment groups at the 25th, 50th, and 75th percentiles is 1.5, 3.1,
and 3.7 months, respectively. When bootstrap is requested, one can test for differences
in treatment effects across survival percentiles with the postestimation command test.

. test [q25]trt = [q50]trt


( 1) [q25]trt - [q50]trt = 0
chi2( 1) = 2.59
Prob > chi2 = 0.1076

We fail to reject the hypothesis that the treatment effects at the 25th and 50th survival
percentiles are equal (p-value > 0.05).
Figure 1 shows the predicted percentiles from the 1st to the 99th in each treatment
group. The difference of 3 months in median survival between groups is represented by
the horizontal distance between the points A and B. Approximately 30% and 40% of the
patients on MPA and IFN, respectively, are estimated to live longer than 12 months. The
absolute difference of about 10% in the probability of surviving 12 months is represented
by the vertical distance between the points C and D.
M. Bottai and N. Orsini 307

100
90
80
C
70
Percentiles

60 D
50 A B
40
30
20
10
0
0 12 24 36 48 60
Follow−up time (months)

Figure 1. Survival percentiles in the MPA (solid line) and IFN (dashed line) groups
estimated with Laplace regression. The horizontal distance between the points A and B
(3.1 months) indicates the difference in median survival between groups. The vertical
distance between C and D (about 10%) indicates the difference in the proportion of
patients estimated to survive 12 months.

3.3 Interactions between covariates


Royston, Sauerbrei, and Ritchie (2004) analyzed the same data and described how a
continuous prognostic factor, white cell count (wcc), affects the treatment effect as
measured by a relative hazard. We now perform a similar analysis by using Laplace
regression for the median survival. We include as covariates the treatment indicator
(trt), three equally sized classes of white cell counts (cwcc) by means of two indicator
variables, and their interactions.
308 A command for Laplace regression

. xi: laplace months i.trt*i.cwcc, failure(cens)


i.trt _Itrt_0-1 (naturally coded; _Itrt_0 omitted)
i.cwcc _Icwcc_0-2 (naturally coded; _Icwcc_0 omitted)
i.trt*i.cwcc _ItrtXcwc_#_# (coded as above)
Laplace regression No. of subjects = 347
No. of failures = 322

Robust
months Coef. Std. Err. z P>|z| [95% Conf. Interval]

q50
_Itrt_1 8.01462 2.270786 3.53 0.000 3.563962 12.46528
_Icwcc_1 2.262442 2.068403 1.09 0.274 -1.791554 6.316438
_Icwcc_2 -2.496523 1.645959 -1.52 0.129 -5.722544 .7294982
_ItrtXcwc_1_1 -5.737988 3.241483 -1.77 0.077 -12.09118 .6152021
_ItrtXcwc_1_2 -7.751629 2.645534 -2.93 0.003 -12.93678 -2.566478
_cons 6.90203 1.658547 4.16 0.000 3.651337 10.15272

The predicted median survival can be obtained with standard postestimation commands
such as predict or adjust.
. adjust, by(trt cwcc) format(%2.0f) noheader

White Cell Counts


treatment Low Medium High

MPA 7 9 4
IFN 15 11 5

Key: Linear Prediction

The between-treatment-group difference in median survival varies from 8 months in


the low white cell count category to 1 month in the high white cell count category. We
test for interaction between treatment and white cell counts with the postestimation
command testparm.
. testparm _ItrtX*
( 1) [q50]_ItrtXcwc_1_1 = 0
( 2) [q50]_ItrtXcwc_1_2 = 0
chi2( 2) = 8.59
Prob > chi2 = 0.0137

We reject the null hypothesis of equal treatment effect across categories of white cell
counts (p = 0.0137). The treatment effect seems to be largest in patients with low white
cell counts.

3.4 Laplace regression with uncensored data


Suppose all the values for the variable months were uncensored times at death. The
laplace command can be used with uncensored observation by omitting the failure()
option. In this case, laplace is simply an alternative to the standard quantile regression
commands qreg and sqreg.
M. Bottai and N. Orsini 309

. qui laplace months trt


. adjust, by(trt) format(%3.2f) noheader

treatment xb

MPA 6.77
IFN 9.89

Key: xb = Linear Prediction


. qui qreg months trt
. adjust, by(trt) format(%3.2f) noheader

treatment xb

MPA 6.77
IFN 9.96

Key: xb = Linear Prediction

The number of observations in the MPA group is odd (175 patients), and the sample
median survival is 6.77 months. The number of observations in the IFN group is even
(172 patients), and the median is not uniquely defined. The two nearest values are 9.83
and 9.96 months. The command qreg picks the larger of the two, while laplace picks
a value in between.

4 Methods and formulas


In this section, we follow the description provided by Bottai and Zhang (2010). Suppose
we have a sample of size n. Let ti , i = 1, . . . , n, be a continuous outcome variable, ci be
a continuous censoring variable, and xi = {x1,i , . . . , xr,i } and zi = {z1,i , . . . , zs,i } be
two vectors of covariates. The sets of covariates contained in xi and zi may partially or
entirely overlap. We assume that ci is independent of ti conditionally on the covariates.
Suppose we observe (yi , di , xi , zi ), with yi = min(ti , ci ) and di = I(ti ≤ ci ), where I(A)
denotes the indicator function of the event A. We assume that

ti = xi βp + exp(zi σp )εi (1)

where βp = {βp,1 , . . . , βp,r } and σp = {σp,1 , . . . , σp,s } indicate the unknown parameter
vectors, and εi are independent and identically distributed error terms that follow a
standard Laplace distribution, f (εi ) = p(1 − p) exp{[I(εi ≤ 0) − p]εi }. For any given
p ∈ (0, 1), the p-quantile of the conditional distribution of ti given xi and zi is xi βp
because P (ti ≤ xi βp |xi , zi ) = p.
The command laplace estimates the (r + s)-dimensional parameter vector {βp , σp }
by maximizing the Laplace likelihood function described by Bottai and Zhang (2010).
It uses an iterative maximization algorithm based on the gradient of the log likelihood
that generates a finite sequence of parameter values along which the likelihood increases.
Briefly, from a current parameter value, the algorithm searches the positive semiline in
the direction of the gradient for a new parameter value where the likelihood is larger.
310 A command for Laplace regression

The algorithm stops when the change in the likelihood is less than the specified tolerance.
Convergence is guaranteed by the continuity and concavity of the likelihood.
The asymptotic variance of the estimator βp for the parameter βp is derived by con-
sidering the estimating condition reported by Bottai and Zhang (2010, eq. 4), S(βp ) = 0,
where
' (
  1 n
p − 1
S βp =  
xi p − I (yi ≤ xi βp ) − I (yi ≤ xi βp ) (1 − di )
exp (zi σ
) i=1 1 − F (yi |xi )

with F(yi |xi ) = p exp{(1 − p)(yi − xi βp )/ exp(zi σ


p )}. Following the standard asymp-
totic theory for method of moments estimators, βp approximately follows a normal
distribution with mean βp∗ and variance V , where βp∗ indicates the expected value of βp ,
V = H(βp )−1 S(βp ) S(βp )H(βp )−1 , and H(βp ) = ∂S(βp )/∂βp |βp =βbp . The derivative in
H(βp ) is evaluated numerically. Alternatively, the standard errors can be obtained with
bootstrap by specifying the reps() option.

5 Simulation
In this section, we present the setup and results of a small simulation study to as-
sess the finite sample performance of the Laplace regression estimator under different
data-generating mechanisms. We contrast the performance of Laplace with that of the
Kaplan–Meier estimator, a standard, nonparametric, uniformly consistent, and asymp-
totically normal estimator of the survival function. To generate the survival estimates,
we used the sts command.
We generated 500 samples from (1) in each of the six different simulation scenarios
that arose from the combination of two sample sizes and three data-generating mech-
anisms. In each scenario, we estimated five percentiles (p = 0.10, 0.30, 0.50, 0.70, 0.90)
with Laplace regression and the Kaplan–Meier estimator. The two sample sizes were
n = 100 and n = 1,000. The three different data-generating mechanisms were obtained
by changing the values of zi , σp , and the censoring variable ci . In all simulation sce-
narios, xi = (1, x1,i ) , with x1,i ∼ Bernoulli(0.5), βp = (5, 3) , and εi was a standard
normal centered at the quantile being estimated.
In scenario number 1, zi = 1, σp = 1, and the censoring variable was set equal
to a constant ci = 1,000 for all individuals. In this scenario, no observations were
censored, and Laplace regression was equivalent to ordinary quantile regression. In
scenario number 2, zi = 1, σp = 1, and the censoring variable was generated from the
same distribution as the outcome variable ti . This ensured an expected censoring rate
of 50% in both covariate patterns (x1,i = 0, 1). In scenario number 3, zi = (1, x1,i ) and
σp = (0.5, 0.5) . The censoring variable ci was generated from the same distribution as
the outcome variable ti . In this scenario, the standard deviation of ti was equal to 0.5
when x1,i = 0 and equal to 1 when x1,i = 1.
M. Bottai and N. Orsini 311

The following table shows the observed relative mean squared error multiplied by
1,000 for the predicted quantile in the group x1,i = 1 in each combination of sample size
(obs), data-generating scenario (scenario), and percentile (percentile) for Laplace
(top entry) and Kaplan–Meier (bottom entry).

. table percentile scenario obs, contents(mean msel mean msekm) format(%4.3f)


> stubwidth(12)

obs and scenario


100 1000
percentile 1 2 3 1 2 3

10 1.187 1.395 1.268 0.129 0.136 0.126


1.233 1.496 1.320 0.132 0.140 0.132

30 0.597 0.685 0.680 0.064 0.073 0.067


0.606 0.792 0.831 0.064 0.078 0.075

50 0.496 0.570 0.653 0.053 0.065 0.073


0.505 0.860 0.941 0.053 0.074 0.075

70 0.513 0.639 0.711 0.050 0.131 0.144


0.518 1.329 1.050 0.050 0.113 0.094

90 0.728 1.661 1.930 0.063 0.876 0.955


0.731 1.835 1.701 0.063 0.478 0.450

The relative mean squared error was smaller for Laplace than for Kaplan–Meier at lower
quantiles and with the smaller sample size.
Figure 2 shows the relative mean squared error of Laplace (x axis) and Kaplan–Meier
(y axis) estimators of the quantile in group x1,i = 1 over all simulation scenarios.
The Laplace estimator had fewer extreme values than Kaplan–Meier. The overall
concordance correlation coefficient (command concord) was 72.2%. After the 10%
largest differences were excluded, the coefficient was 99.1%.
312 A command for Laplace regression

30
Relative MSE − Kaplan−Meier
10 0 20

0 5 10 15 20
Relative MSE − Laplace

Figure 2. Relative mean squared error of Laplace (x axis) and Kaplan–Meier (y axis)
estimators of the percentiles in group x1,i = 1 over all simulation scenarios. The solid
45-degree line indicates the equal relative mean squared error of the two estimators.

The following two tables show the performance of the estimator of the asymptotic
standard error for the regression coefficients βp,0 (first table) and βp,1 (second table).
In each cell of each table, the top entry is the average estimated asymptotic standard
error, and the bottom entry is the corresponding observed standard deviation across
the simulated samples.

. table percentile scenario obs, contents(mean s0 mean ms0) format(%4.3f)


> stubwidth(12)

obs and scenario


100 1000
percentile 1 2 3 1 2 3

10 0.237 0.228 0.131 0.076 0.077 0.039


0.235 0.251 0.123 0.073 0.082 0.039

30 0.185 0.200 0.098 0.059 0.062 0.031


0.182 0.193 0.097 0.058 0.067 0.032

50 0.176 0.194 0.097 0.056 0.060 0.030


0.169 0.185 0.093 0.053 0.064 0.032

70 0.188 0.198 0.098 0.059 0.064 0.032


0.185 0.207 0.103 0.057 0.071 0.035

90 0.225 0.227 0.114 0.077 0.076 0.038


0.231 0.255 0.141 0.072 0.087 0.046
M. Bottai and N. Orsini 313

. table percentile scenario obs, contents(mean s1 mean ms1) format(%4.3f)


> stubwidth(12)

obs and scenario


100 1000
percentile 1 2 3 1 2 3

10 0.349 0.353 0.276 0.109 0.110 0.087


0.330 0.351 0.263 0.104 0.113 0.088

30 0.277 0.292 0.232 0.084 0.089 0.070


0.265 0.269 0.216 0.079 0.092 0.066

50 0.255 0.279 0.219 0.080 0.086 0.068


0.250 0.257 0.226 0.077 0.086 0.073

70 0.272 0.293 0.227 0.084 0.090 0.070


0.265 0.277 0.236 0.081 0.094 0.076

90 0.337 0.339 0.246 0.109 0.108 0.085


0.325 0.320 0.284 0.104 0.109 0.098

The estimated standard errors were similar to the observed standard deviation across
all cells for both regression coefficients.

6 Acknowledgment
Nicola Orsini was partly supported by a Young Scholar Award from the Karolinska
Institutet’s Strategic Program in Epidemiology.

7 References
Bottai, M., and J. Zhang. 2010. Laplace regression with censored data. Biometrical
Journal 52: 487–503.

Jolliffe, D., B. Krushelnytskyy, and A. Semykina. 2000. sg153: Censored least absolute
deviations estimator: CLAD. Stata Technical Bulletin 58: 13–16. Reprinted in Stata
Technical Bulletin Reprints, vol. 10, pp. 240–244. College Station, TX: Stata Press.

Medical Research Council Renal Cancer Collaborators. 1999. Interferon-α and survival
in metastatic renal carcinoma: Early results of a randomised controlled trial. Lancet
353: 14–17.

Powell, J. L. 1986. Censored regression quantiles. Journal of Econometrics 32: 143–155.

Royston, P., W. Sauerbrei, and A. Ritchie. 2004. Is treatment with interferon-alpha


effective in all patients with metastatic renal carcinoma? A new approach to the
investigation of interactions. British Journal of Cancer 90: 794–799.
314 A command for Laplace regression

About the author


Matteo Bottai is a professor of biostatistics in the Unit of Biostatistics at the Institute of
Environmental Medicine at Karolinska Institutet in Stockholm, Sweden.
Nicola Orsini is an associate professor of medical statistics and an assistant professor of epi-
demiology in the Unit of Biostatistics and the Unit of Nutritional Epidemiology at the Institute
of Environmental Medicine at Karolinska Institutet in Stockholm, Sweden.
The Stata Journal (2013)
13, Number 2, pp. 315–322

Importing U.S. exchange rate data from the


Federal Reserve and standardizing country
names across datasets
Betul Dicle John Levendis
New Orleans, LA Loyola University New Orleans
[email protected] New Orleans, LA
[email protected]

Mehmet F. Dicle
Loyola University New Orleans
New Orleans, LA
[email protected]

Abstract. fxrates is a command to import historical U.S. exchange rate data


from the Federal Reserve and to calculate the daily change of the exchange rates.
Because many cross-country datasets use different spellings and conventions for
country names, we also introduce a second command, countrynames, to convert
country names to a common naming standard.
Keywords: dm0069, fxrates, countrynames, exchange rates, country names, stan-
dardization, data management, historical data

1 Introduction
Economic and financial researchers must often convert between currencies to facilitate
cross-country comparisons. We provide a command, fxrates, that downloads daily
foreign exchange rates relative to the U.S. dollar from the Federal Reserve’s database.
Working with multiple cross-country datasets, such as international foreign exchange
rates, introduces a unique problem: variations in country names. They are often spelled
differently or follow different grammatical conventions across datasets. For example,
North Korea is often different among datasets; it could be “North Korea”, “Korea,
North”, “Korea, Democratic People’s Republic”, or even “Korea, DPR”. Likewise,
“United States of America” is often “United States”, “USA”, “U.S.A.”, “U.S.”, or “US”.
A dataset may have country names in all caps. Country names could also have inad-
vertent leading or trailing spaces. Thus we provide a second command, countrynames,
that renames many country names to follow a standard convention. The command is,
of course, editable, so researchers may opt to use their own naming preferences.


c 2013 StataCorp LP dm0069
316 fxrates and countrynames

2 The fxrates command


2.1 Syntax
  
fxrates namelist , period(2000 | 1999 | 1989) chg(ln | per | sper)

save(filename)

2.2 Options
namelist is a list of country abbreviations for the countries whose foreign exchange data
you wish to download from the Federal Reserve’s website. Exchange rates for all
available countries will be downloaded if namelist is omitted. The list of countries
includes the following:
al Australia ma Malaysia
au Austria mx Mexico
be Belgium ne Netherlands
bz Brazil nz New Zealand
ca Canada no Norway
ch China, P.R. po Portugal
dn Denmark si Singapore
eu Economic and Monetary Union member countries sf South Africa
ec European Union ko South Korea
fn Finland sp Spain
fr France sl Sri Lanka
ge Germany sd Sweden
gr Greece sz Switzerland
hk Hong Kong ta Taiwan
in India th Thailand
ir Ireland uk United Kingdom
it Italy ve Venezuela
ja Japan
period(2000 | 1999 | 1989) specifies which block of dates to download. The Federal
Reserve foreign exchange database is separated into three blocks: one ending in
1989, a second for 1990–1999, and a third for 2000 through the present. The default
(obtained by omitting period()) is to download the three separate files and merge
them automatically so that the user has all foreign exchange market data available.
You can specify one or more periods. If you know which data range you wish to
download, however, you can save time by specifying which of the three blocks to
download. Specifying all three periods is equivalent to the default of downloading
all the data.
B. Dicle, J. Levendis, and M. F. Dicle 317

chg(ln | per | sper) is the periodic return. Three different percent changes can be cal-
culated for the adjusted closing price: natural log difference, percentage change, and
symmetrical percentage change. Whenever one of these is specified, a new variable
is created with the appropriate prefix: ln for the first-difference of logs method, per
for the percent change, and sper for the symmetric percent change.
save(filename) is the output filename. filename is created under the current working
directory.

2.3 Using fxrates to import historical exchange rate data

Example

In this example, we use fxrates to import the entire daily exchange rate dataset
from the Federal Reserve. Because we did not specify the countries, fxrates downloads
data from all countries. Because we did not specify the period, fxrates defaults to
downloading data for all available dates.
. fxrates
au does not have 00
be does not have 00
(output omitted )
ve does not have 89
. summarize
Variable Obs Mean Std. Dev. Min Max

date 10551 11403.8 4264.338 4018 18788


_al 10145 .8764938 .2391958 .4828 1.4885
_au 7013 15.21975 3.999031 9.5381 26.0752
_be 7021 38.61327 8.036983 27.12 69.6
_bz 4134 1.944 .6851041 .832 3.945

_ca 10158 1.227691 .1689277 .9168 1.6128


_ch 7592 6.110467 2.380834 1.5264 8.7409
_dn 10151 6.702978 1.32041 4.6605 12.3725
_eu 3131 1.197019 .1959648 .827 1.601
_ec 4902 1.137739 .1776148 .6476 1.4557

(output omitted )

_sd 10151 6.633961 1.631 3.867 11.027


_sz 10152 1.796176 .7191334 .8352 4.318
_ta 6665 31.36318 4.002814 24.507 40.6
_th 7571 30.99084 7.293934 20.36 56.1
_uk 10152 1.779829 .3176284 1.052 2.644

_ve 4127 1.528076 1.111678 .1697 4.3


318 fxrates and countrynames

Output such as au does not have 00 indicates that there were no observations in a
particular block of years (in this case, the 2000–present block) for the particular country.
When this appears, it is most often the case that the currency has been discontinued,
as when Austria started using the euro.

Example

In this second example, we download the exchange rates of the U.S. dollar versus the
French franc, the German deutschmark, and the Hong Kong dollar for all the available
dates.

. fxrates fr ge hk
fr does not have 00
ge does not have 00
. summarize
Variable Obs Mean Std. Dev. Min Max

date 10550 11404.5 4263.934 4021 18788


_fr 7021 5.673864 1.227564 3.8462 10.56
_ge 7021 2.143872 .5509681 1.3565 3.645
_hk 7652 7.63056 .5069678 5.127 8.7

Example

In this example, we download the exchange rate data for United States versus France,
Germany, and Hong Kong. Because no period was specified, fxrates downloads the
data from all available dates. We also specified that fxrates calculate the daily percent
change, calculated in two different ways: as the log first-difference and as the arithmetic
daily percent change. The log-difference percent change for each country is prefixed by
ln; the arithmetic percent change for each country is prefixed by per.
. fxrates fr ge hk, chg(ln per)
fr does not have 00
ge does not have 00
. summarize
Variable Obs Mean Std. Dev. Min Max

date 10550 11404.5 4263.934 4021 18788


_fr 7021 5.673864 1.227564 3.8462 10.56
_ge 7021 2.143872 .5509681 1.3565 3.645
_hk 7652 7.63056 .5069678 5.127 8.7
ln_fr 6743 -.0000122 .0061762 -.0416059 .0587457

per_fr 6743 6.85e-06 .0061803 -.0407522 .0605055


ln_ge 6743 -.0001283 .0064045 -.0414075 .0586776
per_ge 6743 -.0001078 .0064049 -.0405619 .0604333
ln_hk 7363 .000054 .0023756 -.0410614 .0653051
per_hk 7363 .0000568 .0023914 -.0402298 .0674847
B. Dicle, J. Levendis, and M. F. Dicle 319

Example

In this final example, we download the U.S. dollar exchange rate versus the Japanese
yen and the Mexican peso. We calculate the daily percent change by calculating the
first-differences of natural logs for the data ending in 1999 (that is, for the data ending
in 1989 plus the data from 1990 through 1999).

. fxrates ja mx, period(1999 1989) chg(ln)


mx does not have 89
. summarize
Variable Obs Mean Std. Dev. Min Max

date 7565 9315 3057.56 4021 14609


_ja 7267 195.5763 74.42725 81.12 358.44
_mx 1541 7.258319 2.154108 3.1 10.63
ln_ja 6980 -.0001587 .0063255 -.056302 .0625558
ln_mx 1477 .0005535 .0132652 -.1796934 .1926843

3 The countrynames command


3.1 Syntax

countrynames countryvar

3.2 Description
The command countrynames changes the name of a country in a dataset to corre-
spond to a more standard set of names. By default, countrynames creates a new
variable, changed, containing numeric codes that indicate which country names have
been changed. A code of 0 indicates no change; a code of 1 indicates that the coun-
try’s name has been changed. We recommend you run countrynames on both datasets
whenever two different cross-country datasets are being merged. This minimizes the
chance that a difference in names between datasets will prevent a proper merge from
occurring. However, if you wish to keep a variable with the original names, you need
to copy the variable to another variable. For example, before running countrynames
country, you would need to type generate origcountry = country.
320 fxrates and countrynames

3.3 Using the countrynames command to convert country names to


a common naming standard

Example

In this example, we use two macroeconomic datasets that have countries named
slightly differently. The first dataset is native to and shipped with Stata.

. sysuse educ99gdp, clear


(Education and GDP)

Though the dataset is very small, it suffices for our purposes. Notice the spelling of
United States in this dataset.

. list

country public private

1. Australia .7 .7
2. Britain .7 .4
3. Canada 1.5 .9
4. Denmark 1.5 .1
5. France .9 .4

6. Germany .9 .2
7. Ireland 1.1 .3
8. Netherlands 1 .4
9. Sweden 1.5 .2
10. United States 1.1 1.2

. save temp1.dta, replace


(note: file temp1.dta not found)
file temp1.dta saved

In fact, all the spellings in this dataset correspond with the preferred names listed in
countrynames, so nothing is required of us here. We could run countrynames just to
be on the safe side, but it would not have any effect. It is, however, good practice to
run countrynames whenever merging datasets to maximize the chances that the two
datasets use the same country names.
The second dataset, using World Health Organization data, is from Kohler and
Kreuter (2005). The data are available from the Stata website.
. net from http://www.stata-press.com/data/kk/
(output omitted )
. net get data
(output omitted )
. use who2001.dta, clear
B. Dicle, J. Levendis, and M. F. Dicle 321

Notice how the United States is called United States of America in this dataset.

. list country

country

1. Afghanistan
2. Albania
(output omitted )
180. United States of America
(output omitted )
187. Zambia
188. Zimbabwe

We now run countrynames on this dataset to standardize the names of the countries.
This will rename United States of America to United States, as it was in the first
dataset.

. countrynames country
. list country _changed

country _changed

1. Afghanistan 0
2. Albania 0
(output omitted )
180. United States 1
(output omitted )
187. Zambia 0
188. Zimbabwe 0

Notice that the generated variable, changed, is equal to 1 for the United States entry;
this indicates that its name was once something different.
Having run countrynames on both datasets, we have increased the chances that
countries in both datasets follow the same naming convention. We are now safe to
merge the datasets:

. drop _changed
. sort country
. merge 1:1 country using temp1.dta
Result # of obs.

not matched 180


from master 179 (_merge==1)
from using 1 (_merge==2)
matched 9 (_merge==3)
322 fxrates and countrynames

The merge results table above is important: It is the result of merging a dataset that
used the countrynames command (master: who2001.dta) with a dataset that did not
use the countrynames command (using: temp1.dta). If the dataset using the command
includes a country name that is not renamed with countrynames, then it will appear
in the merge results table.
. sort country
. list country

country

1. Afghanistan
2. Albania
(output omitted )
180. United Arab Emirates
(output omitted )
188. Zambia
189. Zimbabwe

3.4 How to edit preferred country names within the countrynames


command
It is possible to add, remove, or change country name entries within the countrynames
command. After opening the countrynames.ado file with a do-file editor (any text
editor), you can delete country name entries, add new entries, or change spellings ac-
cording to your preferences. Any changes made to the countrynames.ado file should
be saved. The discard command will refresh the countrynames command with your
updates along with all the ado installations to Stata. We recommend that you confirm
updates to the countrynames command with a merge table.1

4 Reference
Kohler, U., and F. Kreuter. 2005. Data Analysis Using Stata. College Station, TX:
Stata Press.

About the authors


Betul Dicle recently earned her PhD from the political science department of Louisiana State
University.
John Levendis is an assistant professor of economics at Loyola University New Orleans.
Mehmet F. Dicle is an assistant professor of finance at Loyola University New Orleans.

1. Please note that if we ever do an update to our program, the user edits to the ado-file will be lost
when users grab the updated ado-file.
The Stata Journal (2013)
13, Number 2, pp. 323–328

Generating Manhattan plots in Stata


Daniel E. Cook Kelli R. Ryckman Jeffrey C. Murray
University of Iowa University of Iowa University of Iowa
Iowa City, IA Iowa City, IA Iowa City, IA
[email protected] [email protected] jeff[email protected]

Abstract. Genome-wide association studies hold the potential for discovering


genetic causes for a wide range of diseases, traits, and behaviors. However, the
incredible amount of data handling, advanced statistics, and visualization have
made conducting these studies difficult for researchers. Here we provide a tool,
manhattan, for helping investigators easily visualize genome-wide association stud-
ies data in Stata.
Keywords: st0295, manhattan, Manhattan plots, genome-wide association studies,
single nucleotide polymorphisms

1 Introduction
The number of published genome-wide association studies (GWAS) has seen a staggering
level of growth from 453 in 2007 to 2,137 in 2010 (Hindorff et al. 2011). These studies
aim to identify the genetic cause for a wide range of diseases, including Alzheimer’s
(Harold et al. 2009), cancer (Hunter et al. 2007), and diabetes (Hayes et al. 2007), and
to elucidate variability in traits, behavior, and other phenotypes. This is accom-
plished by looking at hundreds of thousands to millions of single nucleotide poly-
morphisms and other genetic features across upward of 10,000 individual genomes
(Corvin, Craddock, and Sullivan 2010). These studies generate enormous amounts of
data, which present challenges for researchers in handling data, conducting statistics,
and visualizing data (Buckingham 2008).
One method of visualizing GWAS data is through the use of Manhattan plots, so
called because of their resemblance to the Manhattan skyline. Manhattan plots are
scatterplots, but they are graphed in a characteristic way. To create a Manhattan plot,
you need to calculate p-values, which are generated through one of a variety of statistical
tests. However, because of the large number of hypotheses being tested in a GWAS, local
significance levels typically fall below p = 10−5 (Ziegler, König, and Thompson 2008).
Resulting p-values associated with each marker are −log10 transformed and plotted on
the y axis against their chromosomal position on the x axis. Chromosomes lie end to
end on the x axis and often include the 22 autosomal chromosomes and the X, Y, and
mitochondrial chromosomes.
Manhattan plots are useful for a variety of reasons. They allow investigators to
visualize hundreds of thousands to millions of p-values across an entire genome and to
quickly identify potential genetic features associated with phenotypes. They also enable


c 2013 StataCorp LP st0295
324 Manhattan plots

investigators to identify clusters of genetic features, which associate because of linkage


disequilibrium. They can be used diagnostically—to ensure GWAS data are coded and
formatted appropriately. Finally, they offer an easily interpretable graphical format to
present signals with formal levels of significance. For these reasons, Manhattan plots
are a common feature of GWAS publications.
While Manhattan plots are in essence scatterplots, formatting GWAS datasets for
their generation can be difficult and time consuming. To help researchers in this process,
we have developed a program executed through a new command, manhattan, that
formats data appropriately for plotting and allows for annotation and customization
options of Manhattan plots.

2 Data formatting
Following data cleaning and statistical tests, researchers are typically left with a dataset
consisting of, at a minimum, a list of genetic features (string), p-values (real), chromo-
somes (integer), and their base pair location on a chromosome (integer). Using the
manhattan command, a user specifies these variables. manhattan uses temporary vari-
ables to manipulate data into a format necessary for plotting. The program first iden-
tifies the number of chromosomes present and generates base pair locations relative to
their distance from the beginning of the first chromosome as if they were laid end to end
in numerical order. The format in which p-values are specified is detected and, if need
be, log transformed. manhattan then calculates the median base pair location of each
chromosome as locations to place labels. Labels are generated by using chromosome
numbers except for the sex chromosomes and mitochondrial chromosomes, which define
chromosomes 23, 24, and 25 with the X, Y , and M labels, respectively.
Once data have been reformatted in manhattan, plots are generated. Additional
options may require additional data manipulation. These options include spacing(),
bonferroni(), and mlabel().

3 The manhattan command


3.1 Syntax
  
manhattan chromosome base-pair pvalue if , options

options are listed in section 3.2.


D. E. Cook, K. R. Ryckman, and J. C. Murray 325

3.2 Options
options Description

Plot options
title(string) display a title
caption(string) display a caption
xlabel(string) set x label; default is xlabel(Chromosome)
width(#) set width of plot; default is width(15)
height(#) set height of plot; default is height(5)

Chromosome options
x(#) specify chromosome number to be labeled as
X; default is x(23)
y(#) specify chromosome number to be labeled as
Y ; default is y(24)
mito(#) specify chromosome number to be labeled as
M ; default is mito(25)

Graph options
bonferroni(h | v | n) draw a line at Bonferroni significance level;
label line with horizontal (h), vertical (v),
or no (n) labels
mlabel(var) set a variable to use for labeling markers
mthreshold(# | b) set a −log(p-value) above which markers will
be labeled, or use b to set your threshold
at the Bonferroni significance level
yline(#) set log(p-value) at which to draw a line
labelyline(h | v) label line specified with yline() by using
horizontal labels (h) or vertical labels (v)
addmargin add a margin to the left and right of the plot,
leaving room for labels

Style options
color1(color) set first color of markers
color2(color) set second color of markers
linecolor(color) set the color of Bonferroni line and label
or y line and label

4 Examples
The following examples were created using manhattan gwas.dta, which is available
as an ancillary file within the manhattan package. All the p-values were generated
randomly; therefore, all genetic elements are in linkage equilibrium and are not linked.
326 Manhattan plots

4.1 Example 1
Below you will find a typical Manhattan plot generated with manhattan. Several options
were specified in the generation of this plot. First, bonferroni(h) is used to specify
that a line be drawn at the Bonferroni level of significance. The h indicates that the
label should be placed horizontally, on the line. Next, mlabel(snp) is used to indicate
that markers should be labeled with the variable snp, which contains the names of each
marker. Additionally, mthreshold(b) is used to set a value at which to begin labeling
markers. In this case, b is used to indicate that markers should be labeled at −log10
(p-values) greater than the Bonferroni significance level. Finally, addmargin is used to
add space on either side of the plot to prevent labels from running off the plot.

. manhattan chr bp pvalue, bonferroni(h) mlabel(snp) mthreshold(b) addmargin


p-values log transformed.
Bonferroni Correction -log10(p) = 5.2891339
Label threshold set to Bonferroni value.
97298
8

snp_97994

snp_69797
snp_63831
snp_94406
6

5.29 snp_29084 snp_38620 snp_49775


snp_81068
pvalue
4
2
0

1 3 5 7 9 11 13 15 17 19 21X
2 4 6 8 10 12 14 16 18 2022
Chromosome

4.2 Example 2
Here yline(6.5) is used to draw a horizontal line at log10 (6.5), and labelyline(v)
adds an axis label for the value of this line. Additionally, the variable used for marker
labels is identified using mlabel(snp), and a threshold at which to begin adding labels
to markers is given as the same value as the horizontal line by using mthreshold(6.5).
Spacing is added between chromosomes with spacing(1) to keep labels on the x axis
from running into one another. Finally, a margin is added on either side of the plot by
using addmargin, because some of the marker labels would otherwise fall off the plot.
The colors of the markers are changed with color1(black) and color2(gray). The
color of the line plotted on the y axis by using yline(v) has been changed to black by
using linecolor(black).
D. E. Cook, K. R. Ryckman, and J. C. Murray 327

. manhattan chr bp pvalue, yline(6.5) labelyline(v) mlabel(snp) mthreshold(6.5)


> spacing(1) addmargin color1(black) color2(gray) linecolor(black)
p-values log transformed.
97298
8

snp_97994
6 6.5
pvalue
4
2
0

1 3 5 7 9 11 13 15 17 19 21 X
2 4 6 8 10 12 14 16 18 20 22
Chromosome

5 Conclusions
As the number of GWAS publications continues to grow, easier tools are needed for in-
vestigators to manipulate, perform statistics on, and visualize data. manhattan aims to
provide an easier, more standard method by which to visualize GWAS data in Stata. We
welcome help in the development of manhattan by users and hope to improve manhattan
in response to user suggestions and comments.

6 Acknowledgments
This work was supported by the March of Dimes (1-FY05-126 and 6-FY08-260), the Na-
tional Institutes of Health (R01 HD-52953, R01 HD-57192), and the Eunice Kennedy Shriver
National Institute of Child Health and Human Development (K99 HD-065786). The con-
tent is solely the responsibility of the authors and does not necessarily represent the
official views of the National Institutes of Health or the Eunice Kennedy Shriver Na-
tional Institute of Child Health and Human Development.

7 References
Buckingham, S. D. 2008. Scientific software: Seeing the SNPs between us. Nature
Methods 5: 903–908.

Corvin, A., N. Craddock, and P. F. Sullivan. 2010. Genome-wide association studies:


A primer. Psychological medicine 40: 1063–1077.

Harold, D., R. Abraham, P. Hollingworth, R. Sims, A. Gerrish, M. L. Hamshere, J. Singh


Pahwa, V. Moskvina, K. Dowzell, A. Williams, N. Jones, C. Thomas, A. Stretton,
328 Manhattan plots

A. R. Morgan, S. Lovestone, J. Powell, P. Proitsi, M. K. Lupton, C. Brayne, D. C.


Rubinsztein, M. Gill, B. Lawlor, A. Lynch, K. Morgan, K. S. Brown, P. A. Passmore,
D. Craig, B. McGuinness, S. Todd, C. Holmes, D. Mann, A. D. Smith, S. Love, P. G.
Kehoe, J. Hardy, S. Mead, N. Fox, M. Rossor, J. Collinge, W. Maier, F. Jessen,
B. Schürmann, H. van den Bussche, I. Heuser, J. Kornhuber, J. Wiltfang, M. Dich-
gans, L. Frölich, H. Hampel, M. Hüll, D. Rujescu, A. M. Goate, J. S. K. Kauwe,
C. Cruchaga, P. Nowotny, J. C. Morris, K. Mayo, K. Sleegers, K. Bettens, S. Engel-
borghs, P. P. De Deyn, C. Van Broeckhoven, G. Livingston, N. J. Bass, H. Gurling,
A. McQuillin, R. Gwilliam, P. Deloukas, A. Al-Chalabi, C. E. Shaw, M. Tsolaki, A. B.
Singleton, R. Guerreiro, T. W. Mühleisen, M. M. Nöthen, S. Moebus, K.-H. Jöckel,
N. Klopp, H.-E. Wichmann, M. M. Carrasquillo, V. S. Pankratz, S. G. Younkin, P. A.
Holmans, M. O’Donovan, M. J. Owen, and J. Williams. 2009. Genome-wide associa-
tion study identifies variants at CLU and PICALM associated with Alzheimer’s disease.
Nature Genetics 41: 1088–1093.

Hayes, M. G., A. Pluzhnikov, K. Miyake, Y. Sun, M. C. Y. Ng, C. A. Roe, J. E.


Below, R. I. Nicolae, A. Konkashbaev, G. I. Bell, N. J. Cox, and C. L. Hanis. 2007.
Identification of type 2 diabetes genes in Mexican Americans through genome-wide
association studies. Diabetes 56: 3033–3044.

Hindorff, L. A., J. MacArthur, A. Wise, H. A. Junkins, P. N. Hall, A. K. Klemm,


and T. A. Manolio. 2011. A catalog of published genome-wide association studies.
http://www.genome.gov/gwastudies/.
Hunter, D. J., P. Kraft, K. B. Jacobs, D. G. Cox, M. Yeager, S. E. Hankinson, S. Wa-
cholder, Z. Wang, R. Welch, A. Hutchinson, J. Wang, K. Yu, N. Chatterjee, N. Orr,
W. C. Willett, G. A. Colditz, R. G. Ziegler, C. D. Berg, S. S. Buys, C. A. McCarty,
H. S. Feigelson, E. E. Calle, M. J. Thun, R. B. Hayes, M. Tucker, D. S. Gerhard,
J. F. Fraumeni, Jr., R. N. Hoover, G. Thomas, and S. J. Chanock. 2007. A genome-
wide association study identifies alleles in FGFR2 associated with risk of sporadic
postmenopausal breast cancer. Nature Genetics 39: 870–874.
Ziegler, A., I. R. König, and J. R. Thompson. 2008. Biostatistical aspects of genome-
wide association studies. Biometrical Journal 50: 8–28.

About the authors


Daniel E. Cook is a research assistant in the Department of Pediatrics at the University of
Iowa. His research focuses on genetics and bioinformatics approaches.
Kelli K. Ryckman is an Associate Research Scientist in the Department of Pediatrics at the
University of Iowa. Her research focuses on the genetics and metabolomics of maternal and
fetal complications in pregnancy.
Jeffrey C. Murray received his medical degree from Tufts Medical School in Boston in 1978.
He has been conducting research at the University of Iowa since 1984. The Murray laboratory
is focused on identifying genetic and environmental causes of complex diseases, specifically
premature birth and birth defects such as a cleft lip and palate.
The Stata Journal (2013)
13, Number 2, pp. 329–336

Semiparametric fixed-effects estimator


François Libois
University of Namur
Centre for Research in the Economics of Development (CRED)
Namur, Belgium
[email protected]

Vincenzo Verardi
University of Namur
Centre for Research in the Economics of Development (CRED)
Namur, Belgium
and
Université Libre de Bruxelles
European Center for Advanced Research in Economics and Statistics (ECARES)
and Center for Knowledge Economics (CKE)
Brussels, Belgium
[email protected]

Abstract. In this article, we describe the Stata implementation of Baltagi and


Li’s (2002, Annals of Economics and Finance 3: 103–116) series estimator of par-
tially linear panel-data models with fixed effects. After a brief description of the
estimator itself, we describe the new command xtsemipar. We then simulate data
to show that this estimator performs better than a fixed-effects estimator if the
relationship between two variables is unknown or quite complex.
Keywords: st0296, xtsemipar, semiparametric estimations, panel data, fixed effects

1 Introduction
The objective of this article is to present a Stata implementation of Baltagi and Li’s
(2002) series estimation of partially linear panel-data models.
The structure of the article is as follows. Section 2 describes Baltagi and Li’s (2002)
fixed-effects semiparametric regression estimator. Section 3 presents the implemented
Stata command (xtsemipar). Some simple simulations assessing the performance of
the estimator are shown in section 4. Section 5 provides a conclusion.


c 2013 StataCorp LP st0296
330 Semiparametric fixed-effects estimator

2 Estimation method
2.1 Baltagi and Li’s (2002) semiparametric fixed-effects regression
estimator
Consider a general panel-data semiparametric model with distributed intercept of the
type

yit = xit θ + f (zit ) + αi + εit , i = 1, . . . , N ; t = 1, . . . , T where T << N (1)

To eliminate the fixed effects αi , a common procedure, inter alia, is to differentiate


(1) over time, which leads to

yit − yit−1 = (xit − xit−1 )θ + {f (zit ) − f (zit−1 )} + εit − εit−1 (2)

An evident problem here is to consistently estimate the unknown function of z ≡


G(zit , zit−1 ) = {f (zit ) − f (zit−1 )}. What Baltagi and Li (2002) propose is to approxi-
mate f (z) by series pk (z) [and therefore approximate G(zit , zit−1 ) = {f (zit ) − f (zit−1 )}
by pk (zit , zit−1 ) = {pk (zit ) − pk (zit−1 )}], where pk (z) are the√first k terms of a sequence
of functions [p1 (z), p2 (z), . . .]. They then demonstrate the N normality for the esti-
mator of the parametric component (that is, θ)  and the consistency at the standard
nonparametric rate of the estimated unknown function [that is, f(.)]. Equation (2)
therefore boils down to
 
yit − yit−1 = (xit − xit−1 )θ + pk (zit ) − pk (zit−1 ) γ + εit − εit−1 (3)

which can be consistently estimated by using ordinary least squares. Having estimated
θ and γ
, we propose to fit the fixed effects α i and go back to (1) to estimate the error
component residual
it = yit − xit θ − α
u i = f (zit ) + εit (4)
it on zit by using some standard nonparametric
The curve f can be fit by regressing u
regression estimator.
A typical example of pk series is spline, which is a fractional polynomial with pieces
defined by a sequence of knots c1 < c2 < · · · < ck , where they join smoothly.
The simplest case is a linear spline. For a spline of degree m, the polynomials and
their first m − 1 derivatives agree at the knots, so m − 1 derivatives are continuous (see
Royston and Sauerbrei [2007] for further details).
A spline of degree m with k knots can be represented as a power series:


m 
k 
z − cj if z > cj
S(z) = j
ζj z + λj (z − cj )m
+ where (z − cj )m
+ =
0 otherwise
j=0 j=1
F. Libois and V. Verardi 331

The problem here is that successive terms tend to be highly correlated. A probably
better representation of splines is a linear combination of a set of basic splines called
(kth degree) B-splines, which are defined for a set of k + 2 consecutive knots c1 < c2 <
· · · < ck+2 as
⎧ ⎫−1
⎨
k+2 , ⎬
B(z, c1 , . . . , ck+2 ) = (k + 1) (ch − cj ) (z − cj )k+
⎩ ⎭
j=1 1≤h≤k+2,h=j

B-splines are intrinsically a rescaling of each of the piecewise functions. The tech-
nicalities of this method are beyond the scope of this article, and we refer the reader to
Newson (2000b) for further details.
We implemented this estimator in Stata under the command xtsemipar, which we
describe below.

3 The xtsemipar command


The xtsemipar command fits Baltagi and Li’s double series fixed-effects estimator in the
case of a single variable entering the model nonparametrically. Running the xtsemipar
command requires the prior installation of the bspline package developed by Newson
(2000a).
The general syntax for the command is
        
xtsemipar varlist if in weight , nonpar(varname) generate( string1
string2) degree(#) knots1(numlist) nograph spline knots2(numlist)

bwidth(#) robust cluster(varname) ci level(#)

The first option, nonpar(), is required. It declares that the variable enters the model
nonparametrically. None of the remaining options are compulsory. The user has the
opportunity to recover the error component residual—the left-hand side of (4)—whose
name can be chosen by specifying string2. This error component can then be used to
draw any kind of nonparametric regression. Because the error component has already
been partialled out from fixed effects and from the parametrically dependent variables,
this amounts to estimating the net nonparametric relation between the dependent and
the variable that enters the model nonparametrically. By default, xtsemipar reports
one estimation of this net relationship. string1 makes it possible to reproduce the values
of the fitted dependent variable. Note that the plot of residuals is recentered around its
mean. The remaining part of this section describes options that affect this fit.
A key option in the quality of the fit is degree(). It determines the power of
the B-splines that are used to consistently estimate the function resulting from the first
difference of the f (zit ) and f (zit−1 ) functions. The default is degree(4). If the nograph
option is not specified—that is, the user wants the graph of the nonparametric fit of the
variable in nonpar() to appear—degree() will also determine the degree of the local
332 Semiparametric fixed-effects estimator

weighted polynomial fit used in the Epanechnikov kernel performed at the last stage
fit. If spline is specified, this last nonparametric estimation will also be estimated by
the B-spline method, and degree() is then the power of these splines. knots1() and
knots2() are both rarely used. They define a list of knots where the different pieces
of the splines agree. If left unspecified, the number and location of the knots will be
chosen optimally, which is the most common practice. knots1() refers to the B-spline
estimation in (3). knots2() can only be used if the spline option is specified and refers
to the last stage fit. More details about B-spline can be found in Newson (2000b). The
bwidth() option can only be used if spline is not specified. It gives the half-width
of the smoothing window in the Epanechnikov kernel estimation. If left unspecified,
a rule-of-thumb bandwidth estimator is calculated and used (see [R] lpoly for more
details).
The remaining options refer to the inference. The robust and cluster() options
correct the inference, respectively, for heteroskedasticity and for clustering of error
terms. In the graph, confidence intervals can be displayed by a shaded area around
the curve of fitted values by specifying the option ci. Confidence intervals are set to
95% by default; however, it is possible to modify them by setting a different confidence
level through the level() option. This affects the confidence intervals both in the
nonparametric and in the parametric part of estimations.

4 Simulation
In this section, we show, by using some simple simulations, how xtsemipar behaves
in finite samples. At the end of the section, we illustrate how this command can be
extended to tackle some endogeneity problems.
In brief, the simulation setup is a standard fixed-effects panel of 200 individuals
over five time periods (1,000 observations). For the design space, four variables, x1 ,
x2 , x3 , and d, are generated from a normal distribution with mean μ = (0, 0, 0, 0) and
variance–covariance matrix
x x2 x3 d
⎛ 1 ⎞
x1 1
x2 ⎜
⎜ 0.2 1 ⎟

x3 ⎝ 0.8 0.4 1 ⎠
d 0 0.3 0.6 1

Variable d is categorized in such a way that five individuals are identified by each
category of d. In practice, we generate these variables in a two-step procedure where
the x’s have two components. The first one is fixed for each individual and is correlated
with d. The second one is a random realization for each time period.
F. Libois and V. Verardi 333

Five hundred replications are carried out, and for each replication, an error term
e is drawn from an N (0, 1). The dependent variable y is generated according to the
data-generating process (DGP): y = x1 + x2 − (x3 + 2 × x23 − 0.25 × x33 ) + d + e. As
is obvious from this estimation setting, multivariate regressions with individual fixed
effects should be used if we want to consistently estimate the parameters. So we regress
y on the x’s by using three regression models:

1. xtsemipar, considering that x1 and x2 enter the model linearly and x3 enters
nonparametrically.
2. xtreg, considering that x1 , x2 , and x3 enter the model linearly.

3. xtreg, considering that x1 and x2 enter the model linearly, whereas x3 enters the
model parametrically with the correct polynomial form (x23 and x33 ).

Table 1 reports the bias and mean squared error (MSE) of coefficients associated
with x1 and x2 for the three regression models. What we find is that Baltagi and
Li’s (2002) estimator performs much better than the usual fixed-effects estimator with
linear control for x3 , in terms of both bias and efficiency. As expected, the most effi-
cient and unbiased estimator remains the fixed-effects estimator with the appropriate
polynomial specification. However, this specification is generally unknown. Figure 1
displays the average nonparametric fit of x3 (plain line) obtained in the simulation with
the corresponding 95% band. The true DGP is represented by the dotted line.

Table 1. Comparison between xtsemipar and xtreg

Bias x1 Bias x2 MSEx1 MSEx2

xtsemipar with nonparametric control for x3 −0.0006 −0.0007 0.00536 0.00399


xtreg with linear control for x3 −0.2641 0.03752 0.07383 0.00462
xtreg with 2nd- and 3rd-order polynomial control for x3 −0.0023 −0.0009 0.00410 0.00321
334 Semiparametric fixed-effects estimator

Nonparametric prediction of x3 based on splines

0
−10
f(x3)
−20
−30
−40

−4 −2 0 2 4
x3

DGP confidence interval at 95% average fit

Figure 1. Average semiparametric fit of x3

If we want efficient and consistent estimates of parameters, estimations relying on


the correct parametric specification are always better. Nevertheless, this correct form
has to be known. It could be argued that a sufficiently flexible polynomial fit would
be preferable to a semiparametric model. However, this is not the case. Indeed, let us
consider the same simulation setting described above, but with the dependent variable y
created according to the new DGP y = x1 +x2 +3 sin(2.5x3 )+d+e. Figure 2 reports the
average nonparametric fit of x3 in a black solid line, with a 95% confidence band around
it. The dotted gray line represents the true DGP, which is quite close to the average
fit estimated by xtsemipar using a fourth-order kernel regression with a bandwidth
set to 0.2. The dashed gray line is the average fourth-order polynomial fixed-effects
parametric fit. As is clear from this figure, xtsemipar provides a much better fit for
this quite complex DGP. xtsemipar can also help identify the relevant parametric form
and help applied researchers avoid some trial and error.
F. Libois and V. Verardi 335

Nonparametric prediction of x3

10
5
f(x3)
0
−5

−4 −2 0 2 4
x3

DGP
avg. fit − 4th−order Taylor exp.
confidence interval at 95%
avg. fit − 4th−order local polyn.

Figure 2. Average semiparametric fit of x3

In much of the empirical research in applied economics, measurement errors, omit-


ted variable bias, and simultaneity are common issues that can be solved through
instrumental-variables estimation. Baltagi and Li (2002) extend their results to ad-
dress these kinds of problems and establish the asymptotic properties for a partially
linear panel-data model with fixed effects and possible endogeneity of the regressors. In
practice, our estimator can be used within a two-step procedure to obtain consistent
estimates of the βs. In the first stage, the right-hand side endogenous variable has to
be regressed (and fit) by using (at least) one valid instrument. At this stage, the non-
parametric variable linearly enters into the estimation procedure. In the second stage,
the semiparametric fixed-effects panel-data model can be used to estimate the relation
between the dependent variable and the set of regressors. The nonparametric variable
now enters the model nonparametrically, exactly as explained before. If the instrument
is valid, this procedure leads to consistent estimations.
Another problem can arise if the nonparametric variable is subject to endogeneity
problems. In this case, we suggest, as the first step of the estimation procedure, using a
control functional approach as explained by Ahamada and Flachaire (2008). However,
we believe that the technicalities associated with this method go well beyond the scope
of this article.
336 Semiparametric fixed-effects estimator

5 Conclusion
In econometrics, semiparametric regression estimators are becoming standard tools for
applied researchers. In this article, we presented Baltagi and Li’s (2002) series semi-
parametric fixed-effects regression estimator. We then introduced the Stata program
we created to put it into practice. Some simple simulations to illustrate the usefulness
and the performance of the procedure were also shown.

6 Acknowledgments
We would like to thank Rodolphe Desbordes, Patrick Foissac, our colleagues at CRED
and ECARES, and especially Wouter Gelade and Peter-Louis Heudtlass, who helped im-
prove the quality of the article. The usual disclaimer applies. François Libois wishes to
thank the ERC grant SSD 230290 for financial support. Vincenzo Verardi is an associate
researcher at the FNRS and gratefully acknowledges their financial support.

7 References
Ahamada, I., and E. Flachaire. 2008. Econométrie Non Paramétrique. Paris:
Économica.

Baltagi, B. H., and D. Li. 2002. Series estimation of partially linear panel data models
with fixed effects. Annals of Economics and Finance 3: 103–116.

Newson, R. 2000a. bspline: Stata modules to compute B-splines parameterized by their


values at reference points. Statistical Software Components S411701, Department of
Economics, Boston College. http://ideas.repec.org/c/boc/bocode/s411701.html.

———. 2000b. sg151: B-splines and splines parameterized by their values at reference
points on the x-axis. Stata Technical Bulletin 57: 20–27. Reprinted in Stata Technical
Bulletin Reprints, vol. 10, pp. 221–230. College Station, TX: Stata Press.

Royston, P., and W. Sauerbrei. 2007. Multivariable modeling with cubic regression
splines: A principled approach. Stata Journal 7: 45–70.

About the authors


François Libois is a researcher and teaching assistant in economics at the University of Namur in
the Centre for Research in the Economics of Development (CRED). His main research interests
are new institutional economics with a special focus on development and environmental issues.
Vincenzo Verardi is a research fellow of the Belgian National Science Foundation (FNRS). He is
a professor at the University of Namur and at the Université Libre de Bruxelles. His research
interests include applied econometrics and development economics.
The Stata Journal (2013)
13, Number 2, pp. 337–343

Exact Wilcoxon signed-rank and Wilcoxon


Mann–Whitney ranksum tests
Tammy Harris
Institute for Families in Society
Department of Epidemiology & Biostatistics
University of South Carolina
Columbia, SC
[email protected]

James W. Hardin
Institute for Families in Society
Department of Epidemiology & Biostatistics
University of South Carolina
Columbia, SC
[email protected]

Abstract. We present new Stata commands for carrying out exact Wilcoxon
one-sample and two-sample comparisons of the median. Nonparametric tests are
often used in clinical trials, in which it is not uncommon to have small samples.
In such situations, researchers are accustomed to making inferences by using exact
statistics. The ranksum and signrank commands in Stata provide only asymptotic
results, which assume normality. Because large-sample results are unacceptable
in many clinical trials studies, these researchers must use other software packages.
To address this, we have developed new commands for Stata that provide exact
statistics in small samples. Additionally, when samples are large, we provide results
based on the Student’s t distribution that outperform those based on the normal
distribution.
Keywords: st0297, ranksumex, signrankex, exact distributions, nonparametric
tests, median, Wilcoxon matched-pairs signed-rank test, Wilcoxon ranksum test

1 Introduction
Many statistical analysis methods are derived after making an assumption about the
underlying distribution of the data (for example, normality). However, one may also
consider nonparametric methods from which to draw statistical inferences where no as-
sumptions are made about an underlying population or distribution. For the nonpara-
metric equivalents to the parametric one-sample and two-sample t tests, the Wilcoxon
signed-rank test (one sample) is used to test the hypothesis that the median differ-
ence between the absolute values of positive and negative paired differences is 0. The
Wilcoxon Mann–Whitney ranksum test is used to test the hypothesis of a zero-median
difference between two independently sampled populations.


c 2013 StataCorp LP st0297
338 Exact Wilcoxon rank tests

We present Stata commands to evaluate both of these nonparametric statistical


tests. This article is organized as follows. In section 2, we review the test statistics.
In section 3, Stata syntax is presented for the new commands, followed by examples in
section 4. A final summary is presented in section 5.

2 Nonparametric Wilcoxon tests


2.1 Wilcoxon signed-rank test
Let Xi and Yi be continuous paired random variables from data consisting of n obser-
vations, where observations are denoted as X = (X1 , . . . , Xn )T and Y = (Y1 , . . . , Yn )T .
For these paired bivariate data, (x1 , y1 ), . . . , (xn , yn ), the differences are calculated as
Di = Yi − Xi . We omit consideration of the subset of observations for which the abso-
lute difference is 0. From this one sample of nr ≤ n nonzero differences, ranks (ri ) are
applied to the absolute differences |Di |, where rank 1 is the smallest absolute difference
and rank nr is the largest absolute difference. Before assigning ranks, we omit absolute
differences of 0, Di = 0.
We then test the hypothesis that Xi and Yi are distributed interchangeably by using
the signed-rank test statistic,

nr
nr (nr + 1)
S= ri I(Di > 0) −
i=1
4

where I(Di > 0) is an indicator function that the ith difference is positive. Ranks of
tied absolute differences are averaged for the relevant set of observations. The variance
of S is given by

1 
m
1
V = nr (nr + 1)(2nr + 1) − tj (tj + 1)(tj − 1)
24 48 j

where tj is the number of values tied in absolute value for the jth rank (Lehmann 1975)
out of the m unique assigned ranks; m = nr and tj = 1 ∀j if there are no ties. The
significance of S is then computed one of two ways, contingent on sample size (nr ). If
nr > 25, the significance of S can be based on the normal approximation (as is done in
Stata’s signrank command) or on Student’s t distribution,
6
nr − 1
S
nr V − S 2
with nr − 1 degrees of freedom (Iman 1974). When nr ≤ 25, the significance of S is
computed from the exact distribution.
An algorithm for calculation of associated probabilities is the network algorithm of
Mehta and Patel (1986). Many new improvements and modifications of that algorithm
have been implemented in various applications to compute the exact p-value. Some in-
clude polynomial time algorithms for permutation distributions (Pagano and Tritchler
T. Harris and J. W. Hardin 339

1983), Mann–Whitney-shifted fast Fourier transform (FFT) (Nagarajan and Keich 2009),
and decreased computation time for the network algorithm described in Requena and
Martı́n Ciudad (2006). Comprehensive summaries for exact inference methods are pub-
lished in Agresti (1992) and Waller, Turnbull, and Hardin (1995).

2.2 Wilcoxon Mann–Whitney ranksum test


Let X be a binary variable (group 1 and group 2) and Yn be a continuous random
variable from data consisting of n observations where Y = (Y1 , . . . , Yn )T . Ranks are
assigned to the data, 1 to n, smallest to largest, where tied ranks are given the average
of the ranks. If n > 25, the (asymptotically normal) test statistic Z is given by
R1 − n1 (n + 1)/2
Z= 7
n1 n2 VR /n
where R1 is the sum of the ranks from group 1, n1 is the sample size of group 1, n2 is
the sample size of group 2, and VR is the variance of the ranks. In Stata, group 1 is
lesser in numeric value than group 2. However, if n ≤ 25, the normal approximation
is not appropriate. In this situation, we calculate the exact test by using the approach
outlined in the following section.

2.3 An exact method based on the characteristic function


Pagano and Tritchler (1983) present the basic methodology for computing distribution
functions through Fourier analysis of the characteristic function. Superficially, this
approach appears as complicated as the complete enumeration of results for the distri-
butions of the Wilcoxon test statistics, but Fourier analysis via the FFT in the approach
based on the characteristic function is calculated much faster.
Basically, if X is a discrete random variable with a distribution function given by
P (X = x) = pj for j = 0, . . . , U , then the complex valued characteristic function is
given by
U
φ(θ) = pj exp(ijθ)
j=0

where i = −1 and θ ∈ [0, 2π). Because X is defined on a finite integer lattice, the
basic theorem in Fourier series is used to obtain the probabilities pj . For any integer
Q > U and j = 0, . . . , U ,
Q−1    
1  2πk 2πijk
pj = φ exp − (1)
Q Q Q
k=0

Thus knowing the characteristic function at Q equidispersed points on the interval


[0, 2π) is equivalent to knowing it everywhere. Furthermore, the probabilities of the
distribution are easily obtained from the characteristic function. We emphasize that
the imaginary part of (1) is 0.
340 Exact Wilcoxon rank tests

To allow tied ranks in the commands, we multiply all ranks by L to ensure that the
ranks and sums of ranks will be integers. This can be accomplished for our two statistics
by setting L = 2. The ranges of the values of the two statistics are easily calculated
so that we may choose Q ≥ U . Defining U as the largest possible value of our statistic
(formed from the largest possible ranks), we can choose log2 Q = ceiling{log2 (U )}. We
choose Q to be a power of 2 because of the requirements of the FFT algorithm in Stata
(Fourier analysis is carried out by using the Mata fft command).
Using rk to denote the rank of the kth observation, the characteristic function for
the one-sample statistic S1 is given by
' (
,N
φ1 (−2πij/Q) = exp(−2πij/Q) cos(−2πjLrk /Q)
k=1

while the characteristic function for S2 is calculated by using the difference equation

φ2 (j, k) = exp(−2πijLrk /Q)φ2 (j − 1, k − 1) + φ2 (j, k − 1)

3 Stata syntax
Software accompanying this article includes the command files as well as supporting
files for dialogs and help. Equivalent to the signrank command, the basic syntax for
the new Wilcoxon signed-rank test command is
   
signrankex varname = exp if in

Equivalent to the ranksum command, the basic syntax for the new Wilcoxon Mann–
Whitney ranksum test command is
     
ranksumex varname if in , by(groupvar) porder

4 Example
In this section, we present real-world examples with the new nonparametric Wilcoxon
test commands. In clinical trials, talinolol is used as a β blocker and is controlled by
P–glycoprotein, which protects xenobiotic compounds. Eight healthy men between the
ages of 22 and 26 were evaluated based on their serum-concentration time profiles of
talinolol with kinetic profile differences. These differences were two enantiomers, S(–)
talinolol and R(+) talinolol. The trial examined single intravenous (iv) and repeated
oral talinolol profiles before and after rifampicin comedication. Area under the serum
concentration time curves (AUC) was collected for each subject (see Zschiesche et al.
[2002]). We compare AUC values of S(–) iv talinolol before and after comedication of
rifampicin by using the Wilcoxon signed-rank test. The results are given below, where
S is the Wilcoxon signed-rank test statistic.
T. Harris and J. W. Hardin 341

. use signrank, clear


. signrankex iv_s_before = iv_s_after
Wilcoxon signed-rank test
sign obs sum ranks expected

positive 8 36 18
negative 0 0 18
zero 0

all 8 36 36
Ho: iv_s_before = iv_s_after
S = 18.000
Prob >= |S| = 0.0078

The results show there was a statistically significant difference (p-value = 0.0078) be-
tween iv S(–) talinolol before and after comedication of rifampicin. There were greater
S(–) talinolol AUC values shown before rifampicin administration than after.
For the Wilcoxon Mann–Whitney ranksum test example, we will use performance
data (table 1) collected on rats’ rotarod endurance (in seconds) from two treatment
groups. The rats were randomly selected to be in the control group (received saline
solvent) or the treatment group (received centrally acting muscle relaxant) (Bergmann,
Ludbrook, and Spooren 2000).

Table 1. Rotarod endurance

Treatment group Control group


Endurance time (sec) Rank Endurance time (sec) Rank
22 2 300 15
300 15 300 15
75 3 300 15
271 5 300 15
300 15 300 15
18 1 300 15
300 15 300 15
300 15 300 15
163 4 300 15
300 15 300 15
300 15 300 15
300 15 300 15
342 Exact Wilcoxon rank tests

The results are given below.

. use ranksum, clear


. ranksumex edrce, by(trt)
Two-sample Wilcoxon rank-sum (Mann-Whitney) test
trt obs rank sum expected

0 12 180 150
1 12 120 150

combined 24 300 300


Exact statistics
Ho: edrce(trt==0) = edrce(trt==1)
Prob <= 120 = 0.0186
Prob >= 180 = 0.0186
Two-sided p-value = 0.0373

The two-sided exact p-value of 0.0373 exhibits a statistically significant difference in


average rotarod endurance between the groups of rats. We can also illustrate how to
calculate this exact p-value manually by using the rat rotarod endurance data (table 1).
In Conover (1999), the Wilcoxon Mann–Whitney ranksum test exact p-value is illus-
trated in terms of combinations (arrangements) of ranks. In this example, the number
of arrangements of 12 of the ranks in the table having a sum less than or equal to 120 is
the number of arrangements of choosing
! "! " all 5 of the ranks less than 15 and 7 of the 19
tied ranks of 15; this is given by 55 197 . The total number of ways to choose 12 of 24
! "
ranks is given by 2412 . Thus the p-value is
!5"!19"
50,388
p-value = !24"7
5
= = 0.0186
12
2,704,156

where each of the new commands returns the p-value as well as the numerator and
denominator of the exact fraction (see the return values in the previous example).

5 Summary
In this article, we introduced two supporting Stata commands for the exact nonparamet-
ric Wilcoxon signed-rank test and the Wilcoxon Mann–Whitney ranksum test. These
one-sample and two-sample test statistics can be used to assess the difference in location
(median difference) for small samples (exact distribution) and larger samples (Student’s
t distribution).
T. Harris and J. W. Hardin 343

6 References
Agresti, A. 1992. A survey of exact inference for contingency tables. Statistical Science
7: 131–153.

Bergmann, R., J. Ludbrook, and W. P. J. M. Spooren. 2000. Different outcomes of the


Wilcoxon–Mann–Whitney test from different statistics packages. American Statisti-
cian 54: 72–77.

Conover, W. J. 1999. Practical Nonparametric Statistics. 3rd ed. New York: Wiley.

Iman, R. L. 1974. Use of a t-statistic as an approximation to the exact distribution of


the Wilcoxon signed ranks test statistic. Communications in Statistics 3: 795–806.

Lehmann, E. L. 1975. Nonparametrics: Statistical Methods Based on Ranks. Upper


Saddle River, NJ: Springer.
Mehta, C. R., and N. R. Patel. 1986. Algorithm 643: FEXACT: A FORTRAN subroutine
for Fisher’s exact test on unordered r × c contingency tables. ACM Transactions on
Mathematical Software 12: 154–161.

Nagarajan, N., and U. Keich. 2009. Reliability and efficiency of algorithms for computing
the significance of the Mann–Whitney test. Computational Statistics 24: 605–622.

Pagano, M., and D. Tritchler. 1983. On obtaining permutation distributions in polyno-


mial time. Journal of the American Statistical Association 78: 435–440.

Requena, F., and N. Martı́n Ciudad. 2006. A major improvement to the network al-
gorithm for Fisher’s exact test in 2 × c contingency tables. Computational Statistics
and Data Analysis 51: 490–498.

Waller, L. A., B. W. Turnbull, and J. M. Hardin. 1995. Obtaining distribution func-


tions by numerical inversion of characteristic functions with applications. American
Statistician 49: 346–350.

Zschiesche, M., G. L. Lemma, K.-J. Klebingat, G. Franke, B. Terhaag, A. Hoffmann,


T. Gramatté, H. K. Kroemer, and W. Siegmund. 2002. Stereoselective disposition of
talinolol in man. Journal of Pharmaceutical Sciences 91: 303–311.

About the authors


Tammy Harris is a PhD candidate in the Department of Epidemiology and Biostatistics and an
affiliated researcher in the Institute for Families in Society at the University of South Carolina
in Columbia, SC.
James W. Hardin is an associate professor in the Department of Epidemiology and Biostatistics
and an affiliated faculty member in the Institute for Families in Society at the University of
South Carolina in Columbia, SC.
The Stata Journal (2013)
13, Number 2, pp. 344–355

Extending the flexible parametric survival model


for competing risks
Sally R. Hinchliffe Paul C. Lambert
Department of Health Sciences Department of Health Sciences
University of Leicester University of Leicester
Leicester, UK Leicester, UK
[email protected] [email protected]

Abstract. Competing risks are present when the patients within a dataset could
experience one or more of several exclusive events and the occurrence of any one of
these could impede the event of interest. One of the measures of interest for analy-
ses of this type is the cumulative incidence function. stpm2cif is a postestimation
command used to generate predictions of the cumulative incidence function after
fitting a flexible parametric survival model using stpm2. There is also the option
to generate confidence intervals, cause-specific hazards, and two other measures
that will be discussed in further detail. The new command is illustrated through
a simple example.
Keywords: st0298, stpm2cif, survival analysis, competing risks, cumulative inci-
dence, cause-specific hazard

1 Introduction
In survival analysis, if interest lies in the true probability of death from a particular
cause, then it is important to appropriately account for competing risks. Competing
risks occur when patients are at risk of more than one mutually exclusive event, such
as death from different causes (Putter, Fiocco, and Geskus 2007). The occurrence of
a competing event may prevent the event of interest from ever occurring. It therefore
seems logical to conduct an analysis that considers these competing risks. The two
main measures of interest for analyses of this type are the cause-specific hazard and
the cumulative incidence function. The cause-specific hazard is the instantaneous risk
of dying from a specific cause given that the patient is still alive at a particular time.
The cumulative incidence function is the proportion of patients who have experienced a
particular event at a certain time in the follow-up period. Several methods are already
available to estimate this; however, it is not always clear which approach should be
used.
In this article, we explain how to fit flexible parametric models using the stpm2 com-
mand by estimating the cause-specific hazard for each cause of interest in a competing-
risks situation. The stpm2cif command is a postestimation command used to estimate
the cumulative incidence function for up to 10 competing causes along with confidence
intervals, cause-specific hazards, and two other useful measures.


c 2013 StataCorp LP st0298
S. R. Hinchliffe and P. C. Lambert 345

2 Methods
If a patient is at risk from K different causes, the cause-specific hazard, hk (t), is the
risk of failure at time t given that no failure from cause k or any of the K − 1 other
causes has occurred. In a proportional hazards model, hk (t) is
! "
hk (t | Z) = hk,0 (t) exp βkT Z (1)

where hk,0 (t) is the baseline cause-specific hazard for cause k, and βk is the vector of
parameters for covariates Z. The cumulative incidence function, Ck (t), can be derived
from the cause-specific hazards through the equation

8t ,
K
Ck (t) = hk (u | Z) Sk (u)du (2)
0 k=1

9
K :t 
K
where Sk (u)du = exp(− hk ) is the overall survival function (Prentice et al.
k=1 0 k=1
1978).
Several programs are currently available in Stata that can compute the cumula-
tive incidence function. The command stcompet calculates the function by using the
Kaplan–Meier estimator of the overall survival function (Coviello and Boggess 2004).
It therefore does not allow for the incorporation of covariate effects. A follow-on to
stcompet is stcompadj, which fits the cumulative incidence function based on the Cox
model or the flexible parametric regression model (Coviello 2009). However, it only
allows one competing event, and because the regression models are built into the com-
mand internally, it does not allow users to specify their own options with stcox or
stpm2. Finally, Fine and Gray’s (1999) proportional subhazards model can be fit using
stcrreg.
The flexible parametric model was first proposed by Royston and Parmar in 2002.
The approach uses restricted cubic spline functions to model the baseline log cumulative
hazard. It has the advantage over other well-known models such as the Cox model be-
cause it produces smooth predictions and can be extended to incorporate complex time-
dependent effects, again through the use of restricted cubic splines. The Stata implemen-
tation of the model using stpm2 is described in detail elsewhere (Royston and Parmar
2002; Lambert and Royston 2009). Both the cause-specific hazard (1) and the overall
survival function can be obtained from the flexible parametric model to give the inte-
grand in (2). This can be done by fitting separate models for each of the k causes, but
this will not allow for shared parameters. It is possible to fit one model for all k causes
simultaneously by stacking the data so that each individual patient has k rows of data,
one for each of the k causes. Table 1 illustrates how the data should look once they
have been stacked (in the table, CVD stands for cardiovascular disease). Each patient
can fail from one of three causes. Patient 1 is at risk from all three causes for 10 years
but does not experience any of them and so is censored. Patient 2 is at risk from all
three causes for eight years but then experiences a cardiovascular event. By expanding
346 Competing risks

the dataset, one can allow for covariate effects to be shared across the causes, although
it is possible to include covariates that vary for each cause.

Table 1. Expanding the dataset

ID Age Time Cause Status


1 50 10 Cancer 0
1 50 10 CVD 0
1 50 10 Other 0
2 70 8 Cancer 0
2 70 8 CVD 1
2 70 8 Other 0

3 Syntax
 
stpm2cif newvarlist, cause1(varname # varname # ... ) cause2(varname #
    
varname # ... ) cause3(varname # varname # ... ) ...
 
cause10(varname # varname # ... ) obs(#) ci mint(#) maxt(#)

timename(newvar) hazard contmort conthaz

The names specified in newvarlist coincide with the order of the causes inputted in the
options.

3.1 Options
   
cause1(varname # varname # ... ) . . . cause10(varname # varname # ... )
request that the covariates specified by the listed varname be set to # when pre-
dicting the cumulative incidence functions for each cause. cause1() and cause2()
are required.
obs(#) specifies the number of observations (of time) to predict. The default is
obs(1000). Observations are evenly spread between the minimum and maximum
values of follow-up time.
ci calculates a 95% confidence interval for the cumulative incidence function and stores
the confidence limits in CIF newvar lci and CIF newvar uci.
mint(#) specifies the minimum value of follow-up time. The default is set as the
minimum event time from stset.
maxt(#) specifies the maximum value of follow-up time. The default is set as the
maximum event time from stset.
S. R. Hinchliffe and P. C. Lambert 347

timename(newvar) specifies the time variable generated during predictions for the cu-
mulative incidence function. The default is timename( newt). This is the variable
for time that needs to be used when plotting curves for the cumulative incidence
function and the cause-specific hazard function.
hazard predicts the cause-specific hazard function for each cause.
contmort predicts the relative contribution to total mortality.
conthaz predicts the relative contribution to hazard.

4 Example
Data were used on 506 patients with prostate cancer who were randomly allocated
to treatment with diethylstilbestrol. The data have been used previously to illustrate
the command stcompet (Coviello and Boggess 2004). Patients are classified as alive
or having died from one of three causes: cancer (the event of interest), cardiovascular
disease (CVD), or other causes. To use stpm2cif, the user must first expand the dataset:

. use prostatecancer
. expand 3
(1012 observations created)
. by id, sort: generate cause= _n
. generate cancer = cause==1
. generate cvd = cause==2
. generate other = cause==3
. generate treatcancer = treatment*cancer
. generate treatcvd = treatment*cvd
. generate treatother = treatment*other
. generate event = (cause==status)

The data have been expanded so that each patient has three rows of data, one for
each cause as shown in table 1. Three indicator variables have been created for each
of the three competing causes and interactions between treatment. Three causes have
also been generated. The indicator variable event defines whether a patient has died
and the cause of death. We now need to stset the data and run stpm2.
348 Competing risks

. stset time, failure(event)


failure event: event != 0 & event < .
obs. time interval: (0, time]
exit on or before: failure

1518 total obs.


0 exclusions

1518
obs. remaining, representing
356
failures in single record/single failure data
54898.8
total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 76
. stpm2 cancer cvd other treatcancer treatcvd treatother, scale(hazard)
> rcsbaseoff dftvc(3) nocons tvc(cancer cvd other) eform nolog
Log likelihood = -1150.4866 Number of obs = 1518

exp(b) Std. Err. z P>|z| [95% Conf. Interval]

xb
cancer .2363179 .0275697 -12.37 0.000 .188015 .2970303
cvd .1801868 .0238668 -12.94 0.000 .1389876 .2335983
other .1008464 .018132 -12.76 0.000 .070895 .1434515
treatcancer .6722196 .1096964 -2.43 0.015 .4882109 .9255819
treatcvd 1.188189 .2013301 1.02 0.309 .8524237 1.65621
treatother .6345498 .1672676 -1.73 0.084 .3785199 1.063758
_rcs_cancer1 3.501847 .44435 9.88 0.000 2.730788 4.49062
_rcs_cancer2 .8842712 .0742915 -1.46 0.143 .7500191 1.042554
_rcs_cancer3 1.046436 .0371625 1.28 0.201 .9760756 1.121868
_rcs_cvd1 2.841936 .2619063 11.33 0.000 2.372299 3.404545
_rcs_cvd2 .8772848 .0498866 -2.30 0.021 .7847607 .9807176
_rcs_cvd3 1.008804 .0352009 0.25 0.802 .9421175 1.08021
_rcs_other1 2.751505 .3563037 7.82 0.000 2.134738 3.546467
_rcs_other2 .7962094 .0558593 -3.25 0.001 .6939208 .913576
_rcs_other3 .9614597 .0512891 -0.74 0.461 .8660117 1.067428

By including the three cause indicators (cancer, cvd, and other) as both main
effects and time-dependent effects (using the tvc() option), we have fit a stratified
model with three separate baselines, one for each cause. For this reason, we have used
the rcsbaseoff option together with the nocons option, which excludes the baseline
hazard from the model. The interactions between treatment and the three causes have
also been included in the model. This estimates a different treatment effect for each of
the three causes. The hazard ratios (95% confidence intervals) for the treatment effect
are 0.67 [0.49, 0.93], 1.19 [0.85, 1.66], and 0.63 [0.38, 1.06] for cancer, CVD, and other
causes, respectively.
Now that we have run stpm2, we can run the new postestimation command stpm2cif
to obtain the cumulative incidence function for each cause. Because we have two groups
of patients, treated and untreated, we must run the command twice. This will give
separate cumulative incidence functions for the treated and the untreated groups and
for each of the three causes.
S. R. Hinchliffe and P. C. Lambert 349

. stpm2cif cancer0 cvd0 other0, cause1(cancer 1) cause2(cvd 1) cause3(other 1)


> ci hazard contmort conthaz maxt(60)
. stpm2cif cancer1 cvd1 other1, cause1(cancer 1 treatcancer 1)
> cause2(cvd 1 treatcvd 1) cause3(other 1 treatother 1)
> ci hazard contmort conthaz maxt(60)

The cause1() to cause3() options give the linear predictor for each of the three
causes for which we want a prediction. The commands have generated six new vari-
ables containing the cumulative incidence functions. The untreated group members are
denoted with a 0 at the end of the variable name, and the treated group members are
denoted with a 1. These labels come from the input into newvarlist in the above com-
mand line. The six cumulative incidence functions are therefore labeled CIF cancer0,
CIF cvd0, CIF other0, CIF cancer1, CIF cvd1, and CIF other1. Each of these vari-
ables has a corresponding high and low confidence bound, for example, CIF cancer0 lci
and CIF cancer1 uci. These were created because the ci option was specified. The
maxt() option has been specified to restrict the predictions for the cumulative incidence
function to a maximum follow-up time of 60 months; this was done for illustrative pur-
poses only.
By specifying the hazard option, we have generated cause-specific hazards that
correspond with each of the cumulative incidence functions. These are labeled as
h cancer0, h cvd0, h other0, h cancer1, h cvd1, and h other1. The options contmort
and conthaz are the two additional measures mentioned previously. The contmort op-
tion produces what we have named the “relative contribution to the total mortality”.
This is essentially the cumulative incidence function for each specific cause divided by
the sum of all the cumulative incidence functions. It can be interpreted as the prob-
ability that you will die from a particular cause given that you have died by time t.
The conthaz option produces what we have named the “relative contribution to the
overall hazard”. This is similar to the last measure in that it is the cause-specific hazard
for a particular cause divided by the sum of all the cause-specific hazards. It can be
interpreted as the probability that you will die from a particular cause given that you
die at time t.
350 Competing risks

If we plot the cumulative incidence functions for each cause against time, we can
achieve plots as shown in figure 1.

Cancer CVD Other


0.4 0.4 0.4

0.3 0.3 0.3


Cumulative incidence

0.2 0.2 0.2

0.1 0.1 0.1

0.0 0.0 0.0


0 20 40 60 0 20 40 60 0 20 40 60

Months since randomization

Untreated Treated

Figure 1. Cumulative incidence of cancer, CVD, and other causes of death in treated
and untreated patients with prostate cancer

The plots in figure 1 give the actual probabilities of dying from each cause, taking
into account the competing causes. The treated group have a lower probability of dying
from cancer or other causes compared with the untreated group, but have a higher
probability of dying from CVD.
The model fit above is relatively simple because it only considers treatment as a
predictor for the three causes of death. Age is an important factor when fitting the
probability of death, so we shall now consider a model including age as a continuous
variable with a time-dependent effect. Although the effect of age will most likely differ
between the three causes of death, for demonstrative purposes, we will assume that the
effect of age can be shared across all three causes. This is one of the main advantages of
stacking the data as shown previously. The stpm2 command can be rerun to include age
in both the variable list and the tvc() option. The three cause indicators (cancer, cvd,
and other) remain as time-dependent effects with 3 degrees of freedom to maintain the
stratified model with three separate baselines. Age is now included as a time-dependent
effect with only 1 degree of freedom.
S. R. Hinchliffe and P. C. Lambert 351

. stpm2 cancer cvd other treatcancer treatcvd treatother age, scale(hazard)


> rcsbaseoff dftvc(cancer:3 cvd:3 other:3 age:1) nocons
> tvc(cancer cvd other age) eform nolog
Log likelihood = -1140.8413 Number of obs = 1515

exp(b) Std. Err. z P>|z| [95% Conf. Interval]

xb
cancer .0146644 .0103431 -5.99 0.000 .0036804 .0584297
cvd .0109487 .0078047 -6.33 0.000 .0027076 .0442727
other .0061321 .0044357 -7.04 0.000 .0014856 .0253121
treatcancer .6862214 .112055 -2.31 0.021 .4982751 .9450598
treatcvd 1.208279 .2048582 1.12 0.264 .8666626 1.684553
treatother .6468979 .1705538 -1.65 0.099 .3858491 1.084561
age 1.039325 .009951 4.03 0.000 1.020004 1.059013
_rcs_cancer1 15.00829 10.53069 3.86 0.000 3.793838 59.37229
_rcs_cancer2 .897379 .0757066 -1.28 0.199 .7606152 1.058734
_rcs_cancer3 1.046672 .0375211 1.27 0.203 .9756565 1.122858
_rcs_cvd1 12.71949 9.099712 3.55 0.000 3.129737 51.69301
_rcs_cvd2 .8897659 .0515221 -2.02 0.044 .7943039 .9967009
_rcs_cvd3 1.013176 .0358712 0.37 0.712 .9452535 1.085979
_rcs_other1 12.19435 8.773031 3.48 0.001 2.976976 49.95076
_rcs_other2 .7976752 .0553103 -3.26 0.001 .6963126 .9137932
_rcs_other3 .9682531 .0511771 -0.61 0.542 .8729684 1.073938
_rcs_age1 .980301 .0092192 -2.12 0.034 .9623973 .9985378

As before, we can now use the stpm2cif command to obtain the cumulative incidence
functions for cancer, CVD, and other causes. This time, we want to predict for ages 65
and 75 in both of the two treatment groups, so we will need to run the command four
times.

. stpm2cif age65cancer0 age65cvd0 age65other0, cause1(cancer 1 age 65)


> cause2(cvd 1 age 65) cause3(other 1 age 65) ci hazard contmort conthaz
> maxt(60)
. stpm2cif age65cancer1 age65cvd1 age65other1,
> cause1(cancer 1 treatcancer 1 age 65)
> cause2(cvd 1 treatcvd 1 age 65) cause3(other 1 treatother 1 age 65)
> ci hazard contmort conthaz maxt(60)
. stpm2cif age75cancer0 age75cvd0 age75other0, cause1(cancer 1 age 75)
> cause2(cvd 1 age 75) cause3(other 1 age 75) ci hazard contmort conthaz
> maxt(60)
. stpm2cif age75cancer1 age75cvd1 age75other1,
> cause1(cancer 1 treatcancer 1 age 75)
> cause2(cvd 1 treatcvd 1 age 75) cause3(other 1 treatother 1 age 75)
> ci hazard contmort conthaz maxt(60)

The stpm2cif commands have generated 12 new variables for the cumulative inci-
dence functions, labeled CIF age65cancer0, CIF age65cvd0, CIF age65other0,
CIF age65cancer1, CIF age65cvd1, CIF age65other1, CIF age75cancer0,
CIF age75cvd0, CIF age75other0, CIF age75cancer1, CIF age75cvd1, and
CIF age75other1. A 65 next to age represents the prediction for those 65 years old; a
75 represents a prediction for those 75 years old.
352 Competing risks

Rather than plotting the cumulative incidence function as a line for each cause
separately as we did previously, we display them by stacking them on top of each other.
This produces a graph as shown in figure 2. To do this, we need to generate new
variables that sum up the cumulative incidence functions. This is done for each of the
two treatment groups and two ages. The code shown below is for the 65-year-olds in
the treatment group only.
. generate age65treat1 = CIF_age65cancer1
(518 missing values generated)
. generate age65treat2 = age65treat1+CIF_age65cvd1
(518 missing values generated)
. generate age65treat3 = age65treat2+CIF_age65other1
(518 missing values generated)
. twoway (area age65treat3 _newt, sort fintensity(100))
> (area age65treat2 _newt, sort fintensity(100))
> (area age65treat1 _newt, sort fintensity(100)), ylabel(0(0.2)1, angle(0)
> format(%3.1f)) ytitle("") xtitle("")
> legend(order(3 "Cancer" 2 "CVD" 1 "Other") rows(1) size(small))
> title("Treated") plotregion(margin(zero)) scheme(sj)
> saving(treatedage65, replace)

Age 65
Treated Untreated
1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4
Cumulative incidence

0.2 0.2

0.0 0.0
0 20 40 60 0 20 40 60

Age 75
Treated Untreated
1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
0 20 40 60 0 20 40 60

Months since randomization

Cancer CVD Other

Figure 2. Stacked cumulative incidence of cancer, CVD, and other causes of death for
those aged 65 and 75 in treated and untreated patients with prostate cancer
S. R. Hinchliffe and P. C. Lambert 353

The results in figure 2 allow us to visualize the total probability of dying in both
the treated and the untreated groups for those aged 65 and 75 and allow us to see how
this is broken down by the specific causes. As expected, the total probability of death
is higher for the oldest age in both treatment groups. The distribution of deaths across
the three causes in each treatment group is roughly the same for both ages. Again
we see that although the treatment reduces the total probability of death, it actually
increases the probability of death from CVD.
Using a similar process to the one used above to obtain the stacked cumulative
incidence plots, we can also produce stacked plots of the relative contribution to the
total mortality and the relative contribution to the hazard. These graphs are shown in
figures 3 and 4.

Age 65
Treated Untreated
1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4
Relative contribution
to the total mortality

0.2 0.2

0.0 0.0
0 20 40 60 0 20 40 60

Age 75
Treated Untreated
1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
0 20 40 60 0 20 40 60

Months since randomization

Cancer CVD Other

Figure 3. Relative contribution to the total mortality for those aged 65 and 75 in treated
and untreated patients with prostate cancer

Figure 3 shows the relative contribution to the total mortality for those aged 65
and 75 in the two treatment groups. If we focus on the 65-year-olds in the treated
group, the plot shows us that given a patient 65 years old is going to die by 40 months
if treated, then the probability of dying from cancer is 0.39, the probability of dying
from CVD is 0.48, and the probability of dying from other causes is 0.13. However, if the
patient is untreated, then the same probabilities are 0.49, 0.34, and 0.17, respectively.
354 Competing risks

Age 65
Treated Untreated
1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

Relative contribution
to the overall hazard
0.2 0.2

0.0 0.0
0 20 40 60 0 20 40 60

Age 75
Treated Untreated
1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
0 20 40 60 0 20 40 60

Months since randomization

Cancer CVD Other

Figure 4. Relative contribution to the hazard for those aged 65 and 75 in treated and
untreated patients with prostate cancer

Figure 4 shows the relative contribution to the overall hazard for those aged 65 and 75
in the two treatment groups. Again, if we focus on the 65-year-olds in the treated group,
the plot shows us that given a patient 65 years old is going to die at 40 months if treated,
then the probability of dying from cancer is 0.39, the probability of dying from CVD
is 0.45, and the probability of dying from other causes is 0.16. However, if the patient
is untreated, then the same probabilities are 0.48, 0.32, and 0.20, respectively.

5 Conclusion
The new command stpm2cif provides an extension to the command stpm2 to enable
users to estimate the cumulative incidence function through the flexible parametric
function. We hope that it will be a useful tool in medical research.

6 References
Coviello, E. 2009. stcompadj: Stata module to estimate the covariate-adjusted cu-
mulative incidence function in the presence of competing risks. Statistical Software
Components S457063, Department of Economics, Boston College.
http://ideas.repec.org/c/boc/bocode/s457063.html.

Coviello, V., and M. Boggess. 2004. Cumulative incidence estimation in the presence of
competing risks. Stata Journal 4: 103–112.

Fine, J. P., and R. J. Gray. 1999. A proportional hazards model for the subdistribution
of a competing risk. Journal of the American Statistical Association 94: 496–509.
S. R. Hinchliffe and P. C. Lambert 355

Lambert, P. C., and P. Royston. 2009. Further development of flexible parametric


models for survival analysis. Stata Journal 9: 265–290.

Prentice, R. L., J. D. Kalbfleisch, A. V. Peterson, Jr., N. Flournoy, V. T. Farewell, and


N. E. Breslow. 1978. The analysis of failure times in the presence of competing risks.
Biometrics 34: 541–554.

Putter, H., M. Fiocco, and R. B. Geskus. 2007. Tutorial in biostatistics: Competing


risks and multi-state models. Statistics in Medicine 26: 2389–2430.
Royston, P., and M. K. B. Parmar. 2002. Flexible parametric proportional-hazards and
proportional-odds models for censored survival data, with application to prognostic
modelling and estimation of treatment effects. Statistics in Medicine 21: 2175–2197.

About the authors


Sally Hinchliffe is a PhD student at the University of Leicester, UK. She is currently working
on developing a methodology for application in competing risks.
Paul Lambert is a reader in medical statistics at the University of Leicester, UK. His main
interest is in the development and application of methods in population-based cancer research.
The Stata Journal (2013)
13, Number 2, pp. 356–365

Goodness-of-fit tests for categorical data


Rino Bellocco Sara Algeri
University of Milano–Bicocca Texas A&M University
Milan, Italy College Station, TX
[email protected] [email protected]
and
Karolinska Institutet
Stockholm, Sweden
[email protected]

Abstract. A significant aspect of data modeling with categorical predictors is


the definition of a saturated model. In fact, there are different ways of specifying
it—the casewise, the contingency table, and the collapsing approaches—and they
strictly depend on the unit of analysis considered.
The analytical units of reference could be the subjects or, alternatively, groups
of subjects that have the same covariate pattern. In the first case, the goal is to
predict the probability of success (failure) for each individual; in the second case,
the goal is to predict the proportion of successes (failures) in each group. The
analytical unit adopted does not affect the estimation process; however, it does
affect the definition of a saturated model. Consequently, measures and tests of
goodness of fit can lead to different results and interpretations. Thus one must
carefully consider which approach to choose.
In this article, we focus on the deviance test for logistic regression models.
However, the results and the conclusions are easily applicable to other linear models
involving categorical regressors.
We show how Stata 12.1 performs when implementing goodness of fit. In this
situation, it is important to clarify which one of the three approaches is imple-
mented as default. Furthermore, a prominent role is played by the shape of the
dataset considered (individual format or events–trials format) in accordance with
the analytical unit choice. In fact, the same procedure applied to different data
structures leads to different approaches to a saturated model. Thus one must at-
tend to practical and theoretical statistical issues to avoid inappropriate analyses.
Keywords: st0299, saturated models, categorical data, deviance, goodness-of-fit
tests

1 Deviance test for goodness of fit


It is common to find applications of logistic regression models in categorical data anal-
ysis. In particular, considering the simplest case of a binary outcome Y , the logistic
regression model for the probability of success π {P (Y = 1)} is defined as
 
π(x)
ln = β0 + β1 x1 + · · · + βp xp (1)
1 − π(x)


c 2013 StataCorp LP st0299
R. Bellocco and S. Algeri 357

where π(x) is the probability of success given the set of covariates x = (x1 , . . . , xp ). Con-
sidering β = (β0 , . . . , βp ), the vector containing the unknown parameters in (1), under
the assumption of independent outcomes, we can obtain the corresponding maximum
likelihood estimates β by maximizing the following log-likelihood function:

n
[yi ln{π(xi )} + (1 − yi ) ln{1 − π(xi )}] (2)
i=1

where n is the total number of observations and yi is the observed outcome for the
ith subject. This situation is based on having subjects as analytical units; thus the
data layout presents one record for each individual considered in the dataset (individual
format).
When one works with categorical data, it is possible (and frequently more useful)
to consider some groups of subjects as units of analysis. These groups correspond
to the covariate patterns (that is, the specific combinations of predictor values xj ).
Thus it is possible to reshape the dataset so that each record will correspond to a
particular covariate pattern or profile (events–trials format), including the total number
of individuals and total number of successes (deaths, recoveries, etc.). In this case, the
goal is to predict the proportion of successes for each group. The quantity π will be the
same for any individual in the same group (Kleinbaum and Klein 2010), and we adopt
the binomial distribution as reference to model this probability. So if we rewrite the
log-likelihood function (2) in terms of covariate patterns, we obtain


K
[sj ln{π(xj )} + (mj − sj ) ln{1 − π(xj )}] (3)
j=1

where K is the total number of possible (observed) covariate patterns, sj represents the
number of successes, mj is the number of total individuals, and π(xj ) is the proportion
of successes corresponding to the jth covariate pattern. Therefore, in spite of differ-
ent structures, because the information contained is exactly the same, the parameter
estimates from (2) and (3) are exactly the same.
Having defined the log-likelihood function, we can perform the assessment of good-
ness of fit with different methods. In this article, we focus our attention on the likelihood-
ratio test (LRT) based on the deviance statistics. The deviance statistic compares, in
terms of likelihood, the model being fit with the saturated model. The deviance statistic
for a generalized linear model (see Agresti [2007]) is defined as
  
 

G2 = 2 ln Ls β − ln Lm β (4)

 is the maximized log likelihood of the model of interest and ln{Ls (β)}
where ln{Lm (β)} 
is the maximized log likelihood of the saturated model. This quantity can also be
interpreted as a comparison between the values predicted by the fitted model and those
predicted by the most complete model. Evidence for model lack-of-fit occurs when the
value of G2 is large (see Hosmer et al. [1997]).
358 Goodness-of-fit tests for categorical data

It is generally accepted that this statistic, under specific conditions of regularity,


converges asymptotically to a χ2 distribution with h degrees of freedom, where h is the
difference between the parameters in the saturated model and the parameters in the
model being fit:
G2 ∼ χ2 (h)
Therefore, we use the deviance test to assess the following hypothesis

H0 : βh = 0

where βh is the vector containing the additional parameters of the saturated model
compared with the model considered. So H0 is rejected when

G2 ≥ χ2 1−α

where α is the level of significance. If H0 cannot be rejected, we can safely conclude that
the fitting of the model of interest is substantially similar to that of the most completed
model that can be built (see section 2). We must clarify that the LRT can always be
used to compare two nested models in terms of differences of deviances.

2 Definition of saturated model


A particular issue that is not carefully considered in categorical data analysis is the
definition of a saturated model. In fact, according to Simonoff (1998), three different
specifications are available and depend on the unit of analysis. In general, we can think
of the saturated model as the model that leads to the perfect prediction of the outcome
of interest and represents the largest model we can fit. Thus it is used as a reference
for the assessment of the fitting of any other model of interest.
The traditional approach is the one that considers the saturated model as the model
that gives a perfect fit of the data. So it assumes the subjects to be the analytical
unit and is identified with the casewise approach. This model contains the intercept
and n − 1 covariates (where n is the total number of available observations as specified
above). Consequently, the maximum likelihood function in (2) is always equal to 0 (see
Kleinbaum and Klein [2010]). So the deviance statistic shown in (4) result in


n
2 
G = −2 lnm (β) = −2 π (xi )} + (1 − yi ) ln{1 − π
[yi ln{ (xi )}]
i=1
n %
   &
(xi )
π
= −2 yi ln + ln{1 − π
(xi )}
i=1
1−π(xi )

This approach is generally followed in the case of continuous covariates whose values can-
not be grouped into categorical values. In fact, in this situation, each covariate pattern
will most likely correspond to one subject (n = K), and obviously, the most reasonable
analytical unit is the subject. However, in this case, the G2 goodness-of-fit statistics
cannot be approximated to a χ2 distribution (see Kuss [2002] and Kleinbaum and Klein
R. Bellocco and S. Algeri 359

[2010]). Thus even if statistical packages provide a p-value from a χ2 distribution, we


recommend using the second or third approach—the contingency table or collapsing
approach—if all the regressors are or can be reduced to categorical variables.
These two approaches are based on groups of subjects as analytical units (and the
data layout will be in events–trials format). These groups correspond to the covariate
patterns, and individuals sharing the same covariate pattern are members of the same
group. The saturated model is identified as the one with the intercept and K − 1
regressors, where K is the number of all the possible covariate patterns; in other words,
the model includes all possible main effects and all possible interaction effects (two-
way, three-way, etc., until the maximum possible interaction order). The difference of
the two situations is based on the covariate pattern specification. In the first one, the
covariate patterns are built by considering all the covariates available in the dataset. In
the second one, the covariate patterns are based only on the variables specified in the
model of interest.
Clearly, if the model of interest includes all the variables available in the dataset,
the two approaches coincide. Under these situations (if n = K), the log likelihood of
the saturated model is not equal to 0, and the G2 statistic is
  
 

G2 = 2 ln Ls β − ln Lm β
⎛ ⎞
K %    &

π (x ) 1 − 
π (x )
= 2⎝ ⎠
s j s j
sj ln + (mj − sj ) ln
j=1
m (xj )
π 1−π m (xj )

where πs (xj ) is the proportion of successes for the jth covariate pattern predicted by
the saturated model and π m (xj ) is the one predicted by the fitted model.
The collapsing approach has a main drawback: it uses different saturated models
corresponding to different models of interest, complicating the comparison of their re-
sults in terms of goodness of fit. On the other hand, the contingency approach may
require the listing of a high number of covariates. In this case, we could have many
covariate patterns with a small number of subjects, making the use of the χ2 approx-
imation in the LRT for goodness of fit difficult once again. A possible remedy could
be that the hypothetical saturated model in the contingency approach should be based
on variables identified through the corresponding directed acyclic graphs. In a causal
inference framework, we could then use only the variables suggested by the d-separation
algorithm applied to the directed acyclic graph, which imposes the researcher to specify
the interrelationship among the variables (Greenland, Pearl, and Robins 1999).

3 Implementation of the LRT


In this section, we implement the LRT, and we show the results by considering the three
different saturated model specifications, using Stata 12.1. The data used in the analyses
refer to the Titanic disaster on 15 April 1912. Information on 2,201 persons is available
on three covariates: sex (male or female), economic status (first-class passenger, second-
360 Goodness-of-fit tests for categorical data

class passenger, third-class passenger, or crew), and age (adult or child), which defines
16 different covariate patterns (among which 14 were observed). The outcome of interest
is either passenger’s survival (1 = survivor, 0 = deceased) or the number of survivors
and total number of passengers.
As anticipated above, two possible ways to represent these data can be considered
with respect to the goal and the unit of analysis (Kleinbaum and Klein 2010):

• Individual-record format: One record for each subject considered with the infor-
mation on survival (or death) contained in a binary variable (individ.txt).

• Events–trials format: One record for each covariate pattern with frequencies on
survivors and total number of passengers (grouped.txt) available as follows:

age sex status survival n

1. Adult Male First 57 175


2. Adult Male Second 14 168
3. Adult Male Third 75 462
4. Adult Male Crew 192 862
5. Child Male First 5 5

6. Child Male Second 11 11


7. Child Male Third 13 48
8. Child Male Crew 0 0
9. Adult Female First 140 144
10. Adult Female Second 80 93

11. Adult Female Third 76 165


12. Adult Female Crew 20 23
13. Child Female First 1 1
14. Child Female Second 13 13
15. Child Female Third 14 31

16. Child Female Crew 0 0

Clearly, with simple reshaping data procedures, available in Stata, it is possible to


swap from one format to another. This will allow us to implement each of the three
approaches summarized in the previous section.

3.1 The casewise approach


The glm procedures, available in Stata, use as the default saturated model definition
the model with as many covariates as the number of records in the data file. Thus using
individ.txt with subjects as analytical units, we can easily implement the deviance
test by considering the casewise definition of the saturated model, shown below:
R. Bellocco and S. Algeri 361

. insheet using individ.txt, tab clear


(5 vars, 2201 obs)
. generate male = sex=="Male"
. encode status, generate(econ_status)
. glm survival i.male i.econ_status, family(binomial) link(logit)
Iteration 0: log likelihood = -1116.4813
Iteration 1: log likelihood = -1114.4582
Iteration 2: log likelihood = -1114.4564
Iteration 3: log likelihood = -1114.4564
Generalized linear models No. of obs = 2201
Optimization : ML Residual df = 2196
Scale parameter = 1
Deviance = 2228.91282 (1/df) Deviance = 1.014988
Pearson = 2228.798854 (1/df) Pearson = 1.014936
Variance function: V(u) = u*(1-u) [Bernoulli]
Link function : g(u) = ln(u/(1-u)) [Logit]
AIC = 1.017225
Log likelihood = -1114.45641 BIC = -14672.97

OIM
survival Coef. Std. Err. z P>|z| [95% Conf. Interval]

1.male -2.421328 .1390931 -17.41 0.000 -2.693946 -2.148711

econ_status
2 .8808128 .1569718 5.61 0.000 .5731537 1.188472
3 -.0717844 .1709268 -0.42 0.675 -.4067948 .263226
4 -.7774228 .1423145 -5.46 0.000 -1.056354 -.4984916

_cons 1.187396 .1574664 7.54 0.000 .878767 1.496024

The LRT for goodness of fit can be obtained with the following code:

. scalar dev=e(deviance)
. scalar df=e(df)
. display "GOF casewise "" G^2="dev " df="df " p-value= " chiprob(df, dev)
GOF casewise G^2=2228.9128 df=2196 p-value= .30705384

Thus the deviance statistic G2 is 2228.91 with 2196 (= 2201 − 5) degrees of freedom,
and the p-value referred to the deviance test is 0.3071. We notice that as expected, the
G2 corresponds to −2{lnm (β)} (= −2[−1114.46]). So in this case, the null hypothesis
cannot be rejected, and the fit of the model of interest is not different from the fit of
the saturated model.

3.2 The contingency table approach


The intuitive way of implementing the contingency table approach is to apply this same
procedure (glm) on grouped.txt. In this situation, we want to estimate the proportion
of successes; thus we need to redefine the outcome by specifying two new variables:
the first is the total number of subjects in each category n, and the second is the total
number of events, survival.
362 Goodness-of-fit tests for categorical data

In Stata, we also need to add the variable containing the number of trials, n, in the
family() option:
. insheet using grouped.txt, tab clear
(5 vars, 16 obs)
. generate male = sex=="Male"
. encode status, generate(econ_status)
. glm survival i.male i.econ_status if n>0, family(binomial n) link(logit)
Iteration 0: log likelihood = -91.841683
Iteration 1: log likelihood = -89.026084
Iteration 2: log likelihood = -89.019672
Iteration 3: log likelihood = -89.019672
Generalized linear models No. of obs = 14
Optimization : ML Residual df = 9
Scale parameter = 1
Deviance = 131.4183066 (1/df) Deviance = 14.60203
Pearson = 127.8463371 (1/df) Pearson = 14.20515
Variance function: V(u) = u*(1-u/n) [Binomial]
Link function : g(u) = ln(u/(n-u)) [Logit]
AIC = 13.43138
Log likelihood = -89.01967223 BIC = 107.6668

OIM
survival Coef. Std. Err. z P>|z| [95% Conf. Interval]

1.male -2.421328 .1390931 -17.41 0.000 -2.693946 -2.148711

econ_status
2 .8808128 .1569718 5.61 0.000 .5731537 1.188472
3 -.0717844 .1709268 -0.42 0.675 -.4067948 .263226
4 -.7774228 .1423145 -5.46 0.000 -1.056354 -.4984916

_cons 1.187396 .1574664 7.54 0.000 .878767 1.496024

Now we can obtain the deviance test statistic:


. scalar dev=e(deviance)
. scalar df=e(df)
. display "GOF contingency "" G^2="dev " df="df " p-value= " chiprob(df, dev)
GOF contingency G^2=131.41831 df=9 p-value= 6.058e-24

The parameter estimates do not change as they do in the casewise approach. But
as expected, the deviance statistic (131.42) has significantly decreased; the degrees of
freedom have changed (9 = 14 − 5); and the p-value for the deviance test will now let
us reject the null hypothesis, implying that the model of interest is not as good as the
saturated model.

3.3 The collapsing approach


Both the casewise and contingency table approaches can be applied very easily by
using the procedures shown above, whereas the collapsing approach requires more effort.
R. Bellocco and S. Algeri 363

Thus, concerning the grouped dataset and by using the egen command, we first generate
a variable that allows us to identify all the possible covariate patterns referring just to
the variables male and econ status.

. insheet using grouped.txt, tab clear


(5 vars, 16 obs)
. generate male = sex=="Male"
. encode status, generate(econ_status)
. egen trtp=group(male econ_status)
. list

age sex status survival n male econ_s~s trtp

1. Adult Male First 57 175 1 First 6


2. Adult Male Second 14 168 1 Second 7
3. Adult Male Third 75 462 1 Third 8
4. Adult Male Crew 192 862 1 Crew 5
5. Child Male First 5 5 1 First 6

6. Child Male Second 11 11 1 Second 7


7. Child Male Third 13 48 1 Third 8
8. Child Male Crew 0 0 1 Crew 5
9. Adult Female First 140 144 0 First 2
10. Adult Female Second 80 93 0 Second 3

11. Adult Female Third 76 165 0 Third 4


12. Adult Female Crew 20 23 0 Crew 1
13. Child Female First 1 1 0 First 2
14. Child Female Second 13 13 0 Second 3
15. Child Female Third 14 31 0 Third 4

16. Child Female Crew 0 0 0 Crew 1

Second, we collapse the data by using the variable obtained in the previous step
and applying it to the two variables introduced into the model of interest (male and
econ status). In this way, we obtain a dataset where each record corresponds to a
covariate pattern identified by the combination of the covariates in the model.

. collapse (sum) survival n (first) male econ_status, by(trtp)


. list

trtp survival n male econ_s~s

1. 1 20 23 0 1
2. 2 141 145 0 2
3. 3 93 106 0 3
4. 4 90 196 0 4
5. 5 192 862 1 1

6. 6 62 180 1 2
7. 7 25 179 1 3
8. 8 88 510 1 4
364 Goodness-of-fit tests for categorical data

We continue as we did in the contingency table approach:

. glm survival i.male i.econ_status if n>0, family(binomial n) link(logit)


Iteration 0: log likelihood = -54.70349
Iteration 1: log likelihood = -52.362699
Iteration 2: log likelihood = -52.356281
Iteration 3: log likelihood = -52.356281
Generalized linear models No. of obs = 8
Optimization : ML Residual df = 3
Scale parameter = 1
Deviance = 65.17983096 (1/df) Deviance = 21.72661
Pearson = 60.87983277 (1/df) Pearson = 20.29328
Variance function: V(u) = u*(1-u/n) [Binomial]
Link function : g(u) = ln(u/(n-u)) [Logit]
AIC = 14.33907
Log likelihood = -52.35628099 BIC = 58.94151

OIM
survival Coef. Std. Err. z P>|z| [95% Conf. Interval]

1.male -2.421328 .1390931 -17.41 0.000 -2.693946 -2.148711

econ_status
2 .8808128 .1569718 5.61 0.000 .5731537 1.188472
3 -.0717844 .1709268 -0.42 0.675 -.4067948 .263226
4 -.7774228 .1423145 -5.46 0.000 -1.056354 -.4984916

_cons 1.187396 .1574664 7.54 0.000 .878767 1.496024

. scalar dev=e(deviance)
. scalar df=e(df)
. display "GOF contingency "" G^2="dev " df="df " p-value= " chiprob(df, dev)
GOF contingency G^2=65.179831 df=3 p-value= 4.591e-14

By reshaping the data, we obtain the results according to the collapsing approach
definition. As in the contingency table approach, we reject H0 , but now the value of
the deviance statistic has changed to 65.18 with 3 (= 8 − 5) degrees of freedom. As
expected, estimates do not change.

4 Discussion
The casewise approach is often considered the standard for defining the saturated model.
The reason is that the analysis is focused on subjects, and the saturated model, instead
of the fully parameterized model, is seen as the model that gives the “perfect fit” (see
Kleinbaum and Klein [2010]). This fact does not affect the estimation process; however,
it fatally compromises the inferential step in a goodness-of-fit evaluation where the χ2
approximation becomes questionable. The consideration of the other approaches can
lead to different and meaningful results in terms of both descriptive and inferential
analysis, but the problem is how to implement them in the right way with the statistical
package we are working on.
R. Bellocco and S. Algeri 365

Considering Stata 12.1, we have noticed that in all cases, the default procedures for
goodness of fit consider the saturated model to be the one with as many covariates as
the number of records present in the dataset. Thus, using an individual data layout,
we obtain results relative to the casewise saturated model, where the analytical units
are subjects. However, when considering an events–trials data format, we assess the
goodness of fit based on the contingency table approach, where the unit of analysis
is the covariate pattern defined by the possible values of all the independent variables
in the dataset. The less intuitive implementation is the one based on the collapsing
approach, which uses the covariate patterns defined by the variables involved in the
model. One simple solution could be to build a new dataset containing only these
variables, like we did with the useful commands egen and collapse, which are very
helpful in showing how the collapsing approach works.

5 References
Agresti, A. 2007. An Introduction to Categorical Data Analysis. 2nd ed. Hoboken, NJ:
Wiley.

Greenland, S., J. Pearl, and J. M. Robins. 1999. Causal diagrams for epidemiologic
research. Epidemiology 10: 37–48.

Hosmer, D. W., T. Hosmer, S. Le Cessie, and S. Lemeshow. 1997. A comparison of


goodness-of-fit tests for the logistic regression model. Statistics in Medicine 15: 965–
980.

Kleinbaum, D. G., and M. Klein. 2010. Logistic Regression: A Self-Learning Text. 3rd
ed. New York: Springer.

Kuss, O. 2002. Global goodness-of-fit tests in logistic regression with sparse data.
Statistics in Medicine 21: 3789–3801.
Simonoff, J. S. 1998. Logistic regression, categorical predictors, and goodness-of-fit: It
depends on who you ask. American Statistician 52: 10–14.

About the authors


Rino Bellocco is an associate professor of biostatistics in the Department of Statistics and
Quantitative Methods at the University of Milano–Bicocca, Italy, and in the Department of
Medical Epidemiology and Biostatistics at the Karolinska Institutet, Sweden.
Sara Algeri is a statistician and currently a PhD student at Texas A&M University in College
Station, TX. She obtained both her bachelor’s and her master’s degrees from the University
of Milano–Bicocca, Italy. In the last year of her master’s studies, she worked at Mount Sinai
School of Medicine in New York as a visiting master’s student. This experience has been crucial
in developing her interest in biostatistics and clinical trials. Her current research mainly focuses
on longitudinal data analysis, Bayesian statistics, and statistical applications in genetics.
The Stata Journal (2013)
13, Number 2, pp. 366–378

Standardizing anthropometric measures in


children and adolescents with functions for
egen: Update
Suzanna I. Vidmar
Clinical Epidemiology and Biostatistics Unit
Murdoch Childrens Research Institute and
University of Melbourne Department of Paediatrics
Royal Children’s Hospital
Melbourne, Australia
[email protected]

Tim J. Cole
MRC Centre of Epidemiology for Child Health
UCL Institute of Child Health
London, UK
[email protected]

Huiqi Pan
MRC Centre of Epidemiology for Child Health
UCL Institute of Child Health
London, UK
[email protected]

Abstract. In this article, we describe an extension to the egen functions zanthro()


and zbmicat() (Vidmar et al., 2004, Stata Journal 4: 50–55). All functionality of
the original version remains unchanged. In the 2004 version of zanthro(), z scores
could be generated using the 2000 U.S. Centers for Disease Control and Prevention
Growth Reference and the British 1990 Growth Reference. More recent growth
references are now available. For measurement-for-age charts, age can now be ad-
justed for gestational age. The zbmicat() function previously categorized children
according to body mass index (weight/height2 ) as normal weight, overweight, or
obese. “Normal weight” is now split into normal weight and three grades of thin-
ness. Finally, this updated version uses cubic rather than linear interpolation to
calculate the values of L, M, and S for the child’s decimal age between successive
ages (or length/height for weight-for-length/height charts).
Keywords: dm0004 1, zanthro(), zbmicat(), z scores, LMS, egen, anthropometric
standards


c 2013 StataCorp LP dm0004 1
S. I. Vidmar, T. J. Cole, and H. Pan 367

1 Introduction
Comparison of anthropometric data from children of different ages is complicated by
the fact that children are still growing. We cannot directly compare the height of a
5-year-old with that of a 10-year-old. Clinicians and researchers are often interested in
determining how a child compares with other children of the same age and sex: Is the
child taller, shorter, or about the same height as the average for his or her age and sex?
The growth references available to zanthro() tabulate values obtained by the LMS
method, developed by Cole (1990) and Cole and Green (1992). The LMS values are used
to transform raw anthropometric data, such as height, to standard deviation scores (z
scores). These are standardized to the reference population for the child’s age and sex (or
for length/height and sex). Two sets of population-based reference data that were widely
used at the time zanthro() was initially developed are the 2000 Centers for Disease Con-
trol and Prevention (CDC) Growth Reference in the United States (Kuczmarski et al.
2000) and the British 1990 Growth Reference (Cole, Freeman, and Preece 1998). Since
then, the following population-based reference data have been released and are now
available in zanthro(): the WHO Child Growth Standards, the WHO Reference 2007,
the UK-WHO Preterm Growth Reference, and the UK-WHO Term Growth Reference.

1.1 WHO Child Growth Standards


The WHO Child Growth Standards (World Health Organization 2006, 2007) are the
result of the Multicentre Growth Reference Study (MGRS) undertaken by the World
Health Organization (WHO) between 1997 and 2003. They replace the 1977 National
Center for Health Statistics (NCHS)/WHO Growth Reference created by the U.S. NCHS
and WHO. The 1977 reference underestimated levels of low weight-for-age (underweight)
for breast-fed infants. A number of specific limitations were noted by the WHO working
group in 1995: “1) the sample was limited to Caucasian infants from mostly middle-
income families; 2) data were collected every three months rather than monthly, which
limited the accuracy of developing the growth curve, particularly from 0–6 months
of age; and 3) the majority of the infants in the sample were bottle-fed, and if they
were breast-fed it was only for a short duration (typically less than three months)”
(Binagwaho, Ratnayake, and Smith Fawzi 2009).
WHO concluded that new growth curves were necessary, a recommendation endorsed
by the World Health Assembly. The MGRS collected primary growth data and related
information on 8,440 healthy breast-fed infants and young children in Brazil, Ghana,
India, Norway, Oman, and the United States. It combined a longitudinal follow-up from
birth to 24 months and a cross-sectional survey of children aged 18–71 months. “The
MGRS is unique in that it was purposely designed to produce a standard by selecting
healthy children living under conditions likely to favor the achievement of their full
genetic growth potential” (World Health Organization 2006). The WHO Child Growth
Standards can be used to assess the growth and development of children from 0–5 years.
368 Standardizing anthropometric measures: Update

1.2 WHO Reference 2007


The WHO Reference 2007 (de Onis et al. 2007) is a modification of the 1977 NCHS/WHO
Growth Reference for children and adolescents aged 5–19 years. It was merged with data
from the cross-sectional sample of children aged 18–71 months to smooth the transition
at 5 years. The WHO Reference 2007 can be used for children and adolescents aged 5–19
years. It complements the WHO Child Growth Standards.

1.3 UK-WHO Growth References


In 2007, the Scientific Advisory Committee on Nutrition recommended that a modified
WHO chart be adopted in the UK. Two composite UK-WHO data files (Cole et al. 2011),
one for preterm and the other for term births, were launched in May 2009. Both
comprise three sections:

• A birth section based on the British 1990 Growth Reference. Acknowledgment


statements for these data should specify the data source as “British 1990 Growth
Reference, reanalyzed 2009”.

• A postnatal section from 2 weeks to 4 years copied from the WHO Child Growth
Standards.

• The 4–20 years section from the British 1990 Growth Reference.

Term infants are those born at 37 completed weeks’ gestation and beyond. The UK-
WHO Term Growth Reference can be used for these infants. For infants born before 37
completed weeks’ gestation, the UK-WHO Preterm Growth Reference can be used, with
gestationally corrected age.

1.4 Additional growth charts available in zanthro()


The WHO and composite UK-WHO growth data are now available in zanthro(). In
addition, two new measurements have been added to the British 1990 Growth Refer-
ence: waist-for-age and percentage body fat–for-age (based on Tanita body composition
analyzer/scales).
S. I. Vidmar, T. J. Cole, and H. Pan 369

1.5 Categorizing children into grades of thinness and overweight


Body mass index (BMI) cutoffs are used to define categories of thinness (Cole et al.
2007) and overweight (Cole et al. 2000) in children and adolescents aged 2–18 years.
BMI data were obtained from nationally representative surveys of children in Brazil,
Great Britain, Hong Kong, the Netherlands, Singapore, and the United States. The
thinness cutoffs correspond to equivalent adult BMI cutoff points endorsed by WHO of
16, 17, and 18.5 kg/m2 . zbmicat() now categorizes children into these three thinness
grades as well as normal weight, overweight, or obese according to international cutoff
points.

1.6 Comparison with LMSgrowth


LMSgrowth is a Microsoft Excel add-in to convert measurements to and from United
States, UK, WHO, and composite UK-WHO reference z scores. It can be downloaded
via this link: http://www.healthforallchildren.com/index.php/shop/product/Software/
Gr5yCsMCONpF39hF/0. z scores can only be calculated for the range of ages within
a growth chart. Where the age ranges for the two programs do not overlap (table 1),
only one of zanthro() and LMSgrowth will generate z scores.

Table 1. Differences in age ranges for zanthro() and LMSgrowth

Chart Age range for zanthro() Age range for LMSgrowth


CDC: Weight 0–20 years 0–19.96 years
CDC: Height 2–20 years 1.96–19.96 years
CDC: BMI 2–20 years 1.96–19.96 years

Each growth reference is summarized by three numbers, called L, M, and S, which


represent the skewness, median, and coefficient of variation of the measurement as
it changes with age (or length and height). For age, L, M, and S values are generally
tabulated at monthly intervals. For length and height, these parameters are tabulated at
0.5 or 1 cm intervals. Where a child’s age, length, and height occur within these intervals,
values of L, M, and S are obtained via cubic interpolation except at the endpoints of
the charts, where linear interpolation is used. (The BMI cutoff points are tabulated at 6
monthly intervals. The cutoff point where a child’s age occurs within these intervals is
also obtained via cubic interpolation—or linear interpolation from 2–2.5 years and 17.5–
18 years.) Minor discrepancies in the z scores calculated by zanthro() and LMSgrowth
will be caused by different segment lengths and methods of interpolation for a segment
(figure 1). Even so, the z scores generated by the two programs should agree within one
decimal place.
370 Standardizing anthropometric measures: Update

CDC: Weight
Age (years) 0 0.04 . . . 19.88 19.96 20
Age (months) 0 0.5 . . . 238.5 239.5 240
zanthro() + + ... + + +
; <= > ; <= >
linear linear
LMSgrowth + + ... + + -
; <= > ; <= >
linear linear

CDC: Height and BMI


Age (years) 1.96 2 2.04 . . . 19.88 19.96 20
Age (months) 23.5 24 24.5 . . . 238.5 239.5 240
zanthro() - + + ... + + +
; <= > ; <= >
linear linear
LMSgrowth + - + ... + + -
; <= > ; <= >
linear linear

Figure 1. Use of linear interpolation for charts with different age ranges

2 Syntax
     
egen type newvar = zanthro(varname,chart,version) if in ,
xvar(varname) gender(varname) gencode(male=code, female=code)
 
ageunit(unit) gestage(varname) nocutoff

     
egen type newvar = zbmicat(varname) if
in , xvar(varname)
 
gender(varname) gencode(male=code, female=code) ageunit(unit)

by cannot be used with either of these functions.


S. I. Vidmar, T. J. Cole, and H. Pan 371

3 Functions
zanthro(varname,chart,version) calculates z scores for anthropometric measures in
children and adolescents according to United States, UK, WHO, and composite UK-
WHO reference growth charts. The three arguments are the following:

varname is the variable name of the measure in your dataset for which z scores are
calculated (for example, height, weight, or BMI).
chart; see tables 3–7 for a list of valid chart codes.
version is US, UK, WHO, UKWHOpreterm, or UKWHOterm. US calculates z scores by using
the 2000 CDC Growth Reference; UK uses the British 1990 Growth Reference; WHO
uses the WHO Child Growth Standards and WHO Reference 2007 composite data
files as the reference data; and UKWHOpreterm and UKWHOterm use the British and
WHO Child Growth Standards composite data files for preterm and term births,
respectively.
zbmicat(varname) categorizes children and adolescents aged 2–18 years into three thin-
ness grades—normal weight, overweight, and obese—by using BMI cutoffs (table 2).
BMI is in kg/m2 . This function generates a variable with the following values and
labels:

Table 2. Values and labels for grades of thinness and overweight

Value Grade/Label BMI range at 18 years


-3 Grade 3 thinness <16
-2 Grade 2 thinness 16 to <17
-1 Grade 1 thinness 17 to <18.5
0 Normal wt 18.5 to <25
1 Overweight 25 to <30
2 Obese 30+

Note that since the previous version of zbmicat(), the value label for BMI category
has been changed from 1 = Normal wt, 2 = Overweight, and 3 = Obese.

4 Options
xvar(varname) specifies the variable used (along with gender) as the basis for stan-
dardizing the measure of interest. This variable is usually age but can also be length
or height when the measurement is weight; that is, weight-for-age, weight-for-length,
and weight-for-height are all available growth charts.
gender(varname) specifies the gender variable. It can be string or numeric. The codes
for male and female must be specified by the gencode() option.
372 Standardizing anthropometric measures: Update

gencode(male=code, female=code) specifies the codes for male and female. The gen-
der can be specified in either order, and the comma is optional. Quotes around the
codes are not allowed, even if the gender variable is a string.
ageunit(unit) gives the unit for the age variable and is only valid for measurement-for-
age charts; that is, omit this option when the chart code is wl or wh (see section 5).
The unit can be day, week, month, or year. This option may be omitted if the unit
is year, because this is the default. Time units are converted as follows:
1 year = 12 months = 365.25/7 weeks = 365.25 days
1 month = 365.25/84 weeks = 365.25/12 days
1 week = 7 days
Note: Ages cannot be expressed to full accuracy for all units. The consequence of
this will be most apparent at the extremes of age in the growth charts, where z
scores may be generated when the age variable is in one unit and missing for some
of those same ages when they have been converted to another unit.
gestage(varname) specifies the gestational age variable in weeks. This option enables
age to be adjusted for gestational age. The default is 40 weeks. If gestational age is
greater than 40 weeks, the child’s age will be corrected by the amount over 40 weeks.
A warning will be given if the gestational age variable contains a nonmissing value
over 42. As with the ageunit() option, this option is only valid for measurement-
for-age charts.
nocutoff forces calculation of all z scores, allowing for extreme values in your dataset.
By default, any z scores with absolute values greater than or equal to 5 (that is,
values that are 5 standard deviations or more away from the mean) are set to missing.
The decision to have a default cutoff at 5 standard deviations from the mean was
made as a way of attempting to capture extreme data entry errors. Apart from this
and setting to missing any z scores where the measurement is a nonpositive number,
these functions will not automatically detect data errors. As always, please check
your data!

5 Growth charts
Growth charts available in zanthro() are presented in tables 3–7. Note: Where xvar()
is outside the permitted range, zanthro() and zbmicat() return a missing value.
S. I. Vidmar, T. J. Cole, and H. Pan 373

Table 3. 2000 CDC Growth Charts, version US

chart Description Measurement unit xvar() range


la length-for-age cm 0–35.5 months
ha height-for-age cm 2–20 years
wa weight-for-age kg 0–20 years
ba BMI-for-age kg/m2 2–20 years
hca head circumference–for-age cm 0–36 months
wl weight-for-length kg 45–103.5 cm
wh weight-for-height kg 77–121.5 cm

Table 4. British 1990 Growth Charts, version UK

chart Description Measurement unit xvar() range


ha length/height-for-age cm 0–23 years
wa weight-for-age kg 0–23 years
ba BMI-for-age kg/m2 0–23 years
hca head circumference–for-age cm Males: 0–18 years
Females: 0–17 years
sha sitting height–for-age cm 0–23 years
lla leg length–for-age cm 0–23 years
wsa waist-for-age cm 3–17 years
bfa body fat–for-age % 4.75–19.83 years

Length/height and BMI growth data are available from 33 weeks gestation. Weight and
head circumference growth data are available from 23 weeks gestation.
374 Standardizing anthropometric measures: Update

Table 5. WHO Child Growth Charts and WHO Reference 2007 Charts, version WHO

chart Description Measurement unit xvar() range


ha length/height-for-age cm 0–19 years
wa weight-for-age kg 0–10 years
ba BMI-for-age kg/m2 0–19 years
hca head circumference–for-age cm 0–5 years
aca arm circumference–for-age cm 0.25–5 years
ssa subscapular skinfold–for-age mm 0.25–5 years
tsa triceps skinfold–for-age mm 0.25–5 years
wl weight-for-length kg 45–110 cm
wh weight-for-height kg 65–120 cm

Table 6. UK WHO Preterm Growth Charts, version UKWHOpreterm

chart Description Measurement unit xvar() range


ha length/height-for-age cm 0–20 years
wa weight-for-age kg 0–20 years
ba BMI-for-age kg/m2 0.038–20 years
hca head circumference–for-age cm Males: 0–18 years
Females: 0–17 years

Length/height growth data are available from 25 weeks gestation. Weight and head
circumference growth data are available from 23 weeks gestation.

Table 7. UK WHO Term Growth Charts, version UKWHOterm

chart Description Measurement unit xvar() range


ha length/height-for-age cm 0–20 years
wa weight-for-age kg 0–20 years
ba BMI-for-age kg/m2 0.038–20 years
hca head circumference–for-age cm Males: 0–18 years
Females: 0–17 years

Length/height, weight, and head circumference growth data are available from 37 weeks
gestation.
S. I. Vidmar, T. J. Cole, and H. Pan 375

6 Examples
Below is an illustration with data on a set of British newborns. The British 1990 Growth
Reference is used; the variable sex is coded male = 1, female = 2; and the variable
gestation is “completed weeks gestation”.
. use zwtukeg
. list, noobs abbreviate(9)

sex ageyrs weight gestation

1 .01 3.53 38
2 .073 5.05 40
2 .115 4.68 42
1 .135 4.89 36
2 .177 2.75 28

To compare the weight of the babies in this sample, for instance, with respect to
socioeconomic grouping, we can convert weight to standardized z scores. The z scores
are created using the following command:
. egen zwtuk = zanthro(weight,wa,UK), xvar(ageyrs) gender(sex)
> gencode(male=1, female=2)
(Z values generated for 5 cases)
(gender was assumed to be coded male=1, female=2)
(age was assumed to be in years)

In the command above, we have assumed all are term births. If some babies are born
prematurely, we can adjust for gestational age as follows.
. egen zwtuk_gest = zanthro(weight,wa,UK), xvar(ageyrs) gender(sex)
> gencode(male=1, female=2) gestage(gestation)
(Z values generated for 5 cases)
(gender was assumed to be coded male=1, female=2)
(age was assumed to be in years)

Here are the results for both of the above commands:


. list, noobs abbreviate(10)

sex ageyrs weight gestation zwtuk zwtuk_gest

1 .01 3.53 38 -.2731358 .6329474


2 .073 5.05 40 1.696552 1.696552
2 .115 4.68 42 .2438011 -.4162439
1 .135 4.89 36 -.3017253 1.204823
2 .177 2.75 28 -4.707812 -.2137314

Note that at gestation = 40 weeks, the z score is the same whether or not the
gestage() option is used. The formula for gestationally corrected age is
actual age + (gestation at birth − 40)
where “actual age” and “gestation at birth” are in weeks.
376 Standardizing anthropometric measures: Update

Gestational age may be recorded as weeks and days, as in the following example:

gestwks gestdays

38 3
40 6
42 0
36 2
28 1

These variables first need to be combined into a single gestation variable, which can
then be used with the gestage() option:

. generate gestation = gestwks + gestdays/7

Here we use the UK-WHO Term Growth Reference for term babies:

. egen zwtukwho = zanthro(weight,wa,UKWHOterm), xvar(ageyrs) gender(sex)


> gencode(male=1, female=2)
(Z values generated for 5 cases)
(gender was assumed to be coded male=1, female=2)
(age was assumed to be in years)

Here we use the UK-WHO Preterm Growth Reference for preterm babies, adjusting
for gestational age:

. egen zwtukwho_pre = zanthro(weight,wa,UKWHOpreterm), xvar(ageyrs) gender(sex)


> gencode(male=1, female=2) gestage(gestation)
(Z values generated for 5 cases)
(gender was assumed to be coded male=1, female=2)
(age was assumed to be in years)

Note: Where the gestationally corrected age is from 37 to 42 weeks, the UK-WHO
preterm and term growth charts generate different z scores. For example, the gestation-
ally corrected age of a 2-week-old baby girl who was born at 37 weeks gestation is 39
weeks. If her weight is 3.34 kg, the following z scores are generated using the UK-WHO
preterm and term growth charts:

. use zwtukwhoeg, clear


. egen zpreterm = zanthro(weight,wa,UKWHOpreterm), xvar(agewks) gender(sex)
> gencode(male=1, female=2) ageunit(week) gestage(gestation)
(Z value generated for 1 case)
(gender was assumed to be coded male=1, female=2)
(age was assumed to be in weeks)
. egen zterm = zanthro(weight,wa,UKWHOterm), xvar(agewks) gender(sex)
> gencode(male=1, female=2) ageunit(week) gestage(gestation)
(Z value generated for 1 case)
(gender was assumed to be coded male=1, female=2)
(age was assumed to be in weeks)
S. I. Vidmar, T. J. Cole, and H. Pan 377

. list, noobs abbreviate(9)

sex weight agewks gestation zpreterm zterm

2 3.34 2 37 .2403702 -.0422754

To determine the proportion of children who are thin, normal weight, overweight,
and obese, we can categorize each child by using the following command:

. use zbmicateg, clear


. egen bmicat = zbmicat(bmi), xvar(ageyrs) gender(sex)
> gencode(male=1, female=2)
./zbmicat.dta
(BMI categories generated for 10 cases)
(gender was assumed to be coded male=1, female=2)
(age was assumed to be in years)

Here are the results:

. list, noobs

sex ageyrs bmi bmicat

1 5.95 13.01 Grade 2 thinness


1 9.46 16.43 Normal wt
2 6.71 20.62 Obese
2 6.89 13.45 Grade 1 thinness
2 8.63 18.96 Overweight

1 8.48 17.45 Normal wt


1 7.08 15.65 Normal wt
2 7.56 11.54 Grade 3 thinness
2 9.78 19.56 Normal wt
2 8.25 20.58 Overweight

7 Acknowledgment
This work was supported by the Victorian Government’s Operational Infrastructure
Support Program.

8 References
Binagwaho, A., N. Ratnayake, and M. C. Smith Fawzi. 2009. Holding multilateral
organizations accountable: The failure of WHO in regards to childhood malnutrition.
Health and Human Rights 10(2): 1–4.

Cole, T. J. 1990. The LMS method for constructing normalized growth standards.
European Journal of Clinical Nutrition 44: 45–60.
378 Standardizing anthropometric measures: Update

Cole, T. J., M. C. Bellizzi, K. M. Flegal, and W. H. Dietz. 2000. Establishing a standard


definition for child overweight and obesity worldwide: International survey. British
Medical Journal 320: 1240–1243.
Cole, T. J., K. M. Flegal, D. Nicholls, and A. A. Jackson. 2007. Body mass index
cut offs to define thinness in children and adolescents: International survey. British
Medical Journal 335: 194–201.
Cole, T. J., J. V. Freeman, and M. A. Preece. 1998. British 1990 growth reference cen-
tiles for weight, height, body mass index and head circumference fitted by maximum
penalized likelihood. Statistics in Medicine 17: 407–429.
Cole, T. J., and P. J. Green. 1992. Smoothing reference centile curves: The LMS method
and penalized likelihood. Statistics in Medicine 11: 1305–1319.
Cole, T. J., A. F. Williams, and C. M. Wright. 2011. Revised birth centiles for weight,
length and head circumference in the UK-WHO growth charts. Annals of Human
Biology 38: 7–11.
Kuczmarski, R. J., C. L. Ogden, L. M. Grummer-Strawn, K. M. Flegal, S. S. Guo,
R. Wei, Z. Mei, L. R. Curtin, A. F. Roche, and C. L. Johnson. 2000. CDC growth
charts: United States. Advance Data 314: 1–27.
de Onis, M., A. W. Onyango, E. Borghi, A. Siyam, C. Nishida, and J. Siekmann. 2007.
Development of a WHO growth reference for school-aged children and adolescents.
Bulletin of the World Health Organization 85: 660–667.
Vidmar, S., J. Carlin, K. Hesketh, and T. Cole. 2004. Standardizing anthropometric
measures in children and adolescents with new functions for egen. Stata Journal 4:
50–55.
World Health Organization. 2006. WHO child growth standards: Length/height-for-
age, weight-for-age, weight-for-length, weight-for-height and body mass index-for-age:
Methods and development. Geneva: World Health Organization.
———. 2007. WHO child growth standards: Head circumference-for-age, arm
circumference-for-age, triceps skinfold-for-age and subscapular skinfold-for-age:
Methods and development. Geneva: World Health Organization.

About the authors


Suzanna I. Vidmar is a senior research officer in the Clinical Epidemiology and Biostatistics
Unit at the Murdoch Childrens Research Institute and University of Melbourne Department
of Paediatrics at the Royal Children’s Hospital in Melbourne, Australia.
Tim J. Cole is a professor of medical statistics in the MRC Centre of Epidemiology for Child
Health at the UCL Institute of Child Health in London, UK, and has published widely on the
analysis of human growth data.
Huiqi Pan is a statistical programmer in the MRC Centre of Epidemiology for Child Health at
the UCL Institute of Child Health in London, UK.
The Stata Journal (2013)
13, Number 2, pp. 379–381

Bonferroni and Holm approximations for Šidák


and Holland–Copenhaver q-values
Roger B. Newson
National Heart and Lung Institute
Imperial College London
London, UK
[email protected]

Abstract. I describe the use of the Bonferroni and Holm formulas as approxi-
mations for Šidák and Holland–Copenhaver formulas when issues of precision are
encountered, especially with q-values corresponding to very small p-values.
Keywords: st0300, parmest, qqvalue, smileplot, multproc, multiple-test procedure,
familywise error rate, Bonferroni, Šidák, Holm, Holland, Copenhaver

1 Introduction
Frequentist q-values for a range of multiple-test procedures are implemented in Stata by
using the package qqvalue (Newson 2010), downloadable from the Statistical Software
Components (SSC) archive. The Šidák q-value for a p-value p is given by qsid = 1 −
(1 − p)m , where m is the number of multiple comparisons (Šidák 1967). It is a less
conservative alternative to the Bonferroni q-value, given by qbon = min(1, mp). However,
the Šidák formula may be incorrectly evaluated by a computer to 0 when the input p-
value is too small to give a result lower than 1 when subtracted from 1, which is the
case for p-values of 10−17 or less, even in double precision. q-values of 0 are logically
possible as a consequence of p-values of 0, but in this case, they may be overliberal. This
liberalism may possibly be a problem in the future, given the current technology-driven
trend of exponentially increasing multiple comparisons and the human-driven problem
of ingenious data dredging. I present a remedy for this problem and discuss its use in
computing q-values and discovery sets.

2 Methods for q-values


The remedy used by the SSC packages qqvalue and parmest (Newson 2003) is to sub-
stitute the Bonferroni formula for the Šidák formula for such small p-values. This works
because the Bonferroni and Šidák q-values converge in ratio as p tends to 0. To prove
this, I show that for 0 ≤ p < 1,
dqbon /dp = m and dqsid /dp = m(1 − p)m−1
and that the Šidák/Bonferroni ratio of these derivatives is (1 − p)m−1 , which is 1 if
p = 0. By L’Hôpital’s rule, it follows that the ratio qsid /qbon also tends to 1 as p tends
to 0.


c 2013 StataCorp LP st0300
380 Approximations for Šidák and Holland–Copenhaver q-values

A similar argument shows that the same problem exists with the q-values output
by the Holland–Copenhaver procedure (Holland and Copenhaver 1987). If the m input
p-values, sorted in ascending order, are denoted pi for i from 1 to m, then the Holland–
Copenhaver procedure is defined by the formula
si = 1 − (1 − pi )m−i+1
where si is the ith s-value. (In the terminology of Newson [2010], s-values are truncated
at 1 to give r-values, which are in turn input into a step-down procedure to give the
eventual q-values.) The remedy used by qqvalue here is to substitute the s-value
formula for the procedure of Holm (1979), which is
si = (m − i + 1)pi
whenever 1 − pi is evaluated as 1. This also works because the two s-value formulas
converge in ratio as pi tends to 0. Note that the Holm procedure is derived from the
Bonferroni procedure by using the same step-down method as is used to derive the
Holland–Copenhaver procedure from the Šidák procedure.

3 Methods for discovery sets


The SSC package smileplot (Newson and the ALSPAC Study Team 2003) also imple-
ments a range of multiple-test procedures by using two commands, multproc and
smileplot. However, instead of outputting q-values, smileplot outputs a corrected
critical p-value threshold and a corresponding discovery set, defined as the subset of
input p-values at or below the corrected critical p-value. The Šidák-corrected criti-
cal p-value corresponding to an uncorrected critical p-value punc is given by csid =
1 − (1 − punc )1/m and may be overconservative if wrongly evaluated to 0. In this case,
the quantity that might be wrongly computed as 1 is (1− punc )1/m . When this happens,
smileplot substitutes the Bonferroni-corrected critical p-value cbon = punc /m. How-
ever, this is a slightly less elegant remedy in this case because the quantity (1 − punc )1/m
is usually evaluated to 1 because m is large and not because punc is small.
To study the behavior of the Bonferroni approximation for large m, we define λ =
1/m and note that
dcbon /dλ = punc and dcsid /dλ = − ln(1 − punc )(1 − punc )λ
implying (by L’Hôpital’s rule) that in the limit, as λ tends to 0, the Šidák/Bonferroni
ratio of the two derivatives (and therefore of the two corrected thresholds) tends to
− ln(1−punc )/punc . This quantity is not as low as 1 but is 1.150728, 1.053605, 1.025866,
and 1.005034 if punc is 0.25, 0.10, 0.05, and 0.01, respectively. Therefore, the Bonferroni
approximation in this case is still slightly conservative for a very large number of multiple
comparisons over a range of commonly used uncorrected critical p-values, but is less
conservative than the value of 0, which would otherwise be computed.
This argument is easily generalized to the Holland–Copenhaver procedure. In this
case, smileplot initially calculates a vector of m candidate critical p-value thresholds
by using the formula
R. B. Newson 381

ci = 1 − (1 − punc )1/(m−i+1)
for i from 1 to m and selects the corrected critical p-value corresponding to a given
uncorrected critical p-value from these candidates by using a step-down procedure. If
the quantity (1 − punc )1/(m−i+1) is evaluated as 1, then smileplot substitutes the
corresponding Holm critical p-value threshold
ci = punc /(m − i + 1)
which again is conservative as m − i + 1 becomes large (corresponding to the smallest
p-values from a large number of multiple comparisons), but is less conservative than the
value of 0, which would otherwise be computed.
Newson (2010) argues that q-values are an improvement on discovery sets because,
given the q-values, different members of the audience can apply different input critical
p-values and derive their own discovery sets. The technical issue of precision presented
here may be one more minor reason for preferring q-values to discovery sets.

4 Acknowledgment
I would like to thank Tiago V. Pereira of the University of São Paulo in Brazil for
drawing my attention to this issue of precision with the Šidák and Holland–Copenhaver
procedures.

5 References
Holland, B. S., and M. D. Copenhaver. 1987. An improved sequentially rejective Bon-
ferroni test procedure. Biometrics 43: 417–423.
Holm, S. 1979. A simple sequentially rejective multiple test procedure. Scandinavian
Journal of Statistics 6: 65–70.
Newson, R. 2003. Confidence intervals and p-values for delivery to the end user. Stata
Journal 3: 245–269.
Newson, R., and the ALSPAC Study Team. 2003. Multiple-test procedures and smile
plots. Stata Journal 3: 109–132.
Newson, R. B. 2010. Frequentist q-values for multiple-test procedures. Stata Journal
10: 568–584.
Šidák, Z. 1967. Rectangular confidence regions for the means of multivariate normal
distributions. Journal of the American Statistical Association 62: 626–633.

About the author


Roger B. Newson is a lecturer in medical statistics at Imperial College London, UK, working
principally in asthma research. He wrote the parmest, qqvalue, and smileplot Stata packages.
The Stata Journal (2013)
13, Number 2, pp. 382–397

Fitting the generalized multinomial logit model


in Stata
Yuanyuan Gu Arne Risa Hole
Centre for Health Economics Research and Evaluation Department of Economics
University of Technology, Sydney University of Sheffield
Sydney, Australia Sheffield, UK
[email protected] a.r.hole@sheffield.ac.uk

Stephanie Knox
Centre for Health Economics Research and Evaluation
University of Technology, Sydney
Sydney, Australia
[email protected]

Abstract. In this article, we describe the gmnl Stata command, which can be
used to fit the generalized multinomial logit model and its special cases.
Keywords: st0301, gmnl, gmnlpred, gmnlcov, generalized multinomial logit, scale
heterogeneity multinomial logit, maximum simulated likelihood

1 Introduction
Explaining variations in the behaviors of individuals is of central importance in choice
analysis. For the last decade, the most popular explanation has been preference or taste
heterogeneity; that is, some individuals care more about particular product attributes
than do others. This assumption is most naturally represented via random parameter
models, among which the mixed logit (MIXL) model has become the standard to use
(McFadden and Train 2000).
Recently, however, a group of researchers (for example, Louviere et al. [1999], Lou-
viere et al. [2002], Louviere and Eagle [2006], and Louviere et al. [2007]) has argued that
in most choice contexts, much of the preference heterogeneity may be better described
as “scale” heterogeneity; that is, with attribute coefficients fixed, the scale of the id-
iosyncratic error term is greater for some consumers than it is for others. Because the
scale of the error term is inversely related to the error variance, this argument implies
that choice behavior is more random for some consumers than it is for others. Although
the scale of the error term in discrete choice models cannot be separately identified from
the attribute coefficients, it is possible to identify relative scale terms across consumers.
Thus the statement that all heterogeneity is in the scale of the error term “is observa-
tionally equivalent to the statement that heterogeneity takes the form of the vector of
utility weights being scaled up or down proportionately as one ‘looks’ across consumers”
(Fiebig et al. 2010). These arguments have led to the scale heterogeneity multinomial
logit (S-MNL) model, a much more parsimonious model specification than MIXL.


c 2013 StataCorp LP st0301
Y. Gu, A. R. Hole, and S. Knox 383

To accommodate both preference and scale heterogeneity, Fiebig et al. (2010) devel-
oped a generalized multinomial logit (G-MNL) model that nests MIXL and S-MNL. Their
research also shows that the two sources of heterogeneity often coexist but that their
importance varies in different choice contexts.
In this article, we will describe the gmnl Stata command, which can be used to fit the
G-MNL model and its special cases. The command is a generalization of the mixlogit
command developed by Hole (2007). We will also present an empirical example that
demonstrates how to use gmnl, and we will discuss related computational issues.

2 The G-MNL model and its special cases


We assume a sample of N respondents with the choice of J alternatives in T choice
situations.1 Following Fiebig et al. (2010), the G-MNL model gives the probability of
respondent i choosing alternative j in choice situation t as

exp(βi xitj )
Pr(choiceit = j|βi ) = J 
(1)
k=1 exp(βi xitk )
i = 1, . . . , N ; t = 1, . . . , T ; j = 1, . . . , J

where xitj is a vector of observed attributes of alternative j and βi is a vector of


individual-specific parameters defined as

βi = σi β + {γ + σi (1 − γ)}ηi (2)

The specification of βi in (2) is central to G-MNL and differentiates it from previous


heterogeneity models. It depends on a constant vector β, a scalar parameter γ, a
random vector ηi distributed MVN(0, Σ), and σi , the individual-specific scale of the
idiosyncratic error.
In Fiebig et al. (2010), γ is constrained to be between 0 and 1. In extreme cases,
γ = 1 leads to G-MNL-I: βi = σi β + ηi and γ = 0 leads to G-MNL-II:2 βi = σi (β + ηi ). To
understand the difference between these two models, Fiebig et al. (2010) describe them
with a single equation: βi = σi β + ηi∗ , where σi captures scale heterogeneity and ηi∗
captures residual preference heterogeneity. Through this, we can see that in G-MNL-I,
the standard deviation of ηi∗ is independent of the scaling of β, whereas in G-MNL-II, it
is proportional to σi .
However, an article by Keane and Wasi (forthcoming) points out that γ < 0 or γ > 1
still permits sensible behavioral interpretations, and thus there is no reason to impose
the constraint. We follow their advice and allow γ to take any value.

1. We could also consider a different number of alternatives and choice situations for each respondent;
for example, see Greene and Hensher (2010). The gmnl command can handle both of these cases.
2. Greene and Hensher (2010) call this the “scaled mixed logit model”.
384 Fitting the generalized multinomial logit model

Three useful special cases of G-MNL are the following:

• MIXL: βi = β + ηi (when σi = 1)
• S-MNL: βi = σi β (when var(ηi ) = 0)
• Standard multinomial logit: βi = β (when σi = 1 and var(ηi ) = 0)

The gmnl command includes an option for fitting MIXL models, but we recommend that
mixlogit be used for this purpose because it is usually faster.
To complete the model specification, we need to choose a distribution for σi . Al-
though any distribution defined on the positive real line is a theoretical possibility,
Fiebig et al. (2010) assume that σi is distributed lognormal with standard deviation τ
and mean σ + θzi , where σ is a normalizing constant and zi is a vector of characteristics
of individual i that can be used to explain why σi differs across people.

3 Maximum simulated likelihood


The log likelihood for G-MNL is given by
⎧ ⎫

N ⎨8 , T , J ⎬
LL(β, γ, τ, θ, Σ) = ln Pr(choiceit = j|βi )yitj p(βi |β, γ, τ, θ, Σ)dβi (3)
⎩ ⎭
i=1 t=1 j=1

where yitj is the observed choice variable, Pr(choiceit = j|βi ) is given by (1), and
p(βi |β, γ, τ, θ, Σ) is implied by (2).
Maximizing the log likelihood in (3) directly is rather difficult because the integral
does not have a closed-form representation and so must be evaluated numerically. We
choose to approximate it with simulation (see Train [2009], for example). The simulated
likelihood is
⎧ ⎫
 N ⎨1 R , T , J ⎬
[r]
SLL(β, γ, τ, θ, Σ) = ln Pr(choiceit = j|βi )yitj
⎩R ⎭
i=1 r=1 t=1 j=1

[r] [r] [r] [r]


βi = σi β + γ + σi (1 − γ) ηi
[r]
σi = exp(σ + θzi + τ ν [r] )
[r] [r]
where ηi is a vector generated from MVN(0, Σ) and ν [r] is a N (0, 1) scalar. ηi and ν [r]
are generated using Halton draws (Halton 1964) and pseudorandom draws, respectively.
When testing the code, we found that this combination works better than using Halton
draws to generate all the random terms.
N
Following Fiebig et al. (2010), we set the normalizing constant σ as − ln{ N1 i=1
[r] [r]
exp(τ νi )}, where νi is the rth draw for the ith person. We also draw ν from a
truncated normal with truncation at ±2.
Y. Gu, A. R. Hole, and S. Knox 385

4 The gmnl command


4.1 Syntax
gmnl is implemented as a gf0 ml evaluator. The Halton draws used in the estimation
process are generated using the Mata function halton() (Drukker and Gates 2006).
The generic syntax for the command is as follows:
      
gmnl depvar varlist if in , group(varname) rand(varlist) id(varname)
corr nrep(#) burn(#) gamma(#) scale(matrix) het(varlist) mixl seed(#)

level(#) constraints(numlist) vce(vcetype) maximize options

The command gmnlpred can be used following gmnl to obtain predicted probabili-
ties. The predictions are available both in and out of sample; type gmnlpred . . . if
e(sample) . . . if predictions are wanted for the estimation sample only.
     
gmnlpred newvar if in , nrep(#) burn(#) ll

The command gmnlcov can be used following gmnl to obtain the elements in the coeffi-
cient covariance matrix along with their standard errors. This command is only relevant
when the coefficients are specified to be correlated; see the corr option below. gmnlcov
is a wrapper for nlcom (see [R] nlcom).
 
gmnlcov , sd

The command gmnlbeta can be used following gmnl to obtain the individual-level pa-
rameters corresponding to the variables in the specified varlist by using the method
proposed by Revelt and Train (2000) (see also Train [2009, chap. 11]). The individual-
level parameters are stored in a data file specified by the user. As with gmnlpred, the
predictions are available both in and out of sample; type gmnlbeta . . . if e(sample)
. . . if predictions are wanted for the estimation sample only.
     
gmnlbeta varlist if in , saving(filename) replace nrep(#) burn(#)

4.2 gmnl options


group(varname) specifies a numeric identifier variable for the choice occasions. group()
is required.
rand(varlist) specifies the independent variables whose coefficients are random (nor-
mally distributed). The variables immediately following the dependent variable in
the syntax are specified to have fixed coefficients.
386 Fitting the generalized multinomial logit model

id(varname) specifies a numeric identifier variable for the decision makers. This option
should be specified only when each individual performs several choices, that is, when
the dataset is a panel.
corr specifies that the random coefficients be correlated. The default is that they
are independent. When the corr option is specified, the estimated parameters are
the means of the (fixed and random) coefficients plus the elements of the lower-
triangular matrix L, where the covariance matrix for the random coefficients is given
by Σ = LL . The estimated parameters are reported in the following order: the
means of the fixed coefficients, the means of the random coefficients, and the elements
of the L matrix. The gmnlcov command can be used postestimation to obtain the
elements in the Σ matrix along with their standard errors.
If the corr option is not specified, the estimated parameters are the means of the
fixed coefficients and the means and standard deviations of the random coefficients,
reported in that order. The sign of the estimated standard deviations is irrelevant.
Although in practice the estimates may be negative, interpret them as being positive.
The sequence of the parameters is important to bear in mind when specifying starting
values.
nrep(#) specifies the number of draws used for the simulation. The default is nrep(50).
burn(#) specifies the number of initial sequence elements to drop when creating the
Halton sequences. The default is burn(15). Specifying this option helps reduce the
correlation between the sequences in each dimension. Train (2009, 227) recommends
that # should be at least as large as the largest prime number used to generate the
sequences. If there are K random coefficients, gmnl uses the first K primes to
generate the Halton draws.
gamma(#) constrains the gamma parameter to the specified value in the estimations.
scale(matrix) specifies a matrix whose elements indicate whether their corresponding
variable will be scaled (1 = scaled and 0 = not scaled). The matrix should have
one row, and the number of columns should be equal to the number of explanatory
variables in the model.
het(varlist) specifies the variables in the zi vector (if any).
mixl specifies that a mixed logit model should be estimated instead of a G-MNL model.
seed(#) specifies the seed. The default is seed(12345).
level(#); see [R] estimation options.
constraints(numlist); see [R] estimation options.
vce(vcetype); vcetype may be oim, robust, cluster clustvar, or opg; see [R] vce option.
maximize options: difficult, technique(algorithm spec), iterate(#), trace,
gradient, showstep, hessian, tolerance(#), ltolerance(#), gtolerance(#),
nrtolerance(#), and from(init specs); see [R] maximize.
Y. Gu, A. R. Hole, and S. Knox 387

4.3 gmnlpred options


nrep(#) specifies the number of draws used for the simulation. The default is nrep(50).
burn(#) specifies the number of initial sequence elements to drop when creating the
Halton sequences. The default is burn(15). Specifying this option helps reduce the
correlation between the sequences in each dimension. Train (2009, 227) recommends
that # should be at least as large as the largest prime number used to generate the
sequences. If there are K random coefficients, gmnl uses the first K primes to
generate the Halton draws.
ll estimates individual log likelihoods.

4.4 gmnlcov option


sd reports the standard deviations of the correlated coefficients instead of the covariance
matrix.

4.5 gmnlbeta options


saving(filename) saves individual-level parameters to filename. saving() is required.
replace overwrites filename.
nrep(#) specifies the number of draws used for the simulation. The default is nrep(50).
burn(#) specifies the number of initial sequence elements to drop when creating the
Halton sequences. The default is burn(15). Specifying this option helps reduce the
correlation between the sequences in each dimension. Train (2009, 227) recommends
that # should be at least as large as the largest prime number used to generate the
sequences. If there are K random coefficients, gmnl uses the first K primes to
generate the Halton draws.

5 Computational issues
As in any model estimated using maximum simulated likelihood, parameter estimates of
G-MNL would depend on four factors: the random-number seed, number of draws, start-
ing values, and optimization method. If the four factors are fixed, the same maximum
likelihood estimates would be obtained at each simulation.
388 Fitting the generalized multinomial logit model

To have a good approximation of the likelihood, we must use a reasonable number


of draws: the more draws used, the better the accuracy. However, a larger number of
draws almost surely leads to a longer computation time. To determine the minimum
number of draws for a desirable level of accuracy remains a theoretical challenge, but
empirically, we may run the gmnl command several times with an increasing number of
draws in each run (with the other three factors fixed) until the estimates stabilize. We
should mention that too few draws may lead to serious convergence problems: a good
starting point is 500 draws.
Starting values are crucial to achieve convergence, especially for the full model, that
is, G-MNL with correlated random parameters (G-MNL correlated). If we see optimization
as climbing a hill, then where we start climbing is one of the major factors that decide
how long it will take to reach the top or if we can ever get there in the end. If we start
from the bottom where the territory is often flat, the direction guidance (that is, the
first-order derivatives of the likelihood) may not function well and lead us farther away
from the top. For this reason, the default starting values based on the multinomial logit
estimates may not be the best, and we often need to choose our own set of starting
values.
To estimate “G-MNL correlated”, we have tried four different sets of starting values:
G-MNL uncorrelated, MIXL uncorrelated, MIXL correlated, and G-MNL correlated with γ
fixed as 0. In most cases, these four sets of starting values would all lead to convergence,
but the speed might be very different. In any case, we should not be content with only
one set of starting values because even if the model converges, it is not guaranteed
that we have reached the global maximum. We suggest running the routine multiple
times, each with different starting values, and reporting the estimates from the run that
obtains the largest likelihood.
The choice of optimization method is another important factor that affects model
convergence. Stata allows four options: Newton–Raphson (default), Berndt–Hall–Hall–
Hausman, Davidon–Fletcher–Powell, and Broyden–Fletcher–Goldfarb–Shanno. Chap-
ter 8 in Train (2009) describes these methods in detail and concludes that Broyden–
Fletcher–Goldfarb–Shanno usually performs better than the others. With mixlogit
and gmnl, however, we have found that Newton–Raphson often works best in the sense
that it is more likely to converge than the alternative algorithms. The only problem
with Newton–Raphson is that it can be very slow when there are a lot of parameters to
estimate.
Finally, we have found that in some cases, different computers can give different
results if there are several random parameters in the model and γ is freely estimated.
This can happen when the model is numerically unstable and different numbers of
processors are used during estimation.
Y. Gu, A. R. Hole, and S. Knox 389

6 Empirical example
We will now present some examples that demonstrate how the gmnl command can
be used to fit the different models described in section 2. We will start by fitting
some relatively simple models, and then we will build up the complexity gradually.
The data used in the examples come from a stated preference study on Australian
women who were asked to choose whether to have a pap smear test; see Fiebig and Hall
(2004). There were 79 women in the sample, and each respondent was presented with
32 scenarios. Thus in terms of the model structure described in section 2, N = 79,
T = 32, and J = 2. The dataset also contains five attributes, which are described
in table 1. Besides these five attributes, an alternative specific constant (ASC) will be
used to measure intangible aspects of the pap smear test not captured by the design
attributes (some women would choose or not choose the test just because of these
intangible aspects no matter what attributes they are presented with).

Table 1. Pap smear test data. Definition of variables.

Variable Definition
knowgp 1 if the general practitioner is known to the patient; 0 otherwise
malegp 1 if the general practitioner is male; 0 if the general practitioner
is female
testdue 1 if the patient is due or overdue for a pap smear test; 0 otherwise
drrec 1 if the general practitioner recommends that the patient have a pap
smear test; 0 otherwise
cost cost of test (unit: 10 Australian dollar)

To give an impression of how the data are structured, we have listed the first six
observations below. Each observation corresponds to an alternative, and the dependent
variable y is 1 for the chosen alternative in each choice situation and 0 otherwise.
gid identifies the alternatives in a choice situation; rid identifies the choice situations
faced by a given individual; and the remaining variables are the alternative attributes
described in table 1 and the ASC (dummy test). In the listed data, the same individual
faces three choice situations.
390 Fitting the generalized multinomial logit model

. use paptest.dta
. generate cost = papcost/10
. list y dummy_test knowgp malegp testdue drrec cost gid rid in 1/6, sep(2)
> abb(10)

y dummy_test knowgp malegp testdue drrec cost gid rid

1. 0 1 1 0 0 0 2 1 1
2. 1 0 0 0 0 0 0 1 1

3. 0 1 1 0 0 1 2 2 1
4. 1 0 0 0 0 0 0 2 1

5. 0 1 0 1 0 1 2 3 1
6. 1 0 0 0 0 0 0 3 1

We start by fitting a relatively simple S-MNL model with a fixed (nonrandom) ASC.
Fiebig et al. (2010) have pointed out that ASCs should not be scaled, because they are
fundamentally different from observed attributes. We can fit the model with a fixed
ASC by using the scale() option of gmnl as described below.

. /*S-MNL with fixed ASC*/


. matrix scale = 0,1,1,1,1,1
. gmnl y dummy_test knowgp malegp testdue drrec cost, group(gid) id(rid)
> nrep(500) scale(scale)
Iteration 0: log likelihood = -1452.649 (not concave)
(output omitted )
Iteration 7: log likelihood = -1123.7542
Generalized multinomial logit model Number of obs = 5056
Wald chi2(6) = 238.93
Log likelihood = -1123.7542 Prob > chi2 = 0.0000
(Std. Err. adjusted for clustering on rid)

y Coef. Std. Err. z P>|z| [95% Conf. Interval]

dummy_test -1.938211 .13976 -13.87 0.000 -2.212136 -1.664286


knowgp 1.811842 .4322376 4.19 0.000 .9646719 2.659012
malegp -.9527219 .319577 -2.98 0.003 -1.579081 -.3263625
testdue 5.305355 1.229367 4.32 0.000 2.895841 7.714869
drrec 2.656325 .6021513 4.41 0.000 1.47613 3.83652
cost .0043925 .0932526 0.05 0.962 -.1783792 .1871643

/tau 1.458027 .1726576 8.44 0.000 1.119624 1.79643

The sign of the estimated standard deviations is irrelevant: interpret them as


being positive

To avoid scaling the ASC, we create a matrix whose elements indicate whether their
corresponding variable will be scaled (1 = scaled and 0 = not scaled). Here the “scale”
matrix defined as (0, 1, 1, 1, 1, 1) corresponds to the variables in the order in which they
are specified in the model (dummy test, knowgp, etc.). Therefore, among these six
variables, only dummy test (that is, the ASC) is not scaled.
Y. Gu, A. R. Hole, and S. Knox 391

We should mention that the number of observations reported in the table, 5,056, is
N × T × J, that is, the total number of choices times the number of alternatives. For
most purposes, such as computing information criteria, it is more appropriate to use
the total number of choices (N × T ); therefore, we do not recommend that you use the
estat ic command after gmnl.
We then let dummy test be random, which leads to our second model: S-MNL with
random ASC.3

. /*S-MNL with random ASC*/


. matrix scale = 1,1,1,1,1,0
. gmnl y knowgp malegp testdue drrec cost, group(gid) id(rid) rand(dummy_test)
> nrep(500) scale(scale) gamma(0)
Iteration 0: log likelihood = -1431.8448 (not concave)
(output omitted )
Iteration 8: log likelihood = -1061.7787
Generalized multinomial logit model Number of obs = 5056
Wald chi2(6) = 111.46
Log likelihood = -1061.7787 Prob > chi2 = 0.0000
(Std. Err. adjusted for clustering on rid)

y Coef. Std. Err. z P>|z| [95% Conf. Interval]

Mean
knowgp .6263819 .1431434 4.38 0.000 .3458261 .9069378
malegp -1.350731 .2024933 -6.67 0.000 -1.74761 -.9538514
testdue 2.954924 .2950128 10.02 0.000 2.37671 3.533139
drrec .7730114 .1608242 4.81 0.000 .4578018 1.088221
cost -.1701498 .0585679 -2.91 0.004 -.2849408 -.0553588
dummy_test -.7052151 .3578936 -1.97 0.049 -1.406674 -.0037565

SD
dummy_test 2.660664 .2798579 9.51 0.000 2.112152 3.209175

/tau .9255689 .1032721 8.96 0.000 .7231593 1.127978

The sign of the estimated standard deviations is irrelevant: interpret them as


being positive

3. Strictly speaking, this is not an S-MNL but a parsimonious form of G-MNL; that is, we model
ASCs using preference heterogeneity but model other attributes using scale heterogeneity. This
specification of G-MNL has been used by Fiebig et al. (2011) and Knox et al. (2013).
392 Fitting the generalized multinomial logit model

Comparing “S-MNL with fixed ASC” with “S-MNL with random ASC”, we can see that
the latter model improved the model fit by adding one more parameter, the standard
deviation of dummy test, which is statistically significant.4 The improvement in fit is
not surprising because the random ASC captures preference heterogeneity and allows
for correlation across choice situations because of the panel nature of the data. The
parameter estimates between the two models are somewhat different, but they cannot
be compared directly because of differences in scale across models, as indicated by the
estimate of τ . Instead, we should run the gmnlpred command to compare the predicted
probabilities. We shall demonstrate how to do predictions after fitting the full G-MNL
model.
The third example is a G-MNL model in which dummy test, testdue, and drrec
are given random coefficients. For the moment, the coefficients are specified to be
uncorrelated; that is, the off-diagonal elements of Σ are all 0. To speed up the estimation,
we constrain γ to 0 by using the gamma(0) option, which implies that the fitted model
is a G-MNL-II (or “scaled mixed logit”).

. /*G-MNL with uncorrelated random coefficients*/


. matrix scale = 1,1,1,0,1,1
. gmnl y knowgp malegp cost, group(gid) id(rid) rand(dummy_test testdue drrec)
> nrep(500) scale(scale) gamma(0)
Iteration 0: log likelihood = -1414.4896 (not concave)
(output omitted )
Iteration 19: log likelihood = -991.41088
Generalized multinomial logit model Number of obs = 5056
Wald chi2(6) = 67.25
Log likelihood = -991.41088 Prob > chi2 = 0.0000
(Std. Err. adjusted for clustering on rid)

y Coef. Std. Err. z P>|z| [95% Conf. Interval]

Mean
knowgp .9123367 .1748867 5.22 0.000 .5695651 1.255108
malegp -2.742707 .3543753 -7.74 0.000 -3.43727 -2.048144
cost -.1419785 .0637313 -2.23 0.026 -.2668895 -.0170675
dummy_test -.4904328 .2685654 -1.83 0.068 -1.016811 .0359457
testdue 5.79628 .8667601 6.69 0.000 4.097462 7.495099
drrec 1.492487 .2652157 5.63 0.000 .9726734 2.0123

SD
dummy_test 2.988542 .3213189 9.30 0.000 2.358769 3.618316
testdue 3.166774 .4859329 6.52 0.000 2.214363 4.119185
drrec 1.356382 .194595 6.97 0.000 .974983 1.737781

/tau 1.177626 .115934 10.16 0.000 .9503993 1.404852

The sign of the estimated standard deviations is irrelevant: interpret them as


being positive
. *Save coefficients for later use
. matrix b = e(b)

4. Note that we constrain γ to 0 by using the gamma(0) option. This is to prevent gmnl from attempting
to estimate the gamma parameter, because it is not identified in this model.
Y. Gu, A. R. Hole, and S. Knox 393

The square root of the diagonal elements of Σ is estimated and shown in the block
under SD. All the standard deviations are significantly different from 0, which suggests
the presence of substantial preference heterogeneity in the data.
In the last example, we allow the random coefficients of dummy test, testdue, and
drrec to be correlated, which implies that the off-diagonal elements of Σ will not be
fixed as zeros. Instead of using the default starting values, we use the parameters from
the previous model, setting the starting values for the off-diagonal elements of Σ to 0.

. *Starting values
. matrix start = b[1,1..7],0,0,b[1,8],0,b[1,9..10]
. /*G-MNL with correlated random coefficients*/
. gmnl y knowgp malegp cost, group(gid) id(rid) rand(dummy_test testdue drrec)
> nrep(500) from(start,copy) scale(scale) corr gamma(0)
Iteration 0: log likelihood = -991.41088 (not concave)
(output omitted )
Iteration 8: log likelihood = -987.7783
Generalized multinomial logit model Number of obs = 5056
Wald chi2(6) = 57.86
Log likelihood = -987.7783 Prob > chi2 = 0.0000
(Std. Err. adjusted for clustering on rid)

y Coef. Std. Err. z P>|z| [95% Conf. Interval]

knowgp 1.016481 .1871831 5.43 0.000 .6496085 1.383353


malegp -3.082839 .4511159 -6.83 0.000 -3.96701 -2.198668
cost -.1506127 .0695989 -2.16 0.030 -.2870241 -.0142012
dummy_test -.5499832 .2755279 -2.00 0.046 -1.090008 -.0099584
testdue 6.372514 .9780339 6.52 0.000 4.455603 8.289425
drrec 2.488203 .5429958 4.58 0.000 1.423951 3.552455

/l11 2.749374 .3800353 7.23 0.000 2.004518 3.494229


/l21 -.155092 .2641671 -0.59 0.557 -.67285 .3626659
/l31 1.116604 .3198175 3.49 0.000 .4897736 1.743435
/l22 3.423451 .5413485 6.32 0.000 2.362427 4.484474
/l32 .719799 .3048254 2.36 0.018 .1223522 1.317246
/l33 1.448777 .2470165 5.87 0.000 .9646339 1.932921
/tau 1.250552 .1531374 8.17 0.000 .9504077 1.550695
394 Fitting the generalized multinomial logit model

The six parameters from l11 to l33 are the elements of the lower-triangular matrix
L, the Cholesky factorization of Σ (Σ = LL ). Given the estimate of L, we may recover
Σ and the standard deviations of the random coefficients by using gmnlcov:

. gmnlcov
v11: [l11]_b[_cons]*[l11]_b[_cons]
v21: [l21]_b[_cons]*[l11]_b[_cons]
v31: [l31]_b[_cons]*[l11]_b[_cons]
v22: [l21]_b[_cons]*[l21]_b[_cons] + [l22]_b[_cons]*[l22]_b[_cons]
v32: [l31]_b[_cons]*[l21]_b[_cons] + [l32]_b[_cons]*[l22]_b[_cons]
v33: [l31]_b[_cons]*[l31]_b[_cons] + [l32]_b[_cons]*[l32]_b[_cons] +
> [l33]_b[_cons]*[l33]_b[_cons]

y Coef. Std. Err. z P>|z| [95% Conf. Interval]

v11 7.559055 2.089718 3.62 0.000 3.463284 11.65483


v21 -.4264059 .7303392 -0.58 0.559 -1.857844 1.005033
v31 3.069963 1.026024 2.99 0.003 1.058993 5.080933
v22 11.74407 3.674788 3.20 0.001 4.541615 18.94652
v32 2.29102 1.166262 1.96 0.049 .0051882 4.576851
v33 3.863872 1.571455 2.46 0.014 .7838763 6.943867

. gmnlcov, sd
dummy_test: sqrt([l11]_b[_cons]*[l11]_b[_cons])
testdue: sqrt([l21]_b[_cons]*[l21]_b[_cons] + [l22]_b[_cons]*[l22]_b[_cons])
drrec: sqrt([l31]_b[_cons]*[l31]_b[_cons] +
> [l32]_b[_cons]*[l32]_b[_cons] + [l33]_b[_cons]*[l33]_b[_cons])

y Coef. Std. Err. z P>|z| [95% Conf. Interval]

dummy_test 2.749374 .3800353 7.23 0.000 2.004518 3.494229


testdue 3.426962 .5361583 6.39 0.000 2.376111 4.477813
drrec 1.965673 .3997244 4.92 0.000 1.182228 2.749119

There are other useful postestimation commands besides gmnlcov. For example, to
generate predicted probabilities, we may use the gmnlpred command:

. gmnlpred p_hat, nrep(500)


. list rid gid y p_hat in 1/4

rid gid y p_hat

1. 1 1 0 .51986608
2. 1 1 1 .48013392
3. 1 2 0 .63789177
4. 1 2 1 .36210823
Y. Gu, A. R. Hole, and S. Knox 395

Moreover, if we are also interested in estimating individual log likelihoods, we may


use the ll option of gmnlpred:

. gmnlpred loglik, nrep(500) ll


. list rid loglik in 1

rid loglik

1. 1 -22.560377

Finally, we may use the gmnlbeta command to calculate individual-level parameters


(Revelt and Train 2000):

. gmnlbeta dummy_test testdue drrec, nrep(500) saving(beta) replace


file beta.dta saved
. use beta.dta, clear
. list rid dummy_test testdue drrec in 1/4

rid dummy_test testdue drrec

1. 1 -1.5343953 1.1355702 .36138015


2. 2 -3.1375251 .93454325 .06251802
3. 3 -3.7519709 6.2841202 1.3722745
4. 4 .83317825 1.5324372 .88101046

A file is now created and saved as the beta.dta dataset, which contains all the estimated
individual β’s.

7 Conclusion
In this article, we described the gmnl Stata command, which can be used to fit the
G-MNL model and its variants. As pointed out in Fiebig et al. (2010), G-MNL is very
flexible and nests a rich family of model specifications. In the previous sections, we
demonstrated several important models, which are summarized (along with some other
useful specifications) below in table 2. This list does not exhaust all the possible models
that the gmnl routine can estimate. One example is the type of model considered in
Fiebig et al. (2011) and Knox et al. (2013), which includes interaction terms between
sociodemographic variables and ASCs.
Finally, a word of warning: While we have found that the gmnl command can
be used successfully to implement a range of model specifications, analysts need to
bear in mind that estimation times can be substantial when fitting complex models
with large datasets. As discussed in section 5, it may also be necessary to experiment
with alternative starting values, number of draws, and estimation algorithms to achieve
convergence.
396 Fitting the generalized multinomial logit model

Table 2. Special cases of G-MNL and their Stata commands


Model Command
MIXL gmnl y, group(csid) id(id) rand(x) mixl
S-MNL gmnl y x, group(csid) id(id)
S-MNL+fixed ASC gmnl y asc x, group(csid) id(id) scale(scale)
S-MNL+random ASC gmnl y x, group(csid) id(id) rand(asc) scale(scale) gamma(0)
G-MNL(uncorrelated) gmnl y, group(csid) id(id) rand(x)
G-MNL(correlated) gmnl y, group(csid) id(id) rand(x) corr
G-MNL(uncorrelated)+fixed ASC gmnl y asc, group(csid) id(id) rand(x) scale(scale)
G-MNL(correlated)+fixed ASC gmnl y asc, group(csid) id(id) rand(x) scale(scale) corr
G-MNL(uncorrelated)+random ASC gmnl y, group(csid) id(id) rand(asc x) scale(scale)
G-MNL(correlated)+random ASC gmnl y, group(csid) id(id) rand(asc x) scale(scale) corr

8 Acknowledgments
We are grateful to a referee and to Kristin MacDonald of StataCorp for helpful com-
ments. The research of Yuanyuan Gu and Stephanie Knox was partially supported by
a Faculty of Business Research Grant at the University of Technology in Sydney.

9 References
Drukker, D. M., and R. Gates. 2006. Generating Halton sequences using Mata. Stata
Journal 6: 214–228.

Fiebig, D. G., and J. Hall. 2004. Discrete choice experiments in the analysis of health
policy. Productivity Commission Conference: Quantitative Tools for Microeconomic
Policy Analysis 6: 119–136.

Fiebig, D. G., M. P. Keane, J. Louviere, and N. Wasi. 2010. The generalized multinomial
logit model: Accounting for scale and coefficient heterogeneity. Marketing Science 29:
393–421.

Fiebig, D. G., S. Knox, R. Viney, M. Haas, and D. J. Street. 2011. Preferences for new
and existing contraceptive products. Health Economics 20 (Suppl.): 35–52.

Greene, W. H., and D. A. Hensher. 2010. Does scale heterogeneity across individuals
matter? An empirical assessment of alternative logit models. Transportation 37:
413–428.

Halton, J. H. 1964. Algorithm 247: Radical-inverse quasi-random point sequence. Com-


munications of the ACM 7: 701–702.

Hole, A. R. 2007. Fitting mixed logit models by using maximum simulated likelihood.
Stata Journal 7: 388–401.
Y. Gu, A. R. Hole, and S. Knox 397

Keane, M., and N. Wasi. Forthcoming. Comparing alternative models of heterogeneity


in consumer choice behavior. Journal of Applied Econometrics.

Knox, S. A., R. C. Viney, Y. Gu, A. R. Hole, D. G. Fiebig, D. J. Street, M. R. Haas,


E. Weisberg, and D. Bateson. 2013. The effect of adverse information and positive
promotion on women’s preferences for prescribed contraceptive products. Social Sci-
ence and Medicine 83: 70–80.

Louviere, J., and T. Eagle. 2006. Confound it! That pesky little scale constant messes
up our convenient assumptions. In Proceedings of the Sawtooth Software Conference,
211–228. Sequim, WA: Sawtooth Software.
Louviere, J. J., R. J. Meyer, D. S. Bunch, R. T. Carson, B. Dellaert, W. M. Hanemann,
D. Hensher, and J. Irwin. 1999. Combining sources of preference data for modeling
complex decision processes. Marketing Letters 10: 205–217.

Louviere, J. J., D. Street, L. Burgess, N. Wasi, T. Islam, and A. A. J. Marley. 2007.


Modeling the choices of individual decision-makers by combining efficient choice ex-
periment designs with extra preference information. Journal of Choice Modelling 1:
128–163.

Louviere, J. J., D. Street, R. Carson, A. Ainslie, J. R. Deshazo, T. Cameron, D. Hensher,


R. Kohn, and T. Marley. 2002. Dissecting the random component of utility. Marketing
Letters 13: 177–193.
McFadden, D., and K. Train. 2000. Mixed MNL models for discrete response. Journal
of Applied Econometrics 15: 447–470.

Revelt, D., and K. Train. 2000. Customer-specific taste parameters and mixed logit:
Households’ choice of electricity supplier. Working Paper No. E00-274, Department
of Economics, University of California, Berkeley.

Train, K. E. 2009. Discrete Choice Methods with Simulation. 2nd ed. Cambridge:
Cambridge University Press.

About the authors


Yuanyuan Gu is a research fellow at the Centre for Health Economics Research and Evaluation,
University of Technology in Sydney. His recent research focuses on discrete choice modeling
with applications in health economics.
Arne Risa Hole is a senior lecturer in economics at the University of Sheffield in the UK. His
research interests lie in the area of applied microeconometrics, with a focus on health and labor
economics. Since obtaining his PhD, he has been particularly interested in stated preference
methods and the econometric analysis of discrete choice data.
Stephanie Knox is a research fellow at the Centre for Health Economics Research and Eval-
uation, University of Technology in Sydney. Her research interests include the design and
application of stated preference methods in the area of health service research.
The Stata Journal (2013)
13, Number 2, pp. 398–400

Speaking Stata: Creating and varying box


plots: Correction
Nicholas J. Cox
Department of Geography
Durham University
Durham, UK
[email protected]
A previous article (Cox 2009) discussed the creation of box plots from first principles,
particularly when a box plot is desired that graph box or graph hbox cannot provide.
This update reports and corrects an error in my code given in that article. The
problems are centered on page 484. The question is how to calculate the positions of
the ends of the so-called whiskers.
To make this more concrete, the article’s example starts with

. sysuse lifeexp
. egen upq = pctile(lexp), by(region) p(75)
. egen loq = pctile(lexp), by(region) p(25)
. generate iqr = upq - loq

and that holds good.


Given interquartile range (IQR), the position of the end of the upper whisker is
that of the largest value not greater than the upper quartile + 1.5 IQR. Similarly, the
position of the end of the lower whisker is that of the smallest value not less than the
lower quartile − 1.5 IQR.
The problem lines are on page 484:

. egen upper = max(min(lexp, upq + 1.5 * iqr)), by(region)


. egen lower = min(max(lexp, loq - 1.5 * iqr)), by(region)

This code works correctly if there are no values beyond where the whiskers should end.
Otherwise, it yields upper quartile + 1.5 IQR as the position of the upper whisker,
but this position will be correct only if there are values equal to that. Commonly,
that position will be too high. A similar problem applies to the lower whisker, which
commonly will be too low.
More careful code might be

. egen upper2 = max(lexp / (lexp < upq + 1.5 * iqr)), by(region)


. egen lower2 = min(lexp / (lexp > loq - 1.5 * iqr)), by(region)

That division / may look odd if you have not seen it before in similar examples. But it
is very like a common kind of conditional notation often seen,


c 2013 StataCorp LP gr0039 1
N. J. Cox 399

max(argument | condition)
or
min(argument | condition)
where we seek the maximum or minimum of some argument, restricting attention to
cases in which a specified condition is satisfied, or true.
The connection is given in this way. Divide an argument by a logical expression that
evaluates to 1 when the expression is true and 0 otherwise. The result is the argument
remains unchanged on division by 1 but evaluates as missing on division by 0. In any
context where Stata ignores missings, that is what is wanted. True cases are included
in the computation, and false cases are excluded.
This “divide by zero” trick appears not to be widely known. There was some pub-
licity within a later article (Cox 2011).
Turning back to the box plots, we will see what the difference is in our example.

. tabdisp region, c(upper upper2 lower lower2)

Region upper upper2 lower lower2

Eur & C.Asia 79 79 65 65


N.A. 79 79 58.5 64
S.A. 75 75 63 67

Here upper2 and lower2 are from the more careful code just given, and upper and
lower are from the code in the 2009 column. The results can be the same but need not
be.
Checking Stata’s own box plot

. graph box lexp, over(region) yli(75 79 64 65 67)

shows consistency with the corrected code.


Thanks to Sheena G. Sullivan, UCLA, who identified the problem on Statalist
(http://www.stata.com/statalist/archive/2013-03/msg00906.html).

1 References
Cox, N. J. 2009. Speaking Stata: Creating and varying box plots. Stata Journal 9:
478–496.

———. 2011. Speaking Stata: Compared with ... Stata Journal 11: 305–314.
400 Speaking Stata: Creating and varying box plots: Correction

About the author


Nicholas Cox is a statistically minded geographer at Durham University. He contributes talks,
postings, FAQs, and programs to the Stata user community. He has also coauthored 15 com-
mands in official Stata. He wrote several inserts in the Stata Technical Bulletin and is an editor
of the Stata Journal.
The Stata Journal (2013)
13, Number 2, pp. 401–405

Stata tip 115: How to properly estimate the multinomial


probit model with heteroskedastic errors
Michael Herrmann
Department of Politics and Public Administration
University of Konstanz
Konstanz, Germany
[email protected]

Models for multinomial outcomes are frequently used to analyze individual decision
making in consumer research, labor market research, voting, and other areas. The
multinomial probit model provides a flexible approach to analyzing decisions in these
fields because it does not impose some of the restrictive assumptions inherent in the
often used conditional logit approach. In particular, multinomial probit relaxes 1) the
assumption of independent error terms, allowing for correlation in individual choices
across alternatives, and 2) it does not impose the assumption of identically distributed
errors, allowing unobserved factors to affect the choice of some alternatives more strongly
than others (that is, heteroskedasticity).
By default, asmprobit relaxes both the assumptions of independence and homoske-
dasticity. To avoid overfitting, however, the researcher may sometimes wish to relax
these assumptions one at a time.1 A seemingly straightforward solution would be to rely
on the options stddev() and correlation(), which allow the user to set the structure
for the error variances and their covariances, respectively (see [R] asmprobit).
When doing so, however, the user should be aware that specifying std(het) and
corr(ind) does not actually fit a pure heteroskedastic multinomial probit model. With
J outcome categories, if errors are independent, J − 1 error variances are identified (see
below). Instead, Stata estimates J − 2 error variances and, hence, imposes an additional
constraint, which causes the model to be overidentified. As a result, the estimated model
is not invariant to the choice of base and scale outcomes; that is, changing the base or
scale outcome leads to different values of the likelihood function.
To properly estimate a pure heteroskedastic model, the user needs to define the
structure of the error variances manually. This is easy to accomplish using the pattern
or fixed option. The following example illustrates the problem and shows how to
estimate the model correctly.

1. Another reason to relax them one at a time is that heteroskedasticity and error correlation cannot
be distinguished from each other in the default specification. That is, one cannot simply look
at the estimated covariance matrix of the errors and see whether the errors are heteroskedastic,
correlated, or both. What Stata estimates is the normalized covariance matrix of error differences
whose elements do not allow one to draw any conclusions on the covariance structure of the errors
themselves.


c 2013 StataCorp LP st0302
402 Stata tip 115

Consider an individual’s choice of travel mode with the alternatives being air, train,
bus, and car and predictor variables, including general cost of travel, terminal time,
household income, and traveling group size. One might suspect the choice of some
alternatives to be driven more by unobserved factors than the choice of others. For
example, there might be more unobserved reasons related to an individual’s decision to
travel by plane than by train, bus, or car. Allowing the error variances associated with
the alternatives to differ, we fit the following model:
. use http://www.stata-press.com/data/r12/travel
. asmprobit choice travelcost termtime, casevars(income partysize)
> case(id) alternatives(mode) std(het) corr(ind) nolog
Alternative-specific multinomial probit Number of obs = 840
Case variable: id Number of cases = 210
Alternative variable: mode Alts per case: min = 4
avg = 4.0
max = 4
Integration sequence: Hammersley
Integration points: 200 Wald chi2(8) = 71.57
Log simulated-likelihood = -181.81521 Prob > chi2 = 0.0000

choice Coef. Std. Err. z P>|z| [95% Conf. Interval]

mode
travelcost -.012028 .0030838 -3.90 0.000 -.0180723 -.0059838
termtime -.050713 .0071117 -7.13 0.000 -.0646517 -.0367743

air (base alternative)

train
income -.03859 .0093287 -4.14 0.000 -.0568739 -.0203062
partysize .7590228 .190438 3.99 0.000 .3857711 1.132274
_cons -.9960951 .4750053 -2.10 0.036 -1.927088 -.0651019

bus
income -.0119789 .0081057 -1.48 0.139 -.0278658 .003908
partysize .5876645 .1751734 3.35 0.001 .2443309 .930998
_cons -1.629348 .4803384 -3.39 0.001 -2.570794 -.6879016

car
income -.004147 .0078971 -0.53 0.599 -.019625 .011331
partysize .5737318 .163719 3.50 0.000 .2528485 .8946151
_cons -3.903084 .750675 -5.20 0.000 -5.37438 -2.431788

/lnsigmaP1 -1.097572 .7967201 -1.38 0.168 -2.659115 .4639704


/lnsigmaP2 -.3906271 .3468426 -1.13 0.260 -1.070426 .2891719

sigma1 1 (base alternative)


sigma2 1 (scale alternative)
sigma3 .3336802 .2658497 .0700102 1.590376
sigma4 .6766324 .2346849 .3428624 1.335321

(mode=air is the alternative normalizing location)


(mode=train is the alternative normalizing scale)
M. Herrmann 403

As can be seen, two of the four error variances are set to one. These are the base
and scale alternatives. While choosing a base and scale alternative is necessary to
identify the model, the problem here is that because errors are uncorrelated, fixing the
variance of the base alternative is not necessary to identify the model. As a result, an
additional constraint is imposed, which leads to a different model structure depending
on the choice of base and scale alternatives. For example, changing the base alternative
to car produces a different log likelihood:
. quietly asmprobit choice travelcost termtime, casevars(income partysize)
> case(id) alternatives(mode) std(het) corr(ind) nolog base(4)
. display e(ll)
-181.58795

To properly estimate an unconstrained heteroskedastic model, one needs to define


a vector of variance terms in which one element (the scale alternative) is fixed and
pass this vector on to the estimation command. For example, to set the error variance
of the second alternative to unity, define a vector of missing values, stdpat, whose
second element is 1, and then call this vector from inside asmprobit using the option
std(fixed) (see [R] asmprobit for details):
404 Stata tip 115

. matrix define stdpat = (.,1,.,.)


. asmprobit choice travelcost termtime, casevars(income partysize)
> case(id) alternatives(mode) std(fixed stdpat) corr(ind) nolog base(1)
Alternative-specific multinomial probit Number of obs = 840
Case variable: id Number of cases = 210
Alternative variable: mode Alts per case: min = 4
avg = 4.0
max = 4
Integration sequence: Hammersley
Integration points: 200 Wald chi2(8) = 26.84
Log simulated-likelihood = -180.01839 Prob > chi2 = 0.0008

choice Coef. Std. Err. z P>|z| [95% Conf. Interval]

mode
travelcost -.0196389 .0067143 -2.92 0.003 -.0327988 -.006479
termtime -.0664153 .0140353 -4.73 0.000 -.093924 -.0389065

air (base alternative)

train
income -.0498732 .0154884 -3.22 0.001 -.08023 -.0195165
partysize 1.126922 .3651321 3.09 0.002 .4112761 1.842568
_cons -1.072849 .680711 -1.58 0.115 -2.407018 .2613198

bus
income -.0210642 .0139892 -1.51 0.132 -.0484826 .0063542
partysize .8678651 .3179559 2.73 0.006 .244683 1.491047
_cons -1.831363 .7345686 -2.49 0.013 -3.271091 -.3916349

car
income -.010205 .0131711 -0.77 0.438 -.0360199 .01561
partysize .8708577 .3202671 2.72 0.007 .2431458 1.49857
_cons -4.971594 1.261002 -3.94 0.000 -7.443112 -2.500075

/lnsigmaP1 .558377 .3076004 1.82 0.069 -.0445087 1.161263


/lnsigmaP2 -1.0078 1.116358 -0.90 0.367 -3.195822 1.180223
/lnsigmaP3 -.0158072 .3593511 -0.04 0.965 -.7201225 .6885081

sigma1 1.747833 .5376342 .9564673 3.193964


sigma2 1 (scale alternative)
sigma3 .3650213 .4074946 .0409329 3.255099
sigma4 .9843171 .3537155 .4866926 1.990743

(mode=air is the alternative normalizing location)


(mode=train is the alternative normalizing scale)

Now the model is properly normalized, and the user may verify that changing either
the scale alternative (that is, changing the location of the 1 in stdpat) or the base
alternative leaves results unchanged. Note that while, in theory, the only restriction
necessary to identify the heteroskedastic probit model is to fix one of the variance
terms, in the Stata implementation of the model, the base and scale outcomes must be
different. That is, Stata does not allow the same alternative to be the base outcome
and the scale outcome. However, this is more of an inconvenience than a restriction:
such a model would be equivalent to one in which the base and scale outcomes differed.
M. Herrmann 405

Finally, to show that independence of errors indeed implies J − 1 estimable error


variances, we must verify that the error variances can be calculated directly from the
variance and covariance parameters of the normalized error differences. Only the latter
are identified and, hence, estimable (Train 2009). Suppose, without loss of generality,
J = 3, and let j = 1 be the base outcome.
Following the normalization approach advocated by Train (2009, 100f.), the normal-
ized covariance matrix of error differences is given by
 ∗

 ∗1 = 1 θ23
Ω ∗
θ33

with elements θ∗ relating to the actual error variances σjj and covariances σij as follows:

∗ σ23 + σ11 − σ12 − σ13


θ23 =
σ22 + σ11 − 2σ12

∗ σ33 + σ11 − 2σ13


θ33 =
σ22 + σ11 − 2σ12

Under independence, σij = 0. Fixing σ22 = 1 (that is, choosing j = 2 as the scale
∗ ∗
outcome) yields θ23 = σ11 /(1 + σ11 ) and θ33 = (σ33 + σ11 )/(1 + σ11 ). Obviously, σ11
∗ ∗
can be calculated from θ23 , and subsequent substitution produces σ33 from θ33 . The
same is true if we choose to fix either σ11 or σ33 because in each case, we would obtain
two equations in two unknowns. Similar conclusions follow when there are four or
more outcome categories. Thus, with independent errors, J − 1 variance parameters are
estimable.

Reference
Train, K. E. 2009. Discrete Choice Methods with Simulation. 2nd ed. Cambridge:
Cambridge University Press.
The Stata Journal (2013)
13, Number 2, p. 406

Software Updates

st0210 1: Making spatial analysis operational: Commands for generating spatial-effect


variables in monadic and dyadic data. E. Neumayer and T. Plümper. Stata Journal
10: 585–605.
Changes affecting all ado-files: Users are no longer required to have mmerge.ado
installed.
Changes affecting spdir.ado: A bug was fixed that affected row-standardized
spatial-effect variables and spatial-effect variables with additive link functions. Also
users can now choose from a larger variety of link functions.
Changes affecting spundir.ado: A bug was fixed that affected row-standardized
spatial-effect variables and spatial-effect variables with additive link functions.
Changes affecting spagg.ado and spspc.ado: A bug was fixed that affected row-
standardized spatial-effect variables. Also users can now choose from a larger variety
of link functions; this is achieved by introducing a compulsory link-function choice,
which replaces the previously existing default choice of the link function and the
reverse W option.
st0220 1: eq5d: A command to calculate index values for the EQ-5D quality-of-life
instrument. J. M. Ramos-Goñi and O. Rivero-Arias. Stata Journal 11: 120–125.
Five additional country-specific value sets recently published for France, Italy, South
Korea, Thailand, and Canada have been included using the N3 model methodology.
These additional value sets are also acknowledged by the EuroQol group on its
website as available or “in development” value sets.
A fix has been made to correct a software bug when if was used when estimating
the predicted EQ-5D with the saving() option.


c 2013 StataCorp LP up0040

You might also like