0% found this document useful (0 votes)
11 views

Lecture 4 - How Not To Lie With Spatial Statistics

This document discusses challenges and issues related to spatial data analysis and statistics. It makes three key points: 1) Spatial data can represent different types of spatial processes (e.g. points, discrete objects, continuous surfaces) and the appropriate analysis depends on the type of process, but software may not distinguish between them. 2) Data problems like geocoding errors and unreliable population estimates for small areas can impact rates and underlying risk estimates. Systematic errors can create spurious patterns. 3) Estimates of rates and risks for small areas are unstable due to small populations. Smoothing or aggregation methods are needed but may obscure real small-scale patterns. Choice of method impacts results and policy implications.

Uploaded by

Julio Pedrassoli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Lecture 4 - How Not To Lie With Spatial Statistics

This document discusses challenges and issues related to spatial data analysis and statistics. It makes three key points: 1) Spatial data can represent different types of spatial processes (e.g. points, discrete objects, continuous surfaces) and the appropriate analysis depends on the type of process, but software may not distinguish between them. 2) Data problems like geocoding errors and unreliable population estimates for small areas can impact rates and underlying risk estimates. Systematic errors can create spurious patterns. 3) Estimates of rates and risks for small areas are unstable due to small populations. Smoothing or aggregation methods are needed but may obscure real small-scale patterns. Choice of method impacts results and policy implications.

Uploaded by

Julio Pedrassoli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Commentary

How (Not) to Lie with Spatial Statistics


Luc Anselin, MA, PhD

aking a cue from Mark Monmonnier’s classic,1 I

T
discrete object, such as the location of a noxious
formulate some cautionary remarks related to facility, or even a sample point designed to measure a
the use of methods and software for spatial data continuous phenomenon, such as an air quality moni-
analysis, with particular reference to empirical work toring station. Although these are all points on a GIS
dealing with cancer prevention and research. Due to map, they each require a distinct statistical approach,
length limitations, the discussion will have to be brief, respectively referred to as point pattern analysis
incomplete, and largely nontechnical. For more com- (events), lattice data analysis (discrete objects), and
prehensive and technical reviews of some of the issues geostatistics (continuous surfaces). Methods and prop-
raised, see the articles by Anselin,2 Greenland,3 and erties that are appropriate for one type of analysis do
Wakefield.4 In this context, I define spatial data analysis not readily transfer to other types of spatial processes.
broadly as consisting of three important components: Unfortunately, the GIS (and spatial analysis software)
exploratory spatial data analysis (ESDA), visualization, remains largely ignorant about the nature of the un-
and spatial modeling. Although the dividing lines be- derlying process, and simply deals with the data as
tween these areas of interest are not precise, I consider “points,” thereby not preventing meaningless analyses
ESDA as concerned with the search for interesting (such as the application of geostatistical analysis to
“patterns,” visualization as consisting of methods to discrete lattice data).
show these interesting patterns, and spatial modeling as The upshot of this situation is that care is needed in
the collection of techniques (also referred to as spatial the range of activities involved in spatial data analysis,
regression analysis, spatial econometrics) to explain from the collection of data and the use of software to
and predict these patterns. Recent overviews of the the interpretation of results and their application in
methodology of spatial statistics and spatial economet- policy analysis. Along the way, choices that yield differ-
rics can be found in Lawson,5 Anselin et al.,6 Banerjee ent results must be made, offering the temptation to
et al.,7 Waller and Gotway,8 and Schabenberger and tailor the method to the desired result (to lie with
Gotway.9 statistics). I will briefly comment on a few salient points
The focus on patterns highlights the importance of and important tradeoffs, starting with data problems,
location and distance, two central concepts in spatial methodologic challenges, and software issues, and clos-
data analysis. Recent methodologic advances in spatial ing with some remarks on interpretation and policy.
statistics, combined with the ready availability of cheap
and powerful desktop geographic information systems
(GIS) have brought spatial analysis within reach of
Data Problems
many nonspecialists. The array of techniques available Spatial data include the location of the observation as
can be bewildering, especially because many of them an essential attribute. This is either recorded in a
are easily applied through the use of commercial coordinate system as an absolute location (such as
off-the-shelf point-and-click software, without much latitude–longitude, or some projected x,y coordinates),
guidance as to what is appropriate for the situation at or referred to as an administrative entity, such as a
hand. This is further complicated by the fact that spatial census tract or ZIP code zone. In practice, the geo-
data can be represented in many different ways (e.g., as graphic information about patients or hospitals, for
discrete spatial objects, such as counties, or as contin- example, is not necessarily available in such form, but is
uous surfaces, such as a risk surface). In addition, a more likely recorded as a street address. The process of
given representation does not necessarily provide in- translating street addresses to the formal spatial loca-
sight into the type of spatial process at hand. For tion information is referred to as geocoding. Although
example, a point could represent an event, such as the straightforward to carry out in most commercial GIS, it
address of a person undergoing cancer screening, or a is also fraught with problems such as inaccurate address
information or flaws in the spatial database on street
From the Spatial Analysis Laboratory, University of Illinois, Urbana, locations. This will result in errors that need to be
Illinois accounted for in any spatial statistical analysis. Unfor-
Address correspondence and reprint requests to: Luc Anselin, MA, tunately, in practice, such errors do not tend to be
PhD, Spatial Analysis Laboratory, University of Illinois, 333 Daven-
port Hall, MC 150, 607 South Mathews Avenue, Urbana, IL 61801- random, but show systematic spatial variation. For
3671. E-mail: [email protected] example, more inaccuracies will tend to be found in

Am J Prev Med 2006;30(2S) 0749-3797/06/$–see front matter S3


© 2006 American Journal of Preventive Medicine • Published by Elsevier Inc. doi:10.1016/j.amepre.2005.09.015
recently developed suburban neighborhoods than in heterogeneous in space. Particular interest lies in iden-
long established urban blocks, resulting in systematic tifying locations of significantly elevated risk, trying to
spatial patterns of failure to match, or missing relate these patterns to salient explanatory variables
observations. (risk factors) and incorporating these insights into
Systematic errors in the allocation of street addresses policies of prevention and care provision. Several chal-
to the proper location have repercussions for the lenges are faced in this endeavor.
computation of rates (of incidence or mortality) for The risk estimate, as a rate or proportion, is inher-
geographic areas. The rates serve as estimates of the ently unstable in the sense that the precision of the
underlying risk by dividing the number of events of estimates is not uniform. This precision is directly
interest (cancer cases, number of screenings) by the related to the size of the population at risk, yielding a
population at risk (i.e., the population to which the high variance in estimates for small areas (in the sense
events pertain). The inaccuracy in rates is especially of having a small population). In practice, this means
important for small geographic areas where changes of that a high rate does not necessarily imply a similarly
only a few counts in the numerator may yield significant high risk, but could be due to the larger variance of the
changes in the rate, and thus in the estimate of the estimate, especially in small areas. This could indicate
underlying risk. Not only is the numerator important in spurious outliers or suggest heterogeneity in risk when
this respect, but also the denominator. Apart from the in fact it is uniform. A large statistical literature is
decennial census years, information on the population devoted to addressing variance instability in rates. Ap-
residing in a given small areal unit (census tract, ZIP proaches can be divided roughly between spatial aggre-
code zone, county) is not regularly available and must gation and smoothing. In the former, the instability is
be estimated. For larger units, such as counties, this can “corrected” by grouping small areal units until they
be fairly reliably done with some degree of detail on reach a threshold population at risk. This is often
gender, ethnicity, and age category. This uses estab- implicit in agency policies that preclude the disclosure
lished demographic techniques based on birth and of rates or risk estimates for areas that do not reach a
death records and models to estimate net migration. critical population size (e.g., 25,000 or 50,000). Clearly,
However, the smaller the geographic unit, the less this approach is impractical if the focus is on processes
reliable these estimates become. In practice, it is nearly that operate at small spatial scales. The alternative is to
impossible to obtain estimates with any degree of detail adjust the original “raw” rate estimate, by “borrowing”
about age, gender, and ethnicity in a cost-effective information or smoothing. This is often based on
manner at a smaller spatial scale than the county level. notions from Bayesian statistics, in which a (posterior)
By necessity, these estimates have to rely on models estimate is obtained by combining the data with “prior”
(and their assumption), and the prediction error will information. In so-called Empirical Bayes methods, the
have to be accounted for in statistical analyses. This has prior information is extracted from the data itself, for
obvious implications for the accuracy of computed example, using a national or regional risk estimate as a
rates, especially in areas undergoing rapid demo- prior in small-area estimation. A large number of
graphic transition. methods have been proposed, which can yield some-
The use of explicit information on the location of times drastically different results. This potentially leads
individuals, such as their addresses, raises concerns about to confusion among uninformed practitioners. An im-
protecting privacy. Techniques exist to avoid the identifi- portant aspect to keep in mind is that smoothing is
cation of individuals while retaining important geo- essentially a form of modeling, in which a delicate
graphic information, but there is no consensus on how balance is obtained between assumptions imposed by
this can be used in the reporting of results, such as the model (such as a prior distribution, the inclusion of
summary maps. As a result, due to legal and institutional explanatory variables, or functional form) and what the
constraints, spatial analysis often has to be carried out at data support. Whereas one method often is selected at
aggregate spatial scales that may not be meaningful for the expense of others, sensitivity analysis is important to
the research question at hand (e.g., the effect of a noxious gain insight into the respective tradeoffs involved. Also,
facility on elevated cancer incidence). Such analyses may when data are scarce (as in the case of rare events in
suffer from the ecologic fallacy (or modifiable areal unit small areas), the information extracted from them will
problem) that conclusions obtained at aggregate levels do by necessity be limited and unreliable.
not translate to meaningful behavioral interpretations at A related issue pertaining to the estimation of rates is
the micro scale. the practice of controlling for the effect of known risk
factors, such as gender and age distribution. In epide-
miologic practice, the standardization of rates to a
Methodologic Challenges
common age/gender distribution is the rule. This
The methodologic focus in the spatial analysis of cancer so-called standard population is typically estimated at
risk and prevention is on detecting, visualizing, and the national level and based on a specific census year.
explaining instances where the distribution of risk is Its use corrects for apparent heterogeneity in rates due

S4 American Journal of Preventive Medicine, Volume 30, Number 2S www.ajpm-online.net


solely to differences in age distribution (e.g., a county decade. Although mainstream commercial statistical
dominated by elderly males would, ceteris paribus, have software is still limited in its spatial functionality, a large
a much higher prostate cancer incidence rate than the number of freestanding niche packages, applets, mac-
national average age profile would suggest). Standard- ros, and scripts developed for statistical toolboxes and
ization either applies a reference age distribution to the GIS software fill the need. Most of these implementa-
age-specific rates obtained at the location of interest tions are noncommercial, developed in the academic
(direct standardization), or computes the local rate by world, with considerable research support from agen-
multiplying its age distribution with a reference risk cies such as the National Science Foundation (NSF),
(indirect standardization). Both methods result in a the National Institutes of Health (NIH), the Centers for
loss of information (of age-specific risk estimates). Disease Control and Prevention (CDC), the National
More importantly, they assume homogeneity of the Institute of Justice (NIJ), and the Environmental Pro-
relation between risk and age distribution across space tection Agency (EPA). A notable private-sector excep-
(and time), which may be unrealistic. Extending the tion is the recently released spatial statistics toolbox in
standardization to other risk factors becomes more the leading commercial GIS software, ArcGIS 9.0,
tenuous, because the assumption of a constant or which consists of a collection of functions written in the
proportional relationship across space between the risk open-source Python language.
and the risk factor may not be warranted. A similar The multitude of available software tools may con-
issue is encountered in the use of model-based smooth- fuse the nonspecialist. All too often the inclusion of a
ing, where more complex models inevitably imply technique in a software package suggests that it is the
stronger assumptions. In the exploratory stage of a state of the art, which, in a rapidly changing field like
spatial analysis, it is better to avoid imposing too many spatial statistics, is not always the case. Furthermore,
assumptions and instead let the data speak for them- software tends to be limited in the scope of techniques
selves. Again, different standardization methods may included, which may misrepresent the range of meth-
yield different suggestion of outliers or clusters and odologic options (and pitfalls) available to the analyst.
sensitivity analysis is in order. There is a tension between user-friendliness of the
Once interesting patterns are identified, it remains a software and the technical sophistication needed to
challenge to relate the spatial heterogeneity in (relative) properly appreciate the assumptions and limitations of
risk to meaningful explanatory variables (risk factors). the various techniques. Moreover, there are few stan-
This is typically carried out by means of regression analy- dards, with little interoperability between the different
sis. Due to the presence of spatial heterogeneity as well as packages (and GIS software), and considerable
spatial correlation, standard methods do not apply, and duplication.
one has to make use of specialized techniques of spatial Software that implements advanced or specialized
regression analysis (or spatial econometrics). These meth- methods, although it most likely exists, is often hard to
ods are complex and still constitute a very active area of find and not always fully documented. The usefulness of
research in statistics and econometrics. This is particularly software clearinghouses, such as that maintained by the
the case when it comes to modeling phenomena in both NSF-funded Center for Spatially Integrated Social Science
space and time. (www.csiss.org), should not be underestimated. However,
An additional complication encountered in spatial much needs to be done to provide further standardiza-
data analysis is the so-called change of support problem tion and quality control. In this respect, the growing
(COSP). This is present when the spatial scale of presence of spatial analytical software tools developed in
measurement for the variables of interest is different, an open-source environment is encouraging. Efforts such
such as point observations on air quality and health as RGeo, organized around the R statistical programming
statistics collected by census tract. The solution to this environment (sal.uiuc.edu/csiss/Rgeo), and PySAL, a li-
spatial mismatch requires the application of spatial brary of spatial analytical routines written in Python
interpolation to bring all variables to a common spatial (sal.uiuc.edu/projects_pysal.php), involve a growing com-
unit of observation (e.g., the census tract). Such spatial munity of developers and allow the latest methods to be
interpolation induces measurement errors with com- included in a transparent manner (the source code
plex spatial structure. The resulting additional uncer- serving as the ultimate documentation). Still, much re-
tainty must be properly accounted for in the regression mains to be done, especially to provide effective tools to
model and other analyses. carry out the exploration and modeling of space–time
dynamics in a GIS environment.
Software
From Statistics to Policy
The days are long gone when the dearth of spatial
analysis software was seen as a major impediment for In the context of cancer prevention and control, spatial
the application of these techniques in practice. Signif- statistical analysis and GIS are only a means and not the
icant progress has been made, especially during the last end. The use of these methods varies depending on the

February 2006 Am J Prev Med 2006;30(2S) S5


policy goals. For example, whereas the spatial analysis forward, to result in effective public health policy, and
of cancer incidence and mortality tends to focus on the to lessen the scope for “lying with spatial statistics.”9
etiology of the various diseases, this may be only a
tangential goal for a prevention policy. In the latter This commentary is based in part on a presentation made at
context, the insights gained from a statistical analysis of a meeting on “GIS Research Priorities for Comprehensive
spatial distributions can be very fruitful in the imple- Cancer Control,” organized by the Centers for Disease Con-
mentation of a spatial decision support system. For trol and Prevention (CDC)’s Division of Cancer Prevention
example, applications of so-called “geo-marketing” and Control, Santa Barbara, CA, November 17-18, 2004. This
techniques can be useful in identifying underserved research was supported in part through NSF Grant BCS-
9978057 to the Center for Spatially Integrated Social Science
populations (markets), assessing the spatial distribution
(CSISS), and by a Cooperative Agreement between the Cen-
of future demand for care, locating medical facilities ters for Disease Control and Prevention and the Association
(e.g., for screening), and targeting messages effectively of Teachers of Preventive Medicine (ATPM), award number
to change behavior (e.g., to promote screening). The TS-1125. The contents of the note are the responsibility of the
importance of the statistical insights lies in the quanti- author and do not necessarily reflect the official views of NSF,
fication of the uncertainty associated with various esti- CDC, or ATPM.
mates and in exploiting the spatial characteristics of No financial conflict of interest was reported by the authors
this uncertainty in the decision process. of this paper.
The current state of the art in spatial statistics is
impressive, and substantial progress has been made.
Although it may be tempting to translate this into “best References
practice methods” and institutionalized guidelines for 1. Monmonnier M. How to lie with maps. Chicago: University of Chicago Press,
applied research, such as the identification of a cluster, 1996.
there is little hope for a satisfactory solution along these 2. Anselin L. Under the hood. Issues in the specification and interpretation of
spatial regression models. Agric Econ 2002;27:247– 67.
lines. New techniques are constantly being suggested as 3. Greenland S. A review of multilevel theory for ecologic analyses. Stat Med
well as insights gained into the tradeoffs among differ- 2002;21:389 –95.
ent approaches. Any best practice, be it a method or 4. Wakefield J. A critique of statistical aspects of ecological studies in spatial
epidemiology. Environ Ecol Stat 2004;11:31–54.
software tool, is likely out of date by the time its 5. Lawson A. Statistical methods in spatial epidemiology. Chichester: John
approval has passed all the institutional hurdles. Fortu- Wiley and Sons, 2001.
nately, the new communication tools facilitated by the 6. Anselin L, Florax R, Rey S. Advances in spatial econometrics, methodology,
tools and applications. Berlin: Springer-Verlag. 2004.
Internet allow the development of a community of 7. Banerjee S, Carlin B, Gelfand A. Hierarchical modeling and analysis for
scholars and practitioners where insights can be shared, spatial data. Boca Raton, FL: Chapman & Hall/CRC, 2004.
clearinghouses provided to methods and software tools, 8. Waller L, Gotway C. Applied spatial statistics for public health data.
Chichester: John Wiley and Sons, 2004.
and ongoing training provided. Such a dialogue be- 9. Schabenberger O, Gotway C. Statistical methods for spatial data analysis.
tween practice and research is likely to push the field Boca Raton, FL: Chapman & Hall/CRC, 2005.

S6 American Journal of Preventive Medicine, Volume 30, Number 2S www.ajpm-online.net

You might also like