An Introduction To Spatial Data Analysis
An Introduction To Spatial Data Analysis
May 5, 2004
Abstract
This paper presents an overview of GeoDa TM , a free software program
intended to serve as a user-friendly and graphical introduction to spatial
analysis for non-GIS specialists. It includes functionality ranging from
simple mapping to exploratory data analysis, the visualization of global
and local spatial autocorrelation, and spatial regression. A key feature of
GeoDa is an interactive environment that combines maps with statistical
graphics, using the technology of dynamically linked windows. A brief re-
view of the software design is given, as well as some illustrative examples
that highlight distinctive features of the program in applications dealing
with public health, economic development, real estate analysis and crim-
inology.
Key Words: geovisualization, exploratory spatial data analysis, spatial
outliers, smoothing, spatial autocorrelation, spatial regression.
1 Introduction
The development of specialized software for spatial data analysis has seen rapid
growth since the lack of such tools was lamented in the late 1980s by Haining
∗ This research was supported in part by US National Science Foundation Grant BCS-
9978058, to the Center for Spatially Integrated Social Science (csiss) and by grant RO1 CA
95949-01 from the National Cancer Institute. In addition, this research was made possible in
part through a Cooperative Agreement between the Center for Disease Control and Prevention
(CDC) and the Association of Teachers of Preventive Medicine (ATPM), award number TS-
1125. The contents of the paper are the responsibility of the authors and do not necessarily
reflect the official views of NSF, NCI, the CDC or ATPM. Special thanks go to Oleg Smirnov
for his assistance with the implementation of the spatial regression routines, and to Julie
Le Gallo and Julia Koschinsky for preparing, respectively, the data set for the European
convergence study and for the Seattle house prices.GeoDa TM is a trademark of Luc Anselin.
GeoDa 2
(1989) and cited as a major impediment to the adoption and use of spatial
statistics by GIS researchers. Initially, attention tended to focus on conceptual
issues, such as how to integrate spatial statistical methods and a GIS environ-
ment (loosely vs. tightly coupled, embedded vs. modular, etc.), and which
techniques would be most fruitfully included in such a framework. Familiar re-
views of these issues are represented in, among others, Anselin and Getis (1992),
Goodchild et al. (1992), Fischer and Nijkamp (1993), Fotheringham and Roger-
son (1993, 1994), Fischer et al. (1996), and Fischer and Getis (1997). Today,
the situation is quite different, and a fairly substantial collection of spatial data
analysis software is readily available, ranging from niche programs, customized
scripts and extensions for commercial statistical and GIS packages, to a bur-
geoning open source effort using software environments such as R, Java and
Python. This is exemplified by the growing contents of the software tools clear-
ing house maintained by the U.S.-based Center for Spatially Integrated Social
Science (CSISS).1
CSISS was established in 1999 as a research infrastructure project funded by
the U.S. National Science Foundation in order to promote a spatial analytical
perspective in the social sciences (Goodchild et al. 2000). It was readily rec-
ognized that a major instrument in disseminating and facilitating spatial data
analysis would be an easy to use, visual and interactive software package, aimed
at the non-GIS user and requiring as little as possible in terms of other software
(such as GIS or statistical packages). GeoDa is the outcome of this effort. It is
envisaged as an “introduction to spatial data analysis” where the latter is taken
to consist of visualization, exploration and explanation of interesting patterns
in geographic data.
The main objective of the software is to provide the user with a natural
path through an empirical spatial data analysis exercise, starting with simple
mapping and geovisualization, moving on to exploration, spatial autocorrelation
analysis, and ending up with spatial regression. In many respects, GeoDa is a
reinvention of the original SpaceStat package (Anselin 1992), which by now
has become quite dated, with only a rudimentary user interface, an antiquated
architecture and performance constraints for medium and large data sets. The
software was redesigned and rewritten from scratch, around the central concept
of dynamically linked graphics. This means that different “views” of the data
are represented as graphs, maps or tables with selected observations in one
highlighted in all. In that respect, GeoDa is similar to a number of other
modern spatial data analysis software tools, although it is quite distinct in
its combination of user friendliness with an extensive range of incorporated
methods. A few illustrative comparisons will help clarify its position in the
current spatial analysis software landscape.
In terms of the range of spatial statistical techniques included, GeoDa is most
alike to the collection of functions developed in the open source R environment.
For example, descriptive spatial autocorrelation measures, rate smoothing and
spatial regression are included in the spdep package, as described by Bivand and
1 See http://www.csiss.org/clearinghouse/.
GeoDa 3
Gebhardt (2000), Bivand (2002a,b), and Bivand and Portnov (2004). In contrast
to R, GeoDa is completely driven by a point and click interface and does not
require any programming. It also has more extensive mapping capability (still
somewhat experimental in R) and full linking and brushing in dynamic graphics,
which is currently not possible in R due to limitations in its architecture. On the
other hand, GeoDa is not (yet) customizable or extensible by the user, which
is one of the strengths of the R environment. In that sense, the two are seen as
highly complementary, ideally with more sophisticated users “graduating” to R
after being introduced to the techniques in GeoDa.2
The use of dynamic linking and brushing as a central organizing technique
for data visualization has a strong tradition in exploratory data analysis (EDA),
going back to the notion of linked scatterplot brushing (Stuetzle 1987), and var-
ious methods for dynamic graphics outlined in Cleveland and McGill (1988).
In geographical analysis, the concept of “geographic brushing” was introduced
by Monmonier (1989) and made operational in the Spider/Regard toolboxes
of Haslett, Unwin and associates (Haslett et al. 1990, Unwin 1994). Several
modern toolkits for exploratory spatial data analysis (ESDA) also incorporate
dynamic linking, and, to a lesser extent, brushing. Some of these rely on in-
teraction with a GIS for the map component, such as the linked frameworks
combining XGobi or XploRe with ArcView (Cook et al. 1996, 1997, Symanzik
et al. 2000), the SAGE toolbox, which uses ArcInfo (Wise et al. 2001), and the
DynESDA extension for ArcView (Anselin 2000), GeoDa’s immediate predeces-
sor. Linking in these implementations is constrained by the architecture of the
GIS, which limits the linking process to a single map (in GeoDa, there is no
limit on the number of linked maps). In this respect, GeoDa is similar to other
freestanding modern implementations of ESDA, such as the cartographic data
visualizer, or cdv (Dykes 1997), GeoVISTA Studio (Takatsuka and Gahegan
2002) and STARS (Rey and Janikas 2004). These all include functionality for
dynamic linking, and to a lesser extent, brushing. They are built in open source
programming environments, such as Tkl/Tk (cdv), Java (GeoVISTA Studio)
or Python (STARS) and thus easily extensible and customizable. In contrast,
GeoDa is (still) a closed box, but of these packages it provides the most ex-
tensive and flexible form of dynamic linking and brushing for both graphs and
maps.
Common spatial autocorrelation statistics, such as Moran’s I and even the
Local Moran are increasingly part of spatial analysis software, ranging from
CrimeStat (Levine 2004), to the spdep and DCluster packages available on the
open source Comprehensive R Archive Network (CRAN),3 as well as commercial
packages, such as the spatial statistics toolbox of the forthcoming release of
ArcGIS 9.0 (ESRI 2004). However, at this point in time, none of these include
the range and ease of construction of spatial weights, or the capacity to carry
out sensitivity analysis and visualization of these statistics contained in GeoDa.
Apart from the R spdep package, Geoda is the only one to contain functionality
2 Note that the CSISS spatial tools project is an active participant in the development of
The full set of functions is listed in Table 1 and is documented in detail in the
GeoDa User’s Guides (Anselin 2003, 2004).4
The software implementation consists of two important components: the
user interface and graphics windows on the one hand, and the computational
engine on the other hand. In the current version, all graphic windows are based
on Microsoft Foundation Classes (MFC) and thus are limited to MS Windows
platforms.5 In contrast, the computational engine (including statistical oper-
ations, randomization, and spatial regression) is pure C++ code and largely
cross platform.
The bulk of the graphical interface implements five basic classes of windows:
histogram, box plot, scatter plot (including the Moran scatter plot), map and
grid (for the table selection and calculations). The choropleth maps, including
the significance and cluster maps for the local indicators of spatial autocor-
relation (LISA) are derived from MapObjects classes. Three additional types
of maps were developed from scratch and do not use MapObjects: the map
movie (map animation), the cartogram, and the conditional maps. The three
dimensional scatter plot is implemented with the OpenGL library.
The functionality of GeoDa is invoked either through menu items or directly
by clicking toolbar buttons, as illustrated in Figure 1. A number of specific ap-
plications are highlighted in the following sections, focusing on some distinctive
features of the software.
Figure 1: The opening screen with menu items and toolbar buttons
4 A Quicktime movie with a demonstration of the main features can be found at
http://sal.agecon.uiuc.edu/movies/GeoDaDemo.mov.
5 Ongoing development concerns the porting of all MFC based classes to a cross-platform
Bailey and Gatrell (1995), pp. 303-308. For an alternative recent software implementation,
see Anselin et al. (2004). Spatial smoothing is discussed at length in Kafadar (1996).
7 The cartogram is constructed using the non-linear cellular automata algorithm due to
Dorling (1996).
8 The conditional maps are part of a larger set of conditional plots, which includes his-
of Figure 2, where the area of the circles is proportional to the value of the EB
smoothed rate. The upper outlier is shown as a red circle, the lower outlier
as a blue circle. The yellow circles are the counties that were outliers in the
crude rate map, highlighted here as a result of linking with the other maps and
graphs.12
Figure 2: Linked box maps, box plot and cartogram, raw and smoothed prostate
cancer mortality rates.
4 Multivariate EDA
Multivariate exploratory data analysis is implemented in GeoDa through link-
ing and brushing between a collection of statistical graphs. These include the
usual histogram, box plot and scatter plot, but also a parallel coordinate plot
(PCP) and three-dimensional scatter plot, as well as conditional plots (condi-
tional histogram, box plot and scatter plot).
We illustrate some of this functionality with an exploration of the relation-
ships between economic growth and initial development, typical of the recent
“spatial” regional convergence literature (for an overview, see Rey 2004). We
use economic data over the period 1980-1999 for 145 European regions, most of
12 Note that the outliers identified may be misleading since the rate analyzed is not adjusted
for differences in age distribution. In other words, the outliers shown may simply be counties
with a larger proportion of older males. A much more detailed analysis is necessary before
any policy conclusions may be drawn.
GeoDa 9
them at the NUTS II level of spatial aggregation, except for a few at the NUTS
I level (for Luxembourg and the United Kingdom).13
Figure 3 illustrates the various linked plots and map. The left-hand panel
contains a simple percentile map (GDP per capital in 1989), and a three-
dimensional scatter plot (for the percent agricultural and manufacturing em-
ployment in 1989 as well as the GDP growth rate over the period 1980-99). In
the top right-hand panel is a PCP for the growth rates in the two periods of
interest (1980-89 and 1989-99) and the GDP per capita in the base year, the
typical components of a convergence regression. In the bottom of the right-hand
panel is a simple scatter plot of the growth rate in the full period (1980-99) on
the base year GDP.
Both plots on the right hand side illustrate the typical empirical phenomenon
that higher GDP at the start of the period is associated with a lower growth
rate. However, as demonstrated in the PCP (some of the lines suggest a positive
relation between GDP and growth rate), the pattern is not uniform and there
13 The data are from the most recent version of the NewCronos Regio database by Eurostat.
NUTS stands for “Nomenclature of Territorial Units for Statistics” and contains the definition
of administrative regions in the EU member states. NUTS II level regions are roughly compa-
rable to counties in the U.S. context and are available for all but two countries. Luxembourg
constitutes only a single region. For the United Kingdom, data is not available at the NUTS
II level, since these regions do not correspond to local governmental units.
GeoDa 10
The original house sales data are for point locations, which, for the purposes
of this analysis are converted to Thiessen polygons. This allows a definition of
“neighbor” based on common boundaries between the Thiessen polygons. On
the left hand panel of Figure 4, two LISA cluster maps are shown, depicting the
locations of significant Local Moran’s I statistics, classified by type of spatial
association. The dark red and dark blue locations are indications of spatial
clusters (respectively, high surrounded by high, and low surrounded by low).15
In contrast, the light red and light blue are indications of spatial outliers (re-
spectively, high surrounded by low, and low surrounded by high). The bottom
14 The data are from the King County (Washington State) Department of Assessments.
15 More precisely, the locations highlighted show the “core” of a cluster. The cluster itself
can be thought of as consisting of the core as well as the neighbors. Clearly some of these
clusters are overlapping.
GeoDa 12
map uses the default significance of p = 0.05, whereas the top map is based
on p = 0.01 (after carrying out 9999 permutations). The matching significance
map is in the top right hand panel of Figure 4. Significance is indicated by
darker shades of green, with the darkest corresponding to p = 0.0001. Note how
the tighter significance criterion eleminates some (but not that many) locations
from the map. In the bottom right hand panel of the Figure, the correspond-
ing Moran scatterplot is shown, with the most extreme “high-high” locations
selected. These are shown as cross-hatched polygons in the maps, and almost
all obtain highly significant (at p = 0.0001) local Moran’s I statistics.
The overall pattern depicts a cluster of high priced houses on the East side,
with a cluster of low priced houses following an axis through the center. Put in
context, this is not surprising, since the East side represents houses with a lake
view, while the center cluster follows a highway axis and generally corresponds
with a lower income neighborhood. Interestingly, the pattern is not uniform,
and several spatial outliers can be distinguished. Further investigation of these
patterns would require a full hedonic regression analysis.
6 Spatial Regression
As of version 0.9.5-i, GeoDa also includes a limited degree of spatial regression
functionality. The basic diagnostics for spatial autocorrelation, heteroskedastic-
ity and non-normality are implemented for the standard ordinary least squares
regression. Estimation of the spatial lag and spatial error models is supported
by means of the Maximum Likelihood (ML) method (see Anselin and Bera
1998, for a review of the technical issues). In addition to the estimation itself,
predicted values and residuals are calculated and made available for mapping.
The ML estimation in GeoDa distinguishes itself by the use of extremely
efficient algorithms, that allow the estimation of models for very large data sets.
The standard eigenvalue simplification is used (Ord 1975) for data sets up to
1,000 observations. Beyond that, the sparse algorithm of Smirnov and Anselin
(2001) is used, which exploits the characteristic polynomial associated with the
spatial weights matrix. This algorithm allows estimation of very large data sets
in reasonable time. In addition, GeoDa implements the recent algorithm of
Smirnov (2003) to compute the asymptotic variance matrix for all the model
coefficients (i.e., including both the spatial and non-spatial coefficients). This
involves the inversion of a matrix of the dimensions of the data sets. To date,
GeoDa is the only software that provides such estimates for large data sets.
All estimation methods employ sparse spatial weights, but they are currently
constrained to weights that are intrinsically symmetric (e.g., excluding k-nearest
neighbor weights). The regression routines have been successfully applied to
real data sets of more than 300,000 observations (with estimation and inference
completed in a few minutes). By comparison, a spatial regression for the 3000+
US counties takes a few seconds.
We illustrate the spatial regression capabilities with a partial replication and
extension of the homicide model used in Baller et al. (2001) and Messner and
GeoDa 13
Anselin (2004). These studies assessed the extent to which a classic regression
specification, well-known in the ciminology literature, is robust to the explicit
consideration of spatial effects. The model relates county homicide rates to a
number of socio-economic explanatory variables. In the original study, a full ML
analysis of all US continental counties was precluded by the constraints on the
eigenvalue-based SpaceStat routines. Instead, attention focused on two subsets
of the data containing 1412 counties in the US South and 1673 counties in the
non-South.
In Figure 5, we show the result of the ML estimation of a spatial error
model of county homicide rates for the complete set of 3085 continental US
counties in 1980. The explanatory variables are the same as before: a Southern
dummy variable, a resource deprivation index, a population structure indicator,
GeoDa 14
7 Future Directions
GeoDa is a work in progress and still under active development. This devel-
opment proceeds along three fronts. First and foremost is an effort to make
the code cross-platform and open source. This requires considerable change in
the graphical interface, moving from the Microsoft Foundation Classes (MFC)
that are standard in the various MS Windows flavors, to a cross-platform alter-
native. The current efforts use wxWindows, which operates on the same code
base with a native GUI flavor in Windows, MacOS X and Linux/Unix. Mak-
ing the code open source is currently precluded by the reliance on proprietary
code in ESRI’s MapObjects. Moreover, this involves more than simply making
the source code available, but entails considerable reorganization and stream-
lining of code (refactoring), to make it possible for the community to effectively
participate in the development process.
A second strand of development concerns the spatial regression functionality.
While currently still fairly rudimentary, the inclusion of estimators other than
ML and the extension to models for spatial panel data are in progress. Finally,
the functionality for ESDA itself is being extended to data models other than
the discrete locations in the “lattice” case. Specifically, exploratory variography
is being added, as well as the exploration of patterns in flow data.
Given its initial rate of adoption, there is a strong indication that GeoDa
is indeed providing the “introduction to spatial data analysis” that makes it
possible for growing numbers of social scientists to be exposed to an explicit
spatial perspective. Future development of the software should enhance this
capability and it is hoped that the move to an open source environment will
involve an international community of like minded developers in this venture.
References
Anselin, L. (1992). SpaceStat, a Software Program for Analysis of Spatial Data.
National Center for Geographic Information and Analysis (NCGIA), Univer-
16 See the original papers for technical details and data sources. In Baller et al. (2001),
a different set of spatial weights was used than in this example, but the conclusions of the
specification tests are the same. Specifically, using the county contiguity, the robust Lagrange
Multiplier tests are 1.24 for the Lag alternative, and 24.88 for the Error alternative, strongly
suggesting the latter as the proper alternative.
GeoDa 15
Anselin, L., Syabri, I., Smirnov, O., and Ren, Y. (2002b). Visualizing spatial
autocorrelation with dynamically linked windows. Computing Science and
Statistics, 33. CD-ROM.
Assunção, R. and Reis, E. A. (1999). A new proposal to adjust Moran’s I for
population density. Statistics in Medicine, 18:2147–2161.
GeoDa 16
Becker, R. A., Cleveland, W., and Shyu, M.-J. (1996). The visual design and
control of Trellis displays. Journal of Computational and Graphical Statistics,
5:123–155.
Bivand, R. (2002a). Implementing spatial data analysis software tools in R.
In Anselin, L. and Rey, S., editors, New Tools for Spatial Data Analysis:
Proceedings of the Specialist Meeting. Center for Spatially Integrated Social
Science (CSISS), University of California, Santa Barbara. CD-ROM.
Bivand, R. (2002b). Spatial econometrics functions in R: Classes and methods.
Journal of Geographical Systems, 4(4):405–421.
Bivand, R. and Gebhardt, A. (2000). Implementing functions for spatial statisti-
cal analysis using the R language. Journal of Geographical Systems, 2(3):307–
317.
Bivand, R. S. and Portnov, B. A. (2004). Exploring spatial data analysis tech-
niques using R: The case of observations with no neighbors. In Anselin, L.,
Florax, R. J., and Rey, S. J., editors, Advances in Spatial Econometrics:
Methodology, Tools and Applications, pages 121–142. Springer-Verlag, Berlin.
Carr, D. B., Chen, J., Bell, S., Pickle, L., and Zhang, Y. (2002). Interac-
tive linked micromap plots and dynamically conditioned choropleth maps.
In Anselin, L. and Rey, S., editors, New Tools for Spatial Data Analysis:
Proceedings of the Specialist Meeting. Center for Spatially Integrated Social
Science (CSISS), University of California, Santa Barbara. CD-ROM.
Clayton, D. and Kaldor, J. (1987). Empirical Bayes estimates of age-
standardized relative risks for use in disease mapping. Biometrics, 43:671–681.
Cleveland, W. S. and McGill, M. (1988). Dynamic Graphics for Statistics.
Wadsworth, Pacific Grove, CA.
Cook, D., Majure, J., Symanzik, J., and Cressie, N. (1996). Dynamic graphics
in a GIS: A platform for analyzing and exploring multivariate spatial data.
Computational Statistics, 11:467–480.
Cook, D., Symanzik, J., Majure, J. J., and Cressie, N. (1997). Dynamic graphics
in a GIS: More examples using linked software. Computers and Geosciences,
23:371–385.
Dorling, D. (1996). Area Cartograms: Their Use and Creation. CATMOG 59,
Institute of British Geographers.
GeoDa 17
Fischer, M. M., Scholten, H. J., and Unwin, D. (1996). Spatial Analytical Per-
spectives on GIS. Taylor and Francis, London.
Fotheringham, A. S. and Rogerson, P. (1993). GIS and spatial analytical prob-
lems. International Journal of Geographical Information Systems, 7:3–19.
Fotheringham, A. S. and Rogerson, P. (1994). Spatial Analysis and GIS. Taylor
and Francis, London.
Goodchild, M. F., Anselin, L., Appelbaum, R., and Harthorn, B. (2000). Toward
spatially integrated social science. International Regional Science Review,
23(2):139–159.
Goodchild, M. F., Haining, R. P., Wise, S., and others (1992). Integrating GIS
and spatial analysis — problems and possibilities. International Journal of
Geographical Information Systems, 6:407–423.
Haining, R. (1989). Geography and spatial statistics: Current positions, future
developments. In Macmillan, B., editor, Remodelling Geography, pages 191–
203. Basil Blackwell, Oxford.
Haslett, J., Wills, G., and Unwin, A. (1990). SPIDER — an interactive statis-
tical tool for the analysis of spatially distributed data. International Journal
of Geographic Information Systems, 4:285–296.