Implementing Spatial Data Analysis Software Tools Inr
Implementing Spatial Data Analysis Software Tools Inr
Roger Bivand Economic Geography Section, Department of Economics, Norwegian School of Economics and Business Administration, Bergen, Norway ogerFfivnddnhhFno 16th April 2002
Introduction
This contribution has two equal threads: doing spatial data analysis in the R project and environment, and learning from the R project about how an analytic and infrastructural open source community has achieved critical mass to enable mutually benecial sharing of knowledge and tools. The challenge is to see whether, and if so how far, we can contribute to the next meeting of the community nurturing R (and other projects) at the Distributed Statistical Computing workshop in 2003. It is fair to say that the statistical and data analytic interests of the community are catholic, rigourous, and enthusiastic, and challenge the perceived barriers between commercial and open source software in the interests of better, more timely, and more professional analysis in the proper sense of the word. R is an implementation of the S language, as is S-Plus, and often able to execute the same interpreted code; it was initially written by Ross Ihaka and Robert Gentleman (1996). R follows most of the Brown and Blue Books (Becker, Chambers and Wilks, 1988, Chambers and Hastie 1992), and also implements parts of the Green Book (Chambers, 1998). R is associated with the Omegahat project: it is here that much progress on inter-operation is being made, for instance embedding R in Perl, Python, Java, PostgreSQL or Gnumeric. R is available as Free Software under the terms of the Free Software Foundations GNU General Public License in source code form. It compiles and runs out of the box on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux). It also compiles and runs on Windows systems and MacOS, and as such provides a functional cross-platform distribution standard for software (and from 1.5.0 for data).
Paper for CSISS specialist meeting on spatial data analysis software tools, Santa Barbara CA, 10-11 May 2002.
At the time of writing, searching the R site for "spatial" yielded 447 hits. As Ripley (2001) comments, some of the hesitancy that was previously observable in contributions of new packages has been due to the existence of the S-Plus SpatialStats module: duplicating existing work (including GIS integration) has not seemed fruitful. Over recent months, however, a number of packages have been released on CRAN in all three areas of spatial data analysis (point patterns, continuous surfaces, and lattice data). Ripley is not only very familiar with spatial statistics as an academic statistician (Ripley, 1981 among other publications), but also contributed an early package to R for point pattern analysis and continuous surface analysis, included in Venables and Ripley (1999 - third edition). Descriptions of some of the packages available are given in notes in R News (Ripley, 2001; Bivand, 2001b), while a more dated survey was made by Bivand and Gebhardt (2000), reecting the situation about three years ago. Rather than duplicate these surveys, this section will be concerned with highlighting features of the R implementation of S that are of potential value for spatial data analysis. First, though, some basic remarks may help to provide a context. In the terminology used in the R project, the programming environment is provided as a program interpreting the language, managing memory, providing services, and running the user interface (command line, history mechanisms and graphics windows). In passing, it worth noting that the language supports the use of expressions as function arguments, allowing considerable uency in command line interaction and function writing. There is a clearly dened interface to this program, permitting additional compiled functions to be dynamically loaded, and interpreted functions to be introduced into the list of known objects. By default on program startup, only the base package is loaded, and other packages are loaded at will. R distributions are accompanied by a small set of packages, available by default, and a larger recommended collection. Source packages provide a powerful vehicle for distributing additional code, but their structure encourages a much richer formulation. A minimal source package contains a directory of les of interpreted functions, and a directory of les of documentation of this code. All functions should be fully documented, and must be if the package is to be distributed through the Comprehensive R Archive Network. It is customary for the documentation to include an example section that can be executed to demonstrate what the function does; typically use is made of data sets distributed with the base package if possible. If one chooses to use domain-specic data sets, then the package will contain a further directory with the necessary data les, which in turn are documented in the help le directory. Interpreted R code can of course be read within the context of the user interface, and functions may be edited, saved to user les, and sourced back into the program. In some circumstances, it is desirable to move the internal computations of a function to a compiled language, although, as we will see, this is not an absolute requirement because the built-in internal functions themselves interface highly optimized and heavily debugged compiled code. In this case, a source directory will also be present, 2
with C, C++, or Fortran 77 source les. In the next section below we will see how these are converted into an installed package that is ready for use in the program. Here it is sufcient to mention the possibility of dynamically loading used-compiled shared object code, and the usefulness of header les in C in particular giving direct access to internal R data objects and memory allocation mechanisms from such user-compiled functions. For instance, R provides a ftor object denition for categorical variables, with a character vector of level labels and an integer vector of observation values pointing to the vector of levels. In the GRASS/R compiled interface using GRASS library functions, moving categorical raster data between R and GRASS is accomplished fast and fully preserving labels by operating on R factor objects in C functions. Within R, functions are written use object classes, for example the factor class, to test for object suitability, or in many modelling situations to convert factors into appropriate dummy variables. Finally, users can at will create new classes for which the class method despatch mechanism can be invoked. The summry@A function appears to be a single function, but in fact calls appropriate summary functions based on the class of the rst argument; the same applies to the plot@A and print@A functions. The extent to which the extant spatial data analysis packages use class and method based functions varies, mostly depending on the age of the code and on the potential value of such revisions. The main packages at present published on CRAN specically for spatial data analysis are spatial for point pattern and continuous surface data (in the VR bundle), elds, geoR, geoRglm, RandomFields, sgeostat for continuous surface data, spatstat, splancs for point pattern data, and spdep for spatial lattice data. On the graphics side, R does not provide dynamic linked visualization, since the graphics model is based on drawing on one of a number of graphics devices. R does provide the more important tools for graphical data analysis, although no mapping is present as yet in general terms. Work is progressing on the provision of panelled graphics in the grid and lattice packages, and R can be loosely linked with Ggobi. Graphics have in part been kept fairly simple because of cross-platform difculties; they are extensible at the user level in many ways, but are more for viewing than for interaction.
2.1
Implementation examples
In this section, we will use some of the implementation details of the spdep to exemplify the internal workings of an R package; this package is a workbench for exploring alternative implementation issues, and bug reports, contributions and criticisms are very welcome. The illustrations will draw on the canonical data sets for areal or lattice data, many of which are included in the package to provide clear comparisons with the underlying literature. Most of the world as seen by data analysis software still looks like a at table, and the most characteristic object class in R at rst seems to be the data frame. But a data frame, a rectangular at table with row and column names, and made up of columns of types including numeric, integer, logical, character and logical, and with other 3
attributes, is in fact a list. While point pattern data can exist happily within at tables, as indeed can point locations with attributes as used in the analysis of continuous surfaces, as well as time series, the specic structuring data object of lattice data analysis describing the neighbourhood relations between observations can not. When the weights matrix is represented as such, for moderate to larger data sets analysis may be impeded. This provides one reason for supplementing the existing S-PLUS spatial statistics module for lattice data. In that case, weights are represented sparsely in a data frame, with four columns: weights matrix element row index, column index, value, and (optionally) the order of the weights matrix when multiple matrices are stored in the same data frame. While this provides a direct route to sparse matrix functions for nding the Jacobian, and for matrix multiplication, it makes the retrieval of neighbourhood relations awkward for other needs. Here, it was found simpler to create a hierarchy of class objects leading to the same target, but also open to many operations at earlier stages. The basic building block is a simple list, with each list element an integer vector containing the indices of its neighbours in the present denition. The list is of class n, and has a character region ID attribute, to provide a mapping between the region names and indices. These lists may be read in as legacy GAL-format les, or generated from lists of polygon perimeter coordinates, or matrices of coordinates representing the regions under analysis. Nicholas Levin-Koh has contributed a number of useful graph object derived functions, so that there is now quite a choice with regard to creating lists of neighbours. Class n has summry@A and plot@A functions:
b dt@olumusA b summry@olFglFnD oordsA gonnetivity of olFglFn with the following ttriutesX vist of S 6 lss X hr 4n4 6 regionFidX num IXRW IHHS IHHI IHHT IHHP IHHU FFF 6 gl X logi i 6 ll X logi i 6 sym X logi i xvv xumer of regionsX RW xumer of nonzero linksX PQH erentge nonzero weightsX WFSUWQRP everge numer of linksX RFTWQVUV vink numer distriutionX Q R S T U V W IH U IQ R W T I I I U lest onneted regionsX IHHS IHHV IHRS IHRU IHRW IHRV IHIS with P links I most onneted regionX IHIU with IH links P U
15
14
q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q
13
12
q q q q
10
11
10
11
b plotpolys@polysD sD ordera4grey4A b plot@olFglFnD oordsD ddaiA b toFeFdropped `E whih@oordsDI `a oordsPIDIA b text@oordstoFeFdroppedDID oordstoFeFdroppedDPD C lelsatoFeFdroppedD posaPD offsetaHFQA b suFolFglFn `E suset@olFglFnD 3@IXlength@olFglFnA 7in7 C toFeFdroppedAA b plot@suFolFglFnD oordsEtoFeFdroppedD D ol a 4red4D C dd a iA b whih@3@ttr@olFglFnD 4regionFid4A 7in7 ttr@suFolFglFnD C 4regionFid4AAA I QI QR QT QW RP RT
neighbours list. The code snippet above and gure 2 show how the neighbours list for Columbus may be subsetted to retain spatial units east of the river:
15
14
q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q
13
12
11
31 q 36 q 34 q 39 q 46 q 42 q
21 q
10
11
Figure 2: Subsetting Columbus OH. on the Scioto River. Since this is written in interpreted code, it may be instructive to prole 1000 repetitions of the subsetting operation using the convenient proler compiled by default into R under Unix and Linux. Total time was 12.06 seconds, of which 43% was spent in the calling function. Standard functions such as sort@A and uniqueFdefult@A do spend extra time checking argument characteristics, but turn out to be very effective, themselves calling internal compiled functions. A more demanding task is to nd lists of higher order neighbours, in this case the second, third and fourth order neighbours for Columbus, once again proling for 1000 calls of the nlg@A function. Of the total time of 136.28 seconds for 1000 calls, one third is within the calling function, but as much as 17% is in the double left bracket function for accessing list elements, and a good deal in whih and mth . The nlg@A function returns a list of neighbours lists, and is now just interpreted code. This also permits more condent debugging, especially in relation to possible special cases, allowing extra conditions to be imposed, for example with regard to units with no neighbours. In other cases, a call to a compiled function, say for counting numbers of neighbours per unit, may be a helpful simplication. Similar considerations apply to the function for creating weights lists from neighbours lists, nPlistw@A , which is now interpreted, having been partly compiled in the past. 6
subset.nb sort unique.default unique names inherits is.na ! is.factor names.default which
q q q q q q q q q q
Figure 3: Seconds elapsed by function for 1000 subset function calls on Columbus neighbours list.
10
20
30
40
Figure 4: Seconds elapsed by function for 1000 nblag function calls on Columbus neighbours list returning four orders of neighbours.
In particular the conversion of the standard lpply@A function from interpreted to compiled for applying a function to list elements has meant that less code needs to be compiled for acceptable response times. The listw object is a list with a neighbours list member and a corresponding weights member, and well as attributes providing some metadata. The function can use general as well as binary weights, and can use the weighting schemes described in Tiefelsdorf, Grifth and Boot (1999). A nal consideration is that the availability of classes does encourage conformity, for example in using the htest class to report the results of hypothesis tests. The following example shows the output of three Morans I tests on the Freeman-Tukey square root transformation of SIDS occurrences in counties in North Carolina 1974-8, the rst agging the presence of two counties without neighbouring county seats within 30 miles of their own, the second subsetting to remove the zero-neighbour counties, and the third using the Saddlepoint approximation (Tiefelsdorf, forthcoming):
b mornFtest@ftFshURBsqrt@nFsids6fsURAD nPlistw@sidsorigFnD C zeroFpoliyaiAD zeroFpoliyaiD lterntivea4twoFsided4A worn9s s test under rndomistion dtX ftFshUR B sqrt@nFsids6fsURA weightsX nPlistw@sidsorigFnD zeroFpoliy a iA worn s sttisti stndrd devite a QFQHSD pEvlue a HFHHHWRWV lterntive hypothesisX twoFsided smple estimtesX worn s sttisti ixpettion rine HFPQSVVTIWI EHFHIHIHIHIH HFHHSSQWTRI b C b b b C dropFnoFneighs `E 3@IXlength@sidsorigFnA 7in7 whih@rd@sidsorigFnA aa HAA suFsidsorigFn `E suset@sidsorigFnD dropFnoFneighsA suFx `E suset@ftFshURBsqrt@nFsids6fsURAD dropFnoFneighsA mornFtest@suFxD nPlistw@suFsidsorigFnAD lterntivea4twoFsided4A worn9s s test under rndomistion dtX suFx weightsX nPlistw@suFsidsorigFnA worn s sttisti stndrd devite a QFQTURD pEvlue a HFHHHUSVU lterntive hypothesisX twoFsided smple estimtesX worn s sttisti ixpettion rine HFPRHIUVVIR EHFHIHQHWPUV HFHHSSQQPQI b lmFmorntestFsd@lm@suFx ~ IAD nPlistw@suFsidsorigFnAD C lterntivea4twoFsided4A ddlepoint pproximtion for glol worn9s s @frndorffExielsen
formulA dtX modelXlm@formul a suFx ~ IA weightsX nPlistw@suFsidsorigFnA ddlepoint pproximtion a QFIWIUD pEvlue a HFHHIRIR lterntive hypothesisX twoFsided smple estimtesX yserved worn9s s HFPRHIUVV
Concluding this section, it is worth recapping that the structures of the R language are evolving, and that many issues that even 18 months ago seemed potentially forbidding have been resolved. For example, reading images into R was handled by using compiled code from an external library not available on all platforms. Now the same task is accomplished using standard connections functions, encapsulated for pnm images, in the pixmap package. The same connections functions are used for reading and writing legacy GAL les to and from neighbours lists. These advances are brought about by continuing interaction between the core developers and users contributing packages to CRAN, and because R provides researchers working on the S language with a exible environment to prototype functionality.
Above, CRAN (Comprehensive R Archive Network) and packages were mentioned. While R provides a rich language and environment for data analysis and visualization, it is also extendible, not just because the user can write new or customised interpreted functions, and dynamically load compiled C, Fortran or C++ code, but because the project provides tools for checking, building, archiving and distributed user-contributed packages. Each such package is required to document functions, to provide examples which should run without error if the package is correctly installed. This effectively reduces barriers between users (with a certain insight into the language and their own problem areas) and core developers, and seems to be a good example of the benecial consequences of an open source development model. It has been important to maintain a certain conservatism, meaning that hard-won experience (and legacy C and Fortran code) is central, while experimentation continues in parallel, and in part in the Omegahat project. It is also worth stressing that the R project is an open community, with multiple commitments to varying data analysis communities, and a clear willingness to adapt within the possibilities offered by open source development, in particular through inter-operation with other visualization software, databases, languages, and so on (even including R as an Excel Addin).
As has already been indicated indirectly, much of the added value of the R project extends beyond the standard functionality of the language and programming environment. The archive network is such an extension, as are the package checking mechanisms (in the tools package). Together with the test suites, they have been developed to facilitate quality control of the core system rather than user contributed packages, but because the same standards are used, the contributed packages also benet in terms of organisation and coherence. The use of proling was demonstrated above, and is a typical side effect of the spillovers from the core team to users. The dynamics of the rEhelp mailing list further provide feedback about areas which might be given higher priority; currently providing threading is such an area, as are name space mechanisms for user contributed code. Of course, such mechanisms are relatively common in open source communities, but do need care and a willingness to contribute and participate. Maybe for analysts of spatial data, some of the more detailed statistical themes may seem marginal, usually until those in dispute have claried their positions. Since R is in general well documented, and introductions now exist in a number of languages, at least some questions may be superuous, or a product more of misunderstanding than real difculty. Consequently, the R project provides a number of helpful ideas for the organisation of similar kinds of actions, particularly about the internal dynamics of encouraging many people with little time and no funding to collaborate fruitfully and enjoyably. What seems to happen (from observant reading of list trafc and occasional contact with other user-space contributors and core members is that adaptation to helpful signals is stronger than responses to (sometimes justied) negative signals, I feel mostly because a majority of the people most of the time nd that their own work, be it teaching, research, consulting or production, benets from their participation. This is also related to disciplinary culture, where statisticians and scientists in different knowledge domains have differing traditions for working collaboratively. Naturally, a sense of humour helps, and a willingness to sense when positive feedback in words is needed, and when one actually needs to devote the hours it takes to attack a problem. It is also worth mentioning that R functions on effectively all Unix and Linux systems, Windows systems, and MacOS. There is a framework GUI under Windows and MacOS, which does not permit user interaction with data objects in iconic form, but permits the system to be managed. For current Windows releases, this includes a menu item for online downloading and installation of binary packages from CRAN. This effectively puts spatial data analysis in R just a few clicks away - to get the basic functions anyway. Beyond this, the lack of a GUI does constitute an important hinder, but as has been said on the list many times, developing and maintaining GUIs on many platforms is not a priority for anyone in the core group, not least because they see R as an engine somewhat removed from direct contact with users not motivated to take the system as it stands. In projects and production, use has been made of Tcl/Tk to build custom interfaces, though not all platforms can be relied on to have the necessary libraries.
10
While a good deal is already going on, there are some clear gaps that need to be lled, over and above making more modern spatial data analysis tools and knowledge available. One is the wish that Ross Ihaka made after the last DSC meeting (at which I had talked about GIS integration, Bivand, 2001a) for mapping capability in R. There is some code around, including topology code, other libraries are available (particularly from Frank Warnerdams work), and all the current spatial data analysis packages try to solve visualization problems in their own ways. GRASS is also moving to positions from which the use of vector libraries is likely to be possible, also GPL and written in C. Underlying the following example is the use of county boundary polygons, downloaded from the Spacestat data archive, and projected to UTM zone 18, measured in km, using proj in a system@A call:
b b b b C b C b b b b C b b C sidsFpht `E sum@nFsids6shURA G sum@nFsids6fsURA pm `E ppois@nFsids6shURD sidsFphtBnFsids6fsURA pmFh `E hoynowski@nFsids6shURD sidsFphtBnFsids6fsURA pmFf `E sFordered@ut@pmD reksa@HFHD HFHID HFHSD HFID HFWD HFWSD HFWWD IAD inludeFlowestaiAA pmFhFf `E sFordered@ut@pmFhD reksa@HFHD HFHID HFHSD HFID HFWD HFWSD HFWWD IAD inludeFlowestaiAA ols `E mFolors@length@levels@pmFfAAA pr@mfrowa@PDIAA plotpolys@nFutmFpolysD nFutmsD olaolsodes@pmFfAA legend@@EPVHD EUHAD @QUHHD QWHHAD legendapste@4proF4D levels@pmFfAAD fillaolsD tya4n4A plotpolys@nFutmFpolysD nFutmsD olaolsodes@pmFhFfAA legend@@EPVHD EUHAD @QUHHD QWHHAD legendapste@4proF4D levels@pmFhFfAAD fillaolsD tya4n4A
Note that using ppois@A does not fold together spatial units where observed counts greatly exceed and greatly fall below expectations as in the standard denition of probability maps. The small hoynowski@A function gives the same values where the observed count is less than the expected value, but folds back the others, as can be seen in gure 5.
hoynowski `E funtion@D iA { n `E length@A res `E numeri@nA for @i in IXnA { if@i ` iiA { for @j in HXiA { xx `E @iijBexp@EiiAA G gmm@j C IA resi `E resi C xx }
11
} res
Showing this without a mp function is cumbersome, and nding display class intervals equally so. Here there is something that can be contributed very practically! A further area is that of inter-operation, using XML and/or Green Book connections methods, or simple programs writing programs. This could also involve plugging R data computation services into other front ends, say R in PostGIS given that R can already be embedded (experimentally) in PostgreSQL. This is more speculative, but Omegahat seems to be progressing vigourously, and highlights inter-system interfaces. It would however build on any pre-existing spatial data analysis functions in R, which would become available in the environment within which R is embedded if so selected. In fact, it seems that embedding R into Python is now quite practical, but thought would need to be given to the transfer of data structures needed for spatial analysis between systems.
References
R. A. Becker, J. M. Chambers, and A. R. Wilks. 1998. The New S Language. Chapman & Hall, London. R. S. Bivand. 2001a. R and geographical information systems, especially GRASS, Proceedings of the 2nd International Workshop on Distributed Statistical Computing, Technische Universitt Wien, Vienna, Austria. R. S. Bivand. 2001b. More on Spatial Data Analysis, R News, 1 (3) 13-17. R. S. Bivand and A. Gebhardt. 2000. Implementing functions for spatial statistical analysis using the R language, Journal of Geographical Systems, 2 (3) 307-317. J. M. Chambers. 1998. Programming with Data. Springer, New York. J. M. Chambers and T. J. Hastie. 1992. Statistical Models in S. Chapman & Hall, London. R. Ihaka and R. Gentleman. 1996. R: A Language for Data Analysis and Graphics, Journal of Computational and Graphical Statistics, 5, 299-314. B. D. Ripley. 1981 Spatial statistics. Wiley, New York. 12
3900
4000
4100
3700
prob. [0,0.01] prob. (0.01,0.05] prob. (0.05,0.1] prob. (0.1,0.9] prob. (0.9,0.95] prob. (0.95,0.99] prob. (0.99,1]
3800
200
200
400
3900
4000
4100
3700
3800
200
200
400
Figure 5: Probability map of North Carolina SID counts, 1974-8; upper map cumulative probabilities, lower map Choynowski probabilities.
13
B. D. Ripley. 2001. Spatial Statistics in R, R News, 1 (2) 14-15. M. Tiefelsdorf. forthcoming. The Saddlepoint approximation of Morans I and local Morans Ii reference distributions and their numerical evaluation. Geographical Analysis. M. Tiefelsdorf, D. A. Grifth, and B. Boots. 1999. A variance-stabilizing coding scheme for spatial link matrices. Environment and Planning A, 31, 165180. W. N. Venables and B. D. Ripley. 1999 Modern Applied Statistics with S-Plus. Springer, New York (book website).
14