Chapter 2
Chapter 2
Spatial data
Spatial data is data that contains information about geographical locations and/or about the shapes of
the geographical features (geometry)
Spatial data can be stored as precise (point) coordinates (lat-lon/XY ) or topology (i.e., the geometric
boundaries of municipalities, countries, etc.)
A (postal) address or zip-code is not spatial data. It needs to be geolocated in order to obtain point
coordinates
In order to visually represent (i.e. map) an administrative unit, we need information about its geometry
(lines/polygons)
Shapefiles
Polygon boundaries are a collection of lines that form shapes. These are usually contained in shapefiles
The shapefile format is a geospatial vector data format for geographic information system (GIS) software.
Shapefile folders include at least four files with extensions: .shp (the feature geometry itself); .shx (a
positional index of the feature geometry); .dbf (columnar attributes). They can also include a .prj file
(with the coordinate system and projection information)
R and GIS
Spatial data is usually managed in a Geographic Information System (GIS) software (e.g. ArcGIS,
QGIS)
Some packages in R make it possible to handle and analyze spatial data. These include sp, rgdal,
matpools, rgeos and spatstat
In R it is possible to geolocate addresses, manipulate and visualize spatial data and perform spatial
statistical analyses
Super Neighborhoods division polygon data is freely available at the City of Houston GIS open data
website
We can use the function readOGR from the package rgdal to import this data
library(rgdal)
neigh <- readOGR(dsn = "Super_Neighborhoods", layer = "Super_Neighborhoods")
1
## OGR data source with driver: ESRI Shapefile
## Source: "Super_Neighborhoods", layer: "Super_Neighborhoods"
## with 88 features
## It has 11 fields
Other options for importing shapefiles include the readSpatialPoly function of the maptools package
(without projection data)
Inspection
Neigh is a SpatialPolygonDataFrame with several slots, two of which are @data (containing variables
associated to each polygon) and @polygons
We can work with @data the same way we work with a regular data object
str(neigh@data)
NA values
##
## FALSE TRUE
## 60 28
The complete.cases function reports TRUE for cases without any NA values (it can be used to subset
as well)
table(complete.cases(neigh@data))
##
## FALSE TRUE
## 80 8
Map projections
Every spatial data has its own Coordinate System, which can be either a lat-lon system (Geographic
Coordinate System) or a projected coordinate system (map projection)
A map projection is always necessary to create a map. Projections are a way to represent the 3-D
surface of the earth on a 2-D plane
2
Different spatial operations require either lat-lon or projected coordinate systems. In any case, the
systems should be the same when working with super-imposed layers
Please take a moment to read these Projection Basics
To find out the CRS of a spatial object, open .prj file in shapefile folder, or in R type
proj4string(neigh)
Projecting coordenates
To find out the equivalent projection to a GCS, see this Spatial Reference website
3
Transformations from unprojected to projected systems (and vice-versa) can be made in R using the
spTransform function of the sp package
library(sp)
neigh <- spTransform(neigh, CRS("+init=epsg:3673"))
Suppose we are only interested in the area DOWNTOWN. We can construct a TRUE/FALSE variable to
subset the spatial object
sel <- neigh@data$SNBNAME=="DOWNTOWN"
downtown<-neigh[sel, ]
plot(downtown, axes=TRUE)
4
4219000
4217000
More information by Super Nieghborhood from the 2010 US Census is available at the City of Houston
GIS open data website in .csv format
We can import this csv into R using the read.csv function (see ?read.csv)
SN_data<-read.csv("Census_2010_By_SuperNeighborhood.csv", header=TRUE)
Alternatively, we can download and import the data directly using read.csv2
SN_data<-read.csv2("http://arcg.is/2dZ7naE", sep=",")
The variable POLYID in SN_data is also in neigh@data. It is a unique identifier of the spatial units
(Super Neighborhoods)
We can merge the SN_data into the neigh SpatialPointsDataFrame using this unique identifier
neigh<-merge(neigh, SN_data, by="POLYID")
We can make basic queries on the merged data the same way we would with a regular data object. For
instance, we could find out how many people in all neighborhoods, or how many people live Downtown
5
sum(neigh@data$SUM_TotPop)
## [1] 2066749
neigh@data$SUM_TotPop[neigh@data$SNBNAME=="DOWNTOWN"]
## [1] 16716
To add more criteria (e.g., more neighborhoods) use & for AND and | for OR
neigh@data$SUM_TotPop[neigh@data$SNBNAME=="DOWNTOWN" | neigh@data$SNBNAME=="MEMORIAL"]
We can use the spplot to get a simple cloropeth map of any continuos variable
spplot(neigh, "SUM_TotPop", main = "Population distribution", col = "transparent")
Population distribution
1e+05
8e+04
6e+04
4e+04
2e+04
0e+00
We can manipulate all the attributes of the cloropeth map by modifying different options. Unfortunately,
there is no friendly Map editor (as in QGIS or ArcMap) for these tasks
6
To change the colors, we have to first choose a palette from the RColorsBrewer package options (type
display.brewer.all() to see the options) and then define the number of cuts. In this case the palette is
Orange Red and the number of cuts 7 (which becomes 6 when passed on the spplot function)
library(RColorBrewer)
my.palette <- brewer.pal(n = 7, name = "OrRd")
spplot(neigh, "SUM_TotPop", col.regions = my.palette, main = "Population distribution", cuts = 6, col =
Population distribution
1e+05
8e+04
6e+04
4e+04
2e+04
0e+00
We can subset polygons in the same way we subset data based on a condition
plot(neigh, axes=TRUE)
sel <- neigh@data$SUM_TotPop>mean(neigh@data$SUM_TotPop)
plot(neigh[sel, ], axes=TRUE, col="dark red", add=TRUE)
7
4260000
4230000
4200000
The function gCentroid of the rgeos package allows us to find the geometric center (centroid) of a
map or a single polygon
library(rgeos)
plot(neigh, axes = TRUE)
center <- gCentroid(neigh[neigh$SNBNAME == "DOWNTOWN",])
plot(center, cex = 0.3, col="red", add=TRUE)
8
4260000
4230000
4200000
Buffers around a point are often used to define areas of interest. The function gBuffer of the rgeos package
allows us to calculate buffers of different width (in meters) around a given spatial geometry
plot(neigh, axes=T)
buffer_2k <- gBuffer(center, width = 2000)
plot(neigh[buffer_2k,], col = "lightblue", add = T)
plot(buffer_2k, add = T, border = "red", lwd = 2)
9
4260000
4230000
4200000
We can calculate the euclidian distance between every pair of centroids (as an approximation for the distance
between areas) using the distm function of the geosphere package. As it requires points in degrees (lon/lat),
we have to transform back our spatial object to the original CRS
library(geosphere)
neigh <- spTransform(neigh, CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"))
center_all <- gCentroid(neigh, byid=TRUE)
dist_matrix <- distm(center_all)
dist_matrix[1:5,1:5]
Variables may have a spatial pattern, meaning that they seem to be more clustered or dispersed than would
be expected by random location. To check for this, we first plot the spatial distribution of the variable of
interest (in this case Black population)
spplot(neigh, "SUM_NH_Black", col.regions = my.palette, main = "Number of Blacks", cuts = 6, col = "tran
10
Number of Blacks
25000
20000
15000
10000
5000
To find out the extent of spatial autocorrelation in our variable, we need to define some criteria for
what we consider being a neighbor. The first step in this direction is to construct a neighbors list
using the poly2nb function of the spdep package.
library(spdep)
neigh_nb <- poly2nb(neigh)
head(neigh_nb, n=3)
## [[1]]
## [1] 3 10 78 85
##
## [[2]]
## [1] 3 11 12 73 77 82
##
## [[3]]
## [1] 1 2 10 11 19 75 77 78
Here the default contiguity condition is Queen, meaning that areas sharing any boundary point are
considered neighbors
The object neigh_nb contains a list with the ids of neighbors of each polygon
Plot neighbors
We can visualize the nb object in a plot, together with the coordinates of the centroids of each polygon
11
coords<-coordinates(neigh)
plot(neigh)
plot(neigh_nb, coords, add=T)
The next step is to construct weights between neighbors. There are many coding schemes that can be chosen
to represent the spatial relationship between areas. In this simple example, we assume row-standardized
weights between neighbors, so that areas with more neighbors weight more than those with less. We do this
using the nb2listw function
neigh_lw_W <- nb2listw(neigh_nb, zero.policy = T)
print(neigh_lw_W, zero.policy=T)
12
## Number of regions: 88
## Number of nonzero links: 430
## Percentage nonzero weights: 5.552686
## Average number of links: 4.886364
## 2 regions with no links:
## 36 38
##
## Weights style: W
## Weights constants summary:
## n nn S0 S1 S2
## W 86 7396 86 40.11216 353.6972
We have to introduce the option zero.policy = TRUE because we have 2 areas (green areas/parks) with no
neighbors in the original shapefile
Finally we are ready to calculate measures of spatial autocorrelation. In this example we calculate the
Morans I statistic under normality
moran_u <- moran.test(neigh@data$SUM_NH_Black, listw = neigh_lw_W, zero.policy = T, randomisation=FALSE)
moran_u
##
## Moran I test under normality
##
## data: neigh@data$SUM_NH_Black
## weights: neigh_lw_W
##
## Moran I statistic standard deviate = 5.1102, p-value = 1.609e-07
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.354440899 -0.011764706 0.005135344
According to these results, we can reject the hypothesis of no spatial autocorrelation in the distribution of
blacks in Houston
13