Spatial Analysis
Spatial Analysis
1 Introduction 1
3 Spatial autocorrelation 17
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 Temporal autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 Spatial autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Example data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Adjacent polygons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Compute Moran’s I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Interpolation 29
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Temperature in California . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.1 9.2 NULL model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.2 proximity polygons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.3 Nearest neighbour interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.4 Inverse distance weighted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Calfornia Air Pollution data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.2 Fit a variogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.3 Ordinary kriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.4 Compare with other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.5 Cross-validate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
i
5 Spatial distribution models 51
5.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1.1 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1.2 Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1.3 Background data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1.4 Combine presence and background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Fit a model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2.1 CART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Predict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4 Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.5 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6 Local regression 73
6.1 California precipitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 California House Price Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.3 Summarize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.4 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.5 Geographicaly Weighted Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.5.1 By county . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.5.2 By grid cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.6 spgwr package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
ii
CHAPTER
ONE
INTRODUCTION
In this section we introduce a number of approaches and techniques that are commonly used in spatial data analysis
and modelling.
Spatial data are mostly like other data. The same general principles apply. But there are few things that are rather
important to consider when using spatial data that are not common with other data types. These are discussed in
Chapters 2 and 3 and include issues of scale and zonation (the modifiable areal unit problem), distance and spatial
autocorrelation.
The other chapters, introduce methods in different areas of spatial data analysis. These include the three classical area
of spatial statistics (point pattern analysis, regression and inference with spatial data, geostatistics (interpolation using
Kriging), as well some other methods (local and global regression and classification with spatial data).
Some of the material presented here is based on examples in the book “Geographic Information Analysis” by David
O’Sullivan and David J. Unwin. This book provides an excellent and very accessible introduction to spatial data
analysis. It has much more depth than what we present here. But the book does not show how to practically implement
the approaches that are discussed — which is the main purpose of this website.
The spatial statistical methods are treated in much more detail in “Applied Spatial Data Analysis with R” by Bivand,
Pebesma and Gómez-Rubio.
This section builds on our Introduction to Spatial Data Manipulation R, that you should read first.
1
Spatial Data Analysis with R
2 Chapter 1. Introduction
CHAPTER
TWO
2.1 Introduction
Scale, aggregations, and distance are two key concepts in spatial data analysis that can be tricky to come to grips with.
This chapter first discusses scale and related concepts resolution, aggregation and zonation. The second part of the
chapter discusses distance and adjacency.
3
Spatial Data Analysis with R
From a practical perspective: it affects our estimates of length and size. For example if you wanted to know the length
of the coastline of Britain, you could use the length of spatial dataset representing that coastline. You could get rather
different numbers depending on the data set used. The higher the resolution of the spatial data, the longer the coastline
would appear to be. This is not just a problem of the representation (the data), also at a theoretical level, one can
argue that the length of the coastline is not defined, as it becomes infinite if your resolution approaches zero. This is
illustrated here
Resolution also affects our understanding of relationships between variables of interest. In terms of data collection this
means that we want data to be at the highest spatial (and temporal) resolution possible (affordable). We can aggregate
our data to lower resolutions, but it is not nearly as easy, or even impossible to correctly disaggregate (“downscale”)
data to a higher resolution.
2.3 Zonation
Geographic data are often aggregated by zones. While we would like to have data at the most granular level that is
possible or meanigful (individuals, households, plots, sites), reality is that we often can only get data that is aggregated.
Rather than having data for individuals, we may have mean values for all inhabitants of a census district. Data on
population, disease, income, or crop yield, is typically available for entire countries, for a number of sub-national units
(e.g. provinces), or a set of raster cells.
The areas used to aggregate data are arbitrary (at least relative to the data of interest). The way the borders of the areas
are drawn (how large, what shape, where) can strongly affect the patterns we see and the outcome of data analysis.
This is sometimes referred to as the “Modifiable Areal Unit Problem” (MAUP). The problem of analyzing aggregated
data is referred to as “Ecological Inference”.
To illustrate the effect of zonation and aggregation, I create a region with 1000 households. For each household we
know where they live and what their annual income is. I then aggregate the data to a set of zones.
The income distribution data
set.seed(0)
xy <- cbind(x=runif(1000, 0, 100), y=runif(1000, 0, 100))
income <- (runif(1000) * abs((xy[,1] - 50) * (xy[,2] - 50))) / 500
Inspect the data, both spatially and non-spatially. The first two plots show that there are many poor people and a few
rich people. The thrird that there is a clear spatial pattern in where the rich and the poor live.
par(mfrow=c(1,3), las=1)
plot(sort(income), col=rev(terrain.colors(1000)), pch=20, cex=.75, ylab='income')
hist(income, main='', col=rev(terrain.colors(10)), xlim=c(0,5), breaks=seq(0,5,0.5))
plot(xy, xlim=c(0,100), ylim=c(0,100), cex=income, col=rev(terrain.
˓→colors(50))[10*(income+1)])
n <- length(income)
G <- (2 * sum(sort(income) * 1:n)/sum(income) - (n + 1)) / n
G
## [1] 0.5814548
library(raster)
r1 <- raster(ncol=1, nrow=4, xmn=0, xmx=100, ymn=0, ymx=100, crs=NA)
r1 <- rasterize(xy, r1, income, mean)
Have a look at the plots of the income distribution and the sub-regional averages.
par(mfrow=c(2,3), las=1)
plot(r1); plot(r2); plot(r3); plot(r4); plot(r5); plot(r6)
2.3. Zonation 5
Spatial Data Analysis with R
It is not surprising to see that the smaller the regions get, the better the real pattern is captured. But in all cases,
the histograms show that we do not capture the full income distribution (compare to the histogram with the data for
individuals).
par(mfrow=c(1,3), las=1)
hist(r4, main='', col=rev(terrain.colors(10)), xlim=c(0,5), breaks=seq(0, 5, 0.5))
hist(r5, main='', col=rev(terrain.colors(10)), xlim=c(0,5), breaks=seq(0, 5, 0.5))
hist(r6, main='', col=rev(terrain.colors(10)), xlim=c(0,5), breaks=seq(0, 5, 0.5))
2.4 Distance
Distance is a numerical description of how far apart things are. It is the most fundamental concept in geography. After
all, Waldo Tobler’s First Law of Geography states that “everything is related to everything else, but near things are
more related than distant things”. But how far away are things? That is not always as easy a question as it seems. Of
course we can compute distance “as the crow flies” but that is often not relevant. Perhaps you need to also consider
national borders, mountains, or other barriers. The distance between A and B may even by asymetric, meaning that it
the distance from A to B is not the same as from B to A (for example, the President of the United States can call me,
but I cannot call him (or her)); or because you go faster when walking downhill than when waling uphill.
2.4. Distance 7
Spatial Data Analysis with R
text(pts+5, LETTERS[1:6])
You can use the dist function to make a distance matrix with a data set of any dimension.
We can check that for the first point using Pythagoras’ theorem.
sqrt((40-101)^2 + (43-1)^2)
## [1] 74.06079
D <- as.matrix(dis)
round(D)
## A B C D E F
## A 0 74 72 68 29 46
## B 74 0 54 64 46 81
## C 72 54 0 13 60 105
## D 68 64 13 0 62 105
## E 29 46 60 62 0 45
## F 46 81 105 105 45 0
Distance matrices are used in all kinds of non-geographical applications. For example, they are often used to create
cluster diagrams (dendograms).
Question 4: Show R code to make a cluster dendogram summarizing the distances between these six sites, and plot it.
See ?hclust.
2.5.1 Adjacency
Adjacency is an important concept in some spatial analysis. In some cases objects are considered ajacent when they
“touch”, e.g. neighboring countries. In can also be based on distance. This is the most common approach when
analyzing point data.
We create an adjacency matrix for the point data analysed above. We define points as “ajacent” if they are within a
distance of 50 from each other. Given that we have the distance matrix D this is easy to do.
a <- D < 50
a
## A B C D E F
## A TRUE FALSE FALSE FALSE TRUE TRUE
## B FALSE TRUE FALSE FALSE TRUE FALSE
## C FALSE FALSE TRUE TRUE FALSE FALSE
## D FALSE FALSE TRUE TRUE FALSE FALSE
## E TRUE TRUE FALSE FALSE TRUE TRUE
## F TRUE FALSE FALSE FALSE TRUE TRUE
In adjacency matrices the diagonal values are often set to NA (we do not consider a point to be adjacent to itself). And
TRUE/FALSE values are commonly stored as 1/0 (this is equivalent, and we can make this change with a simple
trick: multiplication with 1)
diag(a) <- NA
Adj50 <- a * 1
(continues on next page)
As we now have the column numbers, we can make the row-column pairs that we want (rowcols).
W <- 1 / D
round(W, 4)
## A B C D E F
## A Inf 0.0135 0.0139 0.0148 0.0345 0.0219
## B 0.0135 Inf 0.0185 0.0156 0.0217 0.0123
## C 0.0139 0.0185 Inf 0.0767 0.0166 0.0095
## D 0.0148 0.0156 0.0767 Inf 0.0163 0.0095
## E 0.0345 0.0217 0.0166 0.0163 Inf 0.0224
## F 0.0219 0.0123 0.0095 0.0095 0.0224 Inf
Such as “spatial weights” matrix is often “row-normalized”, such that the sum of weights for each row in the matrix is
the same. First we get rid if the Inf values by changing them to NA. (Where did the Inf values come from?)
W[!is.finite(W)] <- NA
And divide the rows by their totals and check if they row sums add up to 1.
W <- W / rtot
rowSums(W, na.rm=TRUE)
## A B C D E F
## 1 1 1 1 1 1
colSums(W, na.rm=TRUE)
## A B C D E F
## 0.9784548 0.7493803 1.2204900 1.1794393 1.1559273 0.7163082
library(raster)
p <- shapefile(system.file("external/lux.shp", package="raster"))
library(spdep)
We use poly2nb to create a “rook’s case” neighbors-list. And from that a neighbors matrix.
wr[1:6]
## [[1]]
## [1] 2 4 5
##
## [[2]]
## [1] 1 3 4 5 6 12
##
## [[3]]
## [1] 2 5 9 12
##
## [[4]]
## [1] 1 2
##
## [[5]]
## [1] 1 2 3
##
## [[6]]
## [1] 2 8 12
wm[1:6,1:11]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## 1 0 1 0 1 1 0 0 0 0 0 0
## 2 1 0 1 1 1 1 0 0 0 0 0
## 3 0 1 0 0 1 0 0 0 1 0 0
## 4 1 1 0 0 0 0 0 0 0 0 0
## 5 1 1 1 0 0 0 0 0 0 0 0
## 6 0 1 0 0 0 0 0 1 0 0 0
i <- rowSums(wm)
i
## 1 2 3 4 5 6 7 12 8 9 10 11
## 3 6 4 2 3 3 3 4 4 3 5 6
Expresses as percentage
par(mai=c(0,0,0,0))
plot(p, col='gray', border='blue')
(continues on next page)
Nearest neighbors:
Lag-two Rook:
wr2 <- wr
for (i in 1:length(wr)) {
lag1 <- wr[[i]]
lag2 <- wr[lag1]
lag2 <- sort(unique(unlist(lag2)))
lag2 <- lag2[!(lag2 %in% c(wr[[i]], i))]
wr2[[i]] <- lag2
}
THREE
SPATIAL AUTOCORRELATION
3.1 Introduction
Spatial autocorrelation is an important concept in spatial statistics. It is a both a nuisance, as it complicates statistical
tests, and a feature, as it allows for spatial interpolation. Its computation and properties are often misunderstood. This
chapter discusses what it is, and how statistics describing it can be computed.
Autocorrelation (whether spatial or not) is a measure of similarity (correlation) between nearby observations. To
understand spatial autocorrelation, it helps to first consider temporal autocorrelation.
set.seed(0)
d <- sample(100, 10)
d
## [1] 90 27 37 56 88 20 85 96 61 58
Compute auto-correlation.
a <- d[-length(d)]
b <- d[-1]
plot(a, b, xlab='t', ylab='t-1')
17
Spatial Data Analysis with R
cor(a, b)
## [1] -0.2222057
The autocorrelation computed above is very small. Even though this is a random sample, you (almost) never get a
value of zero. We computed the “one-lag” autocorrelation, that is, we compare each value to its immediate neighbour,
and not to other nearby values.
After sorting the numbers in d autocorrelation becomes very strong (unsurprisingly).
d <- sort(d)
d
## [1] 20 27 37 56 58 61 85 88 90 96
a <- d[-length(d)]
b <- d[-1]
plot(a, b, xlab='t', ylab='t-1')
cor(a, b)
## [1] 0.9530641
The acf function shows autocorrelation computed in a slightly different way for several lags (it is 1 to each point it
self, very high when comparing with the nearest neighbour, and than tapering off).
acf(d)
3.1. Introduction 19
Spatial Data Analysis with R
library(raster)
p <- shapefile(system.file("external/lux.shp", package="raster"))
p <- p[p$NAME_1=="Diekirch", ]
(continues on next page)
Let’s say we are interested in spatial autocorrelation in variable “AREA”. If there were spatial autocorrelation, regions
of a similar size would be spatially clustered.
Here is a plot of the polygons. I use the coordinates function to get the centroids of the polygons to place the
labels.
par(mai=c(0,0,0,0))
plot(p, col=2:7)
xy <- coordinates(p)
points(xy, cex=6, pch=20, col='white')
text(p, 'ID_2', cex=1.5)
library(spdep)
w <- poly2nb(p, row.names=p$Id)
class(w)
## [1] "nb"
summary(w)
## Neighbour list object:
## Number of regions: 5
## Number of nonzero links: 14
## Percentage nonzero weights: 56
## Average number of links: 2.8
## Link number distribution:
##
## 2 3 4
## 2 2 1
## 2 least connected regions:
## 2 3 with 2 links
## 1 most connected region:
## 1 with 4 links
summary(w) tells us something about the neighborhood. The average number of neighbors (adjacent polygons) is
2.8, 3 polygons have 2 neighbors and 1 has 4 neighbors (which one is that?).
For more details we can look at the structure of w.
str(w)
## List of 5
## $ : int [1:3] 2 4 5
## $ : int [1:4] 1 3 4 5
## $ : int [1:2] 2 5
## $ : int [1:2] 1 2
## $ : int [1:3] 1 2 3
## - attr(*, "class")= chr "nb"
## - attr(*, "region.id")= chr [1:5] "0" "1" "2" "3" ...
## - attr(*, "call")= language poly2nb(pl = p, row.names = p$Id)
## - attr(*, "type")= chr "queen"
## - attr(*, "sym")= logi TRUE
We can transform w into a spatial weights matrix. A spatial weights matrix reflects the intensity of the geographic
relationship between observations (see previous chapter).
wm <- nb2mat(w, style='B')
wm
## [,1] [,2] [,3] [,4] [,5]
## 0 0 1 0 1 1
## 1 1 0 1 1 1
## 2 0 1 0 0 1
## 3 1 1 0 0 0
## 4 1 1 1 0 0
## attr(,"call")
## nb2mat(neighbours = w, style = "B")
n <- length(p)
y <- p$value
ybar <- mean(y)
dy <- y - ybar
g <- expand.grid(dy, dy)
yiyj <- g[,1] * g[,2]
Method 2:
And multiply this matrix with the weights to set to zero the value for the pairs that are not adjacent.
pmw <- pm * wm
pmw
## [,1] [,2] [,3] [,4] [,5]
## 0 0.00 -3.64 0.00 9.36 -3.64
## 1 -3.64 0.00 4.76 -5.04 1.96
## 2 0.00 4.76 0.00 0.00 4.76
## 3 9.36 -5.04 0.00 0.00 0.00
## 4 -3.64 1.96 4.76 0.00 0.00
## attr(,"call")
## nb2mat(neighbours = w, style = "B")
The next step is to divide this value by the sum of weights. That is easy.
vr <- n / sum(dy^2)
MI <- vr * sw
MI
## [1] 0.1728896
This is a simple (but crude) way to estimate the expected value of Moran’s I. That is, the value you would get in the
absence of spatial autocorelation (if the data were spatially random). Of course you never really expect that, but that
is how we do it in statistics. Note that the expected value approaches zero if n becomes large, but that it is not quite
zero for small values of n.
EI <- -1/(n-1)
EI
## [1] -0.25
After doing this ‘by hand’, now let’s use the spdep package to compute Moran’s I and do a significance test. To do this
we need to create a ‘listw’ type spatial weights object (instead of the matrix we used above). To get the same value as
above we use “style=’B’” to use binary (TRUE/FALSE) distance weights.
Now we can use the moran function. Have a look at ?moran. The function is defined as ‘moran(y, ww, n, Szero(ww))’.
Note the odd arguments n and S0. I think they are odd, because “ww” has that information. Anyway, we supply them
and it works. There probably are cases where it makes sense to use other values.
#Note that
Szero(ww)
## [1] 14
# is the same as
pmw
## [,1] [,2] [,3] [,4] [,5]
## 0 0.00 -3.64 0.00 9.36 -3.64
## 1 -3.64 0.00 4.76 -5.04 1.96
## 2 0.00 4.76 0.00 0.00 4.76
## 3 9.36 -5.04 0.00 0.00 0.00
## 4 -3.64 1.96 4.76 0.00 0.00
## attr(,"call")
## nb2mat(neighbours = w, style = "B")
sum(pmw==0)
## [1] 11
Now we can test for significance. First analytically, using linear regression based logic and assumptions.
Instead of the approach above you should use Monte Carlo simulation. That is the preferred method (in fact, the only
good method). The oay it works that the values are randomly assigned to the polygons, and the Moran’s I is computed.
This is repeated several times to establish a distribution of expected values. The observed value of Moran’s I is then
compared with the simulated distribution to see how likely it is that the observed values could be considered a random
draw.
n <- length(p)
ms <- cbind(id=rep(1:n, each=n), y=rep(y, each=n), value=as.vector(wm * y))
plot(ams)
reg <- lm(ams[,2] ~ ams[,1])
abline(reg, lwd=2)
abline(h=mean(ams[,2]), lt=2)
abline(v=ybar, lt=2)
coefficients(reg)[2]
## ams[, 1]
## 0.2315341
moran.plot(y, rwm)
FOUR
INTERPOLATION
4.1 Introduction
Almost any variable of interest has spatial autocorrelation. That can be a problem in statistical tests, but it is a
very useful feature when we want to predict values at locations where no measurements have been made; as we
can generally safely assume that values at nearby locations will be similar. There are several spatial interpolation
techniques. We show some of them in this chapter.
if (!require("rspatial")) devtools::install_github('rspatial/rspatial')
## Loading required package: rspatial
library(rspatial)
d <- sp_data('precipitation')
head(d)
## ID NAME LAT LONG ALT JAN FEB MAR APR MAY JUN
## 1 ID741 DEATH VALLEY 36.47 -116.87 -59 7.4 9.5 7.5 3.4 1.7 1.0
## 2 ID743 THERMAL/FAA AIRPORT 33.63 -116.17 -34 9.2 6.9 7.9 1.8 1.6 0.4
## 3 ID744 BRAWLEY 2SW 32.96 -115.55 -31 11.3 8.3 7.6 2.0 0.8 0.1
## 4 ID753 IMPERIAL/FAA AIRPORT 32.83 -115.57 -18 10.6 7.0 6.1 2.5 0.2 0.0
## 5 ID754 NILAND 33.28 -115.51 -18 9.0 8.0 9.0 3.0 0.0 1.0
## 6 ID758 EL CENTRO/NAF 32.82 -115.67 -13 9.8 1.6 3.7 3.0 0.4 0.0
## JUL AUG SEP OCT NOV DEC
## 1 3.7 2.8 4.3 2.2 4.7 3.9
## 2 1.9 3.4 5.3 2.0 6.3 5.5
## 3 1.9 9.2 6.5 5.0 4.8 9.7
## 4 2.4 2.6 8.3 5.4 7.7 7.3
## 5 8.0 9.0 7.0 8.0 7.0 9.0
## 6 3.0 10.8 0.2 0.0 3.3 1.4
29
Spatial Data Analysis with R
library(sp)
dsp <- SpatialPoints(d[,4:3], proj4string=CRS("+proj=longlat +datum=NAD83"))
dsp <- SpatialPointsDataFrame(dsp, d)
CA <- sp_data("counties")
30 Chapter 4. Interpolation
Spatial Data Analysis with R
Transform longitude/latitude to planar coordinates, using the commonly used coordinate reference system for Califor-
nia (“Teale Albers”) to assure that our interpolation results will align with other data sets we have.
library(rgdal)
dta <- spTransform(dsp, TA)
cata <- spTransform(CA, TA)
library(dismo)
v <- voronoi(dta)
plot(v)
ca <- aggregate(cata)
vca <- intersect(v, ca)
spplot(vca, 'prec', col.regions=rev(get_col_regions()))
32 Chapter 4. Interpolation
Spatial Data Analysis with R
Much better. These are polygons. We can ‘rasterize’ the results like this.
set.seed(5132015)
kf <- kfold(nrow(dta))
Question 1: Describe what each step in the code chunk above does
Question 2: How does the proximity-polygon approach compare to the NULL model?
Question 3: You would not typically use proximty polygons for rainfall data. For what kind of data would you use
them?
34 Chapter 4. Interpolation
Spatial Data Analysis with R
library(gstat)
gs <- gstat(formula=prec~1, locations=dta, nmax=5, set=list(idp = 0))
nn <- interpolate(r, gs)
## [inverse distance weighted interpolation]
nnmsk <- mask(nn, vr)
plot(nnmsk)
Cross validate the result. Note that we can use the predict method to get predictions for the locations of the test
points.
library(gstat)
gs <- gstat(formula=prec~1, locations=dta)
idw <- interpolate(r, gs)
## [inverse distance weighted interpolation]
idwr <- mask(idw, vr)
plot(idwr)
36 Chapter 4. Interpolation
Spatial Data Analysis with R
Question 4: IDW generated rasters tend to have a noticeable artefact. What is that?
Cross validate. We can predict to the locations of the test points
Question 5: Inspect the arguments used for and make a map of the IDW model below. What other name could you
x <- sp_data("airqual.csv")
x$OZDLYAV <- x$OZDLYAV * 1000
Create a SpatialPointsDataFrame and transform to Teale Albers. Note the units=km which was needed to fit the
variogram.
library(sp)
coordinates(x) <- ~LONGITUDE + LATITUDE
proj4string(x) <- CRS('+proj=longlat +datum=NAD83')
TA <- CRS("+proj=aea +lat_1=34 +lat_2=40.5 +lat_0=0 +lon_0=-120 +x_0=0 +y_0=-4000000
˓→+datum=NAD83 +units=km +ellps=GRS80")
library(rgdal)
aq <- spTransform(x, TA)
Create an template raster to interpolate to. E.g., given a SpatialPolygonsDataFrame of California, ‘ca’. Coerce that to
a ‘SpatialGrid’ object (a different representation of the same idea)
library(gstat)
gs <- gstat(formula=OZDLYAV~1, locations=aq)
v <- variogram(gs, width=20)
head(v)
## np dist gamma dir.hor dir.ver id
## 1 1010 11.35040 34.80579 0 0 var1
## 2 1806 30.63737 47.52591 0 0 var1
## 3 2355 50.58656 67.26548 0 0 var1
## 4 2619 70.10411 80.92707 0 0 var1
## 5 2967 90.13917 88.93653 0 0 var1
## 6 3437 110.42302 84.13589 0 0 var1
plot(v)
38 Chapter 4. Interpolation
Spatial Data Analysis with R
40 Chapter 4. Interpolation
Spatial Data Analysis with R
plot(v, fve)
42 Chapter 4. Interpolation
Spatial Data Analysis with R
# variance
ok <- brick(kp)
ok <- mask(ok, ca)
names(ok) <- c('prediction', 'variance')
plot(ok)
library(gstat)
idm <- gstat(formula=OZDLYAV~1, locations=aq)
idp <- interpolate(r, idm)
(continues on next page)
We can find good values for the idw parameters (distance decay and number of neighbours) through optimization. For
simplicity’s sake I do not do that k times here. The optim function may be a bit hard to grasp at first. But the essence
is simple. You provide a function that returns a value that you want to minimize (or maximize) given a number of
unknown parameters. Your provide initial values for these parameters, and optim then searches for the optimal values
(for which the function returns the lowest number).
RMSE <- function(observed, predicted) {
sqrt(mean((predicted - observed)^2, na.rm=TRUE))
}
44 Chapter 4. Interpolation
Spatial Data Analysis with R
library(fields)
m <- Tps(coordinates(aq), aq$OZDLYAV)
tps <- interpolate(r, m)
tps <- mask(tps, idw)
plot(tps)
46 Chapter 4. Interpolation
Spatial Data Analysis with R
4.3.5 Cross-validate
Cross-validate the three methods (IDW, Ordinary kriging, TPS) and add RMSE weighted ensemble model.
library(dismo)
nfolds <- 5
k <- kfold(aq, nfolds)
for (i in 1:nfolds) {
test <- aq[k!=i,]
train <- aq[k==i,]
m <- gstat(formula=OZDLYAV~1, locations=train, nmax=opt$par[1], set=list(idp=opt
˓→$par[2]))
}
rmi <- mean(idwrmse)
rmk <- mean(krigrmse)
rmt <- mean(tpsrmse)
rms <- c(rmi, rmt, rmk)
rms
## [1] 7.925989 8.816963 7.588549
rme <- mean(ensrmse)
rme
## [1] 7.718896
48 Chapter 4. Interpolation
Spatial Data Analysis with R
Question 7: Show where the largest difference exist between IDW and OK.
Question 8: Show where the difference between IDW and OK is within the 95% confidence limit of the OK prediction.
Question 9: Can you describe the pattern we are seeing, and speculate about what is causing it?
50 Chapter 4. Interpolation
CHAPTER
FIVE
This page shows how you can use the Random Forest algorithm to make spatial predictions. This approach is widely
used, for example to classify remote sensing data into different land cover classes. But here our objective is to predict
the entire range of a species based on a set of locations where it has been observed. As an example, we use the hominid
species Imaginus magnapedum (also known under the vernacular names of “bigfoot” and “sasquatch”). This species
is so hard to find (at least by scientists) that its very existence is commonly denied by the mainstream media! For
more information about this controversy, see the article by Lozier, Aniello and Hickerson: Predicting the distribution
of Sasquatch in western North America: anything goes with ecological niche modelling.
We want to find out
a) What the complete range of the species might be.
b) How good (general) our model is by predicting the range of the Eastern sub-species, with data from the Western
sub-species.
c) Predict where in Mexico the creature is likely to occur.
d) How climate change might affect its distribution.
In this context, this type of analysis is often referred to as ‘species distribution modeling’ or ‘ecological niche model-
ing’. Here is a more in-depth discussion of this technique.
5.1 Data
5.1.1 Observations
if (!require("rspatial")) devtools::install_github('rspatial/rspatial')
library(rspatial)
bf <- sp_data('bigfoot')
dim(bf)
## [1] 3092 3
head(bf)
## lon lat Class
## 1 -142.9000 61.50000 A
## 2 -132.7982 55.18720 A
## 3 -132.8202 55.20350 A
## 4 -141.5667 62.93750 A
## 5 -149.7853 61.05950 A
## 6 -141.3165 62.77335 A
51
Spatial Data Analysis with R
5.1.2 Predictors
Supervised classification often uses predictor data obtained from satellite remote sensing. But here, as is com-
mon in species distribution modeling, we use climate data. Specifically, we use ‘bioclimatic variables’, see:
http://www.worldclim.org/bioclim
library(raster)
wc <- raster::getData('worldclim', res=10, var='bio')
plot(wc[[c(1, 12)]], nr=2)
Now extract climate data for the locations of our observations. That is, get data about the climate that the species likes,
apparently.
5.1. Data 53
Spatial Data Analysis with R
Here is a plot that illustrates a component of the ecological niche of our species of interest.
library(dismo)
# extent of all points
e <- extent(SpatialPoints(bf[, 1:2]))
e
## class : Extent
## xmin : -156.75
## xmax : -64.4627
## ymin : 25.141
## ymax : 69.5
(continues on next page)
5.1. Data 55
Spatial Data Analysis with R
5.2.1 CART
Let’s first look at a Classification and Regression Trees (CART) model.
library(rpart)
cart <- rpart(pa~., data=dw)
printcp(cart)
##
## Regression tree:
## rpart(formula = pa ~ ., data = dw)
##
## Variables actually used in tree construction:
## [1] bio10 bio12 bio18 bio19 bio4 bio5
##
## Root node error: 762.45/3246 = 0.23489
(continues on next page)
Question 1: Describe the conditions under which you have the highest probability of finding our beloved species?
The variable importance plot shows which variables are most important in fitting the model. This is computing by
randomizing each variable one by one and then computing the decline in model prediction.
varImpPlot(crf)
trf
## mtry OOBError
## 3 3 0.06661352
## 6 6 0.06789353
## 12 12 0.06947837
mt <- trf[which.min(trf[,2]), 1]
mt
## [1] 3
Question 2: What did tuneRF help us find? What does the values of mt represent?
varImpPlot(rrf)
5.3 Predict
We can use the model to make predictions to any other place for which we have values for the predictor variables. Our
climate data is global so we could find suitable places for bigfoot in Australia. At first I only want to predict to our
study region, which I define as follows.
5.3.1 Regression
rp <- predict(wc, rrf, ext=ew)
plot(rp)
5.3. Predict 63
Spatial Data Analysis with R
Note that the regression predictions are well-behaved, in the sense that they are between 0 and 1. However, they are
continuous within that range, and if you wanted presence/absence, you would need a threshold. To get the optimal
threshold, you would normally have a hold out data set, but here I used the training data for simplicity.
plot(eva, 'ROC')
tr <- threshold(eva)
tr
## kappa spec_sens no_omission prevalence equal_sens_spec
## thresholds 0.4869443 0.4869443 0.4754381 0.3756083 0.5191059
## sensitivity
## thresholds 0.7497767
plot(rp > tr[1, 'spec_sens'])
5.3. Predict 65
Spatial Data Analysis with R
5.3.2 Classification
We can also use the classification Random Forest model to make a prediction.
5.3. Predict 67
Spatial Data Analysis with R
5.4 Extrapolation
Now, let’s see if our model is general enough to predict the distribution of the Eastern species.
de <- na.omit(de)
eva2 <- evaluate(de[de$pa==1, ], de[de$pa==0, ], rrf)
eva2
## class : ModelEvaluation
## n presences : 1866
## n absences : 2978
## AUC : 0.4251416
## cor : -0.2611455
## max TPR+TNR at : 8e-04
plot(eva2, 'ROC')
5.4. Extrapolation 69
Spatial Data Analysis with R
Question 4: Why would it be that the model does not extrapolate well?
An important question in the biogeography of the western species is why it does not occur in Mexico. Or if it does,
where would that be?
Let’s see.
Question 5: Where in Mexico are you most likely to encounter western bigfoot?
We can also estimate range shifts due to climate change
5.4. Extrapolation 71
Spatial Data Analysis with R
Question 6: Make a map to show where conditions are improving for western bigfoot, and where they are not. Is the
species headed toward extinction?
SIX
LOCAL REGRESSION
Regression models are typically “global”. That is, all date are used simultaneously to fit a single model. In some
cases it can make sense to fit more flexible “local” models. Such models exist in a general regression framework (e.g.
generalized additive models), where “local” refers to the values of the predictor values. In a spatial context local refers
to location. Rather than fitting a single regression model, it is possible to fit several models, one for each location (out
of possibly very many) locations. This technique is sometimes called “geographically weighted regression” (GWR).
GWR is a data exploration technique that allows to understand changes in importance of different variables over space
(which may indicate that the model used is misspecified and can be improved).
There are two examples here. One short example with California precipitation data, and than a more elaborate example
with house price data.
library(rspatial)
counties <- sp_data('counties')
p <- sp_data('precipitation')
head(p)
## ID NAME LAT LONG ALT JAN FEB MAR APR MAY JUN
## 1 ID741 DEATH VALLEY 36.47 -116.87 -59 7.4 9.5 7.5 3.4 1.7 1.0
## 2 ID743 THERMAL/FAA AIRPORT 33.63 -116.17 -34 9.2 6.9 7.9 1.8 1.6 0.4
## 3 ID744 BRAWLEY 2SW 32.96 -115.55 -31 11.3 8.3 7.6 2.0 0.8 0.1
## 4 ID753 IMPERIAL/FAA AIRPORT 32.83 -115.57 -18 10.6 7.0 6.1 2.5 0.2 0.0
## 5 ID754 NILAND 33.28 -115.51 -18 9.0 8.0 9.0 3.0 0.0 1.0
## 6 ID758 EL CENTRO/NAF 32.82 -115.67 -13 9.8 1.6 3.7 3.0 0.4 0.0
## JUL AUG SEP OCT NOV DEC
## 1 3.7 2.8 4.3 2.2 4.7 3.9
## 2 1.9 3.4 5.3 2.0 6.3 5.5
## 3 1.9 9.2 6.5 5.0 4.8 9.7
## 4 2.4 2.6 8.3 5.4 7.7 7.3
## 5 8.0 9.0 7.0 8.0 7.0 9.0
## 6 3.0 10.8 0.2 0.0 3.3 1.4
plot(counties)
points(p[,c('LONG', 'LAT')], col='red', pch=20)
73
Spatial Data Analysis with R
sp <- p
coordinates(sp) = ~ LONG + LAT
crs(sp) <- "+proj=longlat +datum=NAD83"
spt <- spTransform(sp, alb)
ctst <- spTransform(counties, alb)
Each record represents a census “blockgroup”. The longitude and latitude of the centroids of each block group are
available. We can use that to make a map and we can also use these to link the data to other spatial data. For example
to get county-membership of each block group. To do that, let’s first turn this into a SpatialPointsDataFrame to find
out to which county each point belongs.
library(sp)
coordinates(houses) <- ~longitude+latitude
Now get the county boundaries and assign CRS of the houses data matches that of the counties (because they are both
in longitude/latitude!).
library(raster)
crs(houses) <- crs(counties)
6.3 Summarize
We can summarize the data by county. First combine the extracted county data with the original data.
hd <- cbind(data.frame(houses), cnty)
Income is harder because we have the median household income by blockgroup. But it can be approximated by first
computing total income by blockgroup, summing that, and dividing that by the total number of households.
# total income
hd$suminc <- hd$income * hd$households
# now use aggregate (similar to tapply)
csum <- aggregate(hd[, c('suminc', 'households')], list(hd$NAME), sum)
# divide total income by number of housefholds
csum$income <- 10000 * csum$suminc / csum$households
# sort
csum <- csum[order(csum$income), ]
head(csum)
## Group.1 suminc households income
## 53 Trinity 11198.985 5156 21720.30
## 58 Yuba 43739.708 19882 21999.65
## 25 Modoc 8260.597 3711 22259.76
## 47 Siskiyou 38769.952 17302 22407.79
(continues on next page)
6.4 Regression
Before we make a regression model, let’s first add some new variables that we might use, and then see if we can build
a regression model with house price as dependent variable. The authors of the paper used a lot of log tranforms, so
you can also try that.
hd$roomhead <- hd$rooms / hd$population
hd$bedroomhead <- hd$bedrooms / hd$population
hd$hhsize <- hd$population / hd$households
summary(m)
##
## Call:
## glm(formula = houseValue ~ income + houseAge + roomhead + bedroomhead +
## population, data = hd)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1226134 -48590 -12944 34425 461948
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.508e+04 2.533e+03 -25.686 < 2e-16 ***
## income 5.179e+04 3.833e+02 135.092 < 2e-16 ***
## houseAge 1.832e+03 4.575e+01 40.039 < 2e-16 ***
## roomhead -4.720e+04 1.489e+03 -31.688 < 2e-16 ***
## bedroomhead 2.648e+05 6.820e+03 38.823 < 2e-16 ***
## population 3.947e+00 5.081e-01 7.769 8.27e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 6022427437)
##
## Null deviance: 2.7483e+14 on 20639 degrees of freedom
## Residual deviance: 1.2427e+14 on 20634 degrees of freedom
## AIC: 523369
##
## Number of Fisher Scoring iterations: 2
coefficients(m)
(continues on next page)
6.4. Regression 79
Spatial Data Analysis with R
Then I write a function to get what I want from the regression (the coefficients in this case)
There clearly is variation in the coefficient ($beta$) for income. How does this look on a map?
First make a data.frame of the results
Fix the counties object. There are too many counties because of the presence of islands. I first aggregate (‘dissolve’ in
GIS-speak’) the counties such that a single county becomes a single (multi-)polygon.
dim(counties)
## [1] 68 5
dcounties <- aggregate(counties, vars='NAME')
## Warning in .local(x, ...): Use argument "by" instead of deprecated argument
## "vars"
dim(dcounties)
## [1] 58 1
Now we can merge this SpatialPolygonsDataFrame with data.frame with the regression results.
To show all parameters in a ‘conditioning plot’, we need to first scale the values to get similar ranges.
library(spdep)
nb <- poly2nb(cnres)
plot(cnres)
plot(nb, coordinates(cnres), add=T, col='red')
lw <- nb2listw(nb)
moran.test(cnres$income, lw)
##
## Moran I test under randomisation
##
## data: cnres$income
## weights: lw
##
## Moran I statistic standard deviate = 2.2473, p-value = 0.01231
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.173419996 -0.017543860 0.007220867
moran.test(cnres$roomhead, lw, na.action=na.omit)
##
## Moran I test under randomisation
##
## data: cnres$roomhead
## weights: lw
## omitted: 2
##
## Moran I statistic standard deviate = 1.3929, p-value = 0.08183
## alternative hypothesis: greater
## sample estimates:
(continues on next page)
Create a RasteLayer using the extent of the counties, and setting an arbitrary resolution of 50 by 50 km cells
library(raster)
r <- raster(countiesTA)
res(r) <- 50000
For each cell, we need to select a number of observations, let’s say within 50 km of the center of each cell (thus the
data that are used in different cells overlap). And let’s require at least 50 observations to do a regression.
First transform the houses data to Teale-Albers
Run the model for al cells if there are at least 50 observations within a radius of 50 km.
Moran(rinc)
## [1] 0.3271564
r <- raster(countiesTA)
ca <- rasterize(countiesTA, r)
gwr returns a list-like object that includes (as first element) a SpatialPointsDataFrame that has the model
coeffients. Plot these using spplot, and after that, transfer them to a RasterBrick object.
To extract the SpatialPointsDataFrame:
sp <- gwr.model$SDF
spplot(sp)
Question 3: Briefly comment on the results and the differences (if any) with the two home-brew examples.
SEVEN
7.1 Introduction
This chapter deals with the problem of inference in (regression) models with spatial data. Inference from regression
models with spatial data can be suspect. In essence this is because nearby things are similar, and it may not be
fair to consider individual cases as independent (they may be pseudo-replicates). Therefore, such models need to be
diagnosed before reporting them. Specifically, it is important to evaluate the for spatial autocorrelation in the residuals
(as these are supposed to be independent, not correlated). If the residuals are spatially autocorrelated, this indicates
that the model is misspecified. In that case you should try to improve the model by adding (and perhaps removing)
important variables. If that is not possible (either because there is no data available, or because you have no clue as
to what variable to look for), you can try formulating a regression model that controls for spatial autocorrelation. We
show some examples of that approach here.
library(rspatial)
h <- sp_data('houses2000')
I have selected some variables on on housing and population. You can get more data from the American Fact Finder
http://factfinder2.census.gov (among other web sites).
library(raster)
dim(h)
## [1] 7049 29
names(h)
## [1] "TRACT" "GEOID" "label" "houseValue" "nhousingUn"
## [6] "recHouses" "nMobileHom" "yearBuilt" "nBadPlumbi" "nBadKitche"
## [11] "nRooms" "nBedrooms" "medHHinc" "Population" "Males"
## [16] "Females" "Under5" "MedianAge" "White" "Black"
## [21] "AmericanIn" "Asian" "Hispanic" "PopInHouse" "nHousehold"
## [26] "Families" "householdS" "familySize" "County"
89
Spatial Data Analysis with R
Now we have the county outlines, but we also need to get the values of interest at the county level. Although it is
possible to do everything in one step in the aggregate function, I prefer to do this step by step. The simplest case is
where we can sum the numbers. For example for the number of houses.
In other cases we need to use a weighted mean. For example for houseValue
And merge the aggregated (from census tract to county level) attribute data with the aggregated polygons
Let’s make some maps, at the orignal Census tract level. We are using a bit more advanced (and slower) plotting
methods here. First the house value, using a legend with 10 intervals.
library(latticeExtra)
## Loading required package: lattice
## Loading required package: RColorBrewer
grps <- 10
brks <- quantile(h$houseValue, 0:(grps-1)/(grps-1), na.rm=TRUE)
p + layer(sp.polygons(hh))
This takes very long. spplot (levelplot) is a bit slow when using a large dataset. . .
A map of the median household income.
p + layer(sp.polygons(hh))
Just for illustration, here is how you can do OLS with matrix algebra. First set up the data. I add a constant variable
‘1’ to X, to get an intercept.
y <- matrix(hh$houseValue)
X <- cbind(1, hh$age, hh$nBedrooms)
So, according to this simple model, “age” is highly significant. The older a house, the more expensive. You pay
1,269,475 dollars more for a house that is 100 years old than a for new house! While the p-value for the number of
bedrooms is not impressive, but every bedroom adds about 200,000 dollars to the value of a house.
Question 1: What would be the price be of a house built in 1999 with three bedrooms?
Let’s see if the errors (model residuals) appear to be randomly distributed in space.
What do think? Is this random? Let’s see what Mr. Moran would say. First make a neighborhoods list. I add two
links: between San Francisco and Marin County and vice versa (to consider the Golden Gate bridge).
library(spdep)
nb <- poly2nb(hh)
nb[[21]] <- sort(as.integer(c(nb[[21]], 38)))
nb[[38]] <- sort(as.integer(c(21, nb[[38]])))
nb
## Neighbour list object:
## Number of regions: 58
## Number of nonzero links: 278
## Percentage nonzero weights: 8.263971
## Average number of links: 4.793103
par(mai=c(0,0,0,0))
plot(hh)
plot(nb, coordinates(hh), col='red', lwd=2, add=TRUE)
We can use the neighbour list object to get the average value for the neighbors of each polygon.
lw <- nb2listw(nb)
Clearly, there is spatial autocorrelation. Our p-values and regression model coefficients cannot be trusted. so let’s try
SAR models.
summary(m1s)
##
## Call:lagsarlm(formula = f1, data = hh, listw = lw, tol.solve = 1e-30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -108145.2 -49816.3 -1316.3 44604.9 171536.0
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -418674.1 153693.6 -2.7241 0.006448
## age 5533.6 1698.2 3.2584 0.001120
## nBedrooms 127912.8 50859.7 2.5150 0.011903
##
## Rho: 0.77413, LR test value: 34.761, p-value: 3.7282e-09
## Asymptotic standard error: 0.08125
## z-value: 9.5277, p-value: < 2.22e-16
## Wald statistic: 90.778, p-value: < 2.22e-16
##
## Log likelihood: -727.9964 for lag model
## ML residual variance (sigma squared): 3871700000, (sigma: 62223)
## Number of observations: 58
## Number of parameters estimated: 5
## AIC: 1466, (AIC for lm: 1498.8)
## LM test for residual autocorrelation
## test value: 0.12431, p-value: 0.72441
print( p + layer(sp.polygons(hh)) )
Are the resiudals spatially autocorrelated for either of these models? Let’s plot them for the spatial error model.
print( p + layer(sp.polygons(hh)) )
.png
7.6 Questions
Question 2: The last two maps still seem to show a lot of spatial autocorrelation. But according to the tests there is
none. Now why might that be?
Question 3: One of the most important, or perhaps THE most important aspect of modeling is variable selection. A
misspecified model is never going to be any good, no matter how much you do to, e.g., correct for spatial autocorre-
lation. a) Which variables would you choose from the list?
b) Which new variables could you propose to create from the variables in the list.
c) Which other variables could you add, created from the geometries/location (perhaps other geographic data).
d) add a lot of variables and use stepAIC to select an ‘optimal’ OLS model
EIGHT
8.1 Introduction
We are using a dataset of crimes in a city. Start by reading in the data.
if (!require("rspatial")) devtools::install_github('rspatial/rspatial')
library(rspatial)
city <- sp_data('city')
crime <- sp_data('crime')
tb <- sort(table(crime$CATEGORY))[-1]
tb
##
## Arson Weapons Robbery
## 9 15 49
(continues on next page)
103
Spatial Data Analysis with R
Let’s get the coordinates of the crime data, and for this exercise, remove duplicate crime locations. These are the
‘events’ we will use below (later we’ll go back to the full data set).
xy <- coordinates(crime)
dim(xy)
## [1] 2661 2
xy <- unique(xy)
dim(xy)
## [1] 1208 2
head(xy)
## coords.x1 coords.x2
## [1,] 6628868 1963718
## [2,] 6632796 1964362
## [3,] 6636855 1964873
## [4,] 6626493 1964343
## [5,] 6639506 1966094
## [6,] 6640478 1961983
# mean center
mc <- apply(xy, 2, mean)
# standard distance
sd <- sqrt(sum((xy[,1] - mc[1])^2 + (xy[,2] - mc[2])^2) / nrow(xy))
Plot the data to see what we’ve got. I add a summary circle (as in Fig 5.2) by dividing the circle in 360 points and
compute bearing in radians. I do not think this is particularly helpful, but it might be in other cases. And it is always
fun to figure out how to do tis.
# make a circle
bearing <- 1:360 * pi/180
cx <- mc[1] + sd * cos(bearing)
cy <- mc[2] + sd * sin(bearing)
circle <- cbind(cx, cy)
lines(circle, col='red', lwd=2)
8.3 Density
Here is a basic approach to computing point density.
r <- raster(city)
res(r) <- 1000
r
## class : RasterLayer
## dimensions : 15, 34, 510 (nrow, ncol, ncell)
## resolution : 1000, 1000 (x, y)
## extent : 6620591, 6654591, 1956519, 1971519 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=lcc +lat_1=38.33333333333334 +lat_2=39.83333333333334 +lat_
˓→0=37.66666666666666 +lon_0=-122 +x_0=2000000 +y_0=500000.0000000001 +datum=NAD83
To find the cells that are in the city, and for easy display, I create polygons from the RasterLayer.
r <- rasterize(city, r)
plot(r)
quads <- as(r, 'SpatialPolygons')
plot(quads, add=TRUE)
points(crime, col='red', cex=.5)
The number of events in each quadrat can be counted using the ‘rasterize’ function. That function can be used to
summarize the number of points within each cell, but also to compute statistics based on the ‘marks’ (attributes). For
example we could compute the number of different crime types) by changing the ‘fun’ argument to another function
(see ?rasterize).
nc <- rasterize(coordinates(crime), r, fun='count', background=0)
plot(nc)
plot(city, add=TRUE)
nc has crime counts. As we only have data for the city, the areas outside of the city need to be excluded. We can do
that with the mask function (see ?mask).
ncrimes <- mask(nc, r)
plot(ncrimes)
(continues on next page)
Does this look like a pattern you would have expected? Now compute average number of cases per quadrat.
# number of quadrats
quadrats <- sum(f[,2])
# number of cases
cases <- sum(f[,1] * f[,2])
mu <- cases / quadrats
mu
## [1] 9.261484
ff <- data.frame(f)
colnames(ff) <- c('K', 'X')
ff$Kmu <- ff$K - mu
ff$Kmu2 <- ff$Kmu^2
ff$XKmu2 <- ff$Kmu2 * ff$X
head(ff)
## K X Kmu Kmu2 XKmu2
## 1 0 48 -9.261484 85.77509 4117.2042
## 2 1 29 -8.261484 68.25212 1979.3115
## 3 2 24 -7.261484 52.72915 1265.4996
## 4 3 22 -6.261484 39.20618 862.5360
## 5 4 19 -5.261484 27.68321 525.9811
## 6 5 16 -4.261484 18.16025 290.5639
VMR <- s2 / mu
VMR
## [1] 29.86082
Question 2:What does this VMR score tell us about the point pattern?
d <- dist(xy)
class(d)
## [1] "dist"
I want to coerce the dist object to a matrix, and ignore distances from each point to itself (the zeros on the diagonal).
dm <- as.matrix(d)
dm[1:5, 1:5]
## 1 2 3 4 5
## 1 0.000 3980.843 8070.429 2455.809 10900.016
## 2 3980.843 0.000 4090.992 6303.450 6929.439
## 3 8070.429 4090.992 0.000 10375.958 2918.349
## 4 2455.809 6303.450 10375.958 0.000 13130.236
## 5 10900.016 6929.439 2918.349 13130.236 0.000
diag(dm) <- NA
dm[1:5, 1:5]
## 1 2 3 4 5
## 1 NA 3980.843 8070.429 2455.809 10900.016
## 2 3980.843 NA 4090.992 6303.450 6929.439
## 3 8070.429 4090.992 NA 10375.958 2918.349
## 4 2455.809 6303.450 10375.958 NA 13130.236
## 5 10900.016 6929.439 2918.349 13130.236 NA
To get, for each point, the minimum distance to another event, we can use the ‘apply’ function. Think of the rows as
each point, and the columns of all other points (vice versa could also work).
Now it is trivial to get the mean nearest neighbour distance according to formula 5.5, page 131.
Do you want to know, for each point, Which point is its nearest neighbour? Use the ‘which.min’ function (but note
that this ignores the possibility of multiple points at the same minimum distance).
And what are the most isolated cases? That is the furtest away from their nearest neigbor. I plot the top 25. A bit
complicated.
plot(city)
points(crime, cex=.1)
ord <- rev(order(dmin))
Note that some points, but actually not that many, are used as isolated and as a neighbor to an isolated points.
Now on to the G function
max(dmin)
## [1] 1829.738
# get the unique distances (for the x-axis)
distance <- sort(unique(round(dmin)))
# compute how many cases there with distances smaller that each x
Gd <- sapply(distance, function(x) sum(dmin < x))
(continues on next page)
The steps are so small in our data, that you hardly see the difference.
I use the centers of previously defined raster cells to compute the F function.
legend(1200, .3,
c(expression(italic("G")["d"]), expression(italic("F")["d"]), 'expected'),
lty=1, col=c('red', 'blue', 'black'), lwd=2, bty="n")
Question 3: What does this plot suggest about the point pattern?
Finally, let’s compute K. Note that I use the original distance matrix ‘d’ here.
Question 4: Create a single random pattern of events for the city, with the same number of events as the crime data
(object xy). Use function ‘spsample’
Question 5: Compute the G function, and plot it on a single plot, together with the G function for the observed crime
data, and the theoretical expectation (formula 5.12).
Question 6: (Difficult!) Do a Monte Carlo simulation (page 149) to see if the ‘mean nearest distance’ of the observed
crime data is significantly different from a random pattern. Use a ‘for loop’. First write ‘pseudo-code’. That is, say in
natural language what should happen. Then try to write R code that implements this.
library(spatstat)
We start with making make a Kernel Density raster. I first create a ‘ppp’ (point pattern) object, as defined in the spatstat
package.
A ppp object has the coordinates of the points and the analysis ‘window’ (study region). To assign the points locations
we need to extract the coordinates from our SpatialPoints object. To set the window, we first need to to coerce our
SpatialPolygons into an ‘owin’ object. We need a function from the maptools package for this coercion.
Coerce from SpatialPolygons to an object of class “owin” (observation window)
library(maptools)
cityOwin <- as.owin(city)
class(cityOwin)
## [1] "owin"
cityOwin
## window: polygonal boundary
## enclosing rectangle: [6620591, 6654380] x [1956729.8, 1971518.9] units
Note the warning message about ‘illegal’ points. Do you see them and do you understand why they are illegal?
Having all the data well organized, it is now easy to compute Kernel Density
ds <- density(p)
class(ds)
## [1] "im"
plot(ds, main='crime density')
Density is the number of points per unit area. Let’s ceck if the numbers makes sense, by adding them up and mulit-
plying with the area of the raster cells. I use raster package functions for that.
nrow(pts)
## [1] 2661
r <- raster(ds)
s <- sum(values(r), na.rm=TRUE)
s * prod(res(r))
## [1] 2640.556
Looks about right. We can also get the information directly from the “im” (image) object
str(ds)
## List of 10
## $ v : num [1:128, 1:128] NA NA NA NA NA NA NA NA NA NA ...
## $ dim : int [1:2] 128 128
## $ xrange: num [1:2] 6620591 6654380
## $ yrange: num [1:2] 1956730 1971519
## $ xstep : num 264
## $ ystep : num 116
## $ xcol : num [1:128] 6620723 6620987 6621251 6621515 6621779 ...
## $ yrow : num [1:128] 1956788 1956903 1957019 1957134 1957250 ...
## $ type : chr "real"
## $ units :List of 3
(continues on next page)
Here’s another, lenghty, example of generalization. We can interpolate population density from (2000) census data;
assigning the values to the centroid of a polygon (as explained in the book, but not a great technique). We use a
shapefile with census data.
To compute population density for each census block, we first need to get the area of each polygon. I transform density
from persons per feet^2^ to persons per mile^2^, and then compute population density from POP2000 and the area
Now to get the centroids of the census blocks we can use the ‘coordinates’ function again. Note that it actually does
something quite different (with a SpatialPolygons* object) then in the case above (with a SpatialPoints* object).
p <- coordinates(census)
head(p)
## [,1] [,2]
## 0 6666671 1991720
## 1 6655379 1986903
## 2 6604777 1982474
## 3 6612242 1981881
## 4 6613488 1986776
## 5 6616743 1986446
plot(census)
points(p, col='red', pch=20, cex=.25)
plot(win, add=TRUE, border='blue', lwd=3)
Now we can use ‘Smooth.ppp’ to interpolate. Population density at the points is referred to as the ‘marks’
Note the warning message: “1 point was rejected as lying outside the specified window”. That is odd, there is a
polygon that has a centroid that is outside of the polygon. This can happen with, e.g., kidney shaped polygons.
Let’s find and remove this point that is outside the study area.
plot(census)
points(sp)
points(sp[!i,], col='red', cex=3, pch=20)
You can zoom in using the code below. After running the next line, click on your map twice to zoom to the red dot,
otherwise you cannot continue:
zoom(census)
And add the red points again
points(sp[!i,], col='red')
To only use points that intersect with the window polygon, that is, where ‘i == TRUE’:
s <- Smooth.ppp(pp)
## Warning: Cross-validation criterion was minimised at right-hand end of
## interval [89.7, 3350]; use arguments hmin, hmax to specify a wider interval
plot(s)
plot(city, add=TRUE)
Population density could establish the “population at risk” (to commit a crime) for certain crimes, but not for others.
Maps with the city limits and the incidence of ‘auto-theft’, ‘drunk in public’, ‘DUI’, and ‘Arson’.
Create a marked point pattern object (ppp) for all crimes. It is important to coerce the marks to a factor variable.
plot(density(spp[1:4]), main='')
And produce K-plots (with an envelope) for ‘drunk in public’ and ‘Arson’. Can you explain what they mean?
spatstat.options(checksegments = FALSE)
ktheft <- Kest(spp$"Auto Theft")
ketheft <- envelope(spp$"Auto Theft", Kest)
(continues on next page)
## 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59,
˓→ 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
## 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97,
˓→ 98, 99.
##
## Done.
ktheft <- Kest(spp$"Arson")
ketheft <- envelope(spp$"Arson", Kest)
## Generating 99 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
˓→24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
## 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59,
˓→ 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
## 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97,
˓→ 98, 99.
##
## Done.
par(mfrow=c(1,2))
plot(ktheft)
plot(ketheft)
Let’s try to answer the question you have been wanting to answer all along. Is population density a good predictor of
being (booked for) “drunk in public” and for “Arson”? One approach is to do a Kolmogorov-Smirnov (‘kstest’) on
‘Drunk in Public’ and ‘Arson’, using population density as a covariate:
## 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50.
##
## Done.
plot(ekc)
Much more about point pattern analysis with spatstat is available here