Spatial Analysis
Spatial Analysis
Lovelace, Robin
[email protected]
Cheshire, James
[email protected]
Contents
Part I: Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Attribute data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Changing projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Attribute joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . 14
15
17
21
R quick reference
23
Aknowledgements
23
References
24
Part I: Introduction
This tutorial is an introduction to spatial data in R and map making with Rs base graphics and the popular
graphics package ggplot2. It assumes no prior knowledge of spatial data analysis but prior understanding of the
R command line would be beneficial. For people new to R, we recommend working through an Introduction to
R type tutorial, such as A (very) short introduction to R (Torfs and Brauer, 2012) or the more geographically
inclined Short introduction to R (Harris, 2012).
Building on such background material, the following set of exercises is concerned with specific functions for
spatial data and visualisation. It is divided into five parts:
Introduction, which provides a guide to Rs syntax and preparing for the tutorial
Spatial data in R, which describes basic spatial functions in R
Manipulating spatial data, which includes changing projection, clipping and spatial joins
Map making with ggplot2, a recent graphics package for producing beautiful maps quickly
Taking spatial analysis in R further, a compilation of resources for furthering your skills
In the above code we first created a new object that we have called x. Any name could have been used, like
xBumkin, but x works just fine here, although it is good practice to give your objects meaningful names. Note
the use of the <- arrow symbol, which tells R to create a new object. We will be using this symbol a lot in the
tutorial (tip: typing Alt - on the keyboard will create it in RStudio.). Each time it is used, a new object is
created (or an old one is overwritten) with a name of your choosing.
To distinguish between prose and code, please be aware of the following typographic conventions: R code (e.g.
plot(x, y)) is written in a monospace font and package names (e.g. rgdal) are written in bold. Blocks of
code such as:
c(1:3, 5)^2
## [1]
9 25
are compiled in-line: the ## indicates this is output from R. Some of the output from the code below is quite
long so some is omitted. It should also be clear when we have decided to omit an image to save space. All images
in this document are small and low-quality to save space; they should display better on your computer screen
and can be saved at any resolution. The code presented here is not the only way to do things: we encourage
so we will focus on these elements in the next two parts of the tutorial, before focussing on creating attractive
maps in Part IV.
In the code above dsn stands for data source name and is an argument of the function readOGR. Note that
each new argument is separated by a comma. The dsn argument in this case, specifies the directory in which the
dataset is stored. R functions have a default order of arguments, so dsn = does not actually need to be typed. If
there data were stored in the current working directory, one could use readOGR(".", "london_sport"). For
clarity, it is good practice to include argument names, such as dsn when learning new functions.
The next argument is a character string. This is simply the name of the file required. There is no need to add a
file extension (e.g. .shp) in this case. The files beginning london_sport from the example dataset contain the
borough population and the percentage of the population participating in sporting activities and was taken from
the active people survey. The boundary data is from the Ordnance Survey.
For information about how to load different types of spatial data, the help documentation for readOGR is a good
place to start. This can be accessed from within R by typing ?readOGR. For another worked example, in which a
GPS trace is loaded, please see Cheshire and Lovelace (2014).
Basic plotting
We have now created a new spatial object called sport from the london_sport shapefile. Spatial objects are
made up of a number of different slots, mainly the attribute slot and the geometry slot. The attribute slot can
be thought of as an attribute table and the geometry slot is where the spatial object (and its attributes) lie in
space. Lets now analyse the sport object with some basic commands:
head(sport@data, n = 2)
##
ons_label
name Partic_Per Pop_2001
## 0
00AF
Bromley
21.7
295535
## 1
00BD Richmond upon Thames
26.6
172330
mean(sport$Partic_Per)
## [1] 20.05
Take a look at this output and notice the table format of the data and the column names. There are two
important symbols at work in the above block of code: the @ symbol in the first line of code is used to refer to
the attribute slot of the dataset; the $ symbol refers to a specific variable (column name) in the attribute slot of
the dataset, which was identified from the result of running the first line of code. If you are using RStudio, test
out the auto-completion functionality by hitting tab before completing the command - this can save you a lot of
time in the long run.
The head function in the first line of the code above simply means show the first few lines of data, i.e. the
head. Its default is to output the first 6 rows of the dataset (try simply head(sport@data)), but we can specify
the number of lines with n = 2 after the comma. The second line of the code above calculates the mean value
of the variable Partic_Per (sports participation per 100 people) for each of the zones in the sport object. To
explore the sport object further, try typing nrow(sport) and record how many zones the dataset contains. You
can also try ncol(sport).
Now we have seen something of the attribute slot of the spatial dataset, let us look at sports geometry data,
which describes where the polygons are located in space:
plot(sport)
plot is one of the most useful functions in R, as it changes its behaviour depending on the input data (this is
called polymorphism by computer scientists). Inputting another dataset such as plot(sport@data) will generate
an entirely different type of plot. Thus R is intelligent at guessing what you want to do with the data you
provide it with.
R has powerful subsetting capabilities that can be accessed very concisely using square brackets, as shown in the
following example:
# select rows from attribute slot of sport object, where sports
# participation is less than 15.
sport@data[sport$Partic_Per < 15, ]
##
ons_label
name Partic_Per Pop_2001
## 17
00AQ
Harrow
14.8
206822
## 21
00BB
Newham
13.1
243884
## 32
00AA City of London
9.1
7181
The above line of code asked R to select rows from the sport object, where sports participation is lower than
15, in this case rows 17, 21 and 32, which are Harrow, Newham and the city centre respectively. The square
brackets work as follows: anything before the comma refers to the rows that will be selected, anything after the
comma refers to the number of columns that should be returned. For example if the dataset had 1000 columns
and you were only interested in the first two columns you could specify 1:2 after the comma. The : symbol
simply means to, i.e. columns 1 to 2. Try experimenting with the square brackets notation (e.g. guess the
result of sport@data[1:2, 1:3] and test it): it will be useful.
So far we have been interrogating only the attribute slot (@data) of the sport object, but the square brackets
can also be used to subset spatial datasets, i.e. the geometry slot. Using the same logic as before try to plot a
subset of zones with high sports participation.
# plot zones from sports object where sports participation is greater than
# 25.
plot(sport[sport$Partic_Per > 25, ]) # output not shown in tutorial
This is useful, but it would be great to see these sporty areas in context. To do this, simply use the add = TRUE
argument after the initial plot. (add = T would also work, but we like to spell things out in this tutorial for
clarity). What does the col argument refer to in the below block - it should be obvious (see figure 2).
plot(sport)
plot(sport[sport$Partic_Per > 25, ], col = "blue", add = TRUE)
Figure 2: Preliminary plot of London with areas of high sports participation highlighted in blue
Congratulations! You have just interrogated and visualised a spatial dataset: what kind of places have high levels
of sports participation? The map tells us. Do not worry for now about the intricacies of how this was achieved:
you have learned vital basics of how R works as a language; we will cover this in more detail in subsequent
sections.
While we are on the topic of loading data, it is worth pointing out that R can save and load data efficiently into
its own data format (.RData). Try save(sport, file = "sport.RData") and see what happens. If you type
rm(sport) (which removes the object) and then load("sport.RData") you should see how this works. sport
will disappear from the workspace and then reappear.
Attribute data
All shapefiles have both attribute table and geometry data. These are automatically loaded with readOGR. The
loaded attribute data can be treated the same as an R data frame.
R deliberately hides the geometry of spatial data unless you print the entire object (try typing print(sport)).
Lets take a look at the headings of sport, using the following command: names(sport) Remember, the attribute
data contained in spatial objects are kept in a slot that can be accessed using the @ symbol: sport@data. This
is useful if you do not wish to work with the spatial components of the data at all times.
Type summary(sport) to get some additional information about the sport data object. Spatial objects in R
contain much additional information:
summary(sport)
##
##
##
##
##
##
##
##
The above output tells us that sport is a special spatial class, in this case a SpatialPolygonsDataFrame,
meaning it is composed of various polygons, each of which has attributes. This is the typical class of data
found in administrative zones. The coordinates tell us what the maximum and minimum x and y values are,
for plotting. Finally, we are told something of the coordinate reference system with the Is projected and
proj4string lines. In this case, we have a projected system, which means it is a Cartesian reference system,
relative to some point on the surface of the Earth. We will cover reprojecting data in the next part of the
tutorial.
Changing projection
First things first, before we start data manipulation we will check the reference system of our spatial datasets.
You may have noticed the word proj4string in the summary of the sport object above. This represents the
coordinate reference system used in the data. In this file it has been incorrectly specified so we must change it
with the following:
proj4string(sport) <- CRS("+init=epsg:27700")
You will see a warning. This simply states that you are changing the coordinate reference system, not reprojecting
the data. R uses epsg codes to refer to different coordinate reference systems. Epsg:27700 is the code for British
National Grid. If we wanted to reproject the data into something like WGS84 for latitude and longitude we
would use the following code:
sport.wgs84 <- spTransform(sport, CRS("+init=epsg:4326"))
The above line of code uses the function spTransform, from the sp package, to convert the sport object into a
new form, with the Coordinate Reference System (CRS) specified as WGS84. The different epsg codes are a bit
of hassle to remember but you can search for them at spatialreference.org.
Attribute joins
Attribute joins are used to link additional pieces of information to our polygons. in the sport object, for example,
we have 5 attribute variables - that can be found by typing names(sport). But what happens when we want to
add an additional variable from an external data table? We will use the example of recorded crimes by borough
to demonstrate this.
To reaffirm our starting point, lets re-load the london_sport shapefile as a new object and plot it. This is
identical to the sport object in the first instance, but we will give it a new name, in case we ever need to re-use
sport. We will call this new object lnd, short for London:
plot(lnd)
## [1] 33
The aspatial dataset we are going to join to the lnd object is a dataset on recorded crimes, this dataset currently
resides in a comma delimited (.csv) file called mps-recordedcrime-borough with each row representing a single
reported crime. We are going to use a function called aggregate to pre-process this dataset ready to join to our
spatial lnd dataset. First we will create a new object called crimeDat to store this data.
# Create new crimeDat object from crime data and gain an understanding of it
crimeDat <- read.csv("data/mps-recordedcrime-borough.csv", fileEncoding = "UCS-2LE")
head(crimeDat) # display first 6 lines of the crimeDat object (not shown)
summary(crimeDat$MajorText) # summarise the column 'MajorText' for the crimeDat object
# Extract 'Theft & Handling' crimes from crimeDat object and save these as
# crimeTheft
crimeTheft <- crimeDat[crimeDat$MajorText == "Theft & Handling", ]
head(crimeTheft, 2) # take a look at the result (replace 2 with 10 to see more rows)
# Calculate the sum of the crime count for each district and save result as
# a new object
crimeAg <- aggregate(CrimeCount ~ Spatial_DistrictName, FUN = sum, data = crimeTheft)
# Show the first two rows of the aggregated crime data
head(crimeAg, 2)
There is a lot going on in the above block of code and you should not expect to understand all of it upon first
try: simply typing the commands and thinking briefly about the outputs is all that is needed at this stage to
improve your intuitive understanding of R. It is worth pointing out a few things that you may not have seen
before that will likely be useful in the future:
in the first line of code the fileEncoding argument is used. This is rarely necessary, but in this case the
file comes in a strange file format. 9 times out of ten you can omit this argument but its worth knowing
about.
the which function is used to select only those observations that meet a specific condition, in this case all
crimes involving Theft and Handling.
the ~ symbol means by: we aggregated the CrimeCount variable by the district name.
Now that we have crime data at the borough level (Spatial_DistrictName), the challenge is to join it to the
lnd object. We will base our join on the Spatial_DistrictName variable from the crimeAg object and the
name variable from the lnd object. It is not always straight forward to join objects based on names as the names
do not always match. Let us see which names in the crimeAg object match the spatial data object, lnd:
# Compare the name column in lnd to Spatial_DistrictName column in crimeAg
# to see which rows match.
lnd$name %in% crimeAg$Spatial_DistrictName
## [1]
## [12]
## [23]
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE TRUE
TRUE TRUE
TRUE FALSE
[1]
[3]
[5]
[7]
[9]
[11]
[13]
[15]
[17]
[19]
[21]
[23]
"Barnet"
"Brent"
"Camden"
"Ealing"
"Greenwich"
"Hammersmith and Fulham"
"Harrow"
"Hillingdon"
"Islington"
"Kingston upon Thames"
"Lewisham"
"Newham"
10
[25]
[27]
[29]
[31]
[33]
"NULL"
"Richmond upon Thames"
"Sutton"
"Waltham Forest"
"Westminster"
"Redbridge"
"Southwark"
"Tower Hamlets"
"Wandsworth"
11
The above code loads the data correctly, but also shows that there are problems with it: the Coordinate Reference
System (CRS) of stations differs from that of our lnd object. OSGB 1936 (or EPSG 27700) is the official CRS
for the UK, so we will convert the dataset to this:
# Create new stations27700 object which the stations object reprojected into
# OSGB36
stations27700 <- spTransform(stations, CRSobj = CRS(proj4string(lnd)))
stations <- stations27700 # overwrite the stations object with stations27700
rm(stations27700) # remove the stations27700 object to clear up
plot(lnd) # plot London for context (see figure 4 below)
points(stations) # overlay the station points on the previous plot (shown in figure 4)
12
Spatial aggregation
As with Rs very terse code for spatial subsetting, the base function aggregate (which provides summaries of
variables based on some grouping variable) also behaves differently when the inputs are spatial objects.
stations.c <- aggregate(x = stations["CODE"], by = lnd, FUN = length)
head(stations.c@data)
##
##
##
##
##
0
1
2
3
CODE
48
22
43
18
13
12
13
14
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
"Railway Station"
"Rapid Transit Station"
"Roundabout, A Road Dual Carriageway"
"Roundabout, A Road Single Carriageway"
"Roundabout, B Road Dual Carriageway"
"Roundabout, B Road Single Carriageway"
"Roundabout, Minor Road over 4 metres wide"
"Roundabout, Primary Route Dual Carriageway"
"Roundabout, Primary Route Single C'way"
In the above block of code, we first identified which types of transport points are present in the map with levels
(this command only works on factor data, and tells us the unique names of the factors that the vector can hold).
Next we select a subset of stations using a new command, grepl, to determine which points we want to plot.
Note that grepls first argument is a text string (hence the quote marks) and that the second is a factor (try
typing class(stations$LEGEND) to test this). grepl uses regular expressions to match whether each element
in a vector of text or factor names match the text pattern we want. In this case, because we are only interested
in roundabouts that are A roads and Rapid Transit systems (RTS). Note the use of the vertical separator | to
indicate that we want to match LEGEND names that contain either A Road or Rapid. Based on the positive
matches (saved as sel, a vector of TRUE and FALSE values), we subset the stations. Finally we plot these as
points, using the integer of their name to decide the symbol and add a legend. (See the documentation of
?legend for detail on the complexities of legend creation in Rs base graphics.)
This may seem a frustrating and un-intuitive way of altering map graphics compared with something like QGIS.
Thats because it is! It may not worth pulling too much hair out over Rs base graphics because there is another
option. Please skip to Section IV if youre itching to see this more intuitive alternative.
15
16
17
18
1
2
3
4
5
6
id
00AA
00AA
00AA
00AA
00AA
00AA
long
531027
531555
532136
532946
533411
533843
It is now straightforward to produce a map using all the built in tools (such as setting the breaks in the data)
that ggplot2 has to offer. coord_equal() is the equivalent of asp=T in regular plots with R:
Map <- ggplot(sport.f, aes(long, lat, group = group, fill = Partic_Per)) + geom_polygon() +
coord_equal() + labs(x = "Easting (m)", y = "Northing (m)", fill = "% Sport Partic.") +
ggtitle("London Sports Participation")
Now, just typing Map should result in your first ggplot-made map of London! There is a lot going on in the
code above, so think about it line by line: what have each of the elements of code above been designed to do?
Also note how the aes() components can be combined into one set of brackets after ggplot, that has relevance
19
for all layers, rather than being broken into separate parts as we did above. The different plot functions still
know what to do with these. The group=group points ggplot to the group column added by fortify() and it
identifies the groups of coordinates that pertain to individual polygons (in this case London Boroughs).
The default colours are really nice but we may wish to produce the map in black and white, which should
produce a map like that shown below (and try changing the colors):
Map + scale_fill_gradient(low = "white", high = "black")
The sport object loaded previously is in British National Grid but the ggmap image tiles are in WGS84. We
therefore need to use the sport.wgs84 object created in the reprojection operation earlier.
The first job is to calculate the bounding box (bb for short) of the sport.wgs84 object to identify the geographic
extent of the image tiles that we need.
b <- bbox(sport.wgs84)
b[1, ] <- (b[1, ] - mean(b[1, ])) * 1.05 + mean(b[1, ])
b[2, ] <- (b[2, ] - mean(b[2, ])) * 1.05 + mean(b[2, ])
# scale longitude and latitude (increase bb by 5% for plot) replace 1.05
# with 1.xx for an xx% increase in the plot size
20
This is then fed into the get_map function as the location parameter. The syntax below contains 2 functions.
ggmap is required to produce the plot and provides the base map data.
lnd.b1 <- ggmap(get_map(location = b))
## Warning: bounding box given to google - spatial extent only approximate.
In much the same way as we did above we can then layer the plot with different geoms.
First fortify the sport.wgs84 object and then merge with the required attribute data (we already did this step to
create the sport.f object).
sport.wgs84.f <- fortify(sport.wgs84, region = "ons_label")
sport.wgs84.f <- merge(sport.wgs84.f, sport.wgs84@data, by.x = "id", by.y = "ons_label")
We can now overlay this on our base map.
lnd.b1 + geom_polygon(data = sport.wgs84.f, aes(x = long, y = lat, group = group,
fill = Partic_Per), alpha = 0.5)
The code above contains a lot of parameters. Use the ggplot2 help pages to find out what they are. The resulting
map looks okay, but it would be improved with a simpler base map in black and white. A design firm called
stamen provide the tiles we need and they can be brought into the plot with the get_map function:
lnd.b2 <- ggmap(get_map(location = b, source = "stamen", maptype = "toner",
crop = TRUE))
We can then produce the plot as before:
lnd.b2 + geom_polygon(data = sport.wgs84.f, aes(x = long, y = lat, group = group,
fill = Partic_Per), alpha = 0.5)
Finally, to increase the detail of the base map, we can use get_maps zoom argument (result not shown)
lnd.b3 <- ggmap(get_map(location = b, source = "stamen", maptype = "toner",
crop = TRUE, zoom = 11))
lnd.b3 + geom_polygon(data = sport.wgs84.f, aes(x = long, y = lat, group = group,
fill = Partic_Per), alpha = 0.5)
21
22
23
R quick reference
#: comments all text until line end
df <- data.frame(x = 1:9, y = (1:9)2: create new object of class data.frame, called df, and assign values
help(plot): ask R for basic help on function, the same as ?plot. Replace plot with any function (e.g.
spTransform).
library(ggplot2): load a package (replace ggplot2 with your package name)
install.packages("ggplot2"): install package - note quotation marks
setwd("C:/Users/username/Desktop/"): set Rs working directory (set it to your projects folder)
nrow(df): count the number of rows in the object df
summary(df): summary statistics of the object df
head(df): display first 6 lines of object df
plot(df): plot object df
save(df, "C:/Users/username/Desktop/" ): save df object to specified location
rm(df): remove the df object
proj4string(df): query coordinate reference system of df object
spTransform(df, CRS("+init=epsg:4326"): reproject df object to WGS84
Aknowledgements
The tutorial was developed for a series of Short Courses funded by the National Centre for Research Methods
(NCRM), via the TALISMAN node (see geotalisman.org). Thanks to the ESRC for funding applied methods
research. Many thanks to Rachel Oldroyd and Alistair Leak who helped demonstrate these materials on the
NCRM short courses for which this tutorial was developed. Amy ONeill organised the course and encouraged
feedback from participants. The final thanks is to all users and developers of open source software for making
powerful tools such as R accessible and enjoyable to use.
24
References
Bivand, R. S., Pebesma, E. J., & Rubio, V. G. (2008). Applied spatial data: analysis with R. Springer.
Cheshire, J. & Lovelace, R. (2014). Manipulating and visualizing spatial data with R. Book chapter in Press.
Harris, R. (2012). A Short Introduction to R. social-statistics.org.
Johnson, P. E. (2013). R Style. An Rchaeological Commentary. The Comprehensive R Archive Network.
Kabacoff, R. (2011). R in Action. Manning Publications Co.
Ramsey, P., & Dubovsky, D. (2013). Geospatial Softwares Open Future. GeoInformatics, 16(4).
Torfs and Brauer (2012). A (very) short Introduction to R. The Comprehensive R Archive Network.
Wickham, H. (2009). ggplot2: elegant graphics for data analysis. Springer.
Wilkinson, L. (2005). The grammar of graphics. Springer.
# build the pdf version of the tutorial
# source('latex/rmd2pdf.R') # convert .Rmd to .tex file
# system('pdflatex intro-spatial-rl.tex')