Broomspatial
Broomspatial
Frank Davenport
March 1, 2013
1 Get Your R On
This preliminary section will cover some basic details about R.
• Lists The most common and flexible type of R object. A list is simply a collection of other ob-
jects. For example a regression object is a list of: 1)Coefficient estimates 2) Standard Errors 3) The
Variance/Covariance matrix 4) The design matrix (data) 5) Various measures of fit, et cetera.
We will look at examples of these objects in the next section
We can explore the data using the names(), summary(), head(), and tail() commands (we will use these
frequently through out the exercise)
1
summary(mydat) #basic summary information
2
## 48 8030 Kakamega KAKAMEGA 14 1476500 57460 38.92 2011960
## Y99Births Y99Brate PopChg BrateChg
## 43 22260 36.12 39 -11
## 44 12940 41.87 38 0
## 45 43240 42.89 36 -8
## 46 23440 42.80 29 -2
## 47 69380 34.48 36 -11
## 48 69380 34.48 36 -11
We will go over ways to index and subscript data.frames later on in the exercise. For now lets do a
basic regression so you can see an example of a list
myreg <- lm(Y99Pop ~ Y89Births + Y89Brate, data = mydat) #Regress the Population in 1999 on the populat
myreg
##
## Call:
## lm(formula = Y99Pop ~ Y89Births + Y89Brate, data = mydat)
##
## Coefficients:
## (Intercept) Y89Births Y89Brate
## 502593 38 -14369
A regression object is an example of a list. We can use the names() command to see what the list
contains. We can use the summary() command to get a standard regression output (coefficients, standard
errors, et cetera) and we can also create a new object that contains all the elements of a regression summary.
3
##
## Call:
## lm(formula = Y99Pop ~ Y89Births + Y89Brate, data = mydat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -362649 -117800 -10240 36497 597511
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 502592.59 199219.41 2.52 0.015 *
## Y89Births 38.05 2.03 18.76 <2e-16 ***
## Y89Brate -14369.09 5774.65 -2.49 0.017 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 215000 on 45 degrees of freedom
## Multiple R-squared: 0.898,Adjusted R-squared: 0.894
## F-statistic: 199 on 2 and 45 DF, p-value: <2e-16
names(myregsum)
## [1] 0.8938
## [1] 0.8938
That concludes our basic introduction to data.frames and lists. There is alot more material out on the
web if you are interested. Later in the exercise we will look at data.frames in more detail.
4
add(4)
## [1] 5
## [1] 6
## [1] 7
That’s about all there is too it. The function will generally return the result of the last line that was
evaluated. However you can also use return() to specify exactly what the function will return.
Functions can also return other functions. This concept is known as ’closures’ and can be a very
powerful tool. Here are some trivial examples (courtesy of H. Wickham’s ’R Masters Class’):
## [1] 4
square(4)
## [1] 16
## [1] 8
cube(4)
## [1] 64
5
# --------------------------BASIC SET UP------------------------------
# ---Clear the workspace rm(list=ls()) #commented out for now, but a good way to start
# most R scripts
# setwd(datdir) #This sets the working directory (where R looks for files)- NOT NECESSARY
# FOR THE BROOM CLASS
# --------------------------LOADING PACKAGES-------------------------------------
# gpclibPermit() #Makes all of the function in the maptools package available to us- only
# neccessary if rgeos is not installed
library(raster) #contains a number of useful functions for raster data, especially extract()
# ===================================================================
# ------Extra Note- To Install rgdal and rgeos on a Mac setRepositories(ind=1:2) #set the
# repository to read from CRAN Extras install.packages('rgeos') #or 'rgdal'
6
Note: Mention the importance of gpclibPermit() Note: Mention Installing rgdal and rgeos on a Mac
• dsn- The directory containing the shapefile (even if this is already your working directory)
• layer- the name of the shapefile, without the file extension
# --------------------------READ IN A SHAPEFILE----------------------------------
We can explore some basic aspects of the data using summary() and str(). Summary works on almost
all R objects but returns different results depending on the type of object. For example if the object is the
result of a linear regression then summary will give you the coefficient estimates, standard errors, t-stats,
R2 , et cetera.
str(ds, 2)
7
## Formal class 'SpatialPolygonsDataFrame' [package "sp"] with 5 slots
## ..@ data :'data.frame': 41 obs. of 2 variables:
## ..@ polygons :List of 41
## ..@ plotOrder : int [1:41] 17 36 21 19 12 15 20 14 26 34 ...
## ..@ bbox : num [1:2, 1:2] 33.91 -4.68 41.9 4.63
## .. ..- attr(*, "dimnames")=List of 2
## ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slots
As mentioned above, the summary() command works on virtually all R objects. In this case it gives
some basic information about the projection, coordinates, and data contained in our shapefile
The str() or structure command tells us how R is actually storing and organizing our shapefile. This
is a useful way to explore complex objects in R. When we use str() on a spatial polygon object, it tells us
the object has five ’slots’:
1. data: This holds the data.frame
The only one we want to worry about is data, because this is where the data.frame() associated with
our spatial object is stored. We access slots using the @ sign.
Note Mention S3 vs S4 classes?
## ip89DId ip89DName
## 0 1010 Nairobi
## 1 2010 Kiambu
## 2 2020 Kirinyaga
## 3 2030 Muranga
## 4 2040 Nyandaura
## 5 2050 Nyeri
ds$new <- 1:nrow(dsdat) #add a new colunm, just like adding data to a data.frame
head(ds@data)
8
# --------------------------PLOT THE SHAPEFILE-----------------------------------
plot(ds)
Obviously there are more options to dress up your plot and make a proper map/graphic. A common
method is to use spplot() from the sp package. However I prefer to use the functions available in the
ggplot2 package as I think they are more flexible and intuitive. We will address maps and graphics later
in the in the class. For now, let us move onto reading in some tabular data and merging that data to our
shapefile (similar to the join operation in ArcGIS).
9
## (Other) :35 (Other) :41
## Y89Births Y89Brate Y99Pop Y99Births Y99Brate
## Min. : 1680 Min. :22.6 Min. : 72380 Min. : 1760 Min. :19.0
## 1st Qu.: 9350 1st Qu.:33.5 1st Qu.: 392545 1st Qu.:10870 1st Qu.:28.0
## Median :18270 Median :37.4 Median : 629740 Median :21820 Median :31.0
## Mean :23719 Mean :37.0 Mean : 872928 Mean :27562 Mean :31.6
## 3rd Qu.:39855 3rd Qu.:40.9 3rd Qu.:1384665 3rd Qu.:42140 3rd Qu.:36.4
## Max. :57460 Max. :51.0 Max. :2363120 Max. :69380 Max. :42.9
##
## PopChg BrateChg
## Min. :-14.0 Min. :-38.00
## 1st Qu.: 23.8 1st Qu.:-20.00
## Median : 33.5 Median :-14.00
## Mean : 47.7 Mean :-14.56
## 3rd Qu.: 44.2 3rd Qu.: -6.75
## Max. :343.0 Max. : 0.00
##
# --If you are using RStudio-Click on the Workspace Tab, then click on 'd' and you will
# get a spreadsheet view of the data. If you are not using RStudio you can get the same
# result by typing fix(d)
Before we merge the csv file to our shapefile, let’s do some basic cleaning. The csv file has some excess
columns and rows. Let’s get rid of them. We access rows and columns by using square brackets [,].
Here are some examples using are data.frame ’d’:
• d[1,] first row, all columns
• d[,1] first column all rows
Hopefully you get the idea. See the R cheat sheet: http://cran.r-project.org/doc/contrib/Short-
refcard.pdf for more information.
Now we extract only the columns we we want and then use the unique() command to get rid of
duplicate rows.
d <- d[, c("ip89DId", "PopChg", "BrateChg", "Y89Pop", "Y99Pop")] #Grab only the colunms we want
summary(d)
10
## 1st Qu.:3772 1st Qu.: 23.8 1st Qu.:-20.00 1st Qu.: 222905 1st Qu.: 392545
## Median :6010 Median : 33.5 Median :-14.00 Median : 451510 Median : 629740
## Mean :5207 Mean : 47.7 Mean :-14.56 Mean : 619710 Mean : 872928
## 3rd Qu.:7052 3rd Qu.: 44.2 3rd Qu.: -6.75 3rd Qu.: 947500 3rd Qu.:1384665
## Max. :8030 Max. :343.0 Max. : 0.00 Max. :1476500 Max. :2363120
nrow(d)
## [1] 48
## [1] 41
d3 <- merge(d, d2) #They have common colunm names so we don't have to specify what to join on
head(d3)
11
# =====================================================
# --------------Now lets do the Table Join: Join csv data to our Shapefile-----
# --We can do the join in one line by using the match() function
ds1 <- ds #make a copy so we can demonstrate 2 ways of doing the join
summary(ds)
# ---Alternativley we can do this : This is the preferred method but will only work if ds
# and d have the same number of rows, and the row names are identical and in the same
# order
row.names(d) <- d$ip89DId
row.names(ds1) <- as.character(ds1$ip89DId)
d <- d[order(d$ip89DId), ]
ds1 <- spCbind(ds1, d)
# head(ds@data)
# ==========================================================================
Note that the values from our csv are not in the data attributes of the shapefile. Note also that we have
duplicated the join field ‘ip89DId’. We can delete it afterwards but it’s a nice way to double check and
make sure our join worked correctly. I will go over the details of this approach in class and you can also
see an explanation here:
http://stackoverflow.com/questions/3650636/how-to-attach-a-simple-data-frame-to-a-spatialpolygondatafra
12
in-r
win <- bbox(ds) #the bounding box around the Kenya dataset
win
## min max
## x 33.909 41.899
## y -4.678 4.629
## x y
## min 33.91 -4.678
## max 41.90 4.629
plot(ds)
plot(dran, add = T)
13
Now that we have created some random points, we will extract the x coordinates (longitude), y coor-
dinates (latitude), and then simulate some values to go with them. The purpose of doing this is to create
a file similar to the the random points file we used in the ArcGIS exercise: A text file with x,y, and some
values. We will then write those values out as a .csv file, read them back in, convert them to a shapefile,
and then do a point in polygon spatial join.
## x y
## 1 40.50 -2.6693
## 2 34.87 0.8920
## 3 40.30 2.7511
## 4 38.32 0.4188
## 5 39.83 -3.6759
## 6 38.61 -0.8531
# Now we will add some values that will be aggregated in the next exercise
dp$values <- rnorm(100, 5, 10) #generates 100 values from a Normal distribution with mean 5, and sd-10
head(dp)
## x y values
## 1 40.50 -2.6693 9.4331
## 2 34.87 0.8920 -0.6684
## 3 40.30 2.7511 2.5983
## 4 38.32 0.4188 5.0610
## 5 39.83 -3.6759 -5.3906
## 6 38.61 -0.8531 -11.4819
14
6 Do a Point in Polygon Spatial Join
In the last exercise we generated some random points along with some random values. Now we will read
that data in, convert it to a shapefile (or a SpatialPointsDataFrame object) and then do a point in polygon
spatial join. The command for converting coordinates to spatial points is SpatialPointsDataFrame()
# ---Since the Data was Generated from a source with same projection as our Kenya data,
# we will go head and define the projection'
Now that we have created some points and defined their projection, we are ready to do a point in
polygon spatial join. We will use the over() command (short for overlay()).
In the over() command we feed it a spatial polygon object (ds), a spatial points object (dsp), and tell it
what function we want to use to aggregate the spatial point up. In this case we will use the mean (but we
could use any function or write our own). The result will give us a data.frame, and we will then put the
resulting aggregated values back into the data.frame() associated with ds (ds@data).
See ?over() for more information.
# --The data frame tells us for each point the index of the polygon it falls into
dsdat <- over(ds, dsp, fn = mean) #do the join
head(dsdat) #look at the data
## values
## 0 NA
## 1 NA
## 2 NA
## 3 5.021
## 4 NA
## 5 NA
inds <- row.names(dsdat) #get the row names of dsdat so that we can put the data back into ds
head(inds)
15
ds@data[inds, "pntvals"] <- dsdat #use the row names from dsdata to add the aggregated point values to
head(ds@data)
# --plot it
plot(g)
plot(ds, add = T) #plot kenay on top to get some sense of the extent
16
40
20
400
0 200
0
−20
−200
−40
−20 0 20 40
4
400
2
300
200
0
100
0
−2
−4
34 36 38 40 42
In the last step we read in a raster file, cropped it to the extent of the Kenya data (just to cut down on
the file size and demonstrate that function). Now we will aggregate the pixel values up the polygon values
using the extract() function.
17
# --------------------------PIXEL IN POLY SPATIAL JOIN------------------------------
# Weighted (more accurate, but slower)- weights aggregation by the amount of the grid
# cell that falls within the district boundary
# ds@data$precip_wght<-extract(gc,ds,fun=mean,weights=TRUE)
# --If you want to see the actual values and the weights associated with them do this:
# rastweight<-extract(gc,ds,weights=TRUE)
# ==========================================================
# ------Examine the Results and Extract the Data----------- Plot The Results
# spplot(dsp[,c('wrsi','wrsi_wght')])
Now that we’ve added all this data to our shapefile, we’ll write it out as a new shapefile and then load
it in to make some maps in the next exercise.
18
Now, we will build the map step by step using ggplot2. We could do it all in one line, but it’s easier to
do it one step at a time so you can see how the different elements combine to make the final graphic. In
the code below we will first create the basic layer using the ggplot command, and then we customize to it.
5.0
2.5
PopChg
300
0.0 200
y
100
−2.5
−5.0
34 36 38 40 42
Basic Map with Default Elements
19
# ---Change the Colour Scheme-----
p1 <- p1 + scale_fill_gradient(name = "Population \nChange", low = "wheat", high = "steelblue") #to set
# The \n in Population \nChange' indicates a carriage return
p1 + xlab("We Changed the Color Scale and Gave the Legend a Proper Name")
5.0
2.5
Population
Change
300
0.0
y
200
100
0
−2.5
−5.0
34 36 38 40 42
We Changed the Color Scale and Gave the Legend a Proper Name
20
5.0
2.5
Population
Change
300
0.0
y
200
100
0
−2.5
−5.0
34 36 38 40 42
Now the Legend is a Colorbar
which better Represents Continuous Data
Now we will get rid of all the unnecessary information in the background.
21
Population
Change
300
200
100
## V1 V2 Region ip89DId
## 0 36.86 -1.2985 Nairobi 1010
## 1 36.82 -1.0744 Kiambu 2010
## 2 37.32 -0.5266 Kirinyaga 2020
## 3 37.03 -0.8108 Muranga 2030
## 4 36.48 -0.3225 Nyandaura 2040
## 5 36.95 -0.3396 Nyeri 2050
p1 <- p1 + geom_text(data = cens, aes(V1, V2, label = Region), size = 2.5, vjust = 1)
p1 + xlab("We added some text labels \nfor the Various Spatial Units")
22
# -----Add Some value Labels------------
pdlab <- merge(cens, d) #Merge the centroids with out data
head(pdlab) #We will use this to label the polygons with their data values
p1 <- p1 + geom_text(data = pdlab, aes(V1, V2, label = paste("(", PopChg, ")", sep = "")),
colour = "black", size = 2, vjust = 3.7)
p1 + xlab("Now we added the actual value labels for the data")
Mandera
Turkana
Marsabit
Kilifi
Taita Taveta
Mombasa
Kwale
23
Mandera
Turkana
Marsabit (102)
(111)
(33)
Kilifi
Taita Taveta
(37)
(16)
Mombasa
Kwale
(40)
(31)
# -------Add a title------------------
p1 <- p1 + labs(title = "Population Change in Kenya \n (1989-1999)")
p1 + xlab("Finally we add a title")
24
Population Change in Kenya
(1989−1999)
Mandera
Turkana
Marsabit (102)
(111)
(33)
Kilifi
Taita Taveta
(37)
(16)
Mombasa
Kwale
(40)
(31)
25
pmap <- ggplot(pd)
p2 <- pmap + geom_map(aes(fill = value, map_id = ip89DId), map = pds) + facet_wrap(~variable)
p2 <- p2 + expand_limits(x = pds$lon, y = pds$lat) + coord_equal()
p2 + xlab("Basic Panel Map")
Y89Pop Y99Pop
5.0
value
2.5
2000000
1500000
0.0
y
1000000
500000
−2.5
−5.0
34 36 38 40 42 34 36 38 40 42
Basic Panel Map
We can use the `ncols' (number of columns) argument in facet_wrap() to make the panels stack
vertically instead of horizontally.
26
Y89Pop
5.0
2.5
0.0
−2.5
value
2000000
−5.0
1500000
y
Y99Pop
5.0 1000000
500000
2.5
0.0
−2.5
−5.0
34 36 38 40 42
We change the option in facet_wrap so the panels are stacked
Finally we can use the same options we used above to make our final map.
# --We can also adjust the format, theme, et cetera of the panel lables with
# 'strip.text.x'
p2 <- p2 + theme(strip.background = element_blank(), strip.text.x = element_text(size = 12))
p2 + xlab("Our Final Map")
# ===============================================================================
27
Y89Pop
Population
2000000
1500000
Y99Pop
1000000
500000
p <- ggplot(pds) + geom_raster(data = df, aes(x = x, y = y, fill = band1)) + theme_bw() #use geom_raste
p <- p + geom_map(map = pds, aes(map_id = id, x = long, y = lat), fill = NA, colour = "black") #then pl
28
p <- p + coord_equal()
p <- p + scale_fill_gradient(low = "wheat", high = "blue") #adjust the colors
p <- p + labs(x = "Longitude", y = "Latidude")
p
5.0
2.5
band1
400
Latidude
300
0.0 200
100
0
−2.5
−5.0
34 36 38 40 42
Longitude
29
row.names(d) <- d$ip89DId
mod <- lm(Y99Pop ~ Y89Pop, data = d) #run the regression
summary(mod) #examine the results
##
## Call:
## lm(formula = Y99Pop ~ Y89Pop, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -440683 -55015 -16057 14538 816465
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.39e+04 4.87e+04 0.29 0.78
## Y89Pop 1.35e+00 7.47e-02 18.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 186000 on 39 degrees of freedom
## Multiple R-squared: 0.893,Adjusted R-squared: 0.891
## F-statistic: 326 on 1 and 39 DF, p-value: <2e-16
## resid ip89DId
## 1010 282533 1010
## 2010 143572 2010
## 2020 -87412 2020
## 2030 -440683 2030
## 2040 -16057 2040
## 2050 -190207 2050
30
Map of Residuals
Residuals
8e+05
4e+05
0e+00
−4e+05
31