Analysis Ecological Data Ws17 18
Analysis Ecological Data Ws17 18
Ecological
Data with R
Georg Hörmann
Institute for Natural Resource
Conservation
[email protected]
Ingmar Unkel
Institute for Ecosystem Research
[email protected]
Christian-Albrechts-Universität zu Kiel
Author's Copyright
This book, or whatever one chooses to call it, is subject to the GNU license (GPL, full details
available on every good search engine). It may be further distributed as long as no money is
requested or charged for it.
5 Bivariate Statistics................................................................................68
5.1 Pearson’s Correlation Coefficient.........................................................................................69
5.2 Spearman's Rank Coefficient.................................................................................................75
5.3 Correlograms – correlation matrices...................................................................................77
5.4 Classical Linear Regression....................................................................................................81
5.4.1 Analyzing the Residuals.................................................................................................82
6 Univariate Statistics..............................................................................87
6.1 F-Test.........................................................................................................................................87
6.2 Student's t Test........................................................................................................................88
6.3 Welsh's t Test...........................................................................................................................90
6.4 c²-Test – Goodness of fit test..................................................................................................91
6.5 ANOVA – Analysis of Variance..............................................................................................93
8 Cluster Analysis...................................................................................105
8.1 Measures of distance.............................................................................................................105
8.2 Agglomerative hierarchical clustering..............................................................................109
8.2.1 Linkage methods...........................................................................................................109
8.2.2 Clustering Algorithm....................................................................................................110
8.2.3 Clustering in R................................................................................................................111
8.3 Kmeans clustering.................................................................................................................112
8.4 Chapter exercises..................................................................................................................115
8.5 Problems of cluster analysis................................................................................................115
8.6 R code library for cluster analysis......................................................................................117
9 Ordination..........................................................................................118
9.1 Principle Component Analysis (PCA).................................................................................118
9.1.1 The principle of PCA explained...................................................................................118
9.1.2 PCA in R...........................................................................................................................121
9.1.2.1 Selecting the number of components to extract.............................................123
9.1.3 PCA exercises.................................................................................................................124
9.1.4 Problems of PCA and possible alternatives...............................................................125
9.2 Multidimensional scaling (MDS).........................................................................................125
9.2.1 Principle of a NMDS algorithm....................................................................................126
10 Spatial Data.......................................................................................131
10.1 First example........................................................................................................................131
10.2 Background maps................................................................................................................132
10.3 Spatial interpolation...........................................................................................................134
10.3.1 Nearest neighbour.......................................................................................................135
10.3.2 Inverse distances.........................................................................................................135
10.3.3 Akima............................................................................................................................135
10.3.4 Thiessen polygons.......................................................................................................136
10.4 Point Data.............................................................................................................................137
10.4.1 Bubble plots..................................................................................................................138
10.5 Raster data............................................................................................................................138
10.6 Vector Data...........................................................................................................................139
10.7 Working with your own maps...........................................................................................142
12 Practical Exercises.............................................................................158
12.1 Tasks......................................................................................................................................158
12.1.1 Summaries....................................................................................................................158
12.1.2 Regression Line............................................................................................................159
12.1.3 Database Functions.....................................................................................................159
12.1.4 Frequency Analyses....................................................................................................159
13 Applied Analysis................................................................................160
14 Solutions...........................................................................................164
Comments on typography
Along with the technical arguments there's also the current financial situation of schools
and learning institutions of various levels, and also of many smaller firms. When the
operating system and the office suite together are more expensive than the computer on
which they're installed, many consider whether they shouldn't just buy two PCs with Linux
We have therefore decided in favor of a dual track: we discuss solutions to problems with
standard packages that are also applicable to open-source software (Excel, LibreOffice) and,
concerning more expensive special software for statistics, graphics, and data processing,
elaborate more upon free software.
Import data
Data types
Structure
Missing values
Check extreme values
Compute date/time
Advanced statistics
Figure 1: Workflow of a data analysis
The version from r-project.org is usually newer, but you have to add the servers
manually to the list of repositories.
2.1.2.1 Rcmdr
To install the Rcmdr interface, use the following commands
select Packages->Install Packages from the R-Gui.
If you never worked before with packages, R will ask you which mirror file server it should
use – select the one close to you or in the same network (e.g. Göttingen for the German
Universities). Next, select all packages of Rcmdr as shown in Fig. 2 and wait for the
installation to finish. After the installation you should first start the interface with
library(Rcmdr)
Before it starts it will load other additional packages from the internet. After this process,
Rcmdr will come up (Fig. 3) and is ready for work.
For the first steps in R we recommend Rcmdr, because it helps you to import data files and
builds commands for you. If you are more familiar with R you can switch to Rstudio, which
is the more modern GUI.
We will use the successful import of our data to R (Fig. 4) as an introduction to the basic
philosophy of Rcmdr. The program window consists of three parts: the script window, the
output window and the message window.
• The script window contains the command sent by Rcmdr to R. This is the easiest way
to study how R works, because Rcmdr translates all commands you select with the
mouse in the interface to proper R code. You can also type in any command
manually. To submit a command or a marked block you have to click on the submit
button.
• The output window shows the results of the operation you just submitted. If you type
in the command “3+4” in the script window and submit the line, R confirms the
command line and prints out the result (“7”).
• The message window shows you useful information, e.g. the size of the database we
just imported.
Much of the power of R comes from a clever combination of the script windows and the file
menu shown in Fig. 5.
The commands dealing with the “script file” save or load the commands contained in a
simple text file. This means that all commands you or Rcmdr issues in one session can be
saved in a script file for later use. If you e.g. put together a really complex figure you can
save the script and repeat it whenever you need it with a different data set.
The same procedure can be used for the internal memory, called “workspace” in R. It
contains all variables in memory, you can save it before you quit your session and R reloads
it automatically next time. If you want to reload it manually you can use the original R
interface or load it from the data menu.
2.1.2.2 Rstudio
The Rstudio GUI (http://www.rstudio.com) has to be installed as any other windows
program (Fig. 6).
For Ubuntu-Linux you also have to download and install the software from the website, it is
not part of the software repository.
Figure 8: Download location, search for “tageswerte_01975_*” to get the Hamburg data set (sta-
tion code: 1975)
1. Work always in the same directory, preferably where your code is placed. This is
where R stores all save files, figures, data bases etc. (Fig. 10).
2. Use “####” to create headings and structure your workflow (Fig. 11)
3. Load all libraries at the start of your code
The data set must have a rectangular form without empty rows or columns
Figure 15: Settings for an import of the climate data set from the clip-
board
library(readxl)
Climate=readxl::read_excel("climate_import.xlsx",sheet=1,col_name
s=TRUE,na="-999")
The next step is quite essential: control the structure of the import file with
str(Climate)
Fig. 16 shows the correct output after a successful import. First, you should control the
number of variables and rows (observations). Second, control the data type of the variables.
Import of data in CSV format is also available in Rstudio. Figure 17 shows the import
function. The available options are the same as in Rcmdr.
You can also import the file manually with
Climate <- read.csv("Climate_hh.txt", sep=";", na.strings="-999")
The results should look like Fig. 18 if you used the import function of Rcmdr. The whole file
is converted to a variable of type data.frame which consists of several sub-variable
corresponding to the columns of the file. The data type of all variables is numeric – this is
the normal view.
In case you need some help you have different choices. In case you know the name of a
command, you get help with
help(“ls”)
removes a variable.
edit(Climate) or View(Climate)
lets you control your data set and change or view values.
names(Climate)
Climate[1,"Meas_Date"]
The first ten lines of columns 2-4, 7 and 9. The expression c() creates a list – most
commands accept it as input.
Climate[Climate$AirTemp_Max>35,]
to start to use a data frame to make variables inside a data matrix visible, but do not forget
to
detach(Climate)
For our climate data set we need real date, we have to convert the input to internal date
values.
Climate$Meas_Date=as.character(Climate$Meas_Date)
A conversion of the integer variable to text makes it easier to create the date.
Convert the text to a real date. See the help for a complete list of all format options.
Extract years from the date – we need this information later for annual values.
Climate$Dayno=Climate$Date-as.Date(paste(Climate$Year,"/1/1"),"%Y
/%m/%d")+1
library(lubridate)
Climate$Dayno = yday(Climate$Date)
The easiest way to deal with these problems is to import files with time variables from an
Excel-Worksheet. There the spreadsheet time variables are usually converted directly to R-
style time and date variables.
library(ggplot2)
qplot(Date,AirTemp_Mean,data=Climate)
In case you do net specify the type of figure you want, ggplot2 make a guess.
qplot(Date,AirTemp_Mean,data=Climate,geom="line")
The geom parameter defines the type of figure you want have. In this case line is a good
choice.
qplot(Year,AirTemp_Mean,data=Climate,geom="boxplot")
qplot(as.factor(Year),AirTemp_Mean,data=Climate,geom="boxplot")
Some command cannot handle all data types, here we have to convert the numeric variable
Year to a factor variable.
qplot(as.factor(Month),AirTemp_Mean,data=Climate,geom="boxplot")
Boxplots are not always the best method to display data. If the distribution could be
clustered, the jiiter type is a good alternative. It displays all points of a data set.
qplot(as.factor(Month),AirTemp_Mean,data=Climate,geom="jitter")
One of the advantages of the qplot command is that you can use colours and matrix plots
out of the box.
qplot(as.factor(Month),AirTemp_Mean,data=Climate,geom="jitter",co
l=Year)
qplot(as.factor(Year),AirTemp_Mean,data=Climate,geom="jitter",col
=Month)
qplot(as.factor(Year),AirTemp_Mean,data=Climate,geom="jitter",fac
ets= Month ~ .)
qplot(as.factor(Year),AirTemp_Mean,data=Climate,geom="line",facet
s= Month ~ .)
qplot(as.factor(Dayno),AirTemp_Mean,data=Climate,geom="jitter",co
l=Month)
qplot(as.factor(Dayno),AirTemp_Mean,data=Climate,geom="jitter",co
l=Year)
When you click the **Knit** button a document will be generated that includes both
content as well as the output of any embedded R code chunks within the document. Figure
20 shows the basic structure of such a file. The results can be seen in Figure 21. Please note
that the results of the qplot command are saved directly and automatically in the resulting
DOC file.
The markup language has many more options to produce a nice output. You can e.g. switch
off the display of the code to show only the results of the R commands and you can set the
size of the figures to any number.
If you want to continue to use the data file in R, you can also save the whole data base in R-
Format using
save.image("test.Rdata") #saves memory
To export files in Excel format, you have two possibilities. First, you can use
library(writeXLS)
WriteXLS(Climate, "climate.xls")
However, to use this library you have to install another open source computer language
named “Perl”. For Windows, you can use ActivePerl. In most Linux installations, Perl is part
of the basic system and should be already installed.
library(writexl)
write_xlsx(Climate, "climate.xls")
2.9 Exercises
1: calculate a variable Climate$Summer where summer=1 and
winter = 0
2: Plot the summer and winter temperatures in a boxplot
3: create new factor variables for year and month. Do not replace
the original values, we will need them later.
• Climate$Year = as.numeric(as.character(Climate$Year))
If you forget this not really obvious step, you get the number of index, not of the value of a
variable.
4: create an additional variable for groups with 50
years. Use the “facets” and “color” command to check
and display the temperatures.
Figure 23: Example of a good and bad database structure for lab data
Climate$Air_Press[Climate$Air_Press==-999]=NA
The following solution to replace all -999 works, but is not really obvious
Climate[Climate==-999]=NA
length(Climate$Air_Press[is.na(Climate$Air_Press)])
https://www.rstudio.com/resources/cheatsheets/
Some functions in R are are complex and have many different options which
are difficult to remember. For the most important functions there are so
called “cheat sheets”. The “data wrangling cheat sheet” summarizes all
commands for data management with dplyr
library(dplyr)
Climate=readxl::read_excel("climate_import.xlsx",sheet=1,col_name
s=TRUE,na="-999")
The selection and filtering of variables requires more typing compared to the basic version,
but the code is quite readable.
# select variables
test=dplyr::select(Climate,AirTemp_Mean) # select variable
You can also use the index of columns, but the version with names is more readable and
avoids problems if columns are deleted
You can easily use any R function to change values, but the politically correct way is to use
the mutate function. The calculation of the date from last chapter can be expressed like
this:
test=mutate(Climate,date=as.Date(as.character(Climate$Meas_Date),
"%Y%m%d"))
A new method to confuse beginners is the so called “chaining” of commands. It comes from
the Unix-world and is generally known as “piping”. The output of one command is the
input of the next. It makes computations more efficient, avoids the use of temporary
variables, but is difficult to debug. Therefore, we do not recommend it for beginners. Use
piping only if you know that your code is working as you expect.
#chaining / piping
test= Climate %>%
mutate(Meas_Date=as.character(Climate$Meas_Date)) %>%
mutate(Date=as.Date(Meas_Date, "%Y%m%d")) %>%
mutate(Month = format(Date, "%m"))
You can use the arrange function to select the 100 hottest days in the data set and look for
signs of global change
t7=arrange(test,desc(AirTemp_Mean))
6: Select the 100 hottest days and plot the temporal distribution as a
histogram in groups of 10 years
7: Select the 100 coldest days and plot the temporal distribution as a
histogram in groups of 10 years
The most common application for dplyr is the calculation of mean, sums etc., e.g. for
annual and monthly values. The first step is to create groups
clim_group=group_by(Climate, Year)
The wide format is very common, an example is the Climate data set used for this course
(fig. 24). It is characterised by more than one measured value per row, in this case the different
temperatures. This format is used by many functions in R, e.g. the old graphic functions and
many functions for statistical analysis. The narrow format contains always only one measured
value per row, the description of the different variables goes to a separate column named
variable in fig. 19. It contains the names of the columns from the wide format. At first, the
narrow format seems to be unnecessarily complex and inefficient, but this structure
combined with dplyr, reshape2 and ggplot2 makes many complex operations easy and
efficient.
In figure 24 you can see how the different variables are transformed with melt and to the
new structure shown in Fig. 25. In the melt function, two parameters are important: id-
variables and measure variables. The id-variables remain unchanged and are used as an
index. Date variables are typical id variables. The columns of the measure variables are
collapsed in the variable and value column of the new file. One line of the original,
“wide” data set is now converted to 5 lines in the “narrow” data set.
Now, its easy to create a quite complex figure with a simple command. Please note the “+”
as the last character in the line, we need it to continue the graphic command.
qplot(Dayno,value,data=Clim_melt) +
facet_grid(variable ~ .,scales="free")
In the second step we merge the two data bases. We can use the old style command
# join the two data bases
chem_all=merge(chemie,stations,by.x="Scode")
In fig. 26 the process is explained graphically. The two data bases share a common variable
called Scode, the typical numeric code for the sampling site. All other information about
the site is stored in the Station data base. The merge process now uses the variable Scode
to look up the name and other properties of the sampling site and combines it in a new data
base (chem_all). With this new data bases we can e.g. analyse the relation between lake
area and nutrient content.
The syntax of this command, especially the selection of the variable is quite typical for
many other functions. “Prec ~ Year_fac + Month_fac” means: analyze the data
variable Prec and classify it with the monthly and yearly factor variables.
You can calculate the same result with the dplyr library
mgroup=group_by(Climate,Year,Month)
msum=summarize(mgroup, sum=sum(Prec))
msum=data.frame(msum)
mmonth=dcast(msum,Year~Month)
or use ggplot2
qplot(Month_fac,AirTemp_Mean,data=Climate,geom="boxplot")
The lattice library contains a lot of useful chart types, e.g. dotplots
library(lattice)
dotplot(Mean_Temp~Year_fac)
# also available in ggplot2
qplot(Year_fac,AirTemp_Mean,data=clim2000)
qplot(Year_fac,AirTemp_Mean,data=clim2000,geom="jitter")
A scatterplot is a version of a line plot with symbols instead of lines. It is a very common
type used for later regression analysis. For a ggplot2 version of this figures see 4.4.1)
plot(AirTemp_Max, AirTemp_Min)
abline(0,1)
abline(0,0)
abline(lm(AirTemp_Min ~ AirTemp_Max), col="red")
lines(AirTemp_Min,AirTemp_Mean,col="green", type="p")
abline(lm(AirTemp_Max ~ AirTemp_Min), col="green")
There are also some new packages with more advanced functions. Try e.g.
library(car)
scatterplot(AirTemp_Max ~ AirTemp_Min | Year_fac)
For really big datasets the following functions can quite useful
library(IDPmisc)
iplot(AirTemp_Min, AirTemp_Max)
or
library(hexbin)
bin = hexbin( AirTemp_Min, AirTemp_Max,xbins=50)
plot(bin)
or
with(Climate,smoothScatter( AirTemp_Mean,AirTemp_Max))
or with ggplot2
qplot(data=Climate,AirTemp_Mean,AirTemp_Max, geom="bin2d")
qplot(data=Climate,AirTemp_Mean,AirTemp_Max, geom="hex")
If you do not like the boring blue colours, you can change then to rainbow patterns
qplot(data=Climate,AirTemp_Mean,AirTemp_Max)+
stat_bin2d(bins = 200)+
scale_fill_gradientn(limits=c(0,50), breaks=seq(0, 40, by=10),
colours=rainbow(4))
10: plot time series of max, min and mean temperature with ggplot2 system
(hint: use reshape2)
11: plot a scatterplot with mean temperature vs. (max+min)/2
12: calculate average annual temperatures and average precipitation for
the climate data set using the aggregate function.
13: compare the monthly temperatures and precipitation for 1950-1980 and
1980-2010.
library(ggplot2)
14: find out how to change font size and character orientation of an x axis
in ggplot2.
qplot(data=Climate,x=AirTemp_Mean,geom="histogram")
In plain ggplot2 syntax the same figure with classes 2°C wide is produced by
ggplot()+
geom_histogram(data=Climate,aes(x=AirTemp_Mean),binwidth=2)
For the following plots we use monthly temperature values from the climate data base.
mgroup=group_by(Climate,Month)
msum=summarize(mgroup, mean=mean(AirTemp_Mean),
max=mean(AirTemp_Mean), min=min(AirTemp_Mean))
If you want a display of the values (not the counts) you have to use a barplot (B in Fig. 27).
ggplot()+
geom_bar(data=msum,aes(x=Month, y=mean),stat="identity")
For more than one bar you have to use always use narrow data sets
n_msum=reshape2::melt(msum,id.var=("Month"))
However, with the right keyword you get “standard” as we know it from spreadsheets (D in
Fig. 27)
ggplot()+
geom_bar(data=n_msum,aes(x=Month, y=value,fill=variable),
stat="identity",position="dodge")
As usual, the facet keyword makes it easy to produce several figures at once.
ggplot()+
geom_bar(data=n_msum,aes(x=Month, y=value,fill=variable),
stat="identity")+
facet_grid(variable~.)
Here, screens can be addressed separately by their numbers. It is also possible to nest
screens. Screen(2) is split into one row and two columns which get screen number 5 and 6.
split.screen( figs = c( 1, 2 ), screen = 2 )
screen(5)
plot(Prec ~ Date, type="l", col="red", main="Fig 5 inside 2")
screen(6)
plot(Sunshine ~ Date, type="l", col="red", main="Fig 6 inside 2")
close.screen(all=TRUE
defines a 3x3 matrix with 9 elements. The matrix command assigns each cell of this matrix
to a figure. Thus, the first three elements (the first line) of the matrix are assigned to
(sub-)figure 1. The second line contains subfigure 2 in two elements, the last element is left
free (0). In line 3, only the first cell is assigned to figure 3. The system is shown in Fig. 29.
You can control the layout with
layout.show(3)
The results are shown in Fig. 30. For this course we kept the structure of the matrix quite
simple. You can use as many elements as you want and order the figures in any order you
want.
Fig 1
Mean_Temp
10
-10
Date
Fig 2
Max_Temp
20
0
Date
Fig 3
40
Prec
20
0
37000 39000
Date
• 1st row: time series of max, min and mean temperature, including a legend, use
code from task Error: Reference source not found
Lattice is very well suited for the display of data sets with many (factor) variables, but the
syntax is different from normal figures and the display is not very flexible.
library(lattice)
Please note how the numeric variables (temperatures) and the factor variables are ordered.
All examples above print monthly plots of a temperature.
Scatterplots are very similar, only the definition of the variables is different:
xyplot(AirTemp_Mean ~ AirTemp_Max+AirTemp_Min | Month_fac ,
data=clim2000)
Here you can clearly see the different of summer and winter values.
Another useful feature is the automatic addition of a legend.
xyplot(AirTemp_Mean ~ AirTemp_Max+AirTemp_Min, groups=Summer,
auto.key=T, data=clim2000)
A combination of all simple features makes it easy to get an overview of the dataset. In our
example it is quite apparent, that something went wrong in the year 2007 (Fig. 31).
xyplot(AirTemp_Mean ~ AirTemp_Max+AirTemp_Min, groups=Year_fac,
auto.key=list(title="Year",columns=7), data=clim2000)
Grid needs so called viewports – you can use any area, here we define the lower left part of
the page
### define first plotting region (viewport)
vp1 <- viewport(x = 0, y = 0, height = 0.5, width = 0.5,
just = c("left", "bottom"), name = "lower left")
Now we define the figure. The qplot command is a simplification of the ggplot2 package,
is makes transition from old packages easier. It always requires an extra print command to
appear on the page. Now we print monthly boxplots in a separate figure for each year
Now we move up one step in the hierarchy, all plot commands would now be printed on the
full page.
upViewport(1)
### define second plot area
vp2 <- viewport(x = 1, y = 0, height = 0.5, width = 0.5,
just = c("right", "bottom"), name = "lower right")
### enter vp2
pushViewport(vp2)
### show the plotting region (viewport extent)
### plot another plot
bw.lattice <- qplot(data=clim2000,x=Month_fac,
facets= . ~ Year_fac,y=Prec,geom="boxplot")
print(bw.lattice, newpage= FALSE)
### leave vp2
upViewport(1)
vp3 <- viewport(x = 0, y = 1, height = 0.5, width = 0.5,
just = c("left", "top"), name = "upper left")
pushViewport(vp3)
bw.lattice <- qplot(data=clim2000,x=Month_fac,
facets= . ~ Year_fac, y=Sunshine,geom="boxplot")
If you want additional explanation in a figure you can add text in the margins outside the
plot area with
mtext("Line 1", side=2, line=1, adj=1.0, cex=1, col="green")
The x-y-dimensions of the text command are the same as the data set. As usual, you can
use any variable containing text e.g. for automatic annotations etc.
is the same as
plot(Max_Temp, Min_Temp, col="red")
http://research.stowers-institute.org/efg/R/Color/Chart/
In depth information about colors in R and science
4.6.4 Legend
In the basic graphic system, the legends are not added automatically, you have to define
them separately like
plot(AirTemp_Max, AirTemp_Min)
lines(AirTemp_Min,AirTemp_Mean,col="green", type="p")
Again, the xy dimensions are the same as the data set. If you want to set the location with
the mouse you can use the following command
legend(locator(1), c("Max/Min", "Min/Mean"), col = c("black","green"),
lty = c(0,0), lwd=c(1,2), pch=c("o","o"), bty="n",merge = TRUE, bg =
'white' )
Now comes the second data set. To avoid a new figure we need to set
par(new=T)
The next lines are quite similar, except that we draw the y-axis on the right side (4).
plot(Prec ~ Date, type="l", col="green", yaxt='n', ylab="")
axis(4, pretty(c(0,max(Prec))), col="green")
mtext("Precipitation", side=4, line=3, col="green")
The recommended procedure is to develop and test a figure on screen and wrap it in a file
as soon as the results are as expected.
For the ggplot library you have to use
fig1 <- qplot(data=clim2000,x=Month_fac,y=AirTemp_Mean,geom="boxplot")
ggsave("fig1.png",width=3, height=3) # dim in cm
One of the most useful scatterplot versions is pairs, which also prints out the correlation
and the significance level. Unfortunately, pairs does not work with missing values in the
data set. This cleaning proces often removes half of the data set.
t2=t[complete.cases(t),]
pairs(t2[1:3], lower.panel=panel.smooth, upper.panel=panel.plot)
t2=t[complete.cases(t),]
pairs(t2, lower.panel=panel.smooth, upper.panel=panel.cor)
The following code is the definition of functions needed by pairs. You have to execute
them prior to the use of pairs.
panel.cor <- function(x, y, digits=2, prefix="", cex.cor)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex <- 0.8/strwidth(txt)
test <- cor.test(x,y)
# borrowed from printCoefmat
Signif <- symnum(test$p.value, corr = FALSE, na = FALSE,
cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
symbols = c("***", "**", "*", ".", " "))
text(0.5, 0.5, txt, cex = cex * r)
text(.8, .8, Signif, cex=cex, col=2)
}
4.9 3d Images
Plotting 3d images is no problem if you have already a grid with regular spacing. The
procedure here also works with irregularly spaced data, but keep in mind that the spatial
interpolation may cause two kinds or problems:
• valleys and/or mountains in the image which are not found in the data.
• Information at a smaller scale than the grid size may completely disappear in
the image
For the spatial interpolation we use the package akima
install.packages("akima")
library(akima)
The data set is not a real spatial data set but a time series of soil water content at different
depth. The third dimension here is time.
g <- read.csv("soil_water.csv", header=TRUE)
attach(g)
The variable ak now contains a regular grid. Now we can plot all kinds of impressing
3dimensional figures. We start with a boring contour plot:
contour(ak$x, ak$y, ak$z)
If you don't like the colours of the rainbow you can also use:
'heat.colors','topo.colors','terrain.colors', 'rainbow', 'hsv', 'par'.
To plot the same data with ggplot you have to convert it to narrow format
ak2=as.data.frame(ak)
ak3=melt(ak2,id.vars=c("x","y"))
ak3$z=as.numeric(ak3$variable)
ggplot(ak3,aes(x=z,y=y,fill=value)) +
geom_raster()+
stat_contour(bins=6,aes(x=z,y,z=value), color="black", size=0.6)+
scale_fill_gradientn(colours=brewer.pal(6,"YlOrRd"))
4.10 Exercise
In file Chemie_kielstau.xlsx you find a data set of 10 years of daily measurements of
water quality in the Kielstau catchment. Draw a figure with an overview of the Nitrate
concentrations described in Fig. 34.
Use the libraries ggplot2, reshape2, dplyr and
gridextra. The data set is in narrow format and the
procedures to create the figures are described in
section 4.4 and 4.5.6
When the two variables are measured on the same object, x is usually identified as the
independent variable, whereas y is the dependent variable. If both variables were
generated in an experiment, the variable manipulated by the experimenter is described as
the independent variable. In some cases, both variables are not manipulated and therefore
independent. The methods of bivariate statistics help describe the strength of the
relationship between the two variables, either by a single parameter such as Pearson’s
correlation coefficient for linear relationships or by an equation obtained by regression
analysis (Fig. 35). The equation describing the relationship between x and y can be used to
predict the y-response from arbitrary x’s within the range of original data values used for
regression. This is of particular importance if one of the two parameters is difficult to
measure. Here, the relationship between the two variables is first determined by regression
analysis on a small training set of data. Then, the regression equation is used to calculate
this parameter from the first variable.
Correla on or Regression ?!
Correla on: Neither variable has been set (they are both measured) AND there is no implied
causality between the variables
Regression: Either one of the variables has been specifically set (not measured) OR there is and
implied causality between the variables whereby one variable could influence the other but the
Dividing the covariance by the univariate standard deviations removes this effect and leads
to Pearson’s correlation coefficient r.
We load the data from the file agedepth.txt using the import function of Rstudio
(separator “white space”, decimal “.”) and plot the dataset (x=depth, y=age)
16: plot age (x-axis) against depth (y-axis) using either the basic plot command or qplot from the
package ggplot2.
17: assess linearity and bivariate normality using a sca erplot (package: car)with marginal boxplots
(a–b) Positive
1/15/18 and
/ 11:17:52 negative
/ 69 of 161 / linear correlation, (c) random scatter without a linear correlation, (d) an
outlier
D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt
causing a misleading value of r, (e) curvilinear relationship causing a high r since the curve is close to a
straight line, (f) curvilinear relationship clearly not described by r.
Observation to exercise 17 and 18:
We observe a strong linear trend suggesting some dependency between the variables,
depth and age. This trend can be described by Pearson’s correlation coefficient r, where r =1
represents a perfect positive correlation, i.e., age increases with depth, r = 0 suggests no
correlation, and r =–1 indicates a perfect negative correlation.
We use the function cor to compute Pearson’s correlation coefficient:
cor(agedepth, method="pearson")
From these outputs our suspicion is confirmed x and y have a high positive correlation, but
as always in statistics we can test if this coefficient is significant. Using parametric
assumptions (Pearson, dividing the coefficient by its standard error, giving a value that
follow a t-distribution):
The cor.test command has the form cor.test (dataset$x, dataset$y, method).
If you a+ach a dataset, you can also use the form cor.test (x,y):
attach(agedepth)
cor.test(age, depth)
detach(agedepth)
The value of r = 0.9342 suggests that the two variables age and depth depend on each other.
cor.test(x,y)
Now we introduce one single outlier to the data set, an exceptionally high (x,y)value, which
is located precisely on the one-by-one line. The correlation coefficient for the bivariate data
set including the outlier (x,y)=(5,5) is much higher than before.
x[31]=5
y[31]=5
plot(x,y)
cor(x,y)
abline(lm(y ~ x), col="red")
After increasing the absolute (x,y) values of this outlier, the correlation coefficient increases
dramatically.
x[31]=10
y[31]=10
plot(x,y)
cor.test(x,y)
abline(lm(y ~ x), col="red")
Still, the bivariate data set does not provide much evidence for a strong dependence.
However, the combination of the random bivariate (x,y) data with one single outlier results
in a dramatic increase of the correlation coefficient. Whereas outliers are easy to identify in
a bivariate scatter, erroneous values might be overlooked in large multivariate data sets.
abline(lm(y ~ x)) uses the basic graphical func on “A-B-line” to add at regression trendline
based on the linear model (lm) of x and y.
Exercise 18:
a) import the crab data set (crab.scv, separator “,”)
b) assess linearity and bivariate normality using a sca erplot with marginal boxplots
c) Calculate the Pearson's correla on coefficient and test Null-hypothesis H0 that the popula on
correla on coefficient equals zero
As for Pearson's correlation, you can use the Spearman correlation to explore the
relationship between two variables. Accordingly, the coefficient has a value of -1 for perfect
negative correlation, a value of +1 for perfect positive correlation, and indicates no
correlation at all for values close to 0.
Spearman's correlation coefficient rsp is also called Spearman's rank coefficient, because
there is a tiny but important difference to the classical Pearson's coefficient r: the
correlation is not calculated using the datapoint themselves but in using the rank of the
datapoints.
To calculate Spearman's rank coefficient for the same dataset, we (or actually R in the
background) have to calculate the ranks first and leave the actual values for „age“ and
„time“ aside for a moment. In other words: we work with the „placement on the podium“
instead of the actual running times and also with the „rank“ of the age.
Why is that ?
As the Spearman coefficient uses the ranks, the actual distances (in the example: finishing
times) between rank 1, rank 2 etc. do not matter. Hence, it also doesn't matter if the two
variables have no linear relationship! Spearman's coefficient rsp is always 1 if the lowest x-
values is associated with the lowest y-value, etc.
In Mathematical terms one can say, Spearman's coefficient measures the monotone
relationship between two variables, while Pearson's coefficient measures the linear
relationship.
Exercise 19:
a) create an ar fical squared age-depth rela on in the agedepth dataset
b) assess lithe new age2-depth-rela on using qplot and sca erplot
c) Calculate Pearson's and Spearman's correla on coefficients
stat_smooth(method="lm")
Figure 38: picture of the PMM sediment core and the Si-content plotted as red curve along the core.
We load the data from the file PMM.txt (separator “white space”)
options(digits=2)
cor(PMM)
Al Si S Cl K Ca Ti Mn Fe Zn Br Rb
Al 1,0000 0,9073 -0,1005 -0,0740 0,9167 -0,2974 0,9176 0,1688 0,4443 0,4805 -0,3899 0,7344
Si 0,9073 1,0000 -0,2246 -0,2298 0,7381 -0,5379 0,8063 0,1783 0,2953 0,2885 -0,5806 0,4900
S -0,1005 -0,2246 1,0000 0,1782 -0,1026 0,5458 -0,1194 0,0003 0,5452 -0,0443 0,1718 -0,0904
Cl -0,0740 -0,2298 0,1782 1,0000 0,0828 0,5761 0,0331 0,3657 0,1702 0,3451 0,4423 0,2326
K 0,9167 0,7381 -0,1026 0,0828 1,0000 -0,1509 0,9482 0,1429 0,4794 0,6092 -0,2269 0,8973
Ca -0,2974 -0,5379 0,5458 0,5761 -0,1509 1,0000 -0,3021 0,0238 0,0442 0,0480 0,6203 0,0523
Ti 0,9176 0,8063 -0,1194 0,0331 0,9482 -0,3021 1,0000 0,1917 0,5605 0,5442 -0,3700 0,8425
Mn 0,1688 0,1783 0,0003 0,3657 0,1429 0,0238 0,1917 1,0000 0,2784 0,4044 0,0644 0,1225
Fe 0,4443 0,2953 0,5452 0,1702 0,4794 0,0442 0,5605 0,2784 1,0000 0,3898 -0,0458 0,4875
Zn 0,4805 0,2885 -0,0443 0,3451 0,6092 0,0480 0,5442 0,4044 0,3898 1,0000 0,1412 0,6466
Br -0,3899 -0,5806 0,1718 0,4423 -0,2269 0,6203 -0,3700 0,0644 -0,0458 0,1412 1,0000 -0,0314
Rb 0,7344 0,4900 -0,0904 0,2326 0,8973 0,0523 0,8425 0,1225 0,4875 0,6466 -0,0314 1,0000
library(corrgram)
corrgram(PMM)
To interpret this graph (Fig. 39), start with the lower triangle of cells (the cells below the
principal diagonal). By default, a blue color and hashing that goes from lower left to upper
right represents a positive correlation between the two variables that meet at that cell.
Conversely, a red color and hashing that goes from the upper left to the lower right
represents a negative correlation. The darker and more saturated the color, the greater the
magnitude of the correlation. Weak correlations, near zero, will appear washed out.
where x is a data frame with one observation per row. When order=TRUE, the variables are
reordered using a principal component analysis of the correlation matrix. Reordering can
help make patterns of bivariate relationships more obvious. The option panel specifies the
type of off-diagonal panels to use. Alternatively, you can use the options lower.panel and
upper.panel to choose different options below and above the main diagonal. The text.panel
and diag.panel options refer to the main diagonal.
Exercise 20:
a) import the STY1 data set (STY1.txt, separator=”Tab”)
b) plot the following element-combina ons in a nested (mul -plot) figure of 4 plots,
add a tle (main) and a regression line (abline) and different color in each
respec ve plot:
Al-Si; Ca-Sr; Ca-Si; and Mn-Fe
explain what you see.
c) Calculate the Pearson's correla on coefficient and the Spearman's rank coefficients
for each element pair and evaluate the results
d) produce first an unsorted and than a sorted (order=TRUE) correla on matrix of the
en re STY1 data set, both mes only displaying the lower panel as shades.
For crea ng the same in ggplot2 you need to write every sub-plot into a separate variable,
e.g. p1, p2, p3, p4, ...
p1=ggplot(data=dataset, aes(x=, y=))+
geom_point()
and finally arrange it via:
library(grid)
library(gridExtra)
grid.arrange(p1, p2, p3, p4, ncol=2, top=”Title”)
where b0 and b1 are the regression coefficients. The value of b0 is the intercept with the y-
axis and b1 is the slope of the line. The squared sum of the ∆y deviations to be minimized is
Partial differentiation of the right-hand term and equation to zero yields a simple equation
for the first regression coefficient b1:
The regression line passes through the data centroid defined by the sample means. We can
therefore compute the other regression coefficient b0,
Exercise 21:
a) import the Nelson data set (nelson.csv, separator=”,” )
b) assess linearity and bivariate normality using a sca+erplot with marginal boxplots
comment: the ordinary least squares method is considered appropriate, as there is effec vely no
uncertainty (error) in the predictor variable (x-values, rela ve humidity)
c) fit the simple linear regression model (y=bo+bx) and examine the diagnos cs:
nelson.lm = lm(WEIGHTLOSS~HUMIDITY, nelson)
plot(nelson.lm)
median: the median is often used as an alternative measure of central tendency. The
median is the x-value which is in the middle of the data, i.e., 50% of the
observations are larger than the median and 50% are smaller. The median of a
data set sorted in ascending order is defined as
if N is even if N is odd
Quantiles are a more general way of dividing the data sample into groups containing equal
numbers of observations. For example, quartiles divide the data into four groups,
quintiles divide the observations in five groups and percentiles define one
hundred groups.
degrees of freedom Φ : the number of values in a distribution that are free to be varied.
Null hypothesis:
A biological or research hypothesis is a concise statement about the predicted or theorized
nature of a population or populations and usually proposes that there is an effect of a
treatment (e.g. the means of two populations are different). Logically however, theories
(and thus hypothesis) cannot be proved, only disproved (falsification) and thus a null
hypothesis (Ho) is formulated to represent all possibilities except the hypothesized
prediction. For example, if the hypothesis is that there is a difference between (or
relationship among) populations, then the null hypothesis is that there is no difference or
relationship (effect). Evidence against the null hypothesis thereby provides evidence that
the hypothesis is likely to be true. The next step in hypothesis testing is to decide on an
appropriate statistic that describes the nature of population estimates in the context of the
null hypothesis taking into account the precision of estimates. For example, if the null
hypothesis is that the mean of one population is different to the mean ofanother
population, the null hypothesis is that the population means are equal. The null hypothesis
can therefore be represented mathematically as: Ho: µ1=µ2 or equivalently: Ho µ1-µ2=0.
Figure 44: Cartoons explaining F-test and t-test (image source: G. Meixner 1998, www.viag.org)
6.1 F-Test
(Chapter based on Trauth, 2006)
The F distribution was named after the statistician Sir Ronald Fisher (1890–1962). It is used
for hypothesis testing, namely for the F-test . The F distribution has a relatively complex
probability density function.
The F-test by Snedecor and Cochran (1989) compares the variances sa² and sb² of two
distributions, where sa2 >sb2. An example is the comparison of the natural heterogeneity
of two samples based on replicated measurements. The sample sizes na and nb should be
above 30. Then, the proper test statistic to compare variances is
The two variances are not significantly different, i.e., we reject the alternative hypothesis, if
the measured F-value is lower than the critical F-value, which depends on the degrees of
freedom Φa= na–1 and Φb= nb–1, respectively, and the significance level α
The single parameter Φ of the t distribution is the degrees of freedom. In the analysis of
univariate data, this parameter is Φ = n–1, where n is the sample size. As Φ→∞, the t
distribution converges to the standard normal distribution. Since the t distribution
approaches the normal distribution for Φ >30, it is not often used for distribution fitting.
However, the t distribution is used for hypothesis testing, namely the t-test.
where na and nb are the sample sizes, sa2 and sb2 are the variances of the two samples a and
b. The alternative hypothesis can be rejected if the measured t-value is lower than the
critical t-value, which depends on the degrees of freedom Φ = na+nb–2 and the significance
level α . If this is the case, we cannot reject the null hypothesis without another cause. The
significance level α of a test is the maximum probability of accidentally rejecting a true null
hypothesis. Note that we cannot prove the null hypothesis, in other words not guilty is not
the same as innocent.
Conclusions 1
there is no evidence of non-normality (boxplots not grossly asymmetrically) or
unequal variance (boxplots very similar in size and variances very similar).
Hence the simple student t-test is likely to be reliable. To test the null hypothesis
as formulated above.
Conclusions 2
reject the null hypothesis (i.e. egg production is not the same). Egg production was
significantly greater in mussel zone than in littorinid zone.
Student, 1908. The Probable Error of a Mean. Biometrika 6, 1-25, stable URL:
h+p://www.jstor.org/stable/2331554.
Conclusions 1
Whilst there is no evidence of non-normality (boxplots not grossly asymmetrically),
variances are a little unequal (one of the boxplots is not more than three times smaller than
the other). Hence, a separate variances t-test (Welsh's test) is more appropriate than a
pooled variances t-test (Student's test).
Conclusions 2
do not reject the null hypothesis, i.e. metabolic rate of male fulmars was not found to differ
The χ2-test introduced by Karl Pearson (1900) involves the comparison of distributions,
permitting a test that two distributions were derived from the same population. This
test is independent of the distribution that is being used. Therefore, it can be applied to test
the hypothesis that the observations were drawn from a specific theoretical distribution.
Let us assume that we have a data set that consists of 100 chemical measurements from a
sandstone unit. We could use the χ2-test to test the hypothesis that these measurements
can be described by a Gaussian distribution with a typical central value and a random
dispersion around. The n data are grouped in K classes, where n should be above 30. The
frequencies within the classes Ok should not be lower than four and never be zero. Then, the
proper statistic is
where Ek are the frequencies expected from the theoretical distribution. The alternative
hypothesis is that the two distributions are different. This can be rejected if the measured
χ2 is lower than the critical χ2, which depends on the degrees of freedom Φ=K–Z, where K is
the number of classes and Z is the number of parameters describing the theoretical
distribution plus the number of variables (for instance, Z=2+1 for the mean and the variance
for a Gaussian distribution of a data set of one variable, Z=1+1 for a Poisson distribution of
one variable)
By comparing any given sample chi-square statistic to its appropriate χ2-distribution, the
probability that the observed category frequencies could have be collected from a
population with a specific ratio of frequencies (for example 3:1) can be estimated. As is
the case for most hypothesis tests, probabilities lower than 5% (p<0.05) are considered
unlikely and suggest that the same pie is unlikely to have come from a population
characterized by the null hypothesis. χ2-tests are typically one-tailed tests focusing on the
right-hand tail as we are primarily interested in the probability of obtaining large χ2-
values. Nevertheless, it is also possible to focus on the left-hand tail so as to investigate
whether the observed values are "too good to be true".
The χ²-distribution takes into account the expected natural variability in a population as
well as the nature of sampling (in which multiple samples should yield slightly different
results). The more categories there are, the more likely that the observed and expected
values will differ. It could be argued that when there are a large number of categories,
First, we create a data frame with the Zar (1999) seed data
COUNT = c(152,39,53,6)
TYPE = c("YellowSmooth", "YellowWrinkled", "GreenSmooth", "GreenWrinkled")
seeds = data.frame(TYPE,COUNT)
We should convert the data frame into a table. Whilst this step is not strictly necessary, it
ensures that columns in various tabular outputs have meaningful names:
We assess the assumption of sufficient sample size (<20% of expected values <5) for the
specific null hypothesis.
Conclusion 1
all expected values are greater than 5, therefore the chi-squared statistic is likely to be a
reliable approximation of the c² distribution.
Now, we test the null hypothesis that the samples could have come from a population with
a 9:3:3:1 seed type ratio.
chisq.test(seeds.xtab,p=c(9/16,3/16,3/16,1/16), correct=F)
Conclusion 2
reject the null hypothesis, because the probability is lower than 0,05. the samples are
unlikely to have come from a population with a 9:3:3:1 ratio.
After focusing on the prediction of relations between variables (correlation and regression
in chapter 5), and after comparing the similarities of only two groups by F- and t-test, we
now want to shift to understanding differences between more than two groups or
between several variables of one group. This methodology is referred to as analysis of
variance (ANOVA). ANOVA methodology is used to analyze a wide variety of experimental
and quasi-experimental designs.
Experimental design in general, and analysis of variance in particular, has its own language.
Before discussing the analysis of these designs, we’ll quickly review some important terms.
In our course, we only focus on one way ANOVA. For more complex study designs please
refer to the specific statistics books on which this course builds up.
Let's try to explain a simple ANOVA with an example of a medical study. Say you’re
interested in studying the treatment of a disease. Two popular therapies for this disease
exist: the CBT (therapy 1) and EMDR (therapy 2). You recruit 10 anxious individuals (s1-s10)
and randomly assign half of them to receive five weeks of CBT (s1-s5) and half to receive
five weeks of EMDR (s6-s10). At the conclusion of therapy, each patient is asked to complete
a self-report as a measure of health improvement (with scores from 1=fully recovered to
10=no change). The design is outlined in figure 46(a). In this design, Treatment is a
between-groups factor with two levels (CBT, EMDR). It’s called a between-groups factor
because patients are assigned to one and only one group. No patient receives both
treatments. The grades (of s1-s10) of the self-report are the dependent variable, and
Treatment is the independent variable. The statistical design in figure 46(a) is called a
one-way ANOVA because there’s a single classification variable. Specifically, it’s a one-way
between-groups ANOVA. Effects in ANOVA designs are primarily evaluated through F-tests.
place all 10 patients (s1-s10)in the CBT group and assess them at the conclusion of therapy
and again six months later. This design is displayed in figure 46(b). Time is a within-groups
factor with two levels (five weeks, six months). It’s called a within-groups factor because
each patient is measured under both levels. The statistical design is a one-way within-
groups ANOVA. Because each subject is measured more than once, the design is also called
a repeated measures ANOVA. If the F-test for Time is significant, you can conclude that
patients’ mean self-report scores changed between five weeks and six months.
In the following, we perform a one-way ANOVA with the example dataset PMM
introduced in chapter 5.3. The dataset contains information on the sedimentary units which
will be used as grouping factors. The age and depth columns are not needed for the ANOVA,
and we will reduce the dataset to the first 5 chemical elements for better visibility. All these
columns will be de-selected using the respective function of the dplyr package.
After setting the working directory (setwd) we load the dataset PMM.txt:
PMM = read.delim("PMM.TXT")
For an initial overview, we prepare a boxplot showing the 5 selected elements and their
variances in the sedimentary units (figure 47):
Figure 47: boxplot of the PMM dataset showing the chemical elements by unit.
For getting a numerical overview on the means and standard deviations of the 5 elements
by each group we group the elements using the aggregate function of the basic stats
package. This is equivalent to the group_by function of the dplyr package, however, dplyr
does not allow grouping and a group calculation of several elements simultaneously.
This grouping is not necessary for the ANOVA, it is only an additional assessment!
# group means
PMM.mean = aggregate(PMM3[,3:8], by=list(PMM$unit), FUN=mean)
View(PMM.mean)
The command for the actual ANOVA is rather short and simple, however aov can only
perform an ANOVA for one element at a time. We select the elements Calcium (Ca) and
Silicium (Si) and perform two separate ANOVAs. We then compare the results with the basic
summary function:
# ANOVA of Ca:
# ANOVA of Si
PMM.Si = aov(data = PMM3, Si ~ unit)
anova(PMM.Si)
Conclusion 1
The F-test for Calcium (PMM.Ca) is significant (p=0.035) within the 95% significance limit,
but the F-test for Silicium (PMM.Si) is much clearer (p<0.001). So, as a result the variance of
the elements Ca and Si is significantly different between the units.
The plotmeans function in the gplots package can be used to produce a graph of group
means and their confidence intervals:
library(gplots)
plotmeans(PMM3$Ca ~ PMM3$unit, xlab="Units", ylab="Calcium",
main="Mean plot with 95% confidence interval")
Figure 48: The means of Ca, with 95% confidence limits, in each unit of the PMM dataset.
Now, the summary of the ANOVA tells us that the 4 units are different in the Ca variance
and even more different in the Si variance (note: NOT content!), but it doesn't tell us how
the units differ from each other! You can use a multiple comparison procedure to answer
this question. For example, the TukeyHSD() function provides a test of all pairwise
differences between group means, as shown next:
TukeyHSD(PMM.Ca)
TukeyHSD(PMM.Si)
The output of the Tukey HSD pairwise group comparison (HSD stands for: Honest
Significant Differences) can be plotted as follows:
Figure 49: The result of the Tukey HSD pairwise group comparison on the differences in mean levels of Si, with
95% confidence limits. (PMM dataset).
1. to develop a better predictive model (equation) than is possible from models based
on single independent variables
2. to investigate the relative individual effects of each of the multiple independent
variables above and beyond the effects of the other variables.
library(car)
scatterplot.matrix(~Ca+Ti+K+Rb+Sr+Mn+Fe,data=example2, diag="boxplot")
Conclusion 1
Element Mn varies obviously non-normal (asymmetrical boxplot). Let us try out, how a
scale transformation (e.g. logarithm) is changing that:
scatterplot.matrix(~Ca+Ti+K+Rb+Sr+log10(Mn)+Fe, data=example2,
diag="boxplot")
Conclusion 2
log10 transformation appears successful, no evidence of non-normality (symmetrical
boxplots)
asin(sqrt(LAP))*180/pi
We then have to show that a simple linear regression does not adequately describe the
relationship between Lap94 and distance by examining a scatterplot and a residual plot.
Scatterplot:
scatterplot(LAP ~ DIST, data=mytilus)
residual plot:
plot(lm(LAP ~ DIST, data=mytilus), which=1)
Conclusion 1
the scatterplot smoother suggests a potentially non-linear relationship and a persisting
pattern in the residuals further suggests that the linear model is inadequate for explaining
the response variable (Lap94).
Note that trends beyond a third order polynomial are unlikely to have much biological basis and
are likely to be over-fit. This is also true for most geoscien fic applica ons.
Coefficients:
(Intercept) DIST I(DIST^2) I(DIST^3) I(DIST^4) I(DIST^5)
2.224e+01 1.049e+00 -1.517e-01 6.556e-03 -1.033e-04 5.519e-07
plot(mytilus.lm5, which=1)
Conclusion 2
no “wedge” pattern of the residuals (see Fig. 33 In chapter 5.4.1), suggesting the
homogeneity of variance and that the fitted model is appropriate.
anova(mytilus.lm5)
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What is already stated as “information” above is here put into numbers: powers of distance
beyond a cubic (third order, x³) do not make significant contributions to explain the
variation of this data set.
For evaluating the contribution of an additional power (order) we can compare the fit of
higher order models against models one lower in order.
Conclusion 3
the third order model (lm3) fits the data significantly better that a second order model
summary (mytilus.lm3)
Call:
lm(formula = LAP ~ DIST + I(DIST^2) + I(DIST^3), data = mytilus)
Residuals:
Min 1Q Median 3Q Max
-6.1661 -2.1360 -0.3908 1.9016 6.0079
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.2232524 3.4126910 7.684 3.47e-06 ***
DIST -0.9440845 0.4220118 -2.237 0.04343 *
I(DIST^2) 0.0421452 0.0138001 3.054 0.00923 **
I(DIST^3) -0.0003502 0.0001299 -2.697 0.01830 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Coefficient of determina on
r² – is equal to the square of the correla on coefficient only within simple linear regression
r² = (SSreg / Sstot) is reflec ng the explained variance
Conclusion 4
there was a significant cubic (third order) relationship between the frequency of the Lat94
allele and the distance from Southport. The final equation of the regression is:
arcsin(sqrt(LAT))=26.2233-0.944*DIST+0.042*dist²-0.0003*dist³
# the plot
Figure 50: Graph showing the polynom of 3rd order (mytilus.lm3) plotted against the data points of the mytilus
dataset.
()( )
⃗x x1 x2 .. x n
⃗y = y1 y2 .. y n
⃗z z1 z2 .. z n
.. .. .. .. ..
2 +5 7
So the Jaccard distance in the above example would be d Jaccard ( A , B)= = meaning
10 10
that 70% of the features occur only in one of the objects.
In R we can use the dist() function
DISTANCEMATRIX=dist(DATASET, method="")
to compute the distances between several objects. dist() calculates the distance between
the rows of a matrix, so make sure your DATASET has the right format. The set of
comparison results you get back is called distance matrix. method="binary" gives you the
binary (Jaccard) distance. In R the input vectors are regarded as binary bits, so all non-zero
elements are ‘on’ and zero elements are ‘off’. In these terms the distance can be seen as the
proportion of bits in which only one is on amongst those in which at least one is on, which
is an equivalent definition to the one given above.
Another improtant similarity measure often used in ecology is Bray-Curtis dissimilarity. It
compares species counts on two sites by summing up the absolute differences between the
counts for each species at the two sites and dividing this by the sum of the total abundances
in the two samples. The general formula for calculating the Bray-Curtis dissimilarity
between samples A and B is as follows, supposing that the counts for species x are denoted
by nAx and nBx:
m
∑∣n Ax n Bx∣
x=1
d Bray Curtis ( A , B)= m
∑ (n Ax +n Bx)
x=1
This measure takes on values between 0 (samples identical: nAx= nBx . for all x) and 1 (samples
completely disjoint).
Number of
indiviudals in
Aquarium 1 3 2 4 6
Aquarium 2 6 0 0 11
If we have quantitative data the two most common distances used are Euclidean distance
and Manhattan (city-block) distance. Let us look at an example: We have analyzed two
different rocks for their content in calcium and silicium and we find the following
(
silicium ) ( ) ()
r 12 1 silicium r 22 2( )( )()
R1= calcium = r 11 = 0 a.u. and R2 = calcium = r 21 = 2 a.u. (a.u.=arbitrary units).
If we plot calcium against silicium (Figure 52) we can see two points which represent the
two different rocks. How different are they? A very intuitive way to think of the distance
between these two points is the direct connection between them (continuous line). This
distance is called euclidean distance and can be easily calculated with the Pythagorean
theorem; d (R1, R2 )=√(r 11 r 21)2 +( r 12 r 22 )2 =√5 , as you know from school. Another way
to calculate the distance is to follow the dotted lines — as if we walked around in
Manhattan. Then we get the Manhattan (city-block) distance:
d ( R1, R2 )=∣r 11 r 21∣+∣r 12 r 22∣=3 . Obviously, the obtained distances are not the same,
the distance between two objects very much depends on how you measure it.
n
Manhattan: d (x , y)=∑ ∣x k y k∣ . If n=2 we have the case we saw in the example before.
k=1
These distance measures depend on the scale. If two components are measured on
different scales you want to consider some standardization first, so that the
components contribute equally to the distance. Otherwise the larger
components will always have more influence.
In ecology euclidean distance has to be used with care. The difference between
none (0) and 1 individuals occurring mathematically is less than the difference
between 1 and 4 individuals occurring – ecologically this can be a huge
difference. In ecology, for example for comparing species composition at
different sites, you would therefore rather use other distance measures like
Bray-Curtis dissimilarity.
Exercise 28: Calculate the Manhattan and euclidean distance between the two
objects a=(1,1, 2, 3) and b=( 2, 2,1, 0) . One way to solve this is to use
R as a normal calculator applying the formulas above (or do it in your head). The
second one is to create two vectors a=c(a1,a2,a3,a4) and b and use
rbind(Vector1, Vector2)to combine them to a matrix. Then you can use the dist()
function.
“range” sets the highest value of each variable to 1 and scales the other accordingly – so
they are now percentage of maximum.
Have a look at the dataframe. We are now interested in clustering the elements and not the
observations. The objects we want to cluster have to be in the rows of the dataframe.
Therefore we need to transpose our datamatrix (change rows and columns). This is done by
Now let us calculate the distances between the elements. Since these are measured on a
ratio scale it makes sense to use euclidean distances.
distm=dist(PMMnt, method="euclidean")
Now we use this distance matrix as input for our clustering. Let us first use single as
clustering method.
ahclust=hclust(distm, method="single")
A graphical ouptut can easily be obtained by plot(ahclust). This gives you a dendrogram
where we can see how closely the observations are related. The length of branches shows
the similarity of the objects. You can see that a lot of elements are added to existing
clusters in a stepwise fashion i.e. one after the other. This is a peculiarity of the single-
linkage method.
If you have a lot of objects, presenting the result as dendrogram is not very pretty anymore.
It is more useful to know the assignment of each objects to a certain cluster at a certain
number of clusters present. For that we use the R function cutree(DATA,#CLUSTERS). The
result is a vector with as many components as we have objects and each component tells
you the cluster for that object. If we want to divide our data into two groups we therefore
can now use
ahclust_2g=cutree(ahclust,k=3)
to get the assignment of each object to one of these groups. You can type the variable name
ahclust_2g to get some idea about this assignment.
Exercise 29: Try out the other linkage methods. Are there significant differences?
Check by plotting all the dendrograms into one big graph (see Chapter 4.5
Combined figures if you forgot how to do that).
BONUSPOINTS: Cluster the non-standardized dataset PMM. Do the results make
sense?
will sort our objects into 4 different clusters. You don't need to calculate a distance matrix
before because the distances are calculated anew in each computation round. The obtained
datastructure is different then before. The vector which contains the assignment of the
objects to the different clusters can be obtained by
kmclust=km$cluster
kmclust
library(ggplot2) # plotting
qplot(names(km$cluster), km$cluster)
will give you a representation of this clustering method. Understandably, we will not get a
dendrogram this time.
Exercise 31: The districts of the Baltic can be grouped by composition of the algae
species. Cluster the sites in the dataset algae_presence.csv with agglomerative
hierarchical clustering and a linkage method of your choice. Use the
presence/absence of species for classification. What distance measure should
you use? Look at the dendrogramm. Do the results make sense?
Repeat the exercise after you did a Beals transformation (see the following infobox
or ?beals) of the data. What distance measure should you use? Do your results
make more sense?
Beals transformation:
Beals smoothing is a multivariate transformation specially designed for species
presence/absence community data containing noise and/or a lot of zeros.
This transformation replaces the observed values (i.e. 0 or 1) of the target
species by predictions of occurrence on the basis of its co-occurrences with
the other remaining species (values between 0 and 1). In many applications,
the transformed values are used as input for multivariate analyses.
In R Beals transformation can be performed with the beals() function of the
vegan package.
Further reading:
Afifi, May, Clark (2012): Practical Multivariate Analysis, CRC Press. Chapter 16:
Cluster Analysis
A good introduction which is well understandable but more in-depth than in this
script.
http://www.econ.upf.edu/~michael/stanford/maeb7.pdf
Explanation of hierarchical clustering with examples.
bio.umontreal.ca/legendre/reprints/DeCaceres_&_Legendre_2008.pdf
A discussion about Beals transformation
dist(x, method="") x: a numeric matrix, data frame or "dist" object. Calculate the distance between the
rows of a matrix and returns a
method: the distance measure to be used. distance matrix.
Must be "euclidean", "maximum", "manhattan",
"canberra", "binary" or "minkowski".
9.1.2 PCA in R
There are several possibilities to perform a PCA in R. We use a basic function from the stats
package: princomp(DATASET, cor=TRUE). cor specifies if the PCA should use the covariance
matrix or a correlation matrix. As a rough rule, we use the correlation matrix if the scales of
the variables are unequal. This is a conscious choice of the researcher!
Let us work again with a dataset we already know and love, however in a slightly modified
way: PMM2.txt. Load the dataset, creat a sub-dataset PMM without the last two columns
(depth, phase) and then we can use
PMM_pca = princomp(PMM, cor=TRUE)
to carry out a complete PCA and get 15 principal components, their loadings and the scores
of the data. The first step of a PCA would be to calculate a covariance or correlation matrix.
However, the function will calculate it for us and we can use our raw data as input.
A basic summary of our analysis can be obtained by
print(PMM_pca)
summary(PMM_pca)
To get an idea about the data it is common to plot the scores of the 1st PC against the scores
for the 2nd PC
plot(PMM_pca$scores[,c(1,2)], pch=20, col=PMM2$phase,)
text(PMM_pca$scores[,1],PMM_pca$scores[,2])
abline(0,0); abline(v=0)
To get an overview, we can create a scatterplot matrix, for example like this:
pairs(PMM_pca$scores[,1:4], col=PMM2$phase, main="Scatterplot Matrix
of the scores of the first 4 PCs")
We will get a scatterplot matrix of all these components against each other.
Since we want to know which variables have the greatest influence on our data, we want to
have a look at the loadings of the PCs. One way to do this is to just type the variable name:
PMM_pca$loadings
which shows which elements have the highest influence on the first PC.
choices selects the PCs to plot. It is quite useful in analysing and interpreting the results.
alternatively with 'ggplot2'/'ggfortify':
autoplot(PMM_pca, data=PMM2, label=TRUE, colour='phase',
loadings = TRUE, loadings.label = TRUE, loadings.colour =
'blue') +
geom_vline(aes(xintercept = 0))+
geom_hline(aes(yintercept = 0))
Exercise 32: Repeat the plots above, but this time looking at the relationship
between the 1st and the 3rd Principal Component.
At the moment, plotting principal components other than PC1 and PC2 does not
yet work with ggfortify::autoplot!
We see that the most information is in the first component and from the 6th onward there
is not much information in the component anymore. From the summary we know that the
first 6 components explain 90% of the variance. In the next part we will see methods how to
determine which PCs are still useful for further analysis.
Variable Explanation
body depth (BD) depth of the body; for females, measured after displacement of the
abdomen, in mm
The main question we will try to answer is: Can we determine the species
and sex of the crabs based on these five morphological measurements?
We would like to have one single variable that allows us to classify a crab we find
correctly. Let's see if that is possible by completing the following subtasks:
1. View your data. From univariate box-plots assess whether any individual
variable is sufficient for discriminating the crabs's species or sex.
hint: melt the data set, then use ggplot+facet.grid
2. How could you determine if there is indeed a significant difference? Test if
there is a significant difference between the RW of the different groups.
➔ BONUSPOINTS: Create a scatterplot matrix of all measured variables
against each other.
hint: Use the parameter australian.crabs$group to color according to the
group they belong to. Group has to be a factor!
hint: use Ggally::ggpairs(). → see chapter 7
Does this help us in distinguishing groups?
3. Perform a PCA on the dataset. Since the variables are not quite on the same
scale we use the PCA with the correlation matrix (cor=TRUE).
4. Plot the scores of the first PC against the crab group. Hint: use plot(x,y)
Does it help in distinguishing the groups?
So our work in R is rather easy. We load our forest dataset by (watch out for the correct
directory path!)
forests<-read.csv("forests.csv", header=TRUE, row.names=1)
Since metaMDS is a complex functions there are a lot of possible parameters. You will want
to check
?metaMDS
to see what possible parameters there are. MetaMDS needs the following structure of the
dataset: columns → variables ; rows → samples
As the original forest dataset is organized the other way around, we need to transpose it:
t_forests=t(forests)
Now a simple NMDS analysis of our dataset with the default settings could look like this:
def_nmds_for=metaMDS(t_forests)
distance is the distance measure used (see 8.1 Measures of distance), k is the number of
dimensions, autotransform specifies if automatical transformations are turned on or off.
You can see which objects the metaMDS function returns by
names(nmds_for)
We can view which parametes were used by writing the output variable name:
nmds_for
The column numbers correspond to the MDS axes, so this will return as many columns as
was specified with the k parameter in the call to metaMDS.
We can obtain a plot by:
plot(nmds_for)
You can even obtain fancier plots by further customization. By specifying type
="n", no sample scores or variable scores will be plotted. These can then be
plotted with the points() and text() commands. For crowded plots, congestion
of points and labels can be alleviated by plotting to a larger window and by
using cex to reduce the size of symbols and text. An example:
plot(nmds_for, type="n")
plots axes, but no symbols or labels
points(nmds_for, display=c("sites"), choices=c(1,2), pch=3,
col="red")
plots points for all samples (specified by “sites”) for MDS axes 1 & 2 (specified
by choices). Note that symbols can be modified with typical adjustments,
such as pch and col.
text(nmds_for, display=c("species"), choices=c(1,2), pos=1,
col="blue", cex=0.7)
plots labels for all variable scores (specified by “species”) for MDS axes 1 & 2.
Typical plotting parameters can also be set, such as using cex to plot smaller
labels or pos to determine the label position.
K: number of dimesions
http://cran.r-project.org/web/views/Spatial.html
Bivand, Roger S., Pebesma, Edzer J., Gómez-Rubio, Virgilio, 2013: Applied
Spatial Data Analysis with R, Series: Use R!, Springer, 2008, XIV, 378 p., ISBN
978-0-387-78170-9 - Available for students of CAU Kiel as free ebook
https://stat.ethz.ch/mailman/listinfo/R-SIG-Geo/
R Special Interest Group on using Geographical data and Mapping
Because of the limited time available for this subject we will focus on the practical aspects
of spatial analysis, i.e. things you might need if you add maps to your statistical project or
final thesis. This includes mainly import of vector and raster maps, plotting of maps and
statistical analyses.
First, we need to define the different types of spatial data
• Point data, e.g. the location and one or more properties like the location of a tree and
its diameter. Normally this type is considered the simplest case of a vector file, but
we treat it separately, because mapping in ecology means frequently going out with
a GPS and writing down (or recording) the position and some properties (e.g. species
composition, occurrence of animals, diameter of trees...)
• Vector data with different sub-species like a road or river map (normally coming
from a vector GIS like ArcGIS).
• Grid-Data or raster data are files with a regular grid like digital images from a camera, a
digital elevation model (DEM) data or the results of global models.
str(X)
plot(X)
summary(X)
plot(density(X, 10))
library(ggplot2)
library(reshape2)
library(dplyr)
load(file="wq_map.Rdata")
str(wq_map)
library(ggmap)
lon=mean(lon,na.rm=TRUE),lat=mean(lat,na.rm=TRUE))
NO3_Mean$NO3_class=cut(NO3_Mean$NO3_MEAN,10)
# plot values in map
gg <- ggmap(sh)
gg <- gg + geom_point(data=NO3_Mean,
mapping=aes(x=lon, y=lat, size=2,color=NO3_MEAN))+
scale_color_gradientn(colours =rev(heat.colors(10)))
gg
36: Plot the nitrate means for the Kiel region with the following background
maps: Satellite map, OpenStreet (osm) and – for your artist friends -
Stamen maps in watercolor.
library(sp)
library(maptools)
library(spatstat)
NO3_Mean=as.data.frame(NO3_Mean) # mean nitrate content
# make map object
NO3_map=sp::SpatialPointsDataFrame(NO3_Mean[,c("lon","lat")],
data.frame(NO3_Mean$NO3_MEAN))
# convert to a ppp object used with the maptool library
NO3_ppp = maptools::as.ppp.SpatialPointsDataFrame(NO3_map)
NO3_nn=spatstat::nnmark(NO3_ppp,at="pixels")
plot(NO3_nn)
class(NO3_nn)
The result is an image which can be converted easily to a data frame for further analysis.
nn_df=as.data.frame(NO3_nn)
str(nn_df)
NO3_idw=spatstat::idw(NO3_ppp,at="pixels")
plot(NO3_idw)
idw_df=as.data.frame(NO3_idw) # conversion made easy
qplot(data=idw_df,x=x,y=y,fill=value,geom="raster")
10.3.3 Akima
The akima library interpolates a grid of irregularly spaced input data. It uses bilinear or
bicubic spline interpolation with different algorithms. Unfortunately it also uses a different
data format and requires therefore a little bit more effort.
library(akima)
raw8=NO3_Mean
Unfortunately you cannot plot the result directly with ggplot. The following code
transforms the result to a format you can plot with ggplot.
library(spatstat)
library(maptools)
library(sp)
The result is a spatial structure which cannot be used later on. We have to convert it to
SpatialPolygons. This produced the spatial structures, but they do not contain the NO3
values. In the next step we assign the values to the spatial structures.
int.Z = over(NO3_thiessen, NO3_map, fn=mean)
class(int.Z)
data.spdf = SpatialPolygonsDataFrame(NO3_thiessen, int.Z)
names(data.spdf)="NO3_Mean"
class(data.spdf)
plot(NO3_thiessen)
plot(data.spdf)
spplot(data.spdf, "NO3_Mean", col.regions = rainbow(20))
Finally we have to convert the vector file a raster format and save it to disk.
Point is possibly the most frequent application for ecologists. Typically, positions are
recorded with a GPS device and then listed in Excel or even as text.
The procedure in R to convert point data to an internal or ESRI-map is straightforward:
• read in the data
• define the columns containing the coordinates
• convert everything to a point shapefile
Following is a brief R script that reads such records from a CSV file, converts them to the
appropriate R internal data format, and writes the location records as an ESRI Shape File.
The file Lakes.csv contains the following columns. 1: LAKE_ID, 2: LAKENAME, 3:
Longitude, 4: Latitude. For compatiblility with ArcMap GIS, Longitude must appear
before Latitude.
library(sp)
library(maptools)
library(rgdal)
The easiest way to import a grid is to use the gdal library, but we have to convert them
manually to the raster format.
lu87grd = readGDAL("lu87.asc")
lu87=raster(lu87grd)
str(lu87)
demgrd = readGDAL("dem.asc")
dem=raster(demgrd)
spplot(dem)
lu07grd = readGDAL("lu07.asc")
lu07=raster(lu07grd)
spplot(lu07grd)
To show you how maps are used for statistics we want to find out the land use type on steep
slopes.
slopegrd = readGDAL("slope.asc")
slope=raster(slopegrd)
spplot(slope)
hist(slope)
Multiply with land use – multiplication with 0 is 0, for 1 the value of land use is taken.
lu_steep = steep * lu87
freq(lu_steep)
39: Calculate the change in forest cover from 1987 until 2007 in elevations
> 1000m (code for forest: 1), count decreasing and increasing forest
cover
Hints: use logical functions to select data sets
The following tutorial is taken from: Paul Galpern, 2011: Workshop 2: Spatial
Analysis – Introduction, R for Landscape Ecology Workshop Series, Fall 2011,
NRI, University of Manitoba (http://nricaribou.cc.umanitoba.ca/R/)
Unfortunately, R is not very suitable for vector data, therefore we suggest that you prepare
the vector files as far as possible with a real GIS. If you really want to take a close look at
vector maps in R you can read the following help files and the book from Blivand et al. 2008.
library(maptools)
library(raster)
Read the shape file in a R map. Shape-files, i.e. files with an extension .shp are vector files
in an ArcView format which can be used by all GIS packages.
vecBuildings <- readShapeSpatial("patchmap_buildings.shp")
vecRoads <- readShapeSpatial("patchmap_roads.shp")
vecRivers <- readShapeSpatial("patchmap_rivers.shp")
vecLandcover <- readShapeSpatial("patchmap_landcover.shp")
str(vecRivers)
Because R is not good in vector maps we convert everything to a raster. First, we define size
and extent of the new raster map
rasTemplate <- raster(ncol=110, nrow=110, crs=as.character(NA))
extent(rasTemplate) <- extent(vecLandcover)
The field="GRIDCODE" part defines the variable which contains the code for the land use.
rasBuildings <- rasterize(vecBuildings, rasTemplate)
rasRoads <- rasterize(vecRoads, rasTemplate)
rasRivers <- rasterize(vecRivers, rasTemplate)
plot(rasBuildings)
plot(rasRoads)
plot(rasRivers)
A simple application of map operations is e.g. the creation of a buffer zone around streets or
buildings. This can be done with the edge function which draws a line around the edges of
a raster
ras2 <- boundaries(rasRoads, type="outer")
that only the edges are drawn. To add one map to the other we use
rasRoads2 <- cover(rasRoads, ras2)
The final step is to combine the buildings, roads, rivers, and landcover rasters into one. We
will cover the landcover raster with the other three.
rasRoads[rasRoads==0] <- NA
rasRivers[rasRivers==0] <- NA
The features on each of these three rasters have a value of 1. In order to di erentiate these
features on the final raster we need to give each feature a di erent value. Recall that our
landcover classes are 0 to 4. Let’s set rivers to 5, buildings to 6, and roads to 7. It seems to
be standard practise to use a continuous set of integers when creating feature classes on
rasters.
rasRivers[rasRivers==1] <- 5
rasBuildings[rasBuildings==1] <- 6
rasRoads[rasRoads==1] <- 7
And now we can combine these using the cover function, with the raster on top first, and
the raster on bottom last in the list:
patchmap <- cover(rasBuildings, rasRoads, rasRivers, rasLandcover)
To read the landuse map from the GIS-course you can first check the file
getinfo.shape("landuse.shp")
where you find all attributes of the map. You can manipulate these variables as usual, e.g.
myLanduse@data[1,]
40: Calculate the sum and the average size of the different land use classes
(variable GRIDCODE)
41: Recode or create a variable the GRIDCODE so that there are only two
classes: water and land (water is 5), plot and export the map
11.1 Definitions
Time Series: „In statistics and signal processing, a time series is a sequence of data points,
measured typically at successive times, spaced at (often uniform) time intervals. Time series
analysis comprises methods that attempt to understand such time series, often either to
understand the underlying theory of the data points (where did they come from? what
generated them?), or to make forecasts (predictions). Time series prediction is the use of a
model to predict future events based on known past events: to predict future data points before
they are measured. The standard example is the opening price of a share of stock based on its
past performance. „
Trend: „In statistics, a trend is a long-term movement in time series data after other
components have been accounted for.“
Amplitude: „The amplitude is a non negative scalar measure of a wave's magnitude of
oscillation“
Frequency: „Frequency is the measurement of the number of times that a repeated event
occurs per unit of time. It is also defined as the rate of change of phase of a sinusoidal
waveform. (Measured in Hz) Frequency has an inverse relationship to the concept of
wavelength. „
„Autocorrelation is a mathematical tool used frequently in signal processing for analysing
functions or series of values, such as time domain signals. Informally, it is a measure
of how well a signal matches a time-shifted version of itself, as a function of the
amount of time shift” (the Lag). “More precisely, it is the cross-correlation of a signal
with itself. Autocorrelation is useful for finding repeating patterns in a signal, such as
determining the presence of a periodic signal which has been buried under noise, or
identifying the missing fundamental frequency in a signal implied by its harmonic
frequencies. „
Period: time period or cycle duration is the reciprocal value of frequency: T = 1/frequency
All citations from the corresponding keywords at www.wikipedia.org 2006
Name Content
Date Date
Peff Effective precipitation (mm)
Evpo_Edry Evaporation from dry alder carr (mm)
T_air Air temperature (°C)
Sunshine Sunshine duration (h)
Humid_rel Relative Humidity (%)
H_GW Groundwater level (m)
H_ERLdry Water level in dry part of alder carr (m)
H_ERLwet Water level in wet part of alder carr (m)
H_lake Water level in Lake Belau (m)
Infiltra Infiltration into the soil (mm)
The following command sequences converts the text of a German date (“31.12.2013”) to an
internal date variable:
t$date <- as.Date(as.character(t$Date), format="%d.%m.%Y")
The conversion as.character is sometimes necessary, because date values from files are
sometimes read in as factor variables.
It is useful to convert dates into a standard format available on many platforms, the posix
format which computes seconds from 1970.
t$posix <- as.POSIXct(date)
Where POSIXct shows the variable in a readable form, the alternative version POSIXlt
is better suited for data frames.
A much easier way to convert text to POSICct is the anytime library. It takes many
common formats and converts it to date without complex format strings.
library("anytime")
anytime("2016-12-30")
Despite of the German origin of the author of the library, one of the formats it does not
In the same way you can add any time/date format to the library.
An easier way is the use of the following function from package Hmisc which converts text
from a file immediately into a date variable.
library(Hmisc)
Last not least: the easiest way to get date/time variables into R is to import files directly in
Excel format with the readxl library.
More information is available in the description of the packages chron and zoo, which is
is useful for time series with unequal distances. Some important methods are:
DateTimeClasses(base) Date-Time Classes
cut.POSIXt(base)
create a factor variable containing the years of the data set. The extraction of months and
weeks is similar (see help(cut.Date) for a summary of all possibilities).
You can check the results with
levels(t$years)
You can now use the factor to classify your data set for many functions. With the command
qplot(years,H_GW,data=t,geom="boxplot")
The command creates a factor for each month, e.g. "Jan 1978", "Feb 1978" etc. Frequently
this is not what you want. If you need a mean value for all months in the data set (e.g. for
seasonal analysis) you have to extract the name or number of the months and then to
convert them into a factor which can be used for a boxplot etc.
mon_tmp <- format.Date(t$date,format="%m")
t$julianday=timeDate::dayOfYear(timeDate(t$date))
Once the factors are defined, you can use the dplyr library to create all kinds of summaries
(sums, mean....).
t_annual=dplyr::group_by(t,years)
airtemp=dplyr::summarise(clim_group,
mean_t=mean(AirTemp_Mean),
median=median(AirTemp_Mean))
qplot(as.Date(years),mean_t,data=t_ann_mean,geom="line")
ggplot(data=t_ann,aes(x=as.Date(years),y=value))+
geom_line()+
facet_grid(variable ~ .,scales="free")
42: create boxplots with annual and monthly values of lake water level and
groundwater level
43: create a scatterplot matrix for single months of daily values of lake
water levels (hint: use melt/cast functions for wide/narrow conversions,
use days of month and years as row identifiers
where
T = Trend, a monotone function of time t
S = one or more seasonal component(s) (cycles) of length/duration i
R = Residuals, the unexplained rest
The analysis of TS is entirely based on this concept. The first step is usually to detect and
eliminate trends. In the following steps, the cyclic components are analysed. Sometimes,
the known seasonal influence is also removed.
44: use a linear model to remove the trend from the air temperature (Hint:
function lm, look at the contents of the results)
11.4.2.2 Filter
Some TS show a high degree of variation and the real information my be hidden in the high
variation of the data set. This is why there are several methods of filtering or smoothing a
data set. Sometimes this process is also called “low pass filtering”, because it removes the
high pitches from a sound-file and lets the low frequencies pass. The most frequently used
methods are splines and moving averages. Moving average are computed as mean values of
a number of records before and after the actual value. The range of averaging decides on
the “smoothness” of the curve. Is filtering is used to remove trends, “detrended” means the
deviations from the moving averages.
45: Remove the seasonal trend from the air temperature (Hint: use
daynumbers)
plot(ts)
For time series analysis we often need so called “lag” variables, i.e. the data set moved back
or forth a number of timesteps. A typical example is e.g. Unit-Hydrograph, which compares
the actual discharge to the effective precipitation of a number of past days. This number is
called “lag”. You can create the corresponding time series with the lag function:
ts_test = as.ts(t$H_GW) # Groundwater
lagtest <- ts_test # temp var
for (i in 1:4) {lagtest <- cbind(lagtest,lag(ts_test,-i))}
Now check the structure and the content of lagtest.
erle_acf
The following, more complex command is more useful and adapted to our data set, it
calculates autocorrelation for a whole year (365 days) and plots the coefficients.
erle_acf <- acf(H_ERLdry, lag.max=365, plot=TRUE)
The cross correlation analysis is very similar. The analyse the relation between water level
and precipitation
erle_ccf <- ccf(H_ERLdry, Peff, lag.max=30, plot=TRUE)
By splitting the output screen into several windows you can get a concise overview about
the relations between the different variables:
split.screen(c(2,2))
screen(2)
screen(3)
screen(4)
close.screen(all = TRUE)
The ggplot version looks similar, but uses a different logic to create a narrow version of the
data set using the rbind command..
t1=ccf(t$H_ERLdry, t$Peff, lag.max=30, plot=FALSE)
ccf_all=data.frame(t1$acf,t1$lag,t1$snames)
t1=ccf(t$H_ERLdry, t$Evpo_Edry, lag.max=30, plot=FALSE)
ccf_all=rbind(ccf_all,data.frame(t1$acf,t1$lag,t1$snames))
t1=ccf(t$H_ERLdry, t$H_lake, lag.max=30, plot=FALSE)
ccf_all=rbind(ccf_all,data.frame(t1$acf,t1$lag,t1$snames))
t1=ccf(t$H_ERLdry, t$Infiltra, lag.max=30, plot=FALSE)
ccf_all=rbind(ccf_all,data.frame(t1$acf,t1$lag,t1$snames))
qplot(t1.lag,t1.acf,data=ccf_all,geom="line",col=t1.snames)+
geom_vline(xintercept = 0)+
geom_hline(yintercept = 0)
N <- length(Time)
plot(TempAirC ~ Time)
# modules if
dc <- Mod(transform[1])/N
ylab="Periodogram",
main="Spectral Density")
ylab="Log(Periodogram)",
print(maxfreq)
cat("Corresponding period\n")
print(1/maxfreq)
Please take care that the frequency corresponds to the whole data set (, i.e. 3652 points) and
not for a year day or the defined time step.
par(oldpar)
Next, we can use a different approach with a different scaling. The base period is now 365
days, i.e. frequency of 1 means one per year.
air =read.csv("http://www.hydrology.uni-
kiel.de/~schorsch/air_temp.csv")
spec.pgram(airtemp,xlim=c(0,10))
abline(v=1:10,col="red")
To compute the residuals, we use the information from spectral analysis to create a linear
model.
x <- (1:3652)/365
49: analyse the periodogram of the lake water level before and after the
stl analysis
Frequently the required software is not fully installed the first time. If you get error
message you have to try it a second time.
library(changepoint)
bre=cpt.mean(no3$Value)
qplot(data=no3,x=1:length(no3$Date),y=Value)+
geom_vline(xintercept=bre@cpts[1])
no3$Date[bre@cpts[1]]
12.1 Tasks
The central question in the first units of the course is:
• Has the climate of the Hamburg station changed since measurement began?
The question can be divided, for example, into the following sub-questions or sub-tasks:
• Comparing winter precipitation intensities
• Has the intensity of (winter) precipitation during the years 1959-89 changed in
comparison to the years 1929-59?
• Are trends identifiable in annual mean, minimum and maximum temperature (linear
regression with time as the x-axis)?
• Has the difference between summer and winter temperatures changed?
12.1.1 Summaries
The following examples are organized according to level of difficulty.
Select on variable and create a figure with 800x1200 pixels size and the following
contents:
• a plot of the original data
• a plot of annual, summer and winter mean values,
• a boxplot of decadal values (use the as.integer to calculate the factors)
• a violinplot of the two periods 1950-1980 and 1981-2010
• a lineplot of the monthly means or sums for the two periods 1950-1980 and 1981-
2010
• a boxplot of the daily values as function of period and month
• put everything together in a 800x1200 pixels size (see Fig. Error: Reference source
not found), send us the result by email and have a nice Christmas :-)
Hints
• prepare the figures step by step
• use aggregate to calculate the annual and monthly summaries
• Cloud_Cover
• RelHum
• Mean_Temp
• Airpressure
• Min_Temp_5cm
• Min_Temp
• Max_Temp
• prec
• sunshine
• snowdepth
52: Analyse the slope of the different variables. Is there a significant increase?
First Version:
Climate$Summer = 0
Climate$Summer[Climate$Month>5 & Climate$Month<10]=1
Second Version:
Climate$Summer = (Climate$Month>5) & (Climate$Month<10)
The result is a boolean variable
Solution 11:
m2 = (Max_Temp+Min_Temp)/2
scatterplot(Mean_Temp ~ m2)
scatterplot(Mean_Temp ~ m2| Year_fac)
Solution 39:
check=lu07==lu87
Wieviel Wälder sind von 1987 bis 2007 verschwunden in den Höhen über
2000m?
ue1000 =dem>1000
> t2=ue1000*dem
> spplot(t2)
forest87=lu87==1
forest07=lu07==1
ue1000 =dem>1000
forest87a=forest87*ue1000
forest07a=forest07*ue1000
spplot(diff87_07)
summary(diff87_07)
Cells: 770875
NAs : 378939
Mode "logical"
FALSE "384320"
NA's "378939"
# increase 87=0, 07 = 1
spplot(diff07_87)
summary(diff07_87)
Cells: 770875
NAs : 378943
Mode "logical"
FALSE "370912"
NA's "378943"
diff= diff87_07-diff07_87
spplot(diff)
Solution 42:
Solution 46:
gw = ts(H_GW, start=c(1989,1),freq=365)
plot(stl(gw,s.window="periodic"))
Solution 49: