0% found this document useful (0 votes)
114 views161 pages

Analysis Ecological Data Ws17 18

Analysis Ecological Data Ws17 18

Uploaded by

John Davi Jones
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views161 pages

Analysis Ecological Data Ws17 18

Analysis Ecological Data Ws17 18

Uploaded by

John Davi Jones
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 161

Analysis of

Ecological
Data with R

Georg Hörmann
Institute for Natural Resource
Conservation
[email protected]

Ingmar Unkel
Institute for Ecosystem Research
[email protected]

Christian-Albrechts-Universität zu Kiel

Evaluation Code WS 17/18: 42KYT

1/15/18 / 11:17:40 / 1 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Changelog: (what changed when)
• Fall 2001: first version
• March 2002: revision of the structure, minor adjustments, approx. 300 downloads per
month
• September 2002: revision, minor adjustments, conversion to OpenOffice using Debian
• January 2003: revision, new version of phpMyAdmin, pivot table written out
• October 2003: revision for the winter semester course
• October 2004: revision for the course, now four hours long, new orthography (partially)

• June 2011: Translation by Kevin Callon

• October 2011: complete rewrite


• October 2011: revision of the revision. Ingmar was joining the team and was trying to
topple all over
• October 2012: cluster analysis and ordination added
• October 2013: minor corrections, added simple ggplot2 examples
• Spring 2014-Autumn 2014: removed all references to spreadsheets, replaced by R
operators & functions
• Autumn 2014: started to replace data management functions with dplyr
• Autumn 2015: adapting everything to the Hadley Wickham-Universe (dplyr, tidyr,
reshape2, ggplot2, readxl)
• Winter 2015/16: background maps and spatial interpolation added
• Spring 2016: ANOVA added
• Autumn 2016: minor adaptations to new versions, changepoint analysis and Thiessen
polygons
• Autumn 2017: minor updates, RQGIS added

Author's Copyright

This book, or whatever one chooses to call it, is subject to the GNU license (GPL, full details
available on every good search engine). It may be further distributed as long as no money is
requested or charged for it.

1/15/18 / 11:17:40 / 2 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Table of Contents
1 Introduction...........................................................................................9
1.1 Side note: Freedom for Software: The Linux or Microsoft Question...............................10

2 Basics with R.........................................................................................12


2.1 Installation................................................................................................................................14
2.1.1 Base System......................................................................................................................14
2.1.2 User Interfaces.................................................................................................................14
2.1.2.1 Rcmdr........................................................................................................................14
2.1.2.2 Rstudio......................................................................................................................18
2.2 The Hamburg climate data set...............................................................................................19
2.3 Recommended startup procedures.......................................................................................21
2.4 Import of data..........................................................................................................................22
2.4.1 Recommended input format..........................................................................................22
2.4.2 Import of text data (csv, ASCII) with Rcmdr...............................................................24
2.4.3 Direct import of Excel Files............................................................................................25
2.4.4 Import with Rstudio........................................................................................................26
2.4.5 First data checks, basic commands...............................................................................27
2.5 Working with variables..........................................................................................................29
2.5.1 Variable types and conversions....................................................................................29
2.5.2 Accessing variables.........................................................................................................30
2.5.3 Coding date and time......................................................................................................31
2.6 Simple figures...........................................................................................................................32
2.7 R and RMD.................................................................................................................................33
2.8 Export and save data...............................................................................................................35
2.9 Exercises....................................................................................................................................36
2.10 Debugging R scripts...............................................................................................................36

3 Data management with dplyr and reshape2.........................................38


3.1 Data Organization....................................................................................................................38
3.1.1 Optimum structure of data bases..................................................................................38
3.1.2 Dealing with Missing Values..........................................................................................39
3.2 Basic use of dplyr.....................................................................................................................40
3.2.1 Data management with dplyr........................................................................................40
3.3 Reshaping data sets between narrow and wide..................................................................42
3.4 Merging data bases..................................................................................................................44

4 Exploratory Data analysis.....................................................................46


4.1 Simple numeric analyses........................................................................................................47
4.2 Simple graphic methods.........................................................................................................47
4.3 Line and scatter plots..............................................................................................................48
4.4 Plots with ggplot2....................................................................................................................49
4.4.1 Simple plots......................................................................................................................49
4.4.2 Histograms and barplots with ggplot2.........................................................................50
4.5 Combined figures.....................................................................................................................52
4.5.1 Figures in regular matrix with mfrow()......................................................................52
4.5.2 Nested figures with split.screen().................................................................................52
4.5.3 Free definition of figure position..................................................................................53
4.5.4 Presenting simulation results.......................................................................................55

1/15/18 / 11:17:40 / 3 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


4.5.5 Combined figures with the lattice package.................................................................56
4.5.6 Multiple plots with ggplot2, gridExtra version..........................................................58
4.5.7 Multiple plots with ggplot2, viewport version...........................................................59
4.6 Brushing up your plots made with the standard system..................................................60
4.6.1 Setting margins................................................................................................................60
4.6.2 Text for title and axes.....................................................................................................60
4.6.3 Colors.................................................................................................................................61
4.6.4 Legend...............................................................................................................................61
4.6.5 More than two axes.........................................................................................................61
4.7 Saving Figures..........................................................................................................................62
4.8 Scatterplot matrix plots.........................................................................................................62
4.9 3d Images..................................................................................................................................64
4.10 Exercise...................................................................................................................................65

5 Bivariate Statistics................................................................................68
5.1 Pearson’s Correlation Coefficient.........................................................................................69
5.2 Spearman's Rank Coefficient.................................................................................................75
5.3 Correlograms – correlation matrices...................................................................................77
5.4 Classical Linear Regression....................................................................................................81
5.4.1 Analyzing the Residuals.................................................................................................82

6 Univariate Statistics..............................................................................87
6.1 F-Test.........................................................................................................................................87
6.2 Student's t Test........................................................................................................................88
6.3 Welsh's t Test...........................................................................................................................90
6.4 c²-Test – Goodness of fit test..................................................................................................91
6.5 ANOVA – Analysis of Variance..............................................................................................93

7 Multiple and curvilinear regression.....................................................98


7.1 Multiple linear regression......................................................................................................98
7.2 Curvilinear regression............................................................................................................98

8 Cluster Analysis...................................................................................105
8.1 Measures of distance.............................................................................................................105
8.2 Agglomerative hierarchical clustering..............................................................................109
8.2.1 Linkage methods...........................................................................................................109
8.2.2 Clustering Algorithm....................................................................................................110
8.2.3 Clustering in R................................................................................................................111
8.3 Kmeans clustering.................................................................................................................112
8.4 Chapter exercises..................................................................................................................115
8.5 Problems of cluster analysis................................................................................................115
8.6 R code library for cluster analysis......................................................................................117

9 Ordination..........................................................................................118
9.1 Principle Component Analysis (PCA).................................................................................118
9.1.1 The principle of PCA explained...................................................................................118
9.1.2 PCA in R...........................................................................................................................121
9.1.2.1 Selecting the number of components to extract.............................................123
9.1.3 PCA exercises.................................................................................................................124
9.1.4 Problems of PCA and possible alternatives...............................................................125
9.2 Multidimensional scaling (MDS).........................................................................................125
9.2.1 Principle of a NMDS algorithm....................................................................................126

1/15/18 / 11:17:40 / 4 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


9.2.2 NMDS in R.......................................................................................................................127
9.2.3 NMDS Exercises.............................................................................................................129
9.2.4 Considerations and problems of NMDS.....................................................................129
9.3 R code library for ordination...............................................................................................130

10 Spatial Data.......................................................................................131
10.1 First example........................................................................................................................131
10.2 Background maps................................................................................................................132
10.3 Spatial interpolation...........................................................................................................134
10.3.1 Nearest neighbour.......................................................................................................135
10.3.2 Inverse distances.........................................................................................................135
10.3.3 Akima............................................................................................................................135
10.3.4 Thiessen polygons.......................................................................................................136
10.4 Point Data.............................................................................................................................137
10.4.1 Bubble plots..................................................................................................................138
10.5 Raster data............................................................................................................................138
10.6 Vector Data...........................................................................................................................139
10.7 Working with your own maps...........................................................................................142

11 Time Series Analysis..........................................................................144


11.1 Definitions............................................................................................................................144
11.2 Data sets................................................................................................................................145
11.3 Data management of TS......................................................................................................145
11.3.1 Conversion of variables to TS....................................................................................145
11.3.2 Creating factors from time series.............................................................................146
11.4 Statistical Analysis of TS.....................................................................................................148
11.4.1 Statistical definition of TS..........................................................................................148
11.4.2 Trend Analysis.............................................................................................................149
11.4.2.1 Regression Trends...............................................................................................149
11.4.2.2 Filter......................................................................................................................149
11.5 Removing seasonal influences..........................................................................................149
11.6 Irregular time series............................................................................................................150
11.7 Practical TS analysis in R....................................................................................................150
11.7.1 Auto- and Crosscorrelation........................................................................................151
11.7.2 Fourier- or spectral analysis......................................................................................152
11.7.3 Breakout Detection.....................................................................................................155
11.8 Sample data set for TS analysis.........................................................................................156

12 Practical Exercises.............................................................................158
12.1 Tasks......................................................................................................................................158
12.1.1 Summaries....................................................................................................................158
12.1.2 Regression Line............................................................................................................159
12.1.3 Database Functions.....................................................................................................159
12.1.4 Frequency Analyses....................................................................................................159

13 Applied Analysis................................................................................160
14 Solutions...........................................................................................164

1/15/18 / 11:17:41 / 5 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Illustration Index
Figure 1: Workflow of a data analysis...............................................................................................12
Figure 2: Installation of Rcmdr..........................................................................................................14
Figure 3: Interface of Rcmdr..............................................................................................................15
Figure 4: After a successful import of the Climate data base........................................................16
Figure 5: File menu of Rmcdr, used to save command and data..................................................17
Figure 6: Rstudio user interface.........................................................................................................18
Figure 7: Source of the climate data set for Hamburg-Fuhlsbüttel.............................................19
Figure 8: Download location, search for “tageswerte_01975_*” to get the Hamburg data set
(station code: 1975)..............................................................................................................19
Figure 9: Contents of the climate archive data...............................................................................20
Figure 10: Set work directory.............................................................................................................20
Figure 11: Creating outlines in source files.....................................................................................21
Figure 12: Common problems in spreadsheet files........................................................................22
Figure 13: Structure of our climate data base (Hamburg).............................................................23
Figure 14: Import of data....................................................................................................................23
Figure 15: Settings for an import of the climate data set from the clipboard...........................24
Figure 16: Result of a data import.....................................................................................................25
Figure 17: Data import with RStudio................................................................................................26
Figure 18: Control of variable type...................................................................................................27
Figure 19: Frequent problem with a conversion of mixed variables...........................................28
Figure 20: Basic structure of a markdown file................................................................................33
Figure 21: Result of the Markdown code from previous figure in DOC-Format.......................34
Figure 22: Example of a good and bad database structure for daily time series.......................38
Figure 23: Example of a good and bad database structure for lab data.......................................38
Figure 24: Conversion from wide to narrow....................................................................................41
Figure 25: Structure of the "molten" data set.................................................................................42
Figure 26: Merging two data bases....................................................................................................43
Figure 27: Different version of histograms and barplots with ggplot2......................................49
Figure 28: Combining figures with the split.screen() command..................................................51
Figure 29: Layout frame......................................................................................................................52
Figure 30: Result of the layout commands.......................................................................................53
Figure 31: Common display of hydrological simulation results...................................................54
Figure 32: Scatterplot of air temperatures with annual grouping...............................................56
Figure 33: Structure of a figure of NO3-N overview......................................................................64
Figure 34: Final version of the NO3 overview.................................................................................64
Figure 35: Display of a bivariate data set.........................................................................................66
Figure 36: Pearson’s correlation coefficient r for various sample data......................................70
Figure 37: Pachygrapsus crassipes – striped shore crab................................................................73
Figure 38: picture of the PMM sediment core and the Si-content plotted as red curve along
the core..................................................................................................................................75
Figure 39: Correlogram of the correlations among the variables in the PMM data frame......77
Figure 40: Different correlogram layouts using the corrgram package with the variables in
the PMM data frame............................................................................................................78
Figure 41: Linear regression...............................................................................................................80
Figure 42: Schematic residual plots depicting characteristic patterns of residuals (a) random
scatter of points - homogeneity of variance and linearity met (b) ‘‘wedgeshaped’’-
homogeneity of variance not met (c) linear pattern remaining – erroneously
calculated residuals or additional variable(s) required and (d) curved pattern
1/15/18 / 11:17:41 / 6 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt
remaining - linear function applied to a curvilinear relationship. Modified from Zar
(1999)......................................................................................................................................81
Figure 43: a Probability density function f (x) and b cumulative distribution function F (x)of
a χ2 distribution with different values for Φ..................................................................84
Figure 44: Cartoons explaining F-test and t-test (image source: G. Meixner 1998,
www.viag.org).......................................................................................................................85
Figure 45: a probability density function f(x) and b cumulative distribution function F (x) of
a Student’s t distribution with different values for Φ....................................................87
Figure 46: (a - left) one-way between-groups ANOVA (b – right) one-way within-groups
ANOVA...................................................................................................................................92
Figure 47: boxplot of the PMM dataset showing the chemical elements by unit.....................93
Figure 48: The means of Ca, with 95% confidence limits, in each unit of the PMM dataset....94
Figure 49: The result of the Tukey HSD pairwise group comparison on the differences in
mean levels of Ca, with 95% confidence limits. (PMM dataset)....................................95
Figure 50: Graph showing the polynom of 3rd order (mytilus.lm3) plotted against the data
points of the mytilus dataset............................................................................................100
Figure 51: Illustration of Jaccard distance.....................................................................................104
Figure 52 : Illustration of Euclidean and Manhattan distance...................................................105
Figure 53: Illustration of different ways to determine the distance between to clusters. For
example by single-linkage (A) or complete-linkage (B)...............................................107
Figure 54: Illustration of the agglomerative hierarchical clustering algorithm.....................108
Figure 55: Illustration of the kmeans clustering algorithm........................................................110
Figure 56: Illustration of the PCA principles.................................................................................116
Figure 57: Graphic result of a PCA...................................................................................................117
Figure 58: Leptograpsus variegatus................................................................................................120
Figure 59: Structure of the water quality data base.....................................................................128
Figure 60: Spatial distribution of average Nitrate content in freshwater systems of
Schleswig-Holstein.............................................................................................................130
Figure 61: Summary of one variable (Mean Temperature).........................................................158

1/15/18 / 11:17:41 / 7 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


1 Introduction
There are many books on statistics, difficult to digest and with a tendency to reprint
formulas. There are still more books for every possible type of software, in which
formatting and graphic creation is placidly explained. What's lacking is a compilation of the
methods and tools used daily in practice.
This book is not meant to replace statistical textbooks and programing handbooks, but is
rather meant as a summary for ecologists containing numerous practical tips which
otherwise would have to be gathered from many different sources. It was conceived as an
accompaniment to a course at the Ecology Center of the University of Kiel, in which
students of geography, biology, and agricultural science are introduced to analyzing data
records.
The students have mostly had an introduction to statistics and a basic course in data
processing. Their scope is generally limited and the connection between the two tends not
to have been made – despite this knowledge being particularly fundamental and a
prerequisite by the time of the diploma thesis at the latest.
The aim of this book as well as that of our course is to give students an overview of the
methods and tools used to analyze data records based on measurements and modeling. The
structure of this book is built on the work flow used in the analysis of data.
In the review of tools we've made a point of emphasizing open-source software. This is
partly for financial reasons: small institutions and engineering firms often cannot even
afford large and expensive packages, the range of functions of which moreover are often
oriented more toward the needs of bookkeepers than those of scientists. Software from the
realm of natural sciences may often be arduous to learn but are in return more flexible and
productive in the long run.
The data sets for this course are available on a website in the internal e-learning system of
Kiel University (OLAT, https://www.uni-kiel.de/lms/dmz/) where example data and files as
well as the latest version of this book are available for download. Current links to the
recommended software can of course also be found there.
Presuppositions: this book doesn't provide an introduction into the various programs,
rather it presupposes basic knowledge of user software and operating systems. We cover
the area of things which the user needs in practical situations but which were never
mentioned in respective introductory courses.
The authors of this book have seen everything that can go right or wrong. They also take
the point of view that irony, amusement, and the regular enjoyment of Monty Python films
are fundamental requirements for survival when doing scientific work.

Comments on typography

1/15/18 / 11:17:41 / 8 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Warnings are displayed like this. They point out everyday frequent mistakes that can
be set off by an hour-long search for the cause

1: Exercises and homework for courses are marked like this

Further information, literature and Internet addresses

Formulas for R look like this

1.1 Side note: Freedom for Software: The Linux or Microsoft


Question
Word may have gotten around by now that we don't live in the best of all worlds; for one
thing PCs and software would be freely available if we did. On the one hand we as users
want to pay as little as possible, but at the same time the programmers of our software can't
live on air and appreciation alone, at least not for long. The completely normal capitalistic
model has software sold as any other merchandise and the people who design and build it
paid as any other worker – that's the Microsoft version.
The other side views things somewhat more idealistically: software is a human right and
should flow freely in the free stream of ideas. Users and programmers constitute an organic
unit and continually develop the product together. The programmer earns his/her money
not (only) through the software but through the related rendering of services. Then there
are people who develop software out of idealism and for whom an elegant piece of software
affords the same pleasure as a good concert – that's the Linux version. It's significantly
more prevalent in the academic world because many programs developed with
governmental financial support are passed around free of charge.
Why Linux ?
• Linux is available freely or inexpensively – even for commercial application.
• Linux is a modular operating system, so unused functions take up no storage space
and can't crash. It's thus possible, for example, when using systems for data logging
or when using a pure database system, to avoid graphical user interfaces altogether.
• Linux systems are also serviceable remotely through slow connections – no ifs or
buts. In case of an emergency, just about the whole system can be reinstalled online.
• Linux systems run stabler and are less demanding on hardware.
• Linux systems are fully documented - all interfaces etc.

Along with the technical arguments there's also the current financial situation of schools
and learning institutions of various levels, and also of many smaller firms. When the
operating system and the office suite together are more expensive than the computer on
which they're installed, many consider whether they shouldn't just buy two PCs with Linux

1/15/18 / 11:17:41 / 9 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


and LibreOffice. Then come the exorbitant prices for software in the technical/academic
world. When simulation software, geographical information systems (GIS), statistics
packages, and databases are all needed, then the price of the hardware becomes negligible.
Worse still is that the further development of office suites along with expensive updates for
technical / scientific versions have not improved anything for scientific work.

We have therefore decided in favor of a dual track: we discuss solutions to problems with
standard packages that are also applicable to open-source software (Excel, LibreOffice) and,
concerning more expensive special software for statistics, graphics, and data processing,
elaborate more upon free software.

http://iso.linuxquestions.org/ contains ISO images of all important Linux


distributions. The data can be downloaded, burned to CD-ROM and used
as an installation medium.
http://www.ubuntu.com/ is the Linux we prefer – it has a very good
compatibility and is very user friendly. You can install it within a Windows
system or run it from a CD to test it. Further information, literature and
Internet addresses

1/15/18 / 11:17:41 / 10 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


2 Basics with R
R is fully documented, there are many of tutorials for beginners but also quite advanced
manuals for special problems in e.g. regression or time series analysis. This is why we limit
this introduction to a basic practical work session where you can see how things are
handled in R.
In our course, this session is the first session with R. In practice, you may have checked your
data set already with a spreadsheet and are now ready to start with the real work. It is also
quite common to check things out with a spreadsheet and then transfer the whole process to
R where it can be automated for future use. This is why we repeat here things you can also
do with a spreadsheet, but we are sure that you will soon prefer the minimalistic beauty of
an R command line over the silly and redundant mouse clicking of the un-initiated
population.
The structure of this chapter follows strictly the work flow of a typical session in R (see Fig.
1). We will explain things when they appear first in the workflow – even if this breaks
sometimes the logic of the program and/or the interface.
If you google for solutions of problems discussed in this book you might get very confused
because the internet solution is completely different from ours. This is one of the
disadvantages of open source software: if someone gets upset enough about something, he
can always publish a better solution. This happened quite frequently to R, especially for
graphics and data management. There are at least three different graphics subsystems, all
with a different philosophy, grammar, look. Sometimes, not even x and y have the same
position in the different procedures. For this book we try to use the modern ggplot2
library and go back to the older ones if needed.
In data management, this situation was worse. Only in the last years something as a
common denominator came up with the dplyr library. Unfortunately, this library is hard
to master for a beginner and changes quite fast. We try to use it as often as possible because
it offers a very consistent interface to data management and is similar to other database
languages like SQL.

1/15/18 / 11:17:42 / 11 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


1/15/18 / 11:17:42 / 12 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt
Prepare data
Structure
Missing values
Reformat

Import data

Control and clean Import

Data types
Structure
Missing values
Check extreme values
Compute date/time

Summarize data - stats


Annual summary
Factor summaries
Frequencies

Summarize data - plots


Time series
Xy (Scatter)
Boxplots

Advanced statistics
Figure 1: Workflow of a data analysis

1/15/18 / 11:17:42 / 13 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


2.1 Installation

2.1.1 Base System


In Windows you have to download the program from www.r-project.org and install it.
In Linux (Ubuntu and other Debian System) you can install R together with a user interface
from the Software center or directly from the command line with

sudo apt-get install r-cran-rcmdr

The version from r-project.org is usually newer, but you have to add the servers
manually to the list of repositories.

2.1.2 User Interfaces


During the rest of this course we will use the basic R installation with Rcmdr and Rstudio
as additional user interfaces. Each program has its fans, our personal experiences are
• Rcmdr is suitable for absolute beginners because you can use the menus to carry out
the first steps.
• Rstudio is the software we use for daily work, it is a modern interface/editor for a
programming language – but you have to type everything manually. You can control
plots, contents of variables directly.

2.1.2.1 Rcmdr
To install the Rcmdr interface, use the following commands
select Packages->Install Packages from the R-Gui.

If you never worked before with packages, R will ask you which mirror file server it should
use – select the one close to you or in the same network (e.g. Göttingen for the German
Universities). Next, select all packages of Rcmdr as shown in Fig. 2 and wait for the
installation to finish. After the installation you should first start the interface with
library(Rcmdr)

Before it starts it will load other additional packages from the internet. After this process,
Rcmdr will come up (Fig. 3) and is ready for work.

For the first steps in R we recommend Rcmdr, because it helps you to import data files and
builds commands for you. If you are more familiar with R you can switch to Rstudio, which
is the more modern GUI.

1/15/18 / 11:17:42 / 14 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Figure 2: Installation of Rcmdr

1/15/18 / 11:17:42 / 15 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Figure 3: Interface of Rcmdr

We will use the successful import of our data to R (Fig. 4) as an introduction to the basic
philosophy of Rcmdr. The program window consists of three parts: the script window, the
output window and the message window.
• The script window contains the command sent by Rcmdr to R. This is the easiest way
to study how R works, because Rcmdr translates all commands you select with the
mouse in the interface to proper R code. You can also type in any command
manually. To submit a command or a marked block you have to click on the submit
button.
• The output window shows the results of the operation you just submitted. If you type
in the command “3+4” in the script window and submit the line, R confirms the
command line and prints out the result (“7”).
• The message window shows you useful information, e.g. the size of the database we
just imported.

Much of the power of R comes from a clever combination of the script windows and the file
menu shown in Fig. 5.

1/15/18 / 11:17:42 / 16 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Figure 4: After a successful import of the Climate data base

The commands dealing with the “script file” save or load the commands contained in a
simple text file. This means that all commands you or Rcmdr issues in one session can be
saved in a script file for later use. If you e.g. put together a really complex figure you can
save the script and repeat it whenever you need it with a different data set.
The same procedure can be used for the internal memory, called “workspace” in R. It
contains all variables in memory, you can save it before you quit your session and R reloads
it automatically next time. If you want to reload it manually you can use the original R
interface or load it from the data menu.

1/15/18 / 11:17:42 / 17 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Figure 5: File menu of Rmcdr, used to save command and data

2.1.2.2 Rstudio
The Rstudio GUI (http://www.rstudio.com) has to be installed as any other windows
program (Fig. 6).
For Ubuntu-Linux you also have to download and install the software from the website, it is
not part of the software repository.

1/15/18 / 11:17:43 / 18 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Figure 6: Rstudio user interface

2.2 The Hamburg climate data set


The data set we use in our course is a climate data set from the Hamburg station, starting in
the year 1891. The data set is part of the free climate data sets of the German Weather
Service (http://www.dwd.de/ and Fig. 7). You can download the last version from the
(horribly structured) ftp-site shown in Fig. 8. The contents of the download are shown in
Fig. 9. You need only the file marked with the red circle, all other files are documentation
(in German). Rename the data file to something reasonable like climate_hh.txt (see Fig.
13). Do not worry about the German description of the variables, we will change them
immediately after the import.

1/15/18 / 11:17:43 / 19 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Figure 7: Source of the climate data set for Hamburg-Fuhlsbüttel

Figure 8: Download location, search for “tageswerte_01975_*” to get the Hamburg data set (sta-
tion code: 1975)

1/15/18 / 11:17:43 / 20 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Figure 9: Contents of the climate archive data

2.3 Recommended startup procedures


There are a few startup procedure which are not really essential for the survival of our
species but which make life with R a little bit easier.

Figure 10: Set work directory

1. Work always in the same directory, preferably where your code is placed. This is
where R stores all save files, figures, data bases etc. (Fig. 10).
2. Use “####” to create headings and structure your workflow (Fig. 11)
3. Load all libraries at the start of your code

1/15/18 / 11:17:43 / 21 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


4. Do not work only at the command line, put all your code in the source file and try
to. keep it clean that it can be executed at any time without further intervention or
errors. This keeps all your results together and creates a nice documentation of your
analysis. You can also use R to write a complete paper, but this is beyond the scope
of this book (see R-Markdown for reference, http://rmarkdown.rstudio.com/).

Figure 11: Creating outlines in source files

2.4 Import of data


For a good start of this lesson we need the data base and an interface to the R program.
The structure of the data base is shown in Fig. 13, but you can use any climate data set.

• Column (variable) names in spreadsheets should not contain spaces


or special characters (Umlaut etc.).
• Do not mix text and numbers (e.g. to mark missing values). Use -99
or a similar code for missing values in numeric columns

2.4.1 Recommended input format


Before you think about importing data from worksheets you should control the structure of
the data base in the worksheet.

Make sure that


• the data base has a proper, rectangular structure
• Delete all mixed data columns, use numeric codes for missing
values (“-999”), avoid text like “NA”. Empty cells are ok.
• Control variable/column names (no special characters, no spaces
and/or operator symbols (+-)

1/15/18 / 11:17:43 / 22 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Fig. 12 shows some common problems in spreadsheet files which cause problems later on in
R. First, check that there is only one, rectangular matrix in a worksheet. Remove all old
intermediate steps like the ones shown in columns E-G in Fig. 12. Check also the lower end
of the spreadsheet for sums and other grouping lines. Second, check the variable/column
names. They should only contain good old fashioned ASCII-characters, no spaces (Fig. 12),
umlauts, operator symbols and other characters which have a special meaning (“/”,”(“).
Third, check the columns for text, especially text used to define missing values (“-”) etc.

Figure 12: Common problems in spreadsheet files

1/15/18 / 11:17:43 / 23 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Figure 13: Structure of our climate data base (Hamburg)

2.4.2 Import of text data (csv, ASCII) with Rcmdr


The first step of an analysis is to import the data. Usually, the data set is already available as
spreadsheet or text file and you need only the commands shown in Fig. 14.

Figure 14: Import of data

The data set must have a rectangular form without empty rows or columns

1/15/18 / 11:17:43 / 24 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


If you import the data set from the clipboard you should take care to fill out properly the
fields marked in red in Fig. 15, especially the decimal-point character and the field
separator.

Figure 15: Settings for an import of the climate data set from the clip-
board

2.4.3 Direct import of Excel Files


There are several so called libraries to import Worksheets directly, but many have their
problems and work only with the outdated 32bit version. We had the best experiences with
the new readxl library.
The best method to import spreadsheets is readxl. It does not require additional software
and works lightning fast.

library(readxl)
Climate=readxl::read_excel("climate_import.xlsx",sheet=1,col_name
s=TRUE,na="-999")

The next step is quite essential: control the structure of the import file with

str(Climate)

Fig. 16 shows the correct output after a successful import. First, you should control the
number of variables and rows (observations). Second, control the data type of the variables.

1/15/18 / 11:17:44 / 25 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


In our case, they should be numeric, i.e. numeric or integer. Other types like character or
factor occur if there is text in at least one row of the file. In this case you should go back to
chapter 2.4.3. Do not worry about the “Factor” type for columns with date and time, we will
deal with this problem in the next chapter.

Figure 16: Result of a data import

2.4.4 Import with Rstudio

1/15/18 / 11:17:44 / 26 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Figure 17: Data import with RStudio

Import of data in CSV format is also available in Rstudio. Figure 17 shows the import
function. The available options are the same as in Rcmdr.
You can also import the file manually with
Climate <- read.csv("Climate_hh.txt", sep=";", na.strings="-999")

2.4.5 First data checks, basic commands


After the import of the data set you should first check some basic things: the variable types
(“structure”) with
str(Climate)

The results should look like Fig. 18 if you used the import function of Rcmdr. The whole file
is converted to a variable of type data.frame which consists of several sub-variable
corresponding to the columns of the file. The data type of all variables is numeric – this is
the normal view.
In case you need some help you have different choices. In case you know the name of a
command, you get help with

help(“ls”)

1/15/18 / 11:17:44 / 27 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


You can search in the help files with
??ls

Other useful commands:


ls()

lists all variables currently in memory.


rm()

removes a variable.
edit(Climate) or View(Climate)

lets you control your data set and change or view values.
names(Climate)

Lists the names of the sub-variables (columns) contained in a variable


Climate

If you type the name of a variable, the contents are displayed

Figure 18: Control of variable type

1/15/18 / 11:17:44 / 28 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Fig. 19 shows a frequent problem: If the imported file contains not only numbers, but also
text (see left side, text “Missing”), then the whole column is converted to a factor variable,
i.e. the variable cannot be used for computation, only for classification.

Figure 19: Frequent problem with a conversion of mixed variables

2.5 Working with variables

2.5.1 Variable types and conversions


In spreadsheets you can mix variable types as you like, e.g. text and numbers. For statistical
analysis this does not make sense: you cannot calculate a mean values between numbers
and text. This is why statistical programs (and data bases) are more strict when it comes to
the types of variables. The following types of variables are quite common in programming
languages and R
• real numbers (1.33)
• integer numbers (1.2.3.4.....)
• boolean (yes/no, o/1)
• text (“This is a text”)
• character (“A”, “b”, but also “1”, “2”)
Another important type in R is a factor variable. You can think of it as a header in a pivot
table, it is used for classification of values. A factor variable can be a text like “Forest” or a
number (e.g. a year).

1/15/18 / 11:17:44 / 29 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


2.5.2 Accessing variables
Frequently we do not use the whole data set but only parts of it. The following examples
show, how to use parts of the climate data set from Fig. 18. The commands look simple at
first sight, but they are able to replace a data base and can even replace advanced filters in
Excel.
Climate$AirTemp_Max

Displays the content of the AirTemp_Max column.c


The output of the next column is quite obvious, you select rows and columns by numbers.
Climate[1,1]

First value of first line

Climate[1,"Meas_Date"]

Same as before, but with variable name instead on number


Climate[1,]

All values of first record


Climate[,1]

All values of first variable


The next commands are hard to understand at first sight, but they are a cause of the
unmatched elegance and flexibility of R.
Climate[-1,]

All values except the first line


Climate[1:10,]

The first ten lines


Climate[-1,]

The first ten lines of columns 2-4, 7 and 9. The expression c() creates a list – most
commands accept it as input.
Climate[Climate$AirTemp_Max>35,]

Get only data sets with AirTemp_Max>35


Climate$AirTemp_Max[(Climate$AirTemp_Max>19 &
Climate$AirTemp_Max<20)]

Get values between 19 AND 20 of Max_Temp. The logical OR condition is handled by


operator “|”.
If you want to keep the results and save it in a variable you can use the “=” operator.
climax = Climate[(Climate$AirTemp_Max>19 &

1/15/18 / 11:17:44 / 30 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Climate$AirTemp_Max<20),]

Creates a new variable Climax with the contents of the selection.


If you do not want to type the name of the data matrix each time you need a variable, you
can use
attach(Climate)

to start to use a data frame to make variables inside a data matrix visible, but do not forget
to
detach(Climate)

after you have finished.


After attaching a data frame, the command
AirTemp_Max[(AirTemp_Max>19 & AiTemp_Max<20)]
will list the same result as the command above with full names. Another (politically correct)
method to access variables inside data from it the use the with function
with(Climate, AirTemp_Max[(AirTemp_Max>19 & AirTemp_Max<20)])

2.5.3 Coding date and time


The format of the date in Germany is usually in the form “DD.MM.YYYY”, in international publications it
is in ANSI-Form written as “YYYY.MM.DD”. This text-formatted dates are usually converted into
numbers. Normally, the day-count is the integer part of the coded number, the decimal fraction
represents the time of the day as the fractional day since midnight. What makes handling of date values
difficult is that different programs use different base values for the day-count. ANSI e.g. uses 1601-01-01
as day no. 1 while some spreadsheets use 1900-01-01 on PC and 1904-01-01 on Mac computers. It is
therefore highly recommended to use the text format for data exchange. The commands are explained in
chapter 11.3.1 on page 139.

For our climate data set we need real date, we have to convert the input to internal date
values.

Climate$Meas_Date=as.character(Climate$Meas_Date)

A conversion of the integer variable to text makes it easier to create the date.

Climate$Date= as.Date(Climate$Meas_Date, "%Y%m%d")

Convert the text to a real date. See the help for a complete list of all format options.

Climate$Year = format(Climate$Date, "%Y")

Extract years from the date – we need this information later for annual values.

# check data type!


Climate$Year = as.numeric(Climate$Year)

The format function returns text, we convert it back to a number.

1/15/18 / 11:17:44 / 31 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Climate$Month = format(Climate$Date, "%m")
Climate$Month = as.numeric(Climate$Month)

In ecology we frequently need the daynumber from 1 to 365. We get it in R by subtracting


the value of 1st of January from the current date or from the lubridate library.

Climate$Dayno=Climate$Date-as.Date(paste(Climate$Year,"/1/1"),"%Y
/%m/%d")+1
library(lubridate)
Climate$Dayno = yday(Climate$Date)

The easiest way to deal with these problems is to import files with time variables from an
Excel-Worksheet. There the spreadsheet time variables are usually converted directly to R-
style time and date variables.

2.6 Simple figures


R has three different graphic subsystems. For a first overview we recommend the new
ggplot2 library.

library(ggplot2)
qplot(Date,AirTemp_Mean,data=Climate)

In case you do net specify the type of figure you want, ggplot2 make a guess.

qplot(Date,AirTemp_Mean,data=Climate,geom="line")

The geom parameter defines the type of figure you want have. In this case line is a good
choice.

qplot(Year,AirTemp_Mean,data=Climate,geom="boxplot")
qplot(as.factor(Year),AirTemp_Mean,data=Climate,geom="boxplot")

Some command cannot handle all data types, here we have to convert the numeric variable
Year to a factor variable.

qplot(as.factor(Month),AirTemp_Mean,data=Climate,geom="boxplot")

Boxplots are not always the best method to display data. If the distribution could be
clustered, the jiiter type is a good alternative. It displays all points of a data set.

qplot(as.factor(Month),AirTemp_Mean,data=Climate,geom="jitter")

One of the advantages of the qplot command is that you can use colours and matrix plots
out of the box.

qplot(as.factor(Month),AirTemp_Mean,data=Climate,geom="jitter",co
l=Year)
qplot(as.factor(Year),AirTemp_Mean,data=Climate,geom="jitter",col
=Month)

1/15/18 / 11:17:45 / 32 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


It can be quite useful to plot data sets in annual or monthly subplots, with the facets
option you can plot one- or two-dimensional matrices of plots.

qplot(as.factor(Year),AirTemp_Mean,data=Climate,geom="jitter",fac
ets= Month ~ .)
qplot(as.factor(Year),AirTemp_Mean,data=Climate,geom="line",facet
s= Month ~ .)
qplot(as.factor(Dayno),AirTemp_Mean,data=Climate,geom="jitter",co
l=Month)
qplot(as.factor(Dayno),AirTemp_Mean,data=Climate,geom="jitter",co
l=Year)

2.7 R and RMD


The normal way to program in R is to write code in a file and execute it. The results are
printed on screen and if you want to save them for later use you have to do it manually or
explicitly program it. Especially for figures this can be a boring and time consuming job.
The solution of this problem is called “R-Markup” or in short “rmd”. The so called Markup-
Languages are quite commonly used, the best known example if probably wikipedia. A
markup file is a text file with embedded commands. These commands can be simple text
formatting characters or complete r scripts. A rmd-file is a mixture of R commands , normal
text and formatting code. If you execute this file, the R code is executed, the results are
saved (including all figures) and the output is saved as DOC, pdf or html-file. If you want you
can write your complete paper/thesis in RMD. The main advantage of rmd is that
everything is saved and you do not need to take special precautions to save your results. A
nice by-product is that your research is nicely documented and if something is changed
(e.g. values deleted), you just execute the file again and everything is updated.
The markup syntax is quite easy to learn and fits on the two pages of the “R-Markup Cheat
Sheet” which is part of R-Studio. For more details on using R Markdown see
<http://rmarkdown.rstudio.com>.
The execute the commands you have two options: for programming you can execute code
line by line, if the code is ready you can “knit” the whole document in a single step and
produce updates of all commands.

1/15/18 / 11:17:45 / 33 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Figure 20: Basic structure of a markdown file

When you click the **Knit** button a document will be generated that includes both
content as well as the output of any embedded R code chunks within the document. Figure
20 shows the basic structure of such a file. The results can be seen in Figure 21. Please note
that the results of the qplot command are saved directly and automatically in the resulting
DOC file.
The markup language has many more options to produce a nice output. You can e.g. switch
off the display of the code to show only the results of the R commands and you can set the
size of the figures to any number.

1/15/18 / 11:17:45 / 34 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Figure 21: Result of the Markdown code from previous figure in DOC-Format
For this course we will continue to use pure R-Code, but for practical projects we
recommend the use of rmd-files.

2.8 Export and save data


If you want to export your results, it is best to write it in a csv file
write.table(Climate,"climtest.csv", sep=";")

If you want to continue to use the data file in R, you can also save the whole data base in R-
Format using
save.image("test.Rdata") #saves memory

or save only specific variables with


save(Climate,file="climate.Rdata")

To export files in Excel format, you have two possibilities. First, you can use

library(writeXLS)
WriteXLS(Climate, "climate.xls")

However, to use this library you have to install another open source computer language
named “Perl”. For Windows, you can use ActivePerl. In most Linux installations, Perl is part
of the basic system and should be already installed.

1/15/18 / 11:17:45 / 35 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


The second and more easy to use library is

library(writexl)
write_xlsx(Climate, "climate.xls")

2.9 Exercises
1: calculate a variable Climate$Summer where summer=1 and
winter = 0
2: Plot the summer and winter temperatures in a boxplot

To convert on type to another, there are as.xxx() functions. To convert our


summer/winter classification to a factor variable we could type
Climate$Summer = as.factor(Climate$Summer)

3: create new factor variables for year and month. Do not replace
the original values, we will need them later.

• To convert a factor variable back to its old value you need to


convert it first do a text-variable

• Climate$Year = as.numeric(as.character(Climate$Year))

If you forget this not really obvious step, you get the number of index, not of the value of a
variable.
4: create an additional variable for groups with 50
years. Use the “facets” and “color” command to check
and display the temperatures.

2.10 Debugging R scripts


Many beginners are frustrated by Rs error messages, but with some patience and trust in R
you will get over it. Especially if you start to to use R, the first rule is:
The problem always sits in front of the screen.
Normally, R is only doing what you ask for, the problem is that your commands are broken.
The positive side is that you will learn a lot if you try to find errors on your own without
whining at your instructor that R is broken. Usually, R is not broken.
There are few rules which help you over the first problems
1. Check spelling of variable names. R is case sensitive. “Year” is a different variable
than “year”.
2. Check and correct the first error messages

1/15/18 / 11:17:45 / 36 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


3. Check variable types. If a variable is of the “factor” type you cannot compute
anything with it and you cannot use a numeric variable to group boxplots.
4. Execute the code line by line and check the results
5. Google the error message often helps.

1/15/18 / 11:17:45 / 37 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


3 Data management with dplyr and
reshape2
During many years of the development of R, different standards for data management have
emerged. There are a lot of different solutions for single problems – all with a different
syntax and philosophy. As always, if confusion is highest, a redeemer appeared and gave us
dplyr, the one and only interface for data management in R. You can replace nearly all
features of dplyr with other (old) R-functions, but dplyr offers a clean and easy to
understand interface to data.

3.1 Data Organization


Many data evaluations fail already in the preparatory stages – and with them often a
hopeful junior scientist. It's one of the most moving scenes in the life of a scientist when a
USB stick or notebook computer with freshly processed data (or data deleted in the logger)
sinks in a swamp, ditch, or lake.
Taking heed of our own painful experience, we've placed a chapter before the actual focal
point of the book in which data organization and data safeguarding are gone over. Along
with it there's also a short overview of the vexing set of problems associated with various
date and time formats when working with time series.

3.1.1 Optimum structure of data bases


You can avoid many problems if you create well structured data base. This means normally:
one case per record. Everything else may consume less space but you will have to
restructure the whole thing if you need a different analysis. Some common errors are
explained below. Figure 22 shows time series. Sometimes the daily values are arranged in
horizontal direction and months in vertical direction (left part). With this structure, you
run already into problems if you want to create monthly values because different months
have a different number of days. The right side version looks more complicated at first
sight, but is in fact a more elegant and useful structure: you can create monthly values with
one single command. Figure 23 shows a similar data base with lab data: the repetitions are
located in horizontal direction, the sample number in vertical direction. Again, putting only
one sample in a record facilitates later analysis.

1/15/18 / 11:17:46 / 38 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Figure 22: Example of a good and bad database structure for daily time series

Figure 23: Example of a good and bad database structure for lab data

3.1.2 Dealing with Missing Values


A gap-free storage record is as rare as a winning lottery ticket, and yet the majority of
practical applications and many computer models require a gap-free storage record. The
methods used to touch up or fill in a storage record are complex and often apply only to a
specific field or only to one single variable.
The handling of missing values is a little bit complicated because they cannot be treated like
regular numbers. The basic functions are:
Replace numbers by the code for missing value (NA)

Climate$Air_Press[Climate$Air_Press==-999]=NA

The following solution to replace all -999 works, but is not really obvious
Climate[Climate==-999]=NA

List all records where Air_Press is missing

1/15/18 / 11:17:46 / 39 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Climate[is.na(Climate$Air_Press),]

count all records where Air_Press is missing

length(Climate$Air_Press[is.na(Climate$Air_Press)])

3.2 Basic use of dplyr


In short, dplyr offers you the basic functions of data management
• filter: select parts of the data set defined by filter conditions
• select variables of a data set
• arrange (sort) data sets
• mutate: change values, calculate and create new variables
• group: divide the data set in groups (e.g. years, months)
• summarise: calculate mean, sums etc. for groups
• combine/join two data bases

https://www.rstudio.com/resources/cheatsheets/
Some functions in R are are complex and have many different options which
are difficult to remember. For the most important functions there are so
called “cheat sheets”. The “data wrangling cheat sheet” summarizes all
commands for data management with dplyr

3.2.1 Data management with dplyr


Data management in R is a basic task, but follow many different approaches which are use a
different syntax and difficult to understand for beginners. The dplyr library offers a
consistent interface to all data management tasks, starting with simple filter operations and
going up to complex combination of data bases.
The following code repeats parts of the last chapter and gives you an impression how things
are handled in dplyr.

library(dplyr)
Climate=readxl::read_excel("climate_import.xlsx",sheet=1,col_name
s=TRUE,na="-999")

The selection and filtering of variables requires more typing compared to the basic version,
but the code is quite readable.

# select variables
test=dplyr::select(Climate,AirTemp_Mean) # select variable

You can also use the index of columns, but the version with names is more readable and
avoids problems if columns are deleted

1/15/18 / 11:17:46 / 40 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


test = dplyr::select(Climate, c(18:21,4))

The use of the filter function is straightforward:

test = filter(Climate, AirTemp_Mean>5)

You can easily use any R function to change values, but the politically correct way is to use
the mutate function. The calculation of the date from last chapter can be expressed like
this:

test=mutate(Climate,date=as.Date(as.character(Climate$Meas_Date),
"%Y%m%d"))

A common method in meteorology is to use the average of minimum and maximum


temperature as a replacement for the mean temperature.
test = mutate(Climate,New_Mean=(AirTemp_Max+AirTemp_Min)/2)

In pure R you get the same result with


test$New_Mean = (Climate$AirTemp_Max+Climate$AirTemp_Min)/2

5: Draw a figure (scatterplot) with New_Mean and the measured mean.

A new method to confuse beginners is the so called “chaining” of commands. It comes from
the Unix-world and is generally known as “piping”. The output of one command is the
input of the next. It makes computations more efficient, avoids the use of temporary
variables, but is difficult to debug. Therefore, we do not recommend it for beginners. Use
piping only if you know that your code is working as you expect.

#chaining / piping
test= Climate %>%
mutate(Meas_Date=as.character(Climate$Meas_Date)) %>%
mutate(Date=as.Date(Meas_Date, "%Y%m%d")) %>%
mutate(Month = format(Date, "%m"))

You can use the arrange function to select the 100 hottest days in the data set and look for
signs of global change
t7=arrange(test,desc(AirTemp_Mean))

6: Select the 100 hottest days and plot the temporal distribution as a
histogram in groups of 10 years
7: Select the 100 coldest days and plot the temporal distribution as a
histogram in groups of 10 years

The most common application for dplyr is the calculation of mean, sums etc., e.g. for
annual and monthly values. The first step is to create groups
clim_group=group_by(Climate, Year)

1/15/18 / 11:17:46 / 41 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


With this grouped data set you can use any function to calculate new values based on the
grouping
airtemp=summarise(clim_group,mean=mean(AirTemp_Mean),
median=median(AirTemp_Mean))

8: Calculate and draw monthly mean values

3.3 Reshaping data sets between narrow and wide


Much of the power of R in data management comes from a clever combination of dplyr,
reshape2 and ggplot2.

The wide format is very common, an example is the Climate data set used for this course
(fig. 24). It is characterised by more than one measured value per row, in this case the different
temperatures. This format is used by many functions in R, e.g. the old graphic functions and
many functions for statistical analysis. The narrow format contains always only one measured
value per row, the description of the different variables goes to a separate column named
variable in fig. 19. It contains the names of the columns from the wide format. At first, the
narrow format seems to be unnecessarily complex and inefficient, but this structure
combined with dplyr, reshape2 and ggplot2 makes many complex operations easy and
efficient.

1/15/18 / 11:17:46 / 42 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


The reshape2 is a utility to convert between narrow and wide format ( melt) and back from
narrow to wide (cast). Both formats are used frequently. For all ggplot2 functions, the
narrow format of a data set is a better choice.
library(reshape2)
id=c("Year","Month","Dayno","Date")
measure=c("AirTemp_Mean","AirTemp_Max","AirTemp_Min","Prec","Hum_Rel")
Clim_melt=melt(Climate,id.vars=id,measure.vars=measure)

In figure 24 you can see how the different variables are transformed with melt and to the
new structure shown in Fig. 25. In the melt function, two parameters are important: id-
variables and measure variables. The id-variables remain unchanged and are used as an
index. Date variables are typical id variables. The columns of the measure variables are
collapsed in the variable and value column of the new file. One line of the original,
“wide” data set is now converted to 5 lines in the “narrow” data set.

Figure 25: Structure of the "molten" data set

Now, its easy to create a quite complex figure with a simple command. Please note the “+”
as the last character in the line, we need it to continue the graphic command.
qplot(Dayno,value,data=Clim_melt) +
facet_grid(variable ~ .,scales="free")

3.4 Merging data bases


To demonstrate how to merge two data bases we use a different data set. The
administration of the Plön district supervises a monitoring program of all lakes in the Plön
district (Edith Reck-Mieth). We have a data base of the annual measurements of chemical
properties (chemie) and a second data base of the static properties of the lakes like depth,
area etc. (stations).
First, we read the two data bases and check content and structure.
chemie <- read.csv("chemie.csv", sep=";", dec=",")
stations <- read.csv("stations.csv", sep=";", dec=",")

In the second step we merge the two data bases. We can use the old style command
# join the two data bases
chem_all=merge(chemie,stations,by.x="Scode")

1/15/18 / 11:17:47 / 43 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


In dplyr syntax, the same result is produced by
chem_all2=dplyr::inner_join(chemie,stations,by="Scode")

In fig. 26 the process is explained graphically. The two data bases share a common variable
called Scode, the typical numeric code for the sampling site. All other information about
the site is stored in the Station data base. The merge process now uses the variable Scode
to look up the name and other properties of the sampling site and combines it in a new data
base (chem_all). With this new data bases we can e.g. analyse the relation between lake
area and nutrient content.

9: Check if there is a relation between lake depth, area and a chemical


variable of your choice.

• Remember that you have a lot of choices to code values in ggplot2


◦ size
◦ col
◦ facets
◦ linetypes/point types

1/15/18 / 11:17:47 / 44 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


4 Exploratory Data analysis
The expression exploratory data analysis (EDA) goes back to Tukey 1977. It is an approach
to analyzing data sets to summarize their main characteristics in an easy-to-understand
form, often with visual graphs. The main purpose is to get a “feeling” of the data set –
similar to a first “exploratory” walk in an unknown city.
We will also use this chapter to introduce the three different ways of creating figures in R.
As mentioned earlier, R has three different graphic subsystems: ggplot2, lattice and the
original base system – unfortunately they are not compatible and unfortunately there
are few unique features in each system. This problem is similar with dplyr: sometimes,
older functions are easier to use and are more widespread.
To shorten the data base a little bit we can use the following lines to shorten our data set to
10 years and create some factors.
clim2000 = Climate[Climate$Year >= 2000,]
str(clim2000)
# calculate/Update factor variables
clim2000$Season = as.factor(clim2000$Season)
clim2000$Year_fac = as.factor(clim2000$Year)
clim2000$Month_fac = as.factor(clim2000$Month)

A frequent source of errors is an earlier definition of a figure, which is still


valid, a good start for each test is to switch off and reset old device
settings with
dev.off()
From now on we will not explain all options in a command, please refer to
the help function for more information, e.g.
help(max) or
?max

Murrell, Paul, 2006: R Graphics, Computer Science and Data Analysis


Series, Chapman & Hall/CRC, 291p
Tukey, John Wilder, 1977: Exploratory Data Analysis. Addison-Wesley.
ISBN 0-201-07616-0.
Tufte, Edward (1983), The Visual Display of Quantitative Information,
Graphics Press.
http://www.itl.nist.gov/div898/handbook/eda/eda.htm, online
textbook
Engineering Statistics Handbook: Exploratory Data Analysis

4.1 Simple numeric analyses


The obvious method to get a summary of the whole data set is
summary(Climate)
If you want a first overview, a pivot table is always a good choice if you work with

1/15/18 / 11:17:48 / 45 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


spreadsheets. With R you can get a table with monthly and annual sums with
xtabs(Climate$Prec ~ Climate$Year + Climate$Month)

The syntax of this command, especially the selection of the variable is quite typical for
many other functions. “Prec ~ Year_fac + Month_fac” means: analyze the data
variable Prec and classify it with the monthly and yearly factor variables.
You can calculate the same result with the dplyr library
mgroup=group_by(Climate,Year,Month)
msum=summarize(mgroup, sum=sum(Prec))
msum=data.frame(msum)
mmonth=dcast(msum,Year~Month)

4.2 Simple graphic methods


The best plots for an overview of the data set is a scatterplot and a boxplot. If you want an
overview of monthly mean temperature you can type old style
boxplot(Climate$AirTemp_Mean ~ Climate$Month_fac, ylab="Temp.")

or use ggplot2

qplot(Month_fac,AirTemp_Mean,data=Climate,geom="boxplot")

It is also never a bad idea to plot a histogram with a frequency distribution


hist(Climate$Prec)
qplot(Prec,data=Climate,geom="histogram")

The lattice library contains a lot of useful chart types, e.g. dotplots

library(lattice)
dotplot(Mean_Temp~Year_fac)
# also available in ggplot2
qplot(Year_fac,AirTemp_Mean,data=clim2000)
qplot(Year_fac,AirTemp_Mean,data=clim2000,geom="jitter")

A nice version of a boxplot is a violinplot


library(vioplot)
vioplot(Mean_Temp, Max_Temp,
Min_Temp,names=c("Mean","Max","Min"))
library(violinmplot)
violinmplot( Year_fac ~ prec, data=Climate )
violinmplot( Year_fac ~ Mean_Temp, data=Climate )

There is also a ggplot2 version


qplot(data=clim2000,x=Year_fac,y=AirTemp_Mean,
geom="violin")

1/15/18 / 11:17:48 / 46 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


4.3 Line and scatter plots
Type in the following commands and watch how the figure changes
attach(clim2000)
plot(AirTemp_Mean)
plot(AirTemp_Mean, type="l")
plot(AirTemp_Mean, type="l", col="red")
plot(AirTemp_Mean ~ Date, type="l", col="red")
plot(AirTemp_Mean ~ Date, type="l", col="red",
ylab="Temperatur", xlab="Day")

A scatterplot is a version of a line plot with symbols instead of lines. It is a very common
type used for later regression analysis. For a ggplot2 version of this figures see 4.4.1)
plot(AirTemp_Max, AirTemp_Min)
abline(0,1)
abline(0,0)
abline(lm(AirTemp_Min ~ AirTemp_Max), col="red")
lines(AirTemp_Min,AirTemp_Mean,col="green", type="p")
abline(lm(AirTemp_Max ~ AirTemp_Min), col="green")

There are also some new packages with more advanced functions. Try e.g.
library(car)
scatterplot(AirTemp_Max ~ AirTemp_Min | Year_fac)

For really big datasets the following functions can quite useful
library(IDPmisc)
iplot(AirTemp_Min, AirTemp_Max)

or
library(hexbin)
bin = hexbin( AirTemp_Min, AirTemp_Max,xbins=50)
plot(bin)

or
with(Climate,smoothScatter( AirTemp_Mean,AirTemp_Max))

or with ggplot2
qplot(data=Climate,AirTemp_Mean,AirTemp_Max, geom="bin2d")
qplot(data=Climate,AirTemp_Mean,AirTemp_Max, geom="hex")

If you do not like the boring blue colours, you can change then to rainbow patterns
qplot(data=Climate,AirTemp_Mean,AirTemp_Max)+
stat_bin2d(bins = 200)+
scale_fill_gradientn(limits=c(0,50), breaks=seq(0, 40, by=10),
colours=rainbow(4))

plot() opens a new figure,

1/15/18 / 11:17:48 / 47 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


lines() command adds a line to an existing figure
abline() draws a streight line

10: plot time series of max, min and mean temperature with ggplot2 system
(hint: use reshape2)
11: plot a scatterplot with mean temperature vs. (max+min)/2
12: calculate average annual temperatures and average precipitation for
the climate data set using the aggregate function.

13: compare the monthly temperatures and precipitation for 1950-1980 and
1980-2010.

4.4 Plots with ggplot2


Unfortunately, the best graphic system in R is also the most complicated. Because disasters
always strike twice, ggplot2 does not work with old system for multiple figures like layout
but requires you to learn a new system from the grid package.
http://ggplot2.org/ The Website for the package
https://www.stat.auckland.ac.nz/~paul/grid/grid.html The website
explaining the grid layout package
http://shiny.stat.ubc.ca/r-graph-catalog/ A website with example
code and figures
Chang, W., 2012. R graphics cookbook. Sebastopol, CA: O’Reilly Media.
Wickham, H., 2009. Ggplot2 elegant graphics for data analysis.
Dordrecht; New York: Springer.
Cheat sheet data visualisation with ggplot2 (part of RStudio)

library(ggplot2)

4.4.1 Simple plots


We already used the ggplot2 library a few times: the qplot function is part of the
ggplot2 library. Originally, the library was programmed to implement the “Grammar of
graphics”. As all other grammars, the result was quite complex and difficult to understand.
This why qplot was added – it facilitates the transition from the traditional graphic
subsystem to ggplot2. However, qplot has fewer options for many functions. If you want
to change details, you normally have to move to the original ggplot version.
Because there are a lot of good introductions to ggplot2, we limit explanations in this book
to one example. If you want to know more details we recommend the book of the author
Wickham (2009), it is available for download.
The following command plots an annual time series from our lake data set with a non linear
regression line.
chem_all$Area_fac = as.factor(as.integer(chem_all$MeanDepth/2.5)*2.5)
qplot(Year,NO3.N,data=chem_all,geom=c("smooth", "point"),
facets= Area_fac~ .)

1/15/18 / 11:17:48 / 48 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


The following lines translate the command above to true ggplot. Please not that all
following code lines belong together and create a single figure. The “+” sign adds another
graphic component to a figure.
All figures start with a definition of the data base and the “aesthetics”, the definition of the
axes.
ggplot(chem_all, aes(x=Year, y=NO3.N)) +
# defines how the data set is displayed: point
geom_point(color = "red", size = 3) +
# adds a statistics, in this case a linear regression line
stat_smooth(method="lm") +
# create sub-plots for area
facet_wrap(~Area_fac,scales="free") +
# changes the background to white
theme_bw()

14: find out how to change font size and character orientation of an x axis
in ggplot2.

4.4.2 Histograms and barplots with ggplot2


Unfortunately the most simple plots are the most difficult to produce with ggpplot2:
histograms and barplots. This is why we give you a short overview in Fig. 27.
The easiest application is a histogram: we get the number of counts for temperature classes
(A in Fig. 27).

qplot(data=Climate,x=AirTemp_Mean,geom="histogram")

In plain ggplot2 syntax the same figure with classes 2°C wide is produced by

ggplot()+
geom_histogram(data=Climate,aes(x=AirTemp_Mean),binwidth=2)

For the following plots we use monthly temperature values from the climate data base.
mgroup=group_by(Climate,Month)
msum=summarize(mgroup, mean=mean(AirTemp_Mean),
max=mean(AirTemp_Mean), min=min(AirTemp_Mean))

If you want a display of the values (not the counts) you have to use a barplot (B in Fig. 27).
ggplot()+

geom_bar(data=msum,aes(x=Month, y=mean),stat="identity")

For more than one bar you have to use always use narrow data sets
n_msum=reshape2::melt(msum,id.var=("Month"))

1/15/18 / 11:17:48 / 49 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Unfortunately the standard for a barplot with more than on variable is a stacked version (C
in Fig. 27)
ggplot()+
geom_bar(data=n_msum,aes(x=Month,
y=value,fill=variable),stat="identity")

However, with the right keyword you get “standard” as we know it from spreadsheets (D in
Fig. 27)
ggplot()+
geom_bar(data=n_msum,aes(x=Month, y=value,fill=variable),
stat="identity",position="dodge")

As usual, the facet keyword makes it easy to produce several figures at once.
ggplot()+
geom_bar(data=n_msum,aes(x=Month, y=value,fill=variable),
stat="identity")+
facet_grid(variable~.)

Figure 27: Different version of histograms and barplots with ggplot2

4.5 Combined figures


For more complex and combined figures there are basically two choices in R: an easy to
understand matrix approach where all subfigures have the same size and a complex
approach where you can place your figures free on a grid.

1/15/18 / 11:17:49 / 50 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


4.5.1 Figures in regular matrix with mfrow()
For a figure with 4 elements (2 rows, 2 columns) we write
par(mfrow=c(2,2))
plot(AirTemp_Mean ~ Date, type="l", col="red", main="Fig 1")
plot(AirTemp_Max ~ Date, type="l", col="red", main="Fig 2")
plot(Prec ~ Date, type="l", col="red", main="Fig 3")
plot(Hum_Rel ~ Date, type="l", col="red", main="Fig 4")

4.5.2 Nested figures with split.screen()


A similar effect is produced with
split.screen(c (2, 2) )
screen(3)
plot(Prec ~ Date, type="l", col="red", main="Fig 3")
screen(1)
plot(AirTemp_Mean ~ Date, type="l", col="red", main="Fig 1")
screen(4)
plot(Hum_Rel ~ Date, type="l", col="red", main="Fig 4")

Here, screens can be addressed separately by their numbers. It is also possible to nest
screens. Screen(2) is split into one row and two columns which get screen number 5 and 6.
split.screen( figs = c( 1, 2 ), screen = 2 )
screen(5)
plot(Prec ~ Date, type="l", col="red", main="Fig 5 inside 2")
screen(6)
plot(Sunshine ~ Date, type="l", col="red", main="Fig 6 inside 2")
close.screen(all=TRUE

The result should look like Fig. 28.

1/15/18 / 11:17:49 / 51 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Figure 28: Combining figures with the split.screen() command

4.5.3 Free definition of figure position


The most complicated, but also the most flexible definition of combined figures is the
layout function.

The basic idea is shown in Fig. 29. The command


layout(matrix(c(1,1,1,2,2,0,3,0,0), 3, 3, byrow = TRUE))

defines a 3x3 matrix with 9 elements. The matrix command assigns each cell of this matrix
to a figure. Thus, the first three elements (the first line) of the matrix are assigned to
(sub-)figure 1. The second line contains subfigure 2 in two elements, the last element is left
free (0). In line 3, only the first cell is assigned to figure 3. The system is shown in Fig. 29.
You can control the layout with
layout.show(3)

The figure is filled with


plot(AirTemp_Mean ~ Date, type="l", col="red", main="Fig 1")
plot(AirTemp_Max ~ Date, type="l", col="red", main="Fig 2")
plot(Prec ~ Date, type="l", col="red", main="Fig 3")

The results are shown in Fig. 30. For this course we kept the structure of the matrix quite
simple. You can use as many elements as you want and order the figures in any order you
want.

1/15/18 / 11:17:49 / 52 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Figure 29: Layout frame

Fig 1
Mean_Temp

10
-10

37000 38000 39000 40000

Date

Fig 2
Max_Temp

20
0

37000 38000 39000 40000

Date

Fig 3
40
Prec

20
0

37000 39000

Date

Figure 30: Result of the layout commands

1/15/18 / 11:17:49 / 53 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


4.5.4 Presenting simulation results
As a kind of final example we show you how to create a very common figure in hydrological
modeling as shown in Fig. 52: we compare the simulated and observed values and add
precipitation on top of the discharge.
Xiangxi <-
read.table("Xiangxi.txt", header=TRUE, sep=",", na.strings="NA",
dec=".", strip.white=TRUE)
attach(Xiangxi)
DateCal = as.POSIXct(DateCal)

Figure 31: Common display of hydrological simulation results


par(mar=c(5,5,2,5))
plot(DateCal, QsimCal, ylab="Streamflow [m^3/s]", xlab="Date", type =
"l", col="green", ylim=(c(0,1200)))
lines(DateCal, QobsCal, col="black")
par(new=T)
plot(DateCal, Precip, xlab="", ylab="", col="red", type="n", axes=F,
ylim=rev(c(0,120)))
lines(DateCal, Precip, col="red",lty=3)
axis(4)
mtext("Rain (mm)", side=4, line=3 )

15: create a combined figure with

• 1st row: time series of max, min and mean temperature, including a legend, use
code from task Error: Reference source not found

1/15/18 / 11:17:49 / 54 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


• 2nd row: time series of the difference of max-min
• 3rd row: two scatterplots with 1) min vs. max temperature and 2) the mean
temperature vs. (max+min)/2 (use the plot command, not scatterplot)

4.5.5 Combined figures with the lattice package


R has three different graphic systems: the base system, lattice and ggplot2. Up to now
we worked mainly with the base system which is very flexible. Another frequently used
system is the lattice library.
Deepayan Sarkar, 2008: Lattice: multivariate data visualization with R. Springer
Use R! Series

Lattice is very well suited for the display of data sets with many (factor) variables, but the
syntax is different from normal figures and the display is not very flexible.
library(lattice)

First, let us start with some descriptive figures.


densityplot( ~ Climate$AirTemp_Max | Climate$Month_fac , data=Climate)
histogram( ~ AirTemp_Max | Month_fac , data=clim2000)
histogram( ~ AirTemp_Max+AirTemp_Min | Month_fac , data=clim2000)

Please note how the numeric variables (temperatures) and the factor variables are ordered.
All examples above print monthly plots of a temperature.
Scatterplots are very similar, only the definition of the variables is different:
xyplot(AirTemp_Mean ~ AirTemp_Max+AirTemp_Min | Month_fac ,
data=clim2000)

A very useful keyword is the grouping inside a figure.


xyplot(AirTemp_Mean ~ AirTemp_Max+AirTemp_Min, groups=Summer,
data=clim2000)

Here you can clearly see the different of summer and winter values.
Another useful feature is the automatic addition of a legend.
xyplot(AirTemp_Mean ~ AirTemp_Max+AirTemp_Min, groups=Summer,
auto.key=T, data=clim2000)

xyplot(AirTemp_Mean ~ AirTemp_Max+AirTemp_Min, groups=Summer,


auto.key=(list(title="Summer?")), data=clim2000)

A combination of all simple features makes it easy to get an overview of the dataset. In our
example it is quite apparent, that something went wrong in the year 2007 (Fig. 31).
xyplot(AirTemp_Mean ~ AirTemp_Max+AirTemp_Min, groups=Year_fac,
auto.key=list(title="Year",columns=7), data=clim2000)

1/15/18 / 11:17:49 / 55 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Figure 32: Scatterplot of air temperatures with annual grouping

4.5.6 Multiple plots with ggplot2, gridExtra version


Sometimes, different figures have to be arranged regularly somehow, for ggplot2 there is a
simple and a more complex but more flexible solution.
library(ggplot2)
library(grid)
library(gridExtra)
p1 <- qplot(data=clim2000,x=Month_fac,facets= . ~ Year_fac,
y=AirTemp_Mean,geom="boxplot")
p2 <- qplot(data=clim2000,x=Month_fac,facets= . ~ Year_fac,
y=Prec,geom="boxplot")
p3 <- qplot(data=clim2000,x=Month_fac,facets= . ~ Year_fac,
y=Sunshine,geom="boxplot")
p4 = qplot(data=clim2000,x=Date,y=AirTemp_Mean,geom="line")
grid.arrange(p1, p2, p3, p4, ncol = 2, main = "Main title")
dev.off()

1/15/18 / 11:17:50 / 56 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


The multiplot function delivers nearly the same result.

4.5.7 Multiple plots with ggplot2, viewport version


library(grid)

We start with a new page


grid.newpage()

Grid needs so called viewports – you can use any area, here we define the lower left part of
the page
### define first plotting region (viewport)
vp1 <- viewport(x = 0, y = 0, height = 0.5, width = 0.5,
just = c("left", "bottom"), name = "lower left")

From now on, everything is drawn in the lower left part


pushViewport(vp1)
### show the plotting region (viewport extent)
### plot a plot - needs to be printed (and newpage set to FALSE)!!!

Now we define the figure. The qplot command is a simplification of the ggplot2 package,
is makes transition from old packages easier. It always requires an extra print command to
appear on the page. Now we print monthly boxplots in a separate figure for each year

bw.lattice <- qplot(data=clim2000,x=Month_fac,facets= . ~ Year_fac,


y=AirTemp_Mean,geom="boxplot")
print(bw.lattice, newpage= FALSE)

Now we move up one step in the hierarchy, all plot commands would now be printed on the
full page.
upViewport(1)
### define second plot area
vp2 <- viewport(x = 1, y = 0, height = 0.5, width = 0.5,
just = c("right", "bottom"), name = "lower right")
### enter vp2
pushViewport(vp2)
### show the plotting region (viewport extent)
### plot another plot
bw.lattice <- qplot(data=clim2000,x=Month_fac,
facets= . ~ Year_fac,y=Prec,geom="boxplot")
print(bw.lattice, newpage= FALSE)
### leave vp2
upViewport(1)
vp3 <- viewport(x = 0, y = 1, height = 0.5, width = 0.5,
just = c("left", "top"), name = "upper left")
pushViewport(vp3)
bw.lattice <- qplot(data=clim2000,x=Month_fac,
facets= . ~ Year_fac, y=Sunshine,geom="boxplot")

1/15/18 / 11:17:50 / 57 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


print(bw.lattice, newpage= FALSE)
### show the plotting region (viewport extent)
upViewport(1)
vp4 <- viewport(x = 1, y = 1, height = 0.5, width = 0.5,
just = c("right", "top"), name = "upper right")
pushViewport(vp4)
bw.lattice=qplot(data=clim2000,x=Date,y=AirTemp_Mean,geom="line")
print(bw.lattice, newpage= FALSE)
upViewport(1)

4.6 Brushing up your plots made with the standard system

4.6.1 Setting margins


The setting of the margins is a nightmare for the beginner because there are several
possibilities neither of which are easy to grasp.
To recall the default margin setting you type
par()$mar

for inner margins and


par()$oma

for outer margins. New margins are set by


par(mar=c(4, 4, 4, 4))

The number mean margin at bottom, left, top and right.


The dimension of the number is “lines”, there is also a command for margins in inches (no
decimal units though).

4.6.2 Text for title and axes


Adding text (labels) for axes and main title is quite straightforward
plot(AirTemp_Max, AirTemp_Min, ylab="Minimum Temperature
[°C]",xlab="Maximum Temperature [°C]", main="Temperature")

If you want additional explanation in a figure you can add text in the margins outside the
plot area with
mtext("Line 1", side=2, line=1, adj=1.0, cex=1, col="green")

or inside the plot region with


text(5,5, "Plot", col="red", cex=2)

The x-y-dimensions of the text command are the same as the data set. As usual, you can
use any variable containing text e.g. for automatic annotations etc.

1/15/18 / 11:17:50 / 58 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


4.6.3 Colors
Colors in all pictures can be referred to by number or text.
plot(Max_Temp, Min_Temp, col=2)

is the same as
plot(Max_Temp, Min_Temp, col="red")

List of colors is printed with


colors()

If you want only shades of red


colors()[grep("red",colors())]

http://research.stowers-institute.org/efg/R/Color/Chart/
In depth information about colors in R and science

4.6.4 Legend
In the basic graphic system, the legends are not added automatically, you have to define
them separately like
plot(AirTemp_Max, AirTemp_Min)

lines(AirTemp_Min,AirTemp_Mean,col="green", type="p")

legend(20,-10, c("Max/Min", "Min/Mean"), col = c("black","green"), lty


= c(0,0), lwd=c(1,2), pch=c("o","o"), bty="n",merge = TRUE, bg =
'white' )

locator(1) # get the coordinates of a position in the figure

Again, the xy dimensions are the same as the data set. If you want to set the location with
the mouse you can use the following command
legend(locator(1), c("Max/Min", "Min/Mean"), col = c("black","green"),
lty = c(0,0), lwd=c(1,2), pch=c("o","o"), bty="n",merge = TRUE, bg =
'white' )

4.6.5 More than two axes


Each plot command sets the scales for the whole figure, the next plot command would
create a new figure. To avoid this, you have to create a new reference system in the same
figure.
First, we need more space on the right side of the plot and set margins for the second Y-
axis.
par(mar=c(5,5,5,5))
plot(AirTemp_Mean ~ Date, type="l", col="red", yaxt="n", ylab="")

As the y-axis is not drawn (yaxt="n") , we do it manually

1/15/18 / 11:17:50 / 59 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


axis(2, pretty(c(min(AirTemp_Mean),max(AirTemp_Mean))), col="red")

and finally add a title for the left axis


mtext("Mean Temp", side=2, line=3, col="red")

Now comes the second data set. To avoid a new figure we need to set
par(new=T)

The next lines are quite similar, except that we draw the y-axis on the right side (4).
plot(Prec ~ Date, type="l", col="green", yaxt='n', ylab="")
axis(4, pretty(c(0,max(Prec))), col="green")
mtext("Precipitation", side=4, line=3, col="green")

4.7 Saving Figures


The easiest way to save figures produced with R is to copy them with the clipboard (copy
and paste) directly in your text or presentation or to save them with File → Save as to
an image file. However if you have more than one image or if you have to do the same
image over and over, it is better so save the figures automatically in a file. You can save
figures in different formats, below we show the commands to open a file in PDF, PNG and
JPG format respectively.
pdf(file = "FDC.pdf", width=5, height=4, pointsize=1);
png("acf_catments.png", width=900) # dim in pixels
jpeg("test.jpg",width=600,height=300)

plot(AirTemp_Mean ~ Date, type="l", col="red", main="Fig 1")

All graphic devices are closed with the command


dev.off()

The recommended procedure is to develop and test a figure on screen and wrap it in a file
as soon as the results are as expected.
For the ggplot library you have to use
fig1 <- qplot(data=clim2000,x=Month_fac,y=AirTemp_Mean,geom="boxplot")
ggsave("fig1.png",width=3, height=3) # dim in cm

4.8 Scatterplot matrix plots


Scatterplot matrices belong to the bivariate methods are discussed in more detail in chapter
Error: Reference source not found. However, they are frequently used in EDA for a short and
detailed overview of data sets with correlations. A good example is our data set of lake
chemistry. It is very probable that some substances are correlated. This method is also a
good way to identify outliers and extreme values in data sets.
As always in R, there are three ways to heaven. Because they all have different unique
feature we will introduce all three here. You also need the functions printed below the
proper commend lines.
library(car)

1/15/18 / 11:17:50 / 60 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


library(lattice)
library(GGally)

We start with the grandfather of all scatterplot matrices


splom(chemie)
splom(chemie,groups=as.factor(chemie$Year))

Ggplot2 also contains a scatterplot matrix function


ggpairs(chemie, columns = 5:8)
t=chem_all[,c(6:9,35)]
ggpairs(t,columns=1:3)

You can also integrate density plot in ggpairs


ggpairs(t,columns=1:3,
upper = list(continuous = "density"),
lower = list(combo = "facetdensity"))

One of the most useful scatterplot versions is pairs, which also prints out the correlation
and the significance level. Unfortunately, pairs does not work with missing values in the
data set. This cleaning proces often removes half of the data set.
t2=t[complete.cases(t),]
pairs(t2[1:3], lower.panel=panel.smooth, upper.panel=panel.plot)
t2=t[complete.cases(t),]
pairs(t2, lower.panel=panel.smooth, upper.panel=panel.cor)

The following code is the definition of functions needed by pairs. You have to execute
them prior to the use of pairs.
panel.cor <- function(x, y, digits=2, prefix="", cex.cor)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex <- 0.8/strwidth(txt)
test <- cor.test(x,y)
# borrowed from printCoefmat
Signif <- symnum(test$p.value, corr = FALSE, na = FALSE,
cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
symbols = c("***", "**", "*", ".", " "))
text(0.5, 0.5, txt, cex = cex * r)
text(.8, .8, Signif, cex=cex, col=2)
}

# based mostly on http://gallery.r-enthusiasts.com/RGraphGallery.php?


graph=137
panel.plot <- function(x, y) {
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))

1/15/18 / 11:17:50 / 61 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


ct <- cor.test(x,y)
sig <- symnum(ct$p.value, corr = FALSE, na = FALSE,
cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
symbols = c("***", "**", "*", ".", " "))
r <- ct$estimate
rt <- format(r, digits=2)[1]
cex <- 0.5/strwidth(rt)

text(.5, .5, rt, cex=cex * abs(r))


text(.8, .8, sig, cex=cex, col='blue')
}
panel.smooth <- function (x, y) {
points(x, y)
abline(lm(y~x), col="red")
lines(stats::lowess(y~x), col="blue")
}

4.9 3d Images
Plotting 3d images is no problem if you have already a grid with regular spacing. The
procedure here also works with irregularly spaced data, but keep in mind that the spatial
interpolation may cause two kinds or problems:
• valleys and/or mountains in the image which are not found in the data.
• Information at a smaller scale than the grid size may completely disappear in
the image
For the spatial interpolation we use the package akima
install.packages("akima")
library(akima)

The data set is not a real spatial data set but a time series of soil water content at different
depth. The third dimension here is time.
g <- read.csv("soil_water.csv", header=TRUE)
attach(g)

Define range of values


x0 <- -180:0
y0 <- 0:367
ak <- interp(g$Depth, g$Day, g$SWC)

The ranges can also be defined automatically by the functions:


x0 <- min(Depth):max(Depth)
y0 <- min(Day):max(Day)

The variable ak now contains a regular grid. Now we can plot all kinds of impressing
3dimensional figures. We start with a boring contour plot:
contour(ak$x, ak$y, ak$z)

1/15/18 / 11:17:51 / 62 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


A more colourful version codes the values of the z-column with all colours of the rainbow:
image(ak$x, ak$y, ak$z, col=rainbow(50))

A similar picture comes out from


filled.contour(ak$x, ak$y, ak$z, col=rainbow(50))

If you don't like the colours of the rainbow you can also use:
'heat.colors','topo.colors','terrain.colors', 'rainbow', 'hsv', 'par'.

A 3dim view is created by:


persp(ak$x, ak$y, ak$z, expand=0.25, theta=60, phi=30, xlab="Depth",
ylab="Day", zlab="SWC", ticktype="detailed", col="lightblue")

The ggplot2 version of these figures is:


ggplot(g,aes(x=Day,y=Depth,z=SWC)) +
geom_contour()
ggplot(g,aes(x=Day,y=as.factor(Depth),col=SWC)) +
geom_raster(aes(fill=SWC),interpolate=TRUE)+
stat_contour(bins=6,aes(Day,as.factor(Depth),z=SWC), color="black",
size=0.6)+
scale_fill_gradientn(colours=brewer.pal(6,"YlOrRd"))

To plot the same data with ggplot you have to convert it to narrow format
ak2=as.data.frame(ak)
ak3=melt(ak2,id.vars=c("x","y"))
ak3$z=as.numeric(ak3$variable)
ggplot(ak3,aes(x=z,y=y,fill=value)) +
geom_raster()+
stat_contour(bins=6,aes(x=z,y,z=value), color="black", size=0.6)+
scale_fill_gradientn(colours=brewer.pal(6,"YlOrRd"))

4.10 Exercise
In file Chemie_kielstau.xlsx you find a data set of 10 years of daily measurements of
water quality in the Kielstau catchment. Draw a figure with an overview of the Nitrate
concentrations described in Fig. 34.
Use the libraries ggplot2, reshape2, dplyr and
gridextra. The data set is in narrow format and the
procedures to create the figures are described in
section 4.4 and 4.5.6

1/15/18 / 11:17:51 / 63 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Monthly average (each month, Monthly average (month, all data
time series) seasonal variation)

Frequency distribution of Violin plot of summer and winter


concentrations for summer and (use color)
winter

Figure 33: Structure of a figure of NO3-N overview

Figure 34: Final version of the NO3 overview

1/15/18 / 11:17:51 / 64 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


5 Bivariate Statistics
This chapter is based on the following literature sources:
Kabacoff, R.I., 2011. R in Ac on - Data Analysis and Graphics with R. Manning Publica ons
Co., Shelter Island, NY.
h+p://www.manning.com/kabacoff/
Logan, M., 2010. Biosta s cal Design and Analysis Using R: A Prac cal Guide. Wiley-
Blackwell, Chichester, West Sussex.
h+p://eu.wiley.com/WileyCDA/WileyTitle/productCd-1405190086.html
Trauth, M.H., 2006. MATLAB recipes for earth sciences. Springer, Berlin Heidelberg New
York.
h+p://www.springer.com/earth+sciences+and+geography/book/978-3-642-12761-8?
changeHeader

Bivariate analysis aims to understand the relationship


between two variables x and y. Examples are

• length and width of a fossil


• sodium and potassium content of a soil
• organic matter content along a sediment core
• run-off and precipitation in a catchment

When the two variables are measured on the same object, x is usually identified as the
independent variable, whereas y is the dependent variable. If both variables were
generated in an experiment, the variable manipulated by the experimenter is described as
the independent variable. In some cases, both variables are not manipulated and therefore
independent. The methods of bivariate statistics help describe the strength of the
relationship between the two variables, either by a single parameter such as Pearson’s
correlation coefficient for linear relationships or by an equation obtained by regression
analysis (Fig. 35). The equation describing the relationship between x and y can be used to
predict the y-response from arbitrary x’s within the range of original data values used for
regression. This is of particular importance if one of the two parameters is difficult to
measure. Here, the relationship between the two variables is first determined by regression
analysis on a small training set of data. Then, the regression equation is used to calculate
this parameter from the first variable.

Correla on or Regression ?!

Correla on: Neither variable has been set (they are both measured) AND there is no implied
causality between the variables

Regression: Either one of the variables has been specifically set (not measured) OR there is and
implied causality between the variables whereby one variable could influence the other but the

1/15/18 / 11:17:51 / 65 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


reverse is unlikely.

Figure 35: Display of a bivariate data set.


The thirty data points represent the age of a sediment (in kiloyears before present) in a certain depth (in me-
ters) below the sediment-water interface. The joint distribution of the two variables suggests a linear re-
lationship between age and depth, i.e., the increase of the sediment age with depth is constant. Pearson’s
correlation coefficient (explained in the text) of r = 0.96 supports the strong linear dependency of the
two variables. Linear regression yields the equation age=6.6+5.1 depth. This equation indicates an in-
crease of the sediment age of 5.1 kyrs per meter sediment depth (the slope of the regression line). The in-
verse of the slope is the sedimentation rate of ca. 0.2 meters /kyrs. Furthermore, the equation defines the
age of the sediment surface of 6.6 kyrs (the intercept of the regression line with the y-axis). The deviation
of the surface age from zero can be attributed either to the statistical uncertainty of regression or any
natural process such as erosion or bioturbation.

5.1 Pearson’s Correlation Coefficient


Correlation coefficients are often used at the exploration stage of bivariate statistics. They
are only a very rough estimate of a (recti-)linear trend in the bivariate data set.
Unfortunately, the literature is full of examples where the importance of correlation
coefficients is overestimated and outliers in the data set lead to an extremely biased
estimator of the population correlation coefficient. The most popular correlation coefficient
is Pearson’s linear product-moment correlation coefficient r (Fig. 36). We estimate the
population’s correlation coefficient r from the sample data, i.e., we compute the sample
correlation coefficient r, which is defined as

1/15/18 / 11:17:51 / 66 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


where n is the number of xy pairs of data points, sx and sy are the univariate standard
deviations. The numerator of Pearson’s correlation coefficient is known as the corrected
sum of products of the bivariate data set. Dividing the numerator by (n–1) yields the
covariance which is the summed products of deviations of the data from the sample means,
divided by (n–1). The covariance is a widely-used measure in bivariate statistics, although it
has the disadvantage of depending on the dimension of the data.

Dividing the covariance by the univariate standard deviations removes this effect and leads
to Pearson’s correlation coefficient r.

Pearson’s correlation coefficient is very sensitive to various disturbances in the bivariate


data set. The following example illustrates the use of the correlation coefficients and
highlights the potential pitfalls when using this measure of linear trends. It also describes
the resampling methods that can be used to explore the confidence of the estimate for r.

The dataset: agedepth.txt


The synthetic data consist of two variables, the age of a sediment in kiloyears before
present and the depth below the sediment-water interface in meters. The use of synthetic
data sets has the advantage that we fully understand the linear model behind the data.
The data are represented as two columns contained in file agedepth.txt. These data have
been generated using a series of thirty random levels (in meters) below the sediment
surface. The linear relationship age=5.6 meters+ 1.2 was used to compute noise free values
for the variable age. This is the equation of a straight line with a slope of 5.6 and an
intercept with the y-axis of 1.2. Finally, some gaussian noise of amplitude 10 was added to
the age data.

We load the data from the file agedepth.txt using the import function of Rstudio
(separator “white space”, decimal “.”) and plot the dataset (x=depth, y=age)

16: plot age (x-axis) against depth (y-axis) using either the basic plot command or qplot from the
package ggplot2.

17: assess linearity and bivariate normality using a sca erplot (package: car)with marginal boxplots

plot(x, y) - opens a new figure in the basic

library() - opens/loads a specific package of R

1/15/18 / 11:17:51 / 67 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


help(name) - opens the help in R with the informa on on a func on or package

qplot(data=dataset, x, y) - opens a new figure in ggplot2

1/15/18 / 11:17:52 / 68 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Figure 36: Pearson’s correlation coefficient r for various sample data.

(a–b) Positive
1/15/18 and
/ 11:17:52 negative
/ 69 of 161 / linear correlation, (c) random scatter without a linear correlation, (d) an
outlier
D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt
causing a misleading value of r, (e) curvilinear relationship causing a high r since the curve is close to a
straight line, (f) curvilinear relationship clearly not described by r.
Observation to exercise 17 and 18:

We observe a strong linear trend suggesting some dependency between the variables,
depth and age. This trend can be described by Pearson’s correlation coefficient r, where r =1
represents a perfect positive correlation, i.e., age increases with depth, r = 0 suggests no
correlation, and r =–1 indicates a perfect negative correlation.
We use the function cor to compute Pearson’s correlation coefficient:

cor(agedepth, method="pearson")

From these outputs our suspicion is confirmed x and y have a high positive correlation, but
as always in statistics we can test if this coefficient is significant. Using parametric
assumptions (Pearson, dividing the coefficient by its standard error, giving a value that
follow a t-distribution):

cor.test(agedepth$age, agedepth$depth, method="pearson")

The cor.test command has the form cor.test (dataset$x, dataset$y, method).
If you a+ach a dataset, you can also use the form cor.test (x,y):
attach(agedepth)
cor.test(age, depth)
detach(agedepth)

The program's output looks like this:

Pearson's product-moment correlation

data: depth and age


t = 13.8535, df = 28, p-value = 4.685e-14
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8650355 0.9684927
sample estimates:
cor
0.9341736

The value of r = 0.9342 suggests that the two variables age and depth depend on each other.

However, Pearson’s correlation coefficient is highly sensitive to outliers. This can be


illustrated by the following example. Let us generate a normally-distributed cluster of
thirty (x,y) data with zero mean and standard deviation one.

x=rnorm(30, mean=0, sd=1)


y=rnorm(30, mean=0, sd=1)
plot(x,y)

As expected, the correlation coefficient of these random data is very low.

cor.test(x,y)

Pearson's product-moment correlation

1/15/18 / 11:17:52 / 70 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


data: y and x
t = -0.5961, df = 28, p-value = 0.5559
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.4539087 0.2587592
sample estimates:
cor
-0.111946

Now we introduce one single outlier to the data set, an exceptionally high (x,y)value, which
is located precisely on the one-by-one line. The correlation coefficient for the bivariate data
set including the outlier (x,y)=(5,5) is much higher than before.

x[31]=5
y[31]=5
plot(x,y)
cor(x,y)
abline(lm(y ~ x), col="red")

Result: cor = 0.3777136

After increasing the absolute (x,y) values of this outlier, the correlation coefficient increases
dramatically.
x[31]=10
y[31]=10
plot(x,y)
cor.test(x,y)
abline(lm(y ~ x), col="red")

Result: cor = 0.7266505

Still, the bivariate data set does not provide much evidence for a strong dependence.
However, the combination of the random bivariate (x,y) data with one single outlier results
in a dramatic increase of the correlation coefficient. Whereas outliers are easy to identify in
a bivariate scatter, erroneous values might be overlooked in large multivariate data sets.

abline(lm(y ~ x)) uses the basic graphical func on “A-B-line” to add at regression trendline
based on the linear model (lm) of x and y.

The dataset: crab.csv


Sokal and Rohlf (1997) present an unpublished data set (L. Miller) in which the correlation
between gill weight and body of the crab (Pachygrapsus crassipes) is investigated.

Exercise 18:
a) import the crab data set (crab.scv, separator “,”)
b) assess linearity and bivariate normality using a sca erplot with marginal boxplots
c) Calculate the Pearson's correla on coefficient and test Null-hypothesis H0 that the popula on
correla on coefficient equals zero

1/15/18 / 11:17:52 / 71 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Figure 37: Pachygrapsus crassipes – striped shore crab
(image source: wikimedia commons)

5.2 Spearman's Rank Coefficient

As for Pearson's correlation, you can use the Spearman correlation to explore the
relationship between two variables. Accordingly, the coefficient has a value of -1 for perfect
negative correlation, a value of +1 for perfect positive correlation, and indicates no
correlation at all for values close to 0.
Spearman's correlation coefficient rsp is also called Spearman's rank coefficient, because
there is a tiny but important difference to the classical Pearson's coefficient r: the
correlation is not calculated using the datapoint themselves but in using the rank of the
datapoints.

The dataset: age vs. performance in a 100m run


We want to explore the relation between the age of a person and its performance in a 100m
sprint. Therefore we take from 6 sample persons their age in years and their sprinting time
in second (see dataset runners.txt, taken from:
http://www.crashkurs-statistik.de/spearman-korrelation-rangkorrelation/ )

First we calculate Pearson's correlation coefficient:

cor(runners$age, runners$time, method="pearson")


Result: cor = 0.7301

To calculate Spearman's rank coefficient for the same dataset, we (or actually R in the
background) have to calculate the ranks first and leave the actual values for „age“ and
„time“ aside for a moment. In other words: we work with the „placement on the podium“
instead of the actual running times and also with the „rank“ of the age.

1/15/18 / 11:17:52 / 72 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Table illustrating age, time and the respective „ranks“

Now we calculate Spearman's corr ranked coefficient:


cor(runners$age, runners$time, method="spearman")
Result: cor = 0.8285

Why is that ?
As the Spearman coefficient uses the ranks, the actual distances (in the example: finishing
times) between rank 1, rank 2 etc. do not matter. Hence, it also doesn't matter if the two
variables have no linear relationship! Spearman's coefficient rsp is always 1 if the lowest x-
values is associated with the lowest y-value, etc.
In Mathematical terms one can say, Spearman's coefficient measures the monotone
relationship between two variables, while Pearson's coefficient measures the linear
relationship.

Exercise 19:
a) create an ar fical squared age-depth rela on in the agedepth dataset
b) assess lithe new age2-depth-rela on using qplot and sca erplot
c) Calculate Pearson's and Spearman's correla on coefficients

Crea ng a nice sca+erplot with ggplot2:


ggplot(data=agedepth, aes(x=-age2, y=-depth)) +

geom_point(color = "red", size = 3) +

stat_smooth(method="lm")

5.3 Correlograms – correlation matrices

Correlation matrices are a fundamental aspect of multivariate statistics. Which variables

1/15/18 / 11:17:52 / 73 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


under consideration are strongly related to each other and which aren’t? Are there clusters
of variables that relate in specific ways? As the number of variables grow, such questions
can be harder to answer. Correlograms are a relatively recent tool for visualizing the data
in correlation matrices.
The dataset: PMM.txt
It’s easier to explain a correlogram once you’ve seen one. Consider the correlations among
the variables in the PMM data set. Here you have 15 variables, namely different chemical
elements measured during a XRF (X-Ray fluorescence) scan of a sediment core taken from
an Alpine peat bog (Plan da Mattun Moor, PMM) in 2010. The core had a length of 143 cm
and was scanned in 1-cm-resolution. Hence, 143 data values are available for each element.

Figure 38: picture of the PMM sediment core and the Si-content plotted as red curve along the core.

We load the data from the file PMM.txt (separator “white space”)

You can get the correlations using the following code:

options(digits=2)
cor(PMM)

Al Si S Cl K Ca Ti Mn Fe Zn Br Rb

Al 1,0000 0,9073 -0,1005 -0,0740 0,9167 -0,2974 0,9176 0,1688 0,4443 0,4805 -0,3899 0,7344

Si 0,9073 1,0000 -0,2246 -0,2298 0,7381 -0,5379 0,8063 0,1783 0,2953 0,2885 -0,5806 0,4900

S -0,1005 -0,2246 1,0000 0,1782 -0,1026 0,5458 -0,1194 0,0003 0,5452 -0,0443 0,1718 -0,0904

Cl -0,0740 -0,2298 0,1782 1,0000 0,0828 0,5761 0,0331 0,3657 0,1702 0,3451 0,4423 0,2326

K 0,9167 0,7381 -0,1026 0,0828 1,0000 -0,1509 0,9482 0,1429 0,4794 0,6092 -0,2269 0,8973

Ca -0,2974 -0,5379 0,5458 0,5761 -0,1509 1,0000 -0,3021 0,0238 0,0442 0,0480 0,6203 0,0523

Ti 0,9176 0,8063 -0,1194 0,0331 0,9482 -0,3021 1,0000 0,1917 0,5605 0,5442 -0,3700 0,8425

Mn 0,1688 0,1783 0,0003 0,3657 0,1429 0,0238 0,1917 1,0000 0,2784 0,4044 0,0644 0,1225

Fe 0,4443 0,2953 0,5452 0,1702 0,4794 0,0442 0,5605 0,2784 1,0000 0,3898 -0,0458 0,4875

Zn 0,4805 0,2885 -0,0443 0,3451 0,6092 0,0480 0,5442 0,4044 0,3898 1,0000 0,1412 0,6466

Br -0,3899 -0,5806 0,1718 0,4423 -0,2269 0,6203 -0,3700 0,0644 -0,0458 0,1412 1,0000 -0,0314

Rb 0,7344 0,4900 -0,0904 0,2326 0,8973 0,0523 0,8425 0,1225 0,4875 0,6466 -0,0314 1,0000

• Which variables are most related?


• Which variables are relatively independent?
• Are there any patterns?

1/15/18 / 11:17:53 / 74 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


It isn’t that easy to tell from the correlation matrix without significant time and effort (and
probably a set of colored pens to make notations). You can display that same correlation
matrix using the corrgram() function in the corrgram package.

You have to install the corrgram package first!

library(corrgram)
corrgram(PMM)

To interpret this graph (Fig. 39), start with the lower triangle of cells (the cells below the
principal diagonal). By default, a blue color and hashing that goes from lower left to upper
right represents a positive correlation between the two variables that meet at that cell.
Conversely, a red color and hashing that goes from the upper left to the lower right
represents a negative correlation. The darker and more saturated the color, the greater the
magnitude of the correlation. Weak correlations, near zero, will appear washed out.

The format of the corrgram() function is:

corrgram(x, order=, panel=, text.panel=, diag.panel=)

where x is a data frame with one observation per row. When order=TRUE, the variables are
reordered using a principal component analysis of the correlation matrix. Reordering can
help make patterns of bivariate relationships more obvious. The option panel specifies the
type of off-diagonal panels to use. Alternatively, you can use the options lower.panel and
upper.panel to choose different options below and above the main diagonal. The text.panel
and diag.panel options refer to the main diagonal.

1/15/18 / 11:17:53 / 75 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Figure 39: Correlogram of the correlations among the variables in the PMM data frame

1/15/18 / 11:17:53 / 76 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Figure 40: Different correlogram layouts using the corrgram package with the variables in the PMM data frame

The dataset: STY1.txt


Equivalent to the PMM data set, the STY1 data set consists of geochemical data produced
by XRF scanning of a sediment core from Lake Stymphalia in Greece.
Reference:
Heymann, C., Nelle, O., Dörfler, W., Zagana, H., Nowaczyk, N., Xue, J., & Unkel, I. (2013). Late
Glacial to mid-Holocene palaeoclimate development of Southern Greece inferred from the
sediment sequence of Lake Stymphalia (NE-Peloponnese). Quaternary International, 302, 42–
60. doi:10.1016/j.quaint.2013.02.014

Exercise 20:
a) import the STY1 data set (STY1.txt, separator=”Tab”)
b) plot the following element-combina ons in a nested (mul -plot) figure of 4 plots,
add a tle (main) and a regression line (abline) and different color in each
respec ve plot:
Al-Si; Ca-Sr; Ca-Si; and Mn-Fe
explain what you see.
c) Calculate the Pearson's correla on coefficient and the Spearman's rank coefficients
for each element pair and evaluate the results
d) produce first an unsorted and than a sorted (order=TRUE) correla on matrix of the
en re STY1 data set, both mes only displaying the lower panel as shades.

1/15/18 / 11:17:53 / 77 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Mul plots:
For crea ng a mul plot of 2x2 sub-plots in the basic plo ng system of R, you have mul ple
possibili es. An easy one for the start is:
par(mfrow=(c(2,2)))
to switch this adjustment of the ploPng parameter back to normal you need to type
dev.off()

For crea ng the same in ggplot2 you need to write every sub-plot into a separate variable,
e.g. p1, p2, p3, p4, ...
p1=ggplot(data=dataset, aes(x=, y=))+
geom_point()
and finally arrange it via:
library(grid)
library(gridExtra)
grid.arrange(p1, p2, p3, p4, ncol=2, top=”Title”)

5.4 Classical Linear Regression


Linear regression provides another way of describing the dependence between the two
variables x and y. Whereas Pearson’s correlation coefficient provides only a rough measure
of a linear trend, linear models obtained by regression analysis allow to predict arbitrary y
values for any given value of x within the data range. Statistical testing of the significance of
the linear model provides some insights into the quality of prediction. Classical regression
assumes that y responds to x, and the entire dispersion in the data set is in the y-value (Fig.
41). Then, x is the independent, regressor or predictor variable. The values of x are defined
by the experimenter and are often regarded as to be free of errors. An example is the
location x of a sample in a sediment core. The dependent variable y contains errors as its
magnitude cannot be determined accurately. Linear regression minimizes the ∆y deviations
between the xy data points and the value predicted by the best-fit line using a least-squares
criterion. The basis equation for a general linear model is

where b0 and b1 are the regression coefficients. The value of b0 is the intercept with the y-
axis and b1 is the slope of the line. The squared sum of the ∆y deviations to be minimized is

Partial differentiation of the right-hand term and equation to zero yields a simple equation
for the first regression coefficient b1:

The regression line passes through the data centroid defined by the sample means. We can
therefore compute the other regression coefficient b0,

1/15/18 / 11:17:54 / 78 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


using the univariate sample means and the slope b1 computed earlier.

Figure 41: Linear regression.


Whereas classical regression minimizes the ∆y deviations, reduced major axis regression minimizes the
triangular area 0.5(∆x ∆y) between the points and the regression line, where ∆ x and ∆ y are the distances
between the predicted and the true x and y values. The intercept of the line with the y-axis is b0, whereas the
slope is b1.These two parameters define the equation of the regression line.

5.4.1 Analyzing the Residuals


When you compare how far the predicted values are from the actual or observed values,
you are performing an analysis of the residuals. The statistics of the residuals provides
valuable information on the quality of a model fitted to the data. For instance, a significant
trend in the residuals suggests that the model not fully describes the data. In such a case, a
more complex model, such as a polynomial of a higher degree should be fitted to the data.
Residuals ideally are purely random, i.e., Gaussian distributed with zero mean. Therefore,
we can test the hypothesis that our residuals are Gaussian distributed by visual inspection
of the histogram and by employing a χ2 test introduced later (chapter 6.4).

assessing the residual plot in R:


dataset.lm = lm (y ~ x, dataset)
plot (dataset.lm, which = 1)

1/15/18 / 11:17:54 / 79 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Figure 42: Schematic residual plots depicting characteristic patterns of residuals (a) random scatter of points -
homogeneity of variance and linearity met (b) ‘‘wedgeshaped’’- homogeneity of variance not met (c) linear pat-
tern remaining – erroneously calculated residuals or additional variable(s) required and (d) curved pattern re-
maining - linear function applied to a curvilinear relationship. Modified from Zar (1999).

The dataset: nelson.csv


As part of a Ph.D. Into the effect of starvation and humidity on water loss in the confused
flour beetle (Tribolium confusum), Nelson (1964) investigated the linear relationship
between humidity and water loss by measuring the amount of water loss (mg) by nine
batches of beetles kept at different relative humidities (ranging from 0 to 93%) for a period
of six days (in: Sokal, R. and Rohlf, F.J. (1997). Biometry, 3rd edition. W.H. Freeman, San
Francisco.)

Exercise 21:
a) import the Nelson data set (nelson.csv, separator=”,” )
b) assess linearity and bivariate normality using a sca+erplot with marginal boxplots
comment: the ordinary least squares method is considered appropriate, as there is effec vely no
uncertainty (error) in the predictor variable (x-values, rela ve humidity)
c) fit the simple linear regression model (y=bo+bx) and examine the diagnos cs:
nelson.lm = lm(WEIGHTLOSS~HUMIDITY, nelson)
plot(nelson.lm)

1/15/18 / 11:17:54 / 80 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Box 1: Basic statistical principles
Table from Logan (2011), figure from Trauth (2006)

1/15/18 / 11:17:54 / 81 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


mean: The most popular indicator of central tendency is the arithmetic mean, which is
the sum of all data points divided by the number of observations

median: the median is often used as an alternative measure of central tendency. The
median is the x-value which is in the middle of the data, i.e., 50% of the
observations are larger than the median and 50% are smaller. The median of a
data set sorted in ascending order is defined as

if N is even if N is odd

Quantiles are a more general way of dividing the data sample into groups containing equal
numbers of observations. For example, quartiles divide the data into four groups,
quintiles divide the observations in five groups and percentiles define one
hundred groups.

degrees of freedom Φ : the number of values in a distribution that are free to be varied.

1/15/18 / 11:17:54 / 82 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Figure 43: a Probability density function f (x) and b cumulative distribution function F (x)of a χ2
distribution with different values for Φ.

Null hypothesis:
A biological or research hypothesis is a concise statement about the predicted or theorized
nature of a population or populations and usually proposes that there is an effect of a
treatment (e.g. the means of two populations are different). Logically however, theories
(and thus hypothesis) cannot be proved, only disproved (falsification) and thus a null
hypothesis (Ho) is formulated to represent all possibilities except the hypothesized
prediction. For example, if the hypothesis is that there is a difference between (or
relationship among) populations, then the null hypothesis is that there is no difference or
relationship (effect). Evidence against the null hypothesis thereby provides evidence that
the hypothesis is likely to be true. The next step in hypothesis testing is to decide on an
appropriate statistic that describes the nature of population estimates in the context of the
null hypothesis taking into account the precision of estimates. For example, if the null
hypothesis is that the mean of one population is different to the mean ofanother
population, the null hypothesis is that the population means are equal. The null hypothesis
can therefore be represented mathematically as: Ho: µ1=µ2 or equivalently: Ho µ1-µ2=0.

1/15/18 / 11:17:55 / 83 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


6 Univariate Statistics

Comparing variances → F-test Comparing means → t-test

Figure 44: Cartoons explaining F-test and t-test (image source: G. Meixner 1998, www.viag.org)

6.1 F-Test
(Chapter based on Trauth, 2006)
The F distribution was named after the statistician Sir Ronald Fisher (1890–1962). It is used
for hypothesis testing, namely for the F-test . The F distribution has a relatively complex
probability density function.

The F-test by Snedecor and Cochran (1989) compares the variances sa² and sb² of two
distributions, where sa2 >sb2. An example is the comparison of the natural heterogeneity
of two samples based on replicated measurements. The sample sizes na and nb should be
above 30. Then, the proper test statistic to compare variances is
The two variances are not significantly different, i.e., we reject the alternative hypothesis, if

the measured F-value is lower than the critical F-value, which depends on the degrees of
freedom Φa= na–1 and Φb= nb–1, respectively, and the significance level α

Function for F-test in R: var.test(y~x, data=dataset)

Dataset 1: ward.csv – the gastropods (Logan page 142, 6A)


Ward and Quinn (1988) investigated differences in the fecundity (fertility, as measured by
egg production) of a predatory intertidal gastropod (Lepsiella vinosa) in two different
intertidal zones (mussel zone and the higher littorinid zone).
Reference: Ward, S., Quinn, G.P., 1988. Preliminary investigations of the ecology of the
intertidal predatory gastropod Lepsiella vinosa (Lamarck) (Gastropoda Muricidae). Journal of
Molluscan Studies 54, 109-117, doi:110.1093/mollus/1054.1091.1109.
Dataset 2: furness.csv – the sea birds (Logan page 142, 6B)

1/15/18 / 11:17:55 / 84 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Furness and Bryant (1996) measured the metabolic rates of eight male and six female
breeding northern fulmars (see bird species) and were interested in testing the null
hypothesis (H0) that there was no difference in metabolic rate between the sexes.
Reference: Furness, R.W., Bryant, D.M., 1996. Effect of Wind on Field Metabolic Rates of
Breeding Northern Fulmars. Ecology 77, 1181-1188, doi: 1110.2307/2265587.

22: import/load the datasets ward.csv and furness.csv


a) asses the ward and the furness datasets in boxplots:
boxplot(EGGS~ZONE, ward)
boxplot(METRAT~SEX, furness)
b) perform an F-test for the Ward data set (EGGS by ZONE)
perform an F-test for the Furness data set (METRATE by SEX)
c) create the following ar ficial data set and perform an F-test:
x <- rnorm(50, mean = 0, sd = 1)
y <- rnorm(50, mean = 1, sd = 1)
var.test(x, y)
now vary the standard devia on (sd) and the number of data points and describe what you see.

6.2 Student's t Test


(Chapter text based on Trauth, 2006)
The Student’s t distribution was first introduced by William Gosset (1876–1937) who
needed a distribution for small samples (Fig. 45). W. Gosset was an Irish Guinness Brewery
employee and was not allowed to publish research results. For that reason he published his
t distribution under the pseudonym Student (Student, 1908). The probability density
function is

where Γ is the Gamma function

The single parameter Φ of the t distribution is the degrees of freedom. In the analysis of
univariate data, this parameter is Φ = n–1, where n is the sample size. As Φ→∞, the t
distribution converges to the standard normal distribution. Since the t distribution
approaches the normal distribution for Φ >30, it is not often used for distribution fitting.
However, the t distribution is used for hypothesis testing, namely the t-test.

1/15/18 / 11:17:55 / 85 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Figure 45: a probability density function f(x) and b cumulative distribution function F (x) of a Student’s t distri-
bution with different values for Φ.

The Student’s t-test by Gossett compares the means of two distributions.


Let us assume that two independent sets of na and nb measurements that have been carried
out on the same object. For instance, several samples were taken from two different
outcrops. The t-test can be used to test the hypothesis that both samples come from the
same population, e.g., the same lithologic unit (null hypothesis) or from two different
populations (alternative hypothesis). Both, the sample and population distribution have to
be Gaussian. The variances of the two sets of measurements should be similar. Then, the
proper test statistic for the difference of two means is

where na and nb are the sample sizes, sa2 and sb2 are the variances of the two samples a and
b. The alternative hypothesis can be rejected if the measured t-value is lower than the
critical t-value, which depends on the degrees of freedom Φ = na+nb–2 and the significance
level α . If this is the case, we cannot reject the null hypothesis without another cause. The
significance level α of a test is the maximum probability of accidentally rejecting a true null
hypothesis. Note that we cannot prove the null hypothesis, in other words not guilty is not
the same as innocent.

Exercise 23: the ward-dataset


We asses assump ons of normality and homogeneity of variance for the null hypothesis that the
popula on mean egg produc on is the same for both li+orinid and mussel zone Lespiella:
boxplot(EGGS~ZONE, data=ward)
with(ward,
rbind(MEAN=tapply(EGGS,ZONE,mean), VAR=tapply(EGGS,ZONE,var)))

1/15/18 / 11:17:55 / 86 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


with(data, expr, …)
is a generic func on that evaluates an expression in an environment constructed from data,
possibly modifying the original data.
tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)
Applies a func on to each cell of a ragged array, that is to each (non-empty) group of values given
by a unique combina on of the levels of certain factors.

Conclusions 1
there is no evidence of non-normality (boxplots not grossly asymmetrically) or
unequal variance (boxplots very similar in size and variances very similar).
Hence the simple student t-test is likely to be reliable. To test the null hypothesis
as formulated above.

t.test(EGGS~ZONE, ward, var.equal=T)

Conclusions 2
reject the null hypothesis (i.e. egg production is not the same). Egg production was
significantly greater in mussel zone than in littorinid zone.

Student, 1908. The Probable Error of a Mean. Biometrika 6, 1-25, stable URL:
h+p://www.jstor.org/stable/2331554.

6.3 Welsh's t Test


The separate variances t-test (Welsh's test), represents an improvement of the t-test in that
more appropriately accommodates samples with modestly unequal variances.

Exercise 24: the furness-dataset


We asses assump ons of normality and homogeneity of variance for the null hypothesis that the
popula on mean metabolic rate is the same for both male and female fulmars.
boxplot(METRATE~SEX, data=furness)
with(furness, rbind(MEAN=tapply(METRATE,SEX,mean),
VAR=tapply(METRATE, SEX,var)))

Conclusions 1
Whilst there is no evidence of non-normality (boxplots not grossly asymmetrically),
variances are a little unequal (one of the boxplots is not more than three times smaller than
the other). Hence, a separate variances t-test (Welsh's test) is more appropriate than a
pooled variances t-test (Student's test).

We perform a (Welsh's) t-test to test the null hypothesis as described above:

t.test(METRATE~SEX, furness, var.equal=F)

Conclusions 2
do not reject the null hypothesis, i.e. metabolic rate of male fulmars was not found to differ

1/15/18 / 11:17:56 / 87 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


significantly from that of females.

6.4 χ²-Test – Goodness of fit test

As Trauth (2006) explains it:

The χ2-test introduced by Karl Pearson (1900) involves the comparison of distributions,
permitting a test that two distributions were derived from the same population. This
test is independent of the distribution that is being used. Therefore, it can be applied to test
the hypothesis that the observations were drawn from a specific theoretical distribution.
Let us assume that we have a data set that consists of 100 chemical measurements from a
sandstone unit. We could use the χ2-test to test the hypothesis that these measurements
can be described by a Gaussian distribution with a typical central value and a random
dispersion around. The n data are grouped in K classes, where n should be above 30. The
frequencies within the classes Ok should not be lower than four and never be zero. Then, the
proper statistic is

where Ek are the frequencies expected from the theoretical distribution. The alternative
hypothesis is that the two distributions are different. This can be rejected if the measured
χ2 is lower than the critical χ2, which depends on the degrees of freedom Φ=K–Z, where K is
the number of classes and Z is the number of parameters describing the theoretical
distribution plus the number of variables (for instance, Z=2+1 for the mean and the variance
for a Gaussian distribution of a data set of one variable, Z=1+1 for a Poisson distribution of
one variable)

as Logan (2011) explains it:

By comparing any given sample chi-square statistic to its appropriate χ2-distribution, the
probability that the observed category frequencies could have be collected from a
population with a specific ratio of frequencies (for example 3:1) can be estimated. As is
the case for most hypothesis tests, probabilities lower than 5% (p<0.05) are considered
unlikely and suggest that the same pie is unlikely to have come from a population
characterized by the null hypothesis. χ2-tests are typically one-tailed tests focusing on the
right-hand tail as we are primarily interested in the probability of obtaining large χ2-
values. Nevertheless, it is also possible to focus on the left-hand tail so as to investigate
whether the observed values are "too good to be true".
The χ²-distribution takes into account the expected natural variability in a population as
well as the nature of sampling (in which multiple samples should yield slightly different
results). The more categories there are, the more likely that the observed and expected
values will differ. It could be argued that when there are a large number of categories,

1/15/18 / 11:17:56 / 88 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


samples in which all the observed frequencies are very dose to the expected frequencies are
a little suspicious and may represent dishonesty on the part of the researcher.

Dataset 3: the plant seeds (Logan page 477, 16A)


Zar (1999) presented a data set that depicted the classification of 250 plants into one of four
categories on the basis of seed type (yellow smooth, yellow wrinkled, green smooth, and
green wrinkled). Zar used these data to test the null hypothesis that the samples came
from a population that had a 9:3:3:1 ratio of these seed types.

First, we create a data frame with the Zar (1999) seed data

COUNT = c(152,39,53,6)
TYPE = c("YellowSmooth", "YellowWrinkled", "GreenSmooth", "GreenWrinkled")
seeds = data.frame(TYPE,COUNT)

We should convert the data frame into a table. Whilst this step is not strictly necessary, it
ensures that columns in various tabular outputs have meaningful names:

seeds.xtab = xtabs(COUNT~TYPE, data=seeds)

We assess the assumption of sufficient sample size (<20% of expected values <5) for the
specific null hypothesis.

chisq.test(seeds.xtab, p=c(9/16,3/16,3/16,1/16), correct=F)$exp

Conclusion 1
all expected values are greater than 5, therefore the chi-squared statistic is likely to be a
reliable approximation of the c² distribution.

Now, we test the null hypothesis that the samples could have come from a population with
a 9:3:3:1 seed type ratio.

chisq.test(seeds.xtab,p=c(9/16,3/16,3/16,1/16), correct=F)

Conclusion 2
reject the null hypothesis, because the probability is lower than 0,05. the samples are
unlikely to have come from a population with a 9:3:3:1 ratio.

25: a) import the example2 data set (example2.txt)


b) describe, what informa on is given in the data set, check the dataset if it is “clean” and “ready
to use”

1/15/18 / 11:17:56 / 89 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


c) inves gate the variability of the different elemental proxies depending on the different phases
by using boxplots
d) report the respec ve values for “mean” and “variance” for each element and each phase
e) choose an appropriate t-test and test all elements for the null hypothesis that the element
content is the same in the Glacial and in the Holocene sediments. Interpret the results
f) perform an F-test for Ca and Sr to test the null hypothesis that each has the same variability in
the Glacial and in the Holocene sediments respec vely.
g) calculate the Rb-Sr-ra o and add it as a new column to the example2 data set. Plot the Rb-Sr-
ra o against depth. If high ra os indicate wet climate and low ra os dry climate, what can you
read out of the dataset?

6.5 ANOVA – Analysis of Variance

(based on Kabacoff (2015): R in Action, chapter 9)

Libraries/packages used in this chapter:


• ggplot2
• reshape2
• dplyr
• gplots

After focusing on the prediction of relations between variables (correlation and regression
in chapter 5), and after comparing the similarities of only two groups by F- and t-test, we
now want to shift to understanding differences between more than two groups or
between several variables of one group. This methodology is referred to as analysis of
variance (ANOVA). ANOVA methodology is used to analyze a wide variety of experimental
and quasi-experimental designs.
Experimental design in general, and analysis of variance in particular, has its own language.
Before discussing the analysis of these designs, we’ll quickly review some important terms.
In our course, we only focus on one way ANOVA. For more complex study designs please
refer to the specific statistics books on which this course builds up.

Let's try to explain a simple ANOVA with an example of a medical study. Say you’re
interested in studying the treatment of a disease. Two popular therapies for this disease
exist: the CBT (therapy 1) and EMDR (therapy 2). You recruit 10 anxious individuals (s1-s10)
and randomly assign half of them to receive five weeks of CBT (s1-s5) and half to receive
five weeks of EMDR (s6-s10). At the conclusion of therapy, each patient is asked to complete
a self-report as a measure of health improvement (with scores from 1=fully recovered to
10=no change). The design is outlined in figure 46(a). In this design, Treatment is a
between-groups factor with two levels (CBT, EMDR). It’s called a between-groups factor
because patients are assigned to one and only one group. No patient receives both
treatments. The grades (of s1-s10) of the self-report are the dependent variable, and
Treatment is the independent variable. The statistical design in figure 46(a) is called a
one-way ANOVA because there’s a single classification variable. Specifically, it’s a one-way
between-groups ANOVA. Effects in ANOVA designs are primarily evaluated through F-tests.

1/15/18 / 11:17:56 / 90 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


If the F-test for Treatment is significant, you can conclude that the mean self-report scores
for the two therapies differed after the five weeks of treatment.

place all 10 patients (s1-s10)in the CBT group and assess them at the conclusion of therapy
and again six months later. This design is displayed in figure 46(b). Time is a within-groups
factor with two levels (five weeks, six months). It’s called a within-groups factor because
each patient is measured under both levels. The statistical design is a one-way within-
groups ANOVA. Because each subject is measured more than once, the design is also called
a repeated measures ANOVA. If the F-test for Time is significant, you can conclude that
patients’ mean self-report scores changed between five weeks and six months.

In the following, we perform a one-way ANOVA with the example dataset PMM
introduced in chapter 5.3. The dataset contains information on the sedimentary units which
will be used as grouping factors. The age and depth columns are not needed for the ANOVA,
and we will reduce the dataset to the first 5 chemical elements for better visibility. All these
columns will be de-selected using the respective function of the dplyr package.

After setting the working directory (setwd) we load the dataset PMM.txt:

PMM = read.delim("PMM.TXT")

For an initial overview, we prepare a boxplot showing the 5 selected elements and their
variances in the sedimentary units (figure 47):

1/15/18 / 11:17:56 / 91 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


library(reshape2)
library(dplyr)
PMM3 = dplyr::select(PMM, 1:2,5:10) #select columns 1:2 and 5:21
PMM4 = reshape2::melt(PMM3, id.var=c("siteID","unit"))
# boxplot
library(ggplot2)
qplot(data=PMM4, x=variable, y=value, geom='boxplot', fill=unit)+
facet_grid(.~variable, scales='free')

Figure 47: boxplot of the PMM dataset showing the chemical elements by unit.

For getting a numerical overview on the means and standard deviations of the 5 elements
by each group we group the elements using the aggregate function of the basic stats
package. This is equivalent to the group_by function of the dplyr package, however, dplyr
does not allow grouping and a group calculation of several elements simultaneously.
This grouping is not necessary for the ANOVA, it is only an additional assessment!

# group means
PMM.mean = aggregate(PMM3[,3:8], by=list(PMM$unit), FUN=mean)
View(PMM.mean)

# group standard deviation


PMM.sd = aggregate(PMM3[,3:8], by=list(PMM$unit), FUN=sd)
View(PMM.sd)

The command for the actual ANOVA is rather short and simple, however aov can only
perform an ANOVA for one element at a time. We select the elements Calcium (Ca) and
Silicium (Si) and perform two separate ANOVAs. We then compare the results with the basic
summary function:

# ANOVA of Ca:

1/15/18 / 11:17:57 / 92 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


PMM.Ca = aov(data = PMM3, Ca ~ unit)
anova(PMM.Ca)

# ANOVA of Si
PMM.Si = aov(data = PMM3, Si ~ unit)
anova(PMM.Si)

Conclusion 1
The F-test for Calcium (PMM.Ca) is significant (p=0.035) within the 95% significance limit,
but the F-test for Silicium (PMM.Si) is much clearer (p<0.001). So, as a result the variance of
the elements Ca and Si is significantly different between the units.

The plotmeans function in the gplots package can be used to produce a graph of group
means and their confidence intervals:

library(gplots)
plotmeans(PMM3$Ca ~ PMM3$unit, xlab="Units", ylab="Calcium",
main="Mean plot with 95% confidence interval")

Figure 48: The means of Ca, with 95% confidence limits, in each unit of the PMM dataset.

Now, the summary of the ANOVA tells us that the 4 units are different in the Ca variance
and even more different in the Si variance (note: NOT content!), but it doesn't tell us how
the units differ from each other! You can use a multiple comparison procedure to answer
this question. For example, the TukeyHSD() function provides a test of all pairwise
differences between group means, as shown next:

TukeyHSD(PMM.Ca)
TukeyHSD(PMM.Si)

The output of the Tukey HSD pairwise group comparison (HSD stands for: Honest
Significant Differences) can be plotted as follows:

1/15/18 / 11:17:58 / 93 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


tky.Si = as.data.frame(TukeyHSD(PMM.Si)$unit)
tky.Si$pair = rownames(tky.Ca)

ggplot(data=tky.Si, aes(colour=cut(`p adj`, c(0, 0.01, 0.05, 1),


label=c("p<0.01","p<0.05","Non-Sig")))) +
geom_hline(yintercept=0, lty="11", colour="grey30") +
geom_errorbar(aes(pair, ymin=lwr, ymax=upr), width=0.2) +
geom_point(aes(pair, diff)) +
labs(colour="")+
ggtitle("95% family-wise confidence levels")

Figure 49: The result of the Tukey HSD pairwise group comparison on the differences in mean levels of Si, with
95% confidence limits. (PMM dataset).

1/15/18 / 11:17:58 / 94 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


7 Multiple and curvilinear regression

7.1 Multiple linear regression


(Chapter based on Logan 2011, chapter 9)
multiple regression is an extension of simple linear regression whereby a response variable
is modeled against a linear combination of two or more simultaneously measured
continuous predictor variables. There are two main purposes of multiple linear regression:

1. to develop a better predictive model (equation) than is possible from models based
on single independent variables
2. to investigate the relative individual effects of each of the multiple independent
variables above and beyond the effects of the other variables.

Example: scatteplot matrix


import or load sample data set example2.txt

library(car)
scatterplot.matrix(~Ca+Ti+K+Rb+Sr+Mn+Fe,data=example2, diag="boxplot")

Conclusion 1
Element Mn varies obviously non-normal (asymmetrical boxplot). Let us try out, how a
scale transformation (e.g. logarithm) is changing that:

scatterplot.matrix(~Ca+Ti+K+Rb+Sr+log10(Mn)+Fe, data=example2,
diag="boxplot")

Conclusion 2
log10 transformation appears successful, no evidence of non-normality (symmetrical
boxplots)

7.2 Curvilinear regression


It has become apparent from our previous analysis that a linear regression model provides a
good way of describing the scaling properties of the data. However, we may wish to check
whether the data could be equally-well described by a polynomial fit of a higher degree.

Example- Polynomial regression: (Logan 9F)

1/15/18 / 11:17:58 / 95 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Dataset: mytilus.txt (Logan 9F)
Sokal and Rohlf (1997) present an unpublished data set in which the nature of the
relationship between Lap94 allele (=group of genes) frequency in Mytilus edulis (blue
mussel) and distance (in miles) from Southport was investigated.
For biologists: Sokal and Rohlf (1997) transformed frequencies using angular transformations
(arcsin transformations). Hence, the dataset we are using has been transformed by:

asin(sqrt(LAP))*180/pi

We then have to show that a simple linear regression does not adequately describe the
relationship between Lap94 and distance by examining a scatterplot and a residual plot.

Scatterplot:
scatterplot(LAP ~ DIST, data=mytilus)

residual plot:
plot(lm(LAP ~ DIST, data=mytilus), which=1)

Conclusion 1
the scatterplot smoother suggests a potentially non-linear relationship and a persisting
pattern in the residuals further suggests that the linear model is inadequate for explaining
the response variable (Lap94).

We try to fit a polynomial regression (additive multiple regression) model incorporating up


to the fifth power (5th order polynomial)

Note that trends beyond a third order polynomial are unlikely to have much biological basis and
are likely to be over-fit. This is also true for most geoscien fic applica ons.

mytilus.lm5 = lm(LAP ~ DIST + I(DIST^2) +I(DIST^3) + I(DIST^4) +


I(DIST^5), data=mytilus)

We check the output typing mytilus.lm5:

Coefficients:
(Intercept) DIST I(DIST^2) I(DIST^3) I(DIST^4) I(DIST^5)
2.224e+01 1.049e+00 -1.517e-01 6.556e-03 -1.033e-04 5.519e-07

examining the diagnostics by typing

plot(mytilus.lm5, which=1)

Conclusion 2
no “wedge” pattern of the residuals (see Fig. 33 In chapter 5.4.1), suggesting the
homogeneity of variance and that the fitted model is appropriate.

1/15/18 / 11:17:58 / 96 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Now, we want to examine the fit of the model with respect to the contribution of the
different powers:

anova(mytilus.lm5)

Analysis of Variance Table

Response: asin(sqrt(LAP)) * 180/pi

Df Sum Sq Mean Sq F value Pr(>F)

DIST 1 1418.37 1418.37 125.5532 2.346e-07 ***

I(DIST^2) 1 57.28 57.28 5.0701 0.04575 *

I(DIST^3) 1 85.11 85.11 7.5336 0.01907 *

I(DIST^4) 1 11.85 11.85 1.0493 0.32767

I(DIST^5) 1 15.99 15.99 1.4158 0.25915

Residuals 11 124.27 11.30

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What is already stated as “information” above is here put into numbers: powers of distance
beyond a cubic (third order, x³) do not make significant contributions to explain the
variation of this data set.

For evaluating the contribution of an additional power (order) we can compare the fit of
higher order models against models one lower in order.

Comparing first order against second order:

mytilus.lm1<- lm(LAP ~ DIST, data=mytilus)


mytilus.lm2<- lm(LAP ~ DIST+I(DIST^2), data=mytilus)
anova(mytilus.lm1, mytilus.lm2)

adding a model for third order:


mytilus.lm3<- lm(LAP ~ DIST+I(DIST^2)+I(DIST^3), data=mytilus)

Comparing second order against third order:


anova(mytilus.lm2, mytilus.lm3)

Conclusion 3
the third order model (lm3) fits the data significantly better that a second order model

1/15/18 / 11:17:58 / 97 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


(lm2) (P=0.018), while the second order model is not really better than a linear model (lm1)
(P=0.087). Hence, we focus on the third order model and estimate the model parameters
from the summary:

summary (mytilus.lm3)

Call:
lm(formula = LAP ~ DIST + I(DIST^2) + I(DIST^3), data = mytilus)
Residuals:
Min 1Q Median 3Q Max
-6.1661 -2.1360 -0.3908 1.9016 6.0079

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.2232524 3.4126910 7.684 3.47e-06 ***
DIST -0.9440845 0.4220118 -2.237 0.04343 *
I(DIST^2) 0.0421452 0.0138001 3.054 0.00923 **
I(DIST^3) -0.0003502 0.0001299 -2.697 0.01830 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.421 on 13 degrees of freedom


Multiple R-squared: 0.9112, Adjusted R-squared: 0.8907
F-statistic: 44.46 on 3 and 13 DF, p-value: 4.268e-07

Coefficient of determina on
r² – is equal to the square of the correla on coefficient only within simple linear regression
r² = (SSreg / Sstot) is reflec ng the explained variance

Conclusion 4
there was a significant cubic (third order) relationship between the frequency of the Lat94
allele and the distance from Southport. The final equation of the regression is:

arcsin(sqrt(LAT))=26.2233-0.944*DIST+0.042*dist²-0.0003*dist³

We can now construct a summary figure:

# the y-axes label has some advanced layout:


ytext=expression(paste("Arcsin ", sqrt(paste("freq. of allele ",
italic("Lap"))^{94})))

# preparing the regression polynom:


x=seq(0,80,l=1000)
y=predict(mytilus.lm3, data.frame(DIST=x))
curve.lm3=data.frame(x,y)

# the plot

1/15/18 / 11:17:58 / 98 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


library(ggplot2)
ggplot(data=mytilus, aes(x=DIST, y=LAP))+
geom_point(colour='red', size=3)+
ylab(ytext)+
xlab("Distance (km)")+
geom_line(data=curve.lm3, aes(x=x, y=y), colour='blue')+
geom_text(x=15, y=45, label='y=-0.94+0.04x^2-0.0035x^3')

Figure 50: Graph showing the polynom of 3rd order (mytilus.lm3) plotted against the data points of the mytilus
dataset.

Dataset: bariumcont.txt (Trauth, 2006)


The data set consist of synthetic data resembling the barium content (in wt.%) down a
sediment core (in meters)

26: a) import the bariumcont data set (bariumcont.txt)


b) inves gate what type of regression would fit the rela on between barium content and
sediment depth best by using appropriate ploPng tools and assessing at least two different
polynomial models.
d) extract the equa on of the best regression curve from the summary of the respec ve model
and plot the data set together with the regression curve (500 points, x-max=20)

1/15/18 / 11:17:59 / 99 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Box 2: useful non-linear functions
Table taken from Logan (2011) – Table 9.1

1/15/18 / 11:17:59 / 100 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


1/15/18 / 11:17:59 / 101 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt
8 Cluster Analysis
Cluster Analysis is a technique for grouping a set of individuals or objects into previously
unknown groups (called clusters). The task is to assign these objects to clusters in a way that
the objects in the same cluster are more similar (in some sense or another) to each other
than to those in other clusters. In biology, cluster analysis has been used for decades in the
area of taxonomy where organisms are classified into arbitrary groups based upon their
characteristics. Such characteristics can be binary (i.e. feature present or absent), numerical
(quantity of a feature) or factorial (e.g. color of feature). We as humans cluster things
intuitively based on rules derived from experience, a computer however needs more
precise rules, namely:
1. a measure of distance between objects or clusters
2. a method to determine this distance between clusters
3. an algorithm for clustering
We will now go trough these requirements and use R to cluster different datasets.
Some terms:
Objects/observations/samples or whatever you want to call them are
described by certain properties and these properties have values. In
mathematical terms we can see the observation as a vector and the values of
its properties as components (or coordinates) of that vector. We may use
the following mathematical notation:
Object=( value of Property1 , valueof Property 2 , ... , value of Property n )
Or in short: ⃗x =( x 1 , x 2 ,... , x n) .
Vectors are elements of a vector space. In the simplest case an observation
therefore can be seen as a point in a n dimensional coordinate system, where
n is the number of components of that vector.
Our set of observations then can be arranged in a matrix. The observations are
usually in the rows and the variables in the columns.

()( )
⃗x x1 x2 .. x n
⃗y = y1 y2 .. y n
⃗z z1 z2 .. z n
.. .. .. .. ..

8.1 Measures of distance


Before we can attempt to group objects based on how similar they are we have to somehow
measure the proximity or distance of the one object relative to the other. There are several
ways to determine distances mathematically, depending amongst others on the type of data
at hand. Some of these measures are outlined in the following.
If we have binary qualitative data, say, two objects that are characterized by the presence
or absence of certain features we can use the Jaccard distance to determine how dissimilar

1/15/18 / 11:17:59 / 102 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


these two objects are. In Figure 51 we can see two objects A and B. Some features (points in
the figure) are present only in A some only in B and some in both.

We define the Jaccard distance as:


No.of features occuring only in A+No. of features occuring only in B ∣A ∖ B∣+∣B ∖ A∣
=
No. of features occuring in Aor B ∣A∪ B∣

2 +5 7
So the Jaccard distance in the above example would be d Jaccard ( A , B)= = meaning
10 10
that 70% of the features occur only in one of the objects.
In R we can use the dist() function
DISTANCEMATRIX=dist(DATASET, method="")

to compute the distances between several objects. dist() calculates the distance between
the rows of a matrix, so make sure your DATASET has the right format. The set of
comparison results you get back is called distance matrix. method="binary" gives you the
binary (Jaccard) distance. In R the input vectors are regarded as binary bits, so all non-zero
elements are ‘on’ and zero elements are ‘off’. In these terms the distance can be seen as the
proportion of bits in which only one is on amongst those in which at least one is on, which
is an equivalent definition to the one given above.
Another improtant similarity measure often used in ecology is Bray-Curtis dissimilarity. It
compares species counts on two sites by summing up the absolute differences between the
counts for each species at the two sites and dividing this by the sum of the total abundances
in the two samples. The general formula for calculating the Bray-Curtis dissimilarity
between samples A and B is as follows, supposing that the counts for species x are denoted
by nAx and nBx:
m

∑∣n Ax n Bx∣
x=1
d Bray Curtis ( A , B)= m

∑ (n Ax +n Bx)
x=1
This measure takes on values between 0 (samples identical: nAx= nBx . for all x) and 1 (samples
completely disjoint).

1/15/18 / 11:17:59 / 103 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Exercise 27: You want to compare how similar two aquariums are. Calculate Jaccard
distance and Bray-Curtis dissimilarity with the formulas given above and the data
given below.

Number of
indiviudals in

Aquarium 1 3 2 4 6

Aquarium 2 6 0 0 11

If we have quantitative data the two most common distances used are Euclidean distance
and Manhattan (city-block) distance. Let us look at an example: We have analyzed two
different rocks for their content in calcium and silicium and we find the following

(
silicium ) ( ) ()
r 12 1 silicium r 22 2( )( )()
R1= calcium = r 11 = 0 a.u. and R2 = calcium = r 21 = 2 a.u. (a.u.=arbitrary units).

If we plot calcium against silicium (Figure 52) we can see two points which represent the
two different rocks. How different are they? A very intuitive way to think of the distance
between these two points is the direct connection between them (continuous line). This
distance is called euclidean distance and can be easily calculated with the Pythagorean
theorem; d (R1, R2 )=√(r 11 r 21)2 +( r 12 r 22 )2 =√5 , as you know from school. Another way
to calculate the distance is to follow the dotted lines — as if we walked around in
Manhattan. Then we get the Manhattan (city-block) distance:
d ( R1, R2 )=∣r 11 r 21∣+∣r 12 r 22∣=3 . Obviously, the obtained distances are not the same,
the distance between two objects very much depends on how you measure it.

This concept can be extended to n dimensions. If we have two objects (vectors)


x=(x 1, x 2, ... , x n ) and y=( y1, y 2, ... , y n ) that denote two points in a n-dimensional space
the dissimilarity (distance) is defined as follows:

1/15/18 / 11:18:00 / 104 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


√∑
n
2
Euclidean: d ( x , y)= ( x k y k ) . If n=2 we have the case we saw in the example before.
k=1

n
Manhattan: d (x , y)=∑ ∣x k y k∣ . If n=2 we have the case we saw in the example before.
k=1

In the dist() function method="euclidean" and method="manhattan" respectively would


give you the aforementioned distances.

These distance measures depend on the scale. If two components are measured on
different scales you want to consider some standardization first, so that the
components contribute equally to the distance. Otherwise the larger
components will always have more influence.

In ecology euclidean distance has to be used with care. The difference between
none (0) and 1 individuals occurring mathematically is less than the difference
between 1 and 4 individuals occurring – ecologically this can be a huge
difference. In ecology, for example for comparing species composition at
different sites, you would therefore rather use other distance measures like
Bray-Curtis dissimilarity.

Exercise 28: Calculate the Manhattan and euclidean distance between the two
objects a=(1,1, 2, 3) and b=( 2, 2,1, 0) . One way to solve this is to use
R as a normal calculator applying the formulas above (or do it in your head). The
second one is to create two vectors a=c(a1,a2,a3,a4) and b and use
rbind(Vector1, Vector2)to combine them to a matrix. Then you can use the dist()
function.

Further reading about distance measures:


http://www.econ.upf.edu/~michael/stanford/maeb4.pdf
explanation of euclidean distance
http://www.econ.upf.edu/~michael/stanford/maeb5.pdf
explanation of non-euclidean distance

8.2 Agglomerative hierarchical clustering


One possible way of clustering is called agglomerative hierarchical clustering. It starts out
with each observation being a one-item cluster by itself. Then the clusters are merged until
only one large cluster remains which contains all the observations. At each stage the two
nearest clusters are combined to form one larger cluster. There are different methods to
determine which clusters are the nearest, they are called linkage methods.

8.2.1 Linkage methods


Have a look at Figure 53. Method A considers the (euclidean) distance between the nearest
two objects of two cluster as the distance between these clusters. This is called
single-linkage (or nearest neighbor). Analogously method B is called complete-linkage (or

1/15/18 / 11:18:00 / 105 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


farthest-neighbor) and takes the maximum distance between objects in different clusters as
the distance. This method brings about more compact clusters than single-linkage. In
practice you would often use more sophisticated methods, which lead to better results. You
can take the average distance between all pairs of objects within the two clusters (called
Average linkage) or you could use Ward linkage which leads to rather homogeneous groups. As
the distance measures you can use the ones introduced before for all of these linkage
methods.

8.2.2 Clustering Algorithm


A simple algorithm for agglomerative hierarchical clustering could look like this:
1. Initially all objects you want to cluster are alone in their cluster
2. Calculate the distances between all clusters using your linkage method
3. Join the closest two clusters
4. Go back to step 2 until all objects are in one comprehensive cluster

1/15/18 / 11:18:01 / 106 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


8.2.3 Clustering in R
In R we can use the function hclust(DISTMATRIX, method="") of the stats package. With
DISTMATRIX being a distance matrix obtained by the dist() function and method="" being
one of the following: "single", "complete" or "ward". Further linkage methods exist but
are now of no concern for us, you can look them up under ?hclust.
Let's get started:
Import the dataset PMM.txt in R. The variables are measured in grossly different ranges.
This might result in faulty clustering (try it if you want). Therefore we want to standardize
the range of the data first. We can use decostand() of the package vegan.
PMMn=decostand(PMM,"range")

“range” sets the highest value of each variable to 1 and scales the other accordingly – so
they are now percentage of maximum.
Have a look at the dataframe. We are now interested in clustering the elements and not the
observations. The objects we want to cluster have to be in the rows of the dataframe.
Therefore we need to transpose our datamatrix (change rows and columns). This is done by

1/15/18 / 11:18:01 / 107 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


PMMnt=t(PMMn)

Now let us calculate the distances between the elements. Since these are measured on a
ratio scale it makes sense to use euclidean distances.
distm=dist(PMMnt, method="euclidean")

Now we use this distance matrix as input for our clustering. Let us first use single as
clustering method.
ahclust=hclust(distm, method="single")

A graphical ouptut can easily be obtained by plot(ahclust). This gives you a dendrogram
where we can see how closely the observations are related. The length of branches shows
the similarity of the objects. You can see that a lot of elements are added to existing
clusters in a stepwise fashion i.e. one after the other. This is a peculiarity of the single-
linkage method.
If you have a lot of objects, presenting the result as dendrogram is not very pretty anymore.
It is more useful to know the assignment of each objects to a certain cluster at a certain
number of clusters present. For that we use the R function cutree(DATA,#CLUSTERS). The
result is a vector with as many components as we have objects and each component tells
you the cluster for that object. If we want to divide our data into two groups we therefore
can now use
ahclust_2g=cutree(ahclust,k=3)

to get the assignment of each object to one of these groups. You can type the variable name
ahclust_2g to get some idea about this assignment.
Exercise 29: Try out the other linkage methods. Are there significant differences?
Check by plotting all the dendrograms into one big graph (see Chapter 4.5
Combined figures if you forgot how to do that).
BONUSPOINTS: Cluster the non-standardized dataset PMM. Do the results make
sense?

Another function for hierarchical clustering with interesting possibilities is agnes


which can be found in the package cluster. You may want to check ?agnes for
details

8.3 Kmeans clustering


Another approach to clustering which is gaining importance is kmeans clustering. The basic
idea is to assign objects randomly to clusters and then reorder this assignment until you
find the best solution.
The procedure follows an easy way to classify a given data set through a certain number of
k clusters which you determine before. In the beginning you define k centroids in an n-
dimensional coordinate system, one for each cluster. The next step is to take each point
belonging to a given data set and associate it to its nearest centroid. After completion of this
first step we calculate k new centroids as barycenters of the clusters resulting from the
previous step. After we have these k new centroids, a new assignment of the data points to

1/15/18 / 11:18:02 / 108 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


the now nearest centroid follows. This is repeated and the k centroids change their location
step by step until no more changes are done.

An “algorithm” for kmeans clustering might be


1. Choose k points in the space which is represented by the objects that are being
clustered. These points represent initial cluster centroids.
2. Calculate the distance between the centroids and the objects.
3. Assign each object to the group with the centroid they are closest to.
4. When all objects have been assigned, recalculate the positions of the k centroids by
averaging the vectors of all objects assigned to it.
5. If there has been a change repeat Steps 2, 3 and 4 until the centroids no longer move,
else you are finished.

The basic function in R is kmeans(DATA,#CLUSTERS) of the stats package.

1/15/18 / 11:18:02 / 109 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Let us use the same dataset as above then
km=kmeans(PMMnt,4)

will sort our objects into 4 different clusters. You don't need to calculate a distance matrix
before because the distances are calculated anew in each computation round. The obtained
datastructure is different then before. The vector which contains the assignment of the
objects to the different clusters can be obtained by
kmclust=km$cluster
kmclust
library(ggplot2) # plotting
qplot(names(km$cluster), km$cluster)

will give you a representation of this clustering method. Understandably, we will not get a
dendrogram this time.

8.4 Chapter exercises


Exercise 30: Forests sites characterized by coverage (in %) of different plants shall
be classified. Cluster the dataset forests.csv with the kmeans and the hclust
method.
When importing the data make sure to set row.names=1. Example command:
forests<-read.csv("forests.csv", header=TRUE, row.names=1)
Check the data first. Is some transformation beforehand necessary? Is the dataset
in the right format for the dist() function?
You may use any linkage procedure, use 4 classes in the kmeans clustering.
Create appropriate representations of your results.
BONUSPOINTS: Are the results of hclust and kmeans clustering on the level of 4
clusters the same? Compare the results of both algorithms qualitatively by
1. creating a table which shows the assignment in both cases. cbind() might
come in handy.
2. plotting both result vectors in the same plot. You can use lines(DATA,
type="p", col="red") to add the second plot to the first.
Would you say clustering is an objective statistical method?

Exercise 31: The districts of the Baltic can be grouped by composition of the algae
species. Cluster the sites in the dataset algae_presence.csv with agglomerative
hierarchical clustering and a linkage method of your choice. Use the
presence/absence of species for classification. What distance measure should
you use? Look at the dendrogramm. Do the results make sense?
Repeat the exercise after you did a Beals transformation (see the following infobox
or ?beals) of the data. What distance measure should you use? Do your results
make more sense?
Beals transformation:
Beals smoothing is a multivariate transformation specially designed for species
presence/absence community data containing noise and/or a lot of zeros.
This transformation replaces the observed values (i.e. 0 or 1) of the target
species by predictions of occurrence on the basis of its co-occurrences with
the other remaining species (values between 0 and 1). In many applications,
the transformed values are used as input for multivariate analyses.
In R Beals transformation can be performed with the beals() function of the
vegan package.

1/15/18 / 11:18:03 / 110 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


8.5 Problems of cluster analysis
Cluster analysis can not be regarded as objective statistical method because:
• The choice of similarity index is done by the user.
• Each different linkage procedure gives different results.
• The number of groups is chosen by the researcher.

Further reading:
Afifi, May, Clark (2012): Practical Multivariate Analysis, CRC Press. Chapter 16:
Cluster Analysis
A good introduction which is well understandable but more in-depth than in this
script.

http://www.econ.upf.edu/~michael/stanford/maeb7.pdf
Explanation of hierarchical clustering with examples.

bio.umontreal.ca/legendre/reprints/DeCaceres_&_Legendre_2008.pdf
A discussion about Beals transformation

1/15/18 / 11:18:03 / 111 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


8.6 R code library for cluster analysis
Function Arguments Use

library(stats) stats contains many basic


statistical tool

library(vegan) vegan contains specific tools for


ecologists

dist(x, method="") x: a numeric matrix, data frame or "dist" object. Calculate the distance between the
rows of a matrix and returns a
method: the distance measure to be used. distance matrix.
Must be "euclidean", "maximum", "manhattan",
"canberra", "binary" or "minkowski".

decostand( x, method) x: community data in a matrix standardization

method: the standardization method. E.g.


'normalize', 'standardize', 'range'.
See ?decostand for details.

t() Transposes a matrix

hclust(x, method="") d: a dissimilarity structure (distance matrix) as Agglomerative hierarchical


produced by dist. clustering

method: the agglomeration method to be used.


This should be one of "ward", "single",
"complete", "average", or others.

cutree(tree, k = , h = ) tree: a tree as produced by hclust. “Cuts” a tree created by


hierarchical clustering at a certain
k: desired number of groups heigth or cluster number.
h: heigth where the tree should be cut.

at least one of k or h must be specified, k


overrides h if both are given.

kmeans(x, centers) x: your input, has to be a numeric matrix of Kmeans clustering


data

centers: the number of clusters, say k.

beals(x) x: input community data frame or matrix. Performs a beals transformation of


the data.
Further parameters and details can be looked
up with ?beals.

1/15/18 / 11:18:03 / 112 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


9 Ordination
One of the most challenging aspects of multivariate data analysis is the sheer complexity of
the information. If you have a dataset with 100 variables, how do you make sense of all the
interrelationships present? The goal of ordination methods is to simplify complex datasets
by reducing the number of dimensions of the objects. Recall, that in the cluster analysis part
we defined objects with features as vectors with components. These objects can be thought
of as points in a n-dimensional space with the values of the respective components giving
you the coordinates on n different coordinate axes. Up until n=3, this is easily conceivable
but it works exactly in the same for n>3.
The easiest way of dimension reduction would be to only consider one variable, e.g. the first
component of each vector and discard the rest for your analysis. This is of course not very
reasonable because you will lose a lot of information. Therefore different ordination
techniques have been developed that minimize the distortion of such a dimension
reduction. We will focus on two of these methods: Principal Component Analysis (PCA) and
non-metric multidimensional scaling (NMDS).

9.1 Principle Component Analysis (PCA)


Principal Component Analysis (PCA) is a powerful tool when you have many variables and
you want to look into things that these variables can explain. The basic idea behind PCA is,
that the variables in observations are correlated amongst each other, so the full dataset
contains information which is redundant. PCA is now useful to reduce the number of
variables, because with PCA we can look for “supervariables” that sum up the information
of several variables, without losing much of the information that the original data have.
Furthermore, we can also use PCA for finding structures and relationships in the data, for
example outliers.
More mathematically, PCA uses an orthogonal linear transformation to transform your
data of possibly correlated variables to a new coordinate system defined by a set of linearly
uncorrelated variables, called principal components. So it finds linear projections of your
data which preserve the information your data have.

9.1.1 The principle of PCA explained


A possible, graphic way to describe the general procedure of PCA is outlined in the
following and for the easy case of having only two variables the procedure is illustrated in
Figure 56:
1. First we standardize our variables in a way so that they have a mean of zero.
If we think of our variables as coordinates we move the origin of our coordinate
system to the mean of all variables.
2. Similar to creating a regression line we now look for our first Principal Component
(PC) by finding a combination (called a linear combination) of the original components

1/15/18 / 11:18:03 / 113 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


in such a way that the new PC variable accounts for the most variance in our data.
If we think of it graphically, each variable represents a coordinate axis. We now find
a coordinate axis in a way that it points into the main direction of the data spread.
This new axis is a combination of our original axes.
3. We repeat step 2. The next principal component is again a linear combination that
accounts for the most variance in the original variables, under the additional
constraint that it has to be orthogonal — that means uncorrelated — to all the other
PCs we already calculated.
If we think of it graphically, we create a further coordinate axis but it has to be
orthogonal to all the others we already created (in the example case of only having 2
variables, there is now only one possible option). We can repeat this step and find as
many PCs as we have original variables.

1/15/18 / 11:18:03 / 114 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


As result of the PCA we get a new coordinate systems with the axes represented by the PCs
as seen in Figure 57. We get as many PCs as we had original variables. These PCs are
combinations of these original variables. In this new reference frame, note that variance is
greater along axis 1 than it is on axis 2. Secondly, note that the spatial relationships of the
points are unchanged, the process has merely rotated the data, but no information was lost.
Finally, note that our new vectors, or axes, are uncorrelated.

1/15/18 / 11:18:03 / 115 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


A measure for the amount of information a principle component carries is its variance. In
the process of PCA you calculate a covariance matrix of all variables. This covariance matrix
gives you an eigenvalue for each PC. This eigenvalue tells you how much variance this
particular PC represents. Therefore PCs are arranged in order of decreasing eigenvalues.
The most informative PC is the first and the least informative the last. Now you can reduce
the dimensionality (and also complexity) of your problem without loosing much
information by only looking at the first few PC. A further advantage is that unlike your
original variables the PC are not correlated at all to each other.
We need to introduce two more definitions that are used in discussing the results of a PCA.
The first is component scores, sometimes called factor scores. Scores are the transformed
variable values corresponding to a particular data point i.e. its new coordinates. Loading is
the weight by which each original variable is multiplied to get the component score. The
loadings tell you about the contribution of each original variable to a PC, so a high loading
means the variable determines the PC to a large extent.
The exact mathematical reasoning and procedure of PCA shall be of no concern for us here.
We want to focus more on the application and interpretation of the results, so let's rather
get started in R.
Further reading:
http://yatani.jp/HCIstats/PCA
a simple explanation of PCA which also explains how to interpret the results.
http://strata.uga.edu/software/pdf/pcaTutorial.pdf
well comprehensible, more advanced description
http://ordination.okstate.edu/overview.htm
PCA and other ordination techniques for ecologists.

9.1.2 PCA in R
There are several possibilities to perform a PCA in R. We use a basic function from the stats
package: princomp(DATASET, cor=TRUE). cor specifies if the PCA should use the covariance
matrix or a correlation matrix. As a rough rule, we use the correlation matrix if the scales of
the variables are unequal. This is a conscious choice of the researcher!

1/15/18 / 11:18:04 / 116 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


An alternative to princomp() is prcomp() or the command principal() from
the 'psych' package.

Let us work again with a dataset we already know and love, however in a slightly modified
way: PMM2.txt. Load the dataset, creat a sub-dataset PMM without the last two columns
(depth, phase) and then we can use
PMM_pca = princomp(PMM, cor=TRUE)

to carry out a complete PCA and get 15 principal components, their loadings and the scores
of the data. The first step of a PCA would be to calculate a covariance or correlation matrix.
However, the function will calculate it for us and we can use our raw data as input.
A basic summary of our analysis can be obtained by
print(PMM_pca)
summary(PMM_pca)

To get an idea about the data it is common to plot the scores of the 1st PC against the scores
for the 2nd PC
plot(PMM_pca$scores[,c(1,2)], pch=20, col=PMM2$phase,)
text(PMM_pca$scores[,1],PMM_pca$scores[,2])
abline(0,0); abline(v=0)

alternatively using the 'ggplot2' package and its support-package 'ggfortify'.


library(ggplot2)
library(ggfortify)
autoplot(PMM_pca, data=PMM2, label=TRUE, colour='phase')+
geom_vline(aes(xintercept = 0))+
geom_hline(aes(yintercept = 0))

To get an overview, we can create a scatterplot matrix, for example like this:
pairs(PMM_pca$scores[,1:4], col=PMM2$phase, main="Scatterplot Matrix
of the scores of the first 4 PCs")

We will get a scatterplot matrix of all these components against each other.

Since we want to know which variables have the greatest influence on our data, we want to
have a look at the loadings of the PCs. One way to do this is to just type the variable name:
PMM_pca$loadings

A graphical representation can be obtained by:


barplot(PMM_pca$loadings[,1], ylim=c(-0.5,0.5),ylab="Loading",
xlab="Original Variables", main="Loadings for PC1")

which shows which elements have the highest influence on the first PC.

1/15/18 / 11:18:05 / 117 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


A very common display for PCA results with scores as points in a coordinate system (e.g. 1 st
and 2nd PC) and the loadings as vectors in the same graph is called a biplot. The biplot is
easily obtained:
biplot(PMM_pca, choices=1:2); abline(0,0); abline(v=0)

choices selects the PCs to plot. It is quite useful in analysing and interpreting the results.
alternatively with 'ggplot2'/'ggfortify':
autoplot(PMM_pca, data=PMM2, label=TRUE, colour='phase',
loadings = TRUE, loadings.label = TRUE, loadings.colour =
'blue') +
geom_vline(aes(xintercept = 0))+
geom_hline(aes(yintercept = 0))

Exercise 32: Repeat the plots above, but this time looking at the relationship
between the 1st and the 3rd Principal Component.

At the moment, plotting principal components other than PC1 and PC2 does not
yet work with ggfortify::autoplot!

9.1.2.1 Selecting the number of components to extract


So, how many Principal Components are useful for further analysis? As often, there is no
absolute truth, but there are a couple of methods that generally lead to good results. The
most common approaches are based on the eigenvalues. As we know, the first PC is
associated with the largest eigenvalue, the second PC with the second-largest eigenvalue,
and so on. One criterion suggests retaining all components with eigenvalues greater than 1
because components with eigenvalues less than 1 explain less variance than contained in a
single original variable. Another possibility is a scree test. The scree plot, plotting
eigenvalues or variance against the PC number, will typically demonstrate a bend or elbow,
and all the components above plus one below this sharp break are retained:
plot(PMM_pca, type="lines")

alternatively and nice with the 'psych' package:


psych::scree(PMM, factors=FALSE, pc=TRUE, main="Scree plot")

We see that the most information is in the first component and from the 6th onward there
is not much information in the component anymore. From the summary we know that the
first 6 components explain 90% of the variance. In the next part we will see methods how to
determine which PCs are still useful for further analysis.

9.1.3 PCA exercises


Exercise 33 (based on an example in: Cook, Swayne: Interactive and Dynamic
Graphics for Data Analysis, Springer, 2007):

1/15/18 / 11:18:05 / 118 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


In western Australia rock crabs of the genus Leptograpsus occur. One species L.
variegatus has split into two new species, previously grouped by color: orange or
blue.

Figure 58: Leptograpsus variegatus


Preserved specimens lose their color, so it was hoped that morphological
differences would enable museum specimens to be classified. 50 specimens of each
sex of each species were collected on site at Fremantle, Western Australia. For each
specimen, five measurements were made which can be found in the dataset
australian-crabs.csv:

Variable Explanation

species orange or blue

sex male or female

group 1-4: orange&male, blue&male, etc.

frontal lip (FL) length, in mm

rear width (RW) width, in mm

carapace length (CL) length of midline of the carapace, in mm

carapace width (CW) maximum width of carapace, in mm

body depth (BD) depth of the body; for females, measured after displacement of the
abdomen, in mm

The main question we will try to answer is: Can we determine the species
and sex of the crabs based on these five morphological measurements?
We would like to have one single variable that allows us to classify a crab we find
correctly. Let's see if that is possible by completing the following subtasks:
1. View your data. From univariate box-plots assess whether any individual
variable is sufficient for discriminating the crabs's species or sex.
hint: melt the data set, then use ggplot+facet.grid
2. How could you determine if there is indeed a significant difference? Test if
there is a significant difference between the RW of the different groups.
➔ BONUSPOINTS: Create a scatterplot matrix of all measured variables
against each other.
hint: Use the parameter australian.crabs$group to color according to the
group they belong to. Group has to be a factor!
hint: use Ggally::ggpairs(). → see chapter 7
Does this help us in distinguishing groups?
3. Perform a PCA on the dataset. Since the variables are not quite on the same
scale we use the PCA with the correlation matrix (cor=TRUE).
4. Plot the scores of the first PC against the crab group. Hint: use plot(x,y)
Does it help in distinguishing the groups?

1/15/18 / 11:18:05 / 119 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Create a plot of the scores of the first two PCs. Does this help?
Create a scatterplot matrix of the scores of the first 4 PCs. Which ones
distinguish the groups best?
Can we determine sex and species of a crab we find by looking at one PC?
Hint: You can always color the groups by using the parameter
col=australian.crabs$group
5. Look at the loadings of all PCs. Can you explain why they are like that?
Create a plot of the loadings of the first two PCs. Which variables have the
highest influence on the PCs?
Create a biplot for PC1 and PC2, as well as PC2 and PC3. Can you give an
interpretation what the new axis represent?
6. How many components would you retain for further analysis? Why?
7. BONUS: Your colleague has found a crab. He noted down the following
variables FL=0.91, RW=0.62, CL=0.81, CW=0.86, BD=0.90, but forgot to write
down which species and sex it was. Can you help him?
Hint: predict(pca-model, newdata)
8. BONUS: Perform a cluster analysis on the data set. Do you get the same 4
groups that were empirically determined?

9.1.4 Problems of PCA and possible alternatives


Principle Component Analysis is not suited for all data. The main problem is that it assumes
a linear correlation between the observations and their variables. This is often not justified,
especially in ecology. Species for example often show a unimodal behavior towards
environmental factors. Other ordination techniques exist which might be more suitable.
One alternative is to use higher order PCA, polynomial PCA. Another possibility are
techniques that are summarized under the term multidmensional scaling (MDS) which will be
covered in the next part of this chapter.

9.2 Multidimensional scaling (MDS)


Multidimensional scaling (MDS) is a set of related ordination techniques often used for
exploring underlying dimensions that explain the similarities and distances between a set
of measured objects. Similar to PCA these methods are based on calculating distance
between objects. In PCA many axes are calculated from the data, but only a few are viewed,
owing to graphical limitations. In MDS, a small number of axes are explicitly chosen prior to
the analysis and the data are fitted to those dimensions. So basically the distances of the
data are directly projected into a new coordinate systems. Unlike other ordination methods,
like for example PCA which assumes linear relationships in the data, MDS makes few
assumptions about the nature of the data, so is very robust and well suited for a wide
variety of data and more flexible. MDS also allows the use of any distance measure of the
samples, unlike other methods which specify particular measures, such as covariance or
correlation in PCA. So MDS procedures solve most of the problems of PCA and are gaining in
significance also in environmental and ecological application. In community ecology non-
metric multidimensional scaling (NMDS) is a very common method, which we will
discuss in the following.
Before you go on, you might find it helpful to review the concepts of distance
and similarity of objects in 8.1 Measures of distance.

1/15/18 / 11:18:05 / 120 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


9.2.1 Principle of a NMDS algorithm
You start out with a matrix of data consisting of n rows of samples and p columns of
variables, such as taxa for ecological data. From this, a n x n distance matrix of all pairwise
distances among samples is calculated with an appropriate distance measure, such as
Euclidean distance, Manhattan distance or, most common in ecology, Bray-Curtis distance.
The NMDS ordination will be performed on this distance matrix. In NMDS, only the rank
order of entries in the distance matrix (not the actual dissimilarities) is assumed to contain
the significant information. Thus, the purpose of the non-metric MDS algorithm is to find a
configuration of points whose distances reflect as closely as possible the rank order of the
original data, meaning that the two objects farthest apart in the original data should also be
farthest apart after NMDS and so on.
Next, a desired number of m dimensions is chosen for the ordination. The MDS algorithm
begins by assigning an initial location to each item of the samples in these m dimensions.
This initial configuration can be entirely random, though the chances of reaching the
correct solution are enhanced if the configuration is derived from another ordination
method. Since the final ordination is partly dependent on this initial configuration, a
program performs several ordinations, each starting from a different random arrangement
of points and then select the ordination with the best fit, or applies other procedures in
order to avoid the problem of local minima.
Distances among samples in this starting configuration are calculated and then regressed
against (“compared” with) the original distance matrix. In a perfect ordination, all
ordinated distances would fall exactly on the regression, that is, they would match the
rank-order of distances in the original distance matrix perfectly. The goodness of fit of the
regression is measured based on the sum of squared differences between ordination-based
distances and the distances predicted by the regression. This goodness of fit is called stress.
It can be seen as the mismatch between the rank order of distances in the data, and the
rank order of distances in the ordination. The lower your stress value is, the better is your
ordination.
The configuration is then improved by moving the positions of samples in ordination space
by a small amount in the direction in which stress decreases most rapidly. The ordination
distance matrix is recalculated, the regression performed again, and stress recalculated.
This entire procedure of nudging samples and recalculating stress is repeated until the
stress value seems to have reached a (perhaps local) minimum.
Further reading:
http://strata.uga.edu/software/pdf/mdsTutorial.pdf
Excellent, well comprehensible introduction which covers both the principles and
the application in R in a more comprehensive manner than in this text.
http://www.unesco.org/webworld/idams/advguide/Chapt8_1.htm
Provides a little more mathematical background, but still on introductory level.
http://ordination.okstate.edu/overview.htm
NMDS and other ordination techniques for ecologists.

1/15/18 / 11:18:06 / 121 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


9.2.2 NMDS in R
The function we want to use in R is called metaMDS of the package 'vegan'. In order to
perform NMDS we first need to calculate the distance between items. metaMDS is a smart
function and will take on this task for you as well using vegdist. However, if you want to
scale your data and calculate the distance using a different function, metaMDS also accepts
a distancematrix as an input.
The vegan package is designed for ecological data, so the metaMDS default
settings are set with this in mind. For example, the default distance metric is
Bray-Curtis and common ecological data transformations are turned on by
default. For non-ecological data, these settings may distort the ordination.
One alternative possibility for NMDS analysis would be to use the isoMDS()
function in the MASS package.

So our work in R is rather easy. We load our forest dataset by (watch out for the correct
directory path!)
forests<-read.csv("forests.csv", header=TRUE, row.names=1)

Since metaMDS is a complex functions there are a lot of possible parameters. You will want
to check
?metaMDS

to see what possible parameters there are. MetaMDS needs the following structure of the
dataset: columns → variables ; rows → samples
As the original forest dataset is organized the other way around, we need to transpose it:
t_forests=t(forests)

Now a simple NMDS analysis of our dataset with the default settings could look like this:
def_nmds_for=metaMDS(t_forests)

We might wish to specify some parameters:


nmds_for=metaMDS(t_forests, distance = "euclidean", k = 3,
autotransform=FALSE)

distance is the distance measure used (see 8.1 Measures of distance), k is the number of
dimensions, autotransform specifies if automatical transformations are turned on or off.
You can see which objects the metaMDS function returns by
names(nmds_for)

the important ones are


nmds_for$points #sample/site scores
nmds_for$species #scores of variables (species / taxa in ecology)
nmds_for$stress #stress value of final solution
nmds_for$dims #number of MDS axes or dimensions
nmds_for$data #what was ordinated, including any transformations
nmds_for$distance #distance metric used

We can view which parametes were used by writing the output variable name:
nmds_for

1/15/18 / 11:18:06 / 122 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Important for us are the sample and variable scores, which we can extract by
variableScores <- nmds_for$species
sampleScores <- nmds_for$points

The column numbers correspond to the MDS axes, so this will return as many columns as
was specified with the k parameter in the call to metaMDS.
We can obtain a plot by:
plot(nmds_for)

Sites/samples are shown by black circles, the taxa by red crosses.


MDS plots can be customized by selecting either "sites" or "species" in display=, by
displaying labels instead of symbols by specifying type="t" and by choosing the dimensions
you want to display in choices=.
plot(nmds_for, display = "species", type = "t", choices = c(2, 3))

You can even obtain fancier plots by further customization. By specifying type
="n", no sample scores or variable scores will be plotted. These can then be
plotted with the points() and text() commands. For crowded plots, congestion
of points and labels can be alleviated by plotting to a larger window and by
using cex to reduce the size of symbols and text. An example:
plot(nmds_for, type="n")
plots axes, but no symbols or labels
points(nmds_for, display=c("sites"), choices=c(1,2), pch=3,
col="red")
plots points for all samples (specified by “sites”) for MDS axes 1 & 2 (specified
by choices). Note that symbols can be modified with typical adjustments,
such as pch and col.
text(nmds_for, display=c("species"), choices=c(1,2), pos=1,
col="blue", cex=0.7)
plots labels for all variable scores (specified by “species”) for MDS axes 1 & 2.
Typical plotting parameters can also be set, such as using cex to plot smaller
labels or pos to determine the label position.

9.2.3 NMDS Exercises


Exercise 34: Ordinate the forests data with different distance measures (Euclidean,
Manhattan, Bray-Curtis). Which is the best adaption? At which parameter do you
have to look and what is its value?
Exercise 35: Ordinate the algae_presence.csv data set before and after doing a
Beals transformation. Compare by creating appropriate plots.

9.2.4 Considerations and problems of NMDS


(Based on a tutorial by S. Holland:
http://strata.uga.edu/software/pdf/mdsTutorial.pdf)

1/15/18 / 11:18:06 / 123 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


The ordination will be sensitive to the number of dimensions that is chosen, so this choice
must be made with care. Choosing too few dimensions will force multiple axes of variation
to be expressed on a single ordination dimension. Choosing too many dimensions is no
better in that it can cause a single source of variation to be expressed on more than one
dimension. One way to choose an appropriate number of dimensions is perform ordinations
of progressively higher numbers of dimensions. A scree diagram (stress versus number of
dimensions) can then be plotted, on which one can identify the point beyond which
additional dimensions do not substantially lower the stress value. A second criterion for
the appropriate number of dimensions is the interpretability of the ordination, that is,
whether the results make sense.
The stress value reflects how well the ordination summarizes the observed distances among
the samples. Several “rules of thumb” for stress have been proposed, but have been
criticized for being over-simplistic. Stress increases both with the number of samples and
with the number of variables. For the same underlying data structure, a larger data set will
necessarily result in a higher stress value, so use caution when comparing stress among
data sets. Stress can also be highly influenced by one or a few poorly fit samples, so it is
important to check the contributions to stress among samples in an ordination.
Although MDS seeks to preserve the distance relationships among the samples, it is still
necessary to perform any transformations to obtain a meaningful ordination. For example,
in ecological data, samples should be standardized by sample size to avoid ordinations that
reflect primarily sample size, which is generally not of interest.
A real disadvantage of the NMDS consists in the use of ranks. If a PCA fits well to the data
then the NMDS results get worse.

1/15/18 / 11:18:06 / 124 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


9.3 R code library for ordination
Function Arguments Use

library(vegan) vegan contains specific tools for


ecologists

princomp(x, cor=) x: dataset Principal Component Analysis

cor: use correlation matrix (TRUE/FALSE)

screeplot(x, type=, main="") x: dataset Creates a” scree plot”

type: barchart or lines

main: title of graph

biplot(x, choices=) x: dataset Creates a so called “biplot”

choices: a vector with two components,


specifies which PCs to use in the biplot

metaMDS(x, distance= "", k =, x: dataset or distancematrix Performs nonmetric


autotransform=) multidimensional scaling
distance: the distance measure that should
be used. E.g. "manhattan",
"euclidean", "bray", "jaccard".

K: number of dimesions

autotransform: (TRUE/FALSE) performs


automatic transformations suited for
community ecology. Default is TRUE!

1/15/18 / 11:18:06 / 125 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


10 Spatial Data
The analysis of spatial data is a very hot topic in the R community. Things change very fast
and it is very difficult to get an overview of the different ends of methods and packages. If
you are interested in the subject, we would recommend to subscribe to the R-SIG-Geo
mailing list and read the book of Bivand et al. 2013.

http://cran.r-project.org/web/views/Spatial.html
Bivand, Roger S., Pebesma, Edzer J., Gómez-Rubio, Virgilio, 2013: Applied
Spatial Data Analysis with R, Series: Use R!, Springer, 2008, XIV, 378 p., ISBN
978-0-387-78170-9 - Available for students of CAU Kiel as free ebook
https://stat.ethz.ch/mailman/listinfo/R-SIG-Geo/
R Special Interest Group on using Geographical data and Mapping

Because of the limited time available for this subject we will focus on the practical aspects
of spatial analysis, i.e. things you might need if you add maps to your statistical project or
final thesis. This includes mainly import of vector and raster maps, plotting of maps and
statistical analyses.
First, we need to define the different types of spatial data
• Point data, e.g. the location and one or more properties like the location of a tree and
its diameter. Normally this type is considered the simplest case of a vector file, but
we treat it separately, because mapping in ecology means frequently going out with
a GPS and writing down (or recording) the position and some properties (e.g. species
composition, occurrence of animals, diameter of trees...)
• Vector data with different sub-species like a road or river map (normally coming
from a vector GIS like ArcGIS).
• Grid-Data or raster data are files with a regular grid like digital images from a camera, a
digital elevation model (DEM) data or the results of global models.

10.1 First example


First, install the libraries spatstat and gpclib
library (spatstat)

To get an overview about the main functions you can use


demo(spatstat)

A first example shows us the location of trees


data(swedishpines)

1/15/18 / 11:18:06 / 126 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


X <- swedishpines

str(X)

plot(X)

summary(X)

plot(density(X, 10))

10.2 Background maps


One of the most frequent applications for GIS is the use of background maps to display
spatial values. A typical example is the data base of water quality of freshwater systems in
Schleswig-Holstein (source: M. Trepl, Landesamt für Landwirtschaft, Umwelt und ländliche
Räume, LLUR). Fig. 59 shows an overview of the data base. It contains values of NH4-N,
Nitrate-Nitrogen, ortho-Phosphat-P, total Nitrogen and total P from over 400 rivers in
Schleswig-Holstein. The variables lon and lat contain information about geographical
longitude and latitude.

Figure 59: Structure of the water quality data base

First, we load the libraries and the data set.

library(ggplot2)
library(reshape2)
library(dplyr)
load(file="wq_map.Rdata")
str(wq_map)

Background maps are handled by

library(ggmap)

To get a background map of Schleswig-Holstein we can use the name

sh <- get_map(location="Schleswig-Holstein", zoom=8)


gg <- ggmap(sh)
gg

1/15/18 / 11:18:06 / 127 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


The get_map function has a lot of options and is quite flexible. You can use not only Google
maps, but also satellite images and you can enter the location in coordinates.
Fig. 60 shows a common use of maps: the spatial distribution of mean content of NO3 in
freshwater systems. It is produced with the code below. First we calculate the mean nitrate
content for each sampling point.

# calculate mean values


NO3_raw = dplyr::filter(wq_map,PARAMETER=="Nitrat-N")
NO3_Mess_1 = dplyr::group_by(NO3_raw,SamplePoint)
NO3_Mean = dplyr::summarise(NO3_Mess_1,
NO3_MEAN=mean(Value,na.rm=TRUE),

lon=mean(lon,na.rm=TRUE),lat=mean(lat,na.rm=TRUE))

NO3_Mean$NO3_class=cut(NO3_Mean$NO3_MEAN,10)
# plot values in map
gg <- ggmap(sh)
gg <- gg + geom_point(data=NO3_Mean,
mapping=aes(x=lon, y=lat, size=2,color=NO3_MEAN))+
scale_color_gradientn(colours =rev(heat.colors(10)))
gg

Figure 60: Spatial distribution of average Nitrate content in freshwater systems of


Schleswig-Holstein

36: Plot the nitrate means for the Kiel region with the following background
maps: Satellite map, OpenStreet (osm) and – for your artist friends -
Stamen maps in watercolor.

1/15/18 / 11:18:07 / 128 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


37: Plot the spatial distribution number of nitrate samples since 1992,
select only the 100 most frequents samples to avoid a cluttered map

10.3 Spatial interpolation


A common question in ecology is how to interpolate from point data to spatial averages
and/or sums to get e.g. an areal average of precipitation based on station data. As always,
there are too many ways to happiness in R which use different approaches and produce
different results in different output formats.
For all methods we use the average nitrate content of freshwater systems in Schleswig-
Holstein from the last chapter.
First, we have to prepare the correct data formats

library(sp)
library(maptools)
library(spatstat)
NO3_Mean=as.data.frame(NO3_Mean) # mean nitrate content
# make map object
NO3_map=sp::SpatialPointsDataFrame(NO3_Mean[,c("lon","lat")],
data.frame(NO3_Mean$NO3_MEAN))
# convert to a ppp object used with the maptool library
NO3_ppp = maptools::as.ppp.SpatialPointsDataFrame(NO3_map)

10.3.1 Nearest neighbour


The nearest neighbour method is in fact no interpolation at all, it just takes the value from
the nearest point.

NO3_nn=spatstat::nnmark(NO3_ppp,at="pixels")
plot(NO3_nn)
class(NO3_nn)

The result is an image which can be converted easily to a data frame for further analysis.

nn_df=as.data.frame(NO3_nn)
str(nn_df)

10.3.2 Inverse distances


The inverse distance method is the most frequently used interpolation method. The value
of a pixel is calculated as a function of the distance from neighbouring pixels.

NO3_idw=spatstat::idw(NO3_ppp,at="pixels")
plot(NO3_idw)
idw_df=as.data.frame(NO3_idw) # conversion made easy
qplot(data=idw_df,x=x,y=y,fill=value,geom="raster")

ggplot(idw_df, aes(x, y)) +


geom_raster(aes(fill = value)) +
scale_fill_gradientn(colours = terrain.colors(10))

1/15/18 / 11:18:07 / 129 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


38: Calculate and plot the differences of the different spatial interpolation
methods

10.3.3 Akima
The akima library interpolates a grid of irregularly spaced input data. It uses bilinear or
bicubic spline interpolation with different algorithms. Unfortunately it also uses a different
data format and requires therefore a little bit more effort.

library(akima)
raw8=NO3_Mean

The central function is the interpolation of spatial data.

ak <- interp(raw8$lon, raw8$lat,


raw8$NO3_MEAN,duplicate="mean",linear=TRUE)
image(ak)

Unfortunately you cannot plot the result directly with ggplot. The following code
transforms the result to a format you can plot with ggplot.

# transform ak to a data frame


t=data.frame(x=1,y=1,z=1) # create new data frame
for (j in 1:length(ak$x)) {
for (k in 1:length(ak$y)) {
t=rbind(t,c(x=ak$x[j],y=ak$y[k],z=ak$z[j,k]))
} }
t=t[-1,] # remove first line from creation of df

t=t[!is.na(t$z),] # remove missing values


contour(ak$x, ak$y, ak$z)
ggplot()+
geom_raster(data=t,aes(x=x,y=y,fill=z),alpha=0.5)+
scale_fill_gradientn(colours=rainbow(30))

10.3.4 Thiessen polygons


A common method for spatial interpolation is the so called “Thiessen” polygon method,
sometimes called “dirichlet” or “voronoi” method. The problem in R is that the results of
this methods are polygons and not raster data as before. This means that we will have to
convert the results in a rather complex procedure.

library(spatstat)
library(maptools)
library(sp)

Based on the former point frame, the calculation is quite straightforward.


NO3_thiessen=dirichlet(NO3_ppp)

1/15/18 / 11:18:07 / 130 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


class(NO3_thiessen)
NO3_thiessen=as(NO3_thiessen, "SpatialPolygons")
class(NO3_thiessen)

The result is a spatial structure which cannot be used later on. We have to convert it to
SpatialPolygons. This produced the spatial structures, but they do not contain the NO3
values. In the next step we assign the values to the spatial structures.
int.Z = over(NO3_thiessen, NO3_map, fn=mean)
class(int.Z)
data.spdf = SpatialPolygonsDataFrame(NO3_thiessen, int.Z)
names(data.spdf)="NO3_Mean"
class(data.spdf)
plot(NO3_thiessen)
plot(data.spdf)
spplot(data.spdf, "NO3_Mean", col.regions = rainbow(20))

Finally we have to convert the vector file a raster format and save it to disk.

#conversion to a raster file


library(raster)
r <- raster(ncol=300, nrow=300) # create structure
extent(r) <- extent(data.spdf) # assign coordinates
rp <- rasterize(data.spdf, r, 'NO3_Mean') # assign values
plot(rp, col=rainbow(20))
writeRaster(rp, "rjj.grd",format="GTiff") #raster

10.4 Point Data

Example from http://help.nceas.ucsb.edu/R:_Spatial

Point is possibly the most frequent application for ecologists. Typically, positions are
recorded with a GPS device and then listed in Excel or even as text.
The procedure in R to convert point data to an internal or ESRI-map is straightforward:
• read in the data
• define the columns containing the coordinates
• convert everything to a point shapefile
Following is a brief R script that reads such records from a CSV file, converts them to the
appropriate R internal data format, and writes the location records as an ESRI Shape File.
The file Lakes.csv contains the following columns. 1: LAKE_ID, 2: LAKENAME, 3:
Longitude, 4: Latitude. For compatiblility with ArcMap GIS, Longitude must appear
before Latitude.
library(sp)

library(maptools)

1/15/18 / 11:18:07 / 131 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


LakePoints = read.csv("Lakes.csv")

Columns 3 and 4 contain the geographical coordinates.


LakePointsSPDF =
SpatialPointsDataFrame(LakePoints[,3:4],data.frame(LakePoints[,1:4]))
plot(LakePointsSPDF)

Now write a shape-file for ESRI Software.


maptools:::write.pointShape(coordinates(LakePointsSPDF),data.frame(Lak
ePointsSPDF),"LakePointsShapeRev")
writeSpatialShape(LakePointsSPDF,"LakePointsShapeRev2")

10.4.1 Bubble plots


A quite useful chart type is the spatial bubble plot – the size of the bubble is proportional to
the value of the variable
library(sp)
library(lattice)
#data(meuse)
#coordinates(meuse) = ~x+y
## bubble plots for cadmium and zinc
data(meuse)
coordinates(meuse) <- c("x", "y") # promote to SpatialPointsDataFrame
bubble(meuse, "cadmium", maxsize = 1.5, main = "cadmium concentrations
(ppm)", key.entries = 2^(-1:4))
bubble(meuse, "zinc", maxsize = 1.5, main = "zinc concentrations
(ppm)", key.entries = 100 * 2^(0:4))

10.5 Raster data


Raster data are quite common in ecology. They can come as a digital elevation model (DEM),
a satellite image. R offers a full range of functions, therefore you should first load the
library(raster)

library(rgdal)

The easiest way to import a grid is to use the gdal library, but we have to convert them
manually to the raster format.
lu87grd = readGDAL("lu87.asc")

Check type and structure of the variable


str(lu87grd)

lu87=raster(lu87grd)

str(lu87)

The structure of raster is different.

1/15/18 / 11:18:07 / 132 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


spplot(lu87)

demgrd = readGDAL("dem.asc")

dem=raster(demgrd)

spplot(dem)

First, let us check the frequency distribution of elevation


hist(dem)

lu07grd = readGDAL("lu07.asc")

lu07=raster(lu07grd)

spplot(lu07grd)

An analysis of different land use frequencies is similar


hist(lu07)

To show you how maps are used for statistics we want to find out the land use type on steep
slopes.
slopegrd = readGDAL("slope.asc")

slope=raster(slopegrd)

spplot(slope)

hist(slope)

Extract all cells with slope >4


steep = slope>4

Multiply with land use – multiplication with 0 is 0, for 1 the value of land use is taken.
lu_steep = steep * lu87

Finally, count the different classes

freq(lu_steep)

39: Calculate the change in forest cover from 1987 until 2007 in elevations
> 1000m (code for forest: 1), count decreasing and increasing forest
cover
Hints: use logical functions to select data sets

1/15/18 / 11:18:07 / 133 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


10.6 Vector Data
Vector data are normally handled with a GIS. A very common software product is ArcGIS,
but this package is also very expensive and due to the copy protection very complicated to
install and use. Therefore, some open source packages are worth a try.

Some open source or free GIS packages


• http://grass.fbk.eu/ GRASS, available for many operating systems,
the oldest and biggest system
• http://www.qgis.org/ QGIS, for Windows, Linux and MacOS X

• http://52north.org/communities/ilwis the ILWIS GIS, similar to


ArcView 3.2

The following tutorial is taken from: Paul Galpern, 2011: Workshop 2: Spatial
Analysis – Introduction, R for Landscape Ecology Workshop Series, Fall 2011,
NRI, University of Manitoba (http://nricaribou.cc.umanitoba.ca/R/)

Unfortunately, R is not very suitable for vector data, therefore we suggest that you prepare
the vector files as far as possible with a real GIS. If you really want to take a close look at
vector maps in R you can read the following help files and the book from Blivand et al. 2008.

Adrian Baddeley, 2011: “Handling shapefiles in the spatstat package”


http://cran.r-project.org/web/packages/spatstat/vignettes/shapefiles.pdf

First, load the required libraries


library(rgeos)

library(maptools)

library(raster)

Read the shape file in a R map. Shape-files, i.e. files with an extension .shp are vector files
in an ArcView format which can be used by all GIS packages.
vecBuildings <- readShapeSpatial("patchmap_buildings.shp")
vecRoads <- readShapeSpatial("patchmap_roads.shp")
vecRivers <- readShapeSpatial("patchmap_rivers.shp")
vecLandcover <- readShapeSpatial("patchmap_landcover.shp")
str(vecRivers)

Next, plot the four maps


plot(vecLandcover, col="grey90", border="grey40", lwd=2)

1/15/18 / 11:18:08 / 134 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


plot(vecRoads, col="firebrick1", lwd=3, add=TRUE)
plot(vecRivers, col="deepskyblue2", lwd=10, add=TRUE)
plot(vecBuildings, cex=2, pch=22, add=TRUE)

Because R is not good in vector maps we convert everything to a raster. First, we define size
and extent of the new raster map
rasTemplate <- raster(ncol=110, nrow=110, crs=as.character(NA))
extent(rasTemplate) <- extent(vecLandcover)

The final conversion is


rasLandcover <- rasterize(vecLandcover, rasTemplate, field="GRIDCODE")

The field="GRIDCODE" part defines the variable which contains the code for the land use.
rasBuildings <- rasterize(vecBuildings, rasTemplate)
rasRoads <- rasterize(vecRoads, rasTemplate)
rasRivers <- rasterize(vecRivers, rasTemplate)

Final, control the result with a plot


plot(rasLandcover)

plot(rasBuildings)

plot(rasRoads)

plot(rasRivers)

A simple application of map operations is e.g. the creation of a buffer zone around streets or
buildings. This can be done with the edge function which draws a line around the edges of
a raster
ras2 <- boundaries(rasRoads, type="outer")

but you can check with


plot(ras2)

that only the edges are drawn. To add one map to the other we use
rasRoads2 <- cover(rasRoads, ras2)

You can also join the commands above to one:


ras2=raster(rasBuildings,layer=2)
ras3 = boundaries(ras2, type="outer")
# falscher Datentyp
rasBuildings <- cover(rasBuildings, ras3)
rasBuildings <- cover(rasBuildings, edge(rasBuildings, type="outer"))
rasRoads <- cover(rasRoads, edge(rasRoads, type="outer"))

The final step is to combine the buildings, roads, rivers, and landcover rasters into one. We
will cover the landcover raster with the other three.

1/15/18 / 11:18:08 / 135 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Examining the rasBuildings plot, you will notice that the roads are assigned a value of 1
and non-roads are assigned a value of 0. In order to cover one raster over the another, we
need to set these 0 values to NA. On a raster NA implies that a cell is transparent. So let’s do
this for all the covering rasters:
rasBuildings[rasBuildings==0] <- NA

rasRoads[rasRoads==0] <- NA

rasRivers[rasRivers==0] <- NA

The features on each of these three rasters have a value of 1. In order to di erentiate these
features on the final raster we need to give each feature a di erent value. Recall that our
landcover classes are 0 to 4. Let’s set rivers to 5, buildings to 6, and roads to 7. It seems to
be standard practise to use a continuous set of integers when creating feature classes on
rasters.
rasRivers[rasRivers==1] <- 5

rasBuildings[rasBuildings==1] <- 6

rasRoads[rasRoads==1] <- 7

And now we can combine these using the cover function, with the raster on top first, and
the raster on bottom last in the list:
patchmap <- cover(rasBuildings, rasRoads, rasRivers, rasLandcover)

You can now plot the map


plot(patchmap, axes = TRUE, xlab = "Easting (km)", ylab =
"Northing (km)", col = c(terrain.colors(5), "blue", "black", "red"))

or export it to a ArcGIS format


writeRaster(patchmap, filename="myPatchmap.asc", format="ascii")

10.7 Working with your own maps


If you want to work with your own maps, it is always a good idea to check the maps first
library(rgeos)
library(maptools)

To read the landuse map from the GIS-course you can first check the file
getinfo.shape("landuse.shp")

and then read it in and plot it.


myLanduse <- readShapeSpatial("landuse.shp")
plot(myLanduse)

Writing back a shapefile is also easy


writePolyShape(myLanduse,"testshape.shp")

1/15/18 / 11:18:08 / 136 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


The attributes of a map are stored in the so called slots. You get a list with
slotNames(myLanduse)

The slot we are interested in is data


str(myLanduse@data)

where you find all attributes of the map. You can manipulate these variables as usual, e.g.
myLanduse@data[1,]

If you want to manipulate or select data you can


attach(myLanduse@data)
myLanduse@data[GRIDCODE==1,1]

You could also use the rgdal library


library(rgdal)
myLand2 <- readOGR(dsn="landuse.shp"",layer="landuse")

To plot the landuse (and any other attribute) you can


spplot(myLanduse, "GRIDCODE")
or convert it first to a raster map:
LUTemplate <- raster(ncol=110, nrow=110, crs=as.character(NA))
extent(LUTemplate) <- extent(myLanduse)

The final conversion is


LUraster <- rasterize(myLanduse, LUTemplate, field="GRIDCODE")

40: Calculate the sum and the average size of the different land use classes
(variable GRIDCODE)
41: Recode or create a variable the GRIDCODE so that there are only two
classes: water and land (water is 5), plot and export the map

1/15/18 / 11:18:08 / 137 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


11 Time Series Analysis
http://cran.r-project.org/web/views/TimeSeries.html - The currrent
task view of time series analysis in R
http://www.statsoft.com/textbook/sttimser.html: Tutorial on Time-
Series Analysis
http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4.htm
Chair of Statistics, 2011: A First Course on Time Series Analysis (Open
Source Book),
http://statistik.mathematik.uni-wuerzburg.de/timeseries/

11.1 Definitions
Time Series: „In statistics and signal processing, a time series is a sequence of data points,
measured typically at successive times, spaced at (often uniform) time intervals. Time series
analysis comprises methods that attempt to understand such time series, often either to
understand the underlying theory of the data points (where did they come from? what
generated them?), or to make forecasts (predictions). Time series prediction is the use of a
model to predict future events based on known past events: to predict future data points before
they are measured. The standard example is the opening price of a share of stock based on its
past performance. „
Trend: „In statistics, a trend is a long-term movement in time series data after other
components have been accounted for.“
Amplitude: „The amplitude is a non negative scalar measure of a wave's magnitude of
oscillation“
Frequency: „Frequency is the measurement of the number of times that a repeated event
occurs per unit of time. It is also defined as the rate of change of phase of a sinusoidal
waveform. (Measured in Hz) Frequency has an inverse relationship to the concept of
wavelength. „
„Autocorrelation is a mathematical tool used frequently in signal processing for analysing
functions or series of values, such as time domain signals. Informally, it is a measure
of how well a signal matches a time-shifted version of itself, as a function of the
amount of time shift” (the Lag). “More precisely, it is the cross-correlation of a signal
with itself. Autocorrelation is useful for finding repeating patterns in a signal, such as
determining the presence of a periodic signal which has been buried under noise, or
identifying the missing fundamental frequency in a signal implied by its harmonic
frequencies. „
Period: time period or cycle duration is the reciprocal value of frequency: T = 1/frequency
All citations from the corresponding keywords at www.wikipedia.org 2006

1/15/18 / 11:18:08 / 138 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


11.2 Data sets
The data set for this part of the course is erle_stat_engl.csv, it contains the following
columns:

Name Content
Date Date
Peff Effective precipitation (mm)
Evpo_Edry Evaporation from dry alder carr (mm)
T_air Air temperature (°C)
Sunshine Sunshine duration (h)
Humid_rel Relative Humidity (%)
H_GW Groundwater level (m)
H_ERLdry Water level in dry part of alder carr (m)
H_ERLwet Water level in wet part of alder carr (m)
H_lake Water level in Lake Belau (m)
Infiltra Infiltration into the soil (mm)

11.3 Data management of TS

11.3.1 Conversion of variables to TS


First, read file into a data-frame in R:
t <- read.csv("erle_stat_engl.csv")

The following command sequences converts the text of a German date (“31.12.2013”) to an
internal date variable:
t$date <- as.Date(as.character(t$Date), format="%d.%m.%Y")

Another common format is “2013-12-31” which would be converted with :


t$date <- as.Date(as.character(t$Date), format="%Y-%m-%d")

The conversion as.character is sometimes necessary, because date values from files are
sometimes read in as factor variables.
It is useful to convert dates into a standard format available on many platforms, the posix
format which computes seconds from 1970.
t$posix <- as.POSIXct(date)
Where POSIXct shows the variable in a readable form, the alternative version POSIXlt
is better suited for data frames.

A much easier way to convert text to POSICct is the anytime library. It takes many
common formats and converts it to date without complex format strings.
library("anytime")
anytime("2016-12-30")
Despite of the German origin of the author of the library, one of the formats it does not

1/15/18 / 11:18:08 / 139 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


understand is the common German date format. Fortunately it is quite easy to add it to the
library of available date formats.
anytime("30.12.2016") # does not work

getFormats() # lists all available date formats

addFormats(c("%d.%m.%Y","%d.%m.%y")) # define the German formats

anytime("30.12.2016") # German format with 4 digit year

anytime("30.12.16") # German format with 2 digit year

In the same way you can add any time/date format to the library.
An easier way is the use of the following function from package Hmisc which converts text
from a file immediately into a date variable.
library(Hmisc)

t3 <- csv.get("erle_stat_engl.csv" , datevars="Date", dateformat="%d.


%m.%Y")

Last not least: the easiest way to get date/time variables into R is to import files directly in
Excel format with the readxl library.
More information is available in the description of the packages chron and zoo, which is
is useful for time series with unequal distances. Some important methods are:
DateTimeClasses(base) Date-Time Classes

as.POSIXct(base) Date-time Conversion Functions

cut.POSIXt(base) Convert a Date or Date-Time Object to a


Factor

format.Date(base) Date Conversion Functions to and from


Character

round.POSIXt(base) Round / Truncate Data-Time Objects

axis.POSIXct(graphics) Date and Date-time Plotting Functions

hist.POSIXt(graphics) Histogram of a Date or Date-Time Object

11.3.2 Creating factors from time series


“Factors” play an important role in the classification of a data set. They correspond roughly
to the horizontal and vertical headers in a pivot-table in Excel. Examples of frequently used
factor variables are e.g. names of species in biology or years and months in time series.
Some factor-variables are already identified automatically when R reads a file.
Unfortunately, R also thinks that date variables in text form (“1.1.2006”) are factor
variables. The conversion of this variables is explained in chapter 11.3.1. To create a factor
based on time series we use mainly the following functions:
cut()und factor()

cut.POSIXt(base)

1/15/18 / 11:18:08 / 140 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


The creation of a factor variable with years is easy:
t$years = cut.Date(t$date, "years")

create a factor variable containing the years of the data set. The extraction of months and
weeks is similar (see help(cut.Date) for a summary of all possibilities).
You can check the results with
levels(t$years)

You can now use the factor to classify your data set for many functions. With the command
qplot(years,H_GW,data=t,geom="boxplot")

you get separate plots for each year.


The creation of a monthly variable is done by:
t$months = cut.Date(t$date2, "months")

The command creates a factor for each month, e.g. "Jan 1978", "Feb 1978" etc. Frequently
this is not what you want. If you need a mean value for all months in the data set (e.g. for
seasonal analysis) you have to extract the name or number of the months and then to
convert them into a factor which can be used for a boxplot etc.
mon_tmp <- format.Date(t$date,format="%m")

t$months <- factor(mon_tmp)

year_tmp <- format.Date(t$date,format="%Y")

t$years <- factor(year_tmp)

boxplot (t$T_air ~ t$months)

Another variable we use frequently is the Day number from 1 to 365


library(timeDate)

t$julianday=timeDate::dayOfYear(timeDate(t$date))

qplot (data=t, y=T_air, x=julianday,geom="jitter",col=years)

Once the factors are defined, you can use the dplyr library to create all kinds of summaries
(sums, mean....).
t_annual=dplyr::group_by(t,years)

airtemp=dplyr::summarise(clim_group,

mean_t=mean(AirTemp_Mean),

median=median(AirTemp_Mean))

qplot(as.Date(years),mean_t,data=t_ann_mean,geom="line")

1/15/18 / 11:18:09 / 141 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


qplot(as.Date(years),sum_prec,data=t_ann_mean,geom="line")

In a more modern syntax you get the results with


t_ann = reshape2::melt(data=t_ann_mean,id.vars=c("years"))

ggplot(data=t_ann,aes(x=as.Date(years),y=value))+

geom_line()+

facet_grid(variable ~ .,scales="free")

42: create boxplots with annual and monthly values of lake water level and
groundwater level
43: create a scatterplot matrix for single months of daily values of lake
water levels (hint: use melt/cast functions for wide/narrow conversions,
use days of month and years as row identifiers

11.4 Statistical Analysis of TS

11.4.1 Statistical definition of TS


In statistics, TS are composed of the following subcomponents:
Y t=T t +S i + Rt

where
T = Trend, a monotone function of time t
S = one or more seasonal component(s) (cycles) of length/duration i
R = Residuals, the unexplained rest
The analysis of TS is entirely based on this concept. The first step is usually to detect and
eliminate trends. In the following steps, the cyclic components are analysed. Sometimes,
the known seasonal influence is also removed.

11.4.2 Trend Analysis


Normally, trend analysis is a linear or non-linear regression analysis with time as x-axis or
independent variable. Many authors also use the term for the different filtering algorithms
which are normally used to make plots of data look more “smoothly”.

11.4.2.1 Regression Trends


For the analysis of trends you can use the regression methods from Excel or R. Some
packages offer “detrended” TS, which are the residuals of a regression analysis. The basic
procedure is:

1/15/18 / 11:18:09 / 142 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


● compute the regression equation (linear or non linear)
Y t=b 0+ b1⋅x t

● compute the detrended residuals


Y *t =Y t (b 0+ b1⋅x t )

44: use a linear model to remove the trend from the air temperature (Hint:
function lm, look at the contents of the results)

11.4.2.2 Filter
Some TS show a high degree of variation and the real information my be hidden in the high
variation of the data set. This is why there are several methods of filtering or smoothing a
data set. Sometimes this process is also called “low pass filtering”, because it removes the
high pitches from a sound-file and lets the low frequencies pass. The most frequently used
methods are splines and moving averages. Moving average are computed as mean values of
a number of records before and after the actual value. The range of averaging decides on
the “smoothness” of the curve. Is filtering is used to remove trends, “detrended” means the
deviations from the moving averages.

11.5 Removing seasonal influences


Seasonal influences are known effects that are reasonably stable in terms of annual timing,
direction, and magnitude. Possible causes include natural factors (the weather),
administrative measures (starting and ending dates of the school year), and social, cultural
or religious traditions (fixed holidays such as Christmas).
You can remove this influence
● calculate the mean values for seasonal components (e.g. monthly mean values in a
data set with monthly values)
● subtract this mean value from the original data set
Please keep in mind that the removal of seasonal influences is a very complicated process –
there are many possibilities and methods.

45: Remove the seasonal trend from the air temperature (Hint: use
daynumbers)

1/15/18 / 11:18:09 / 143 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


11.6 Irregular time series
In ecology, time series often have an irregular spacing. There are several packages which
can be used to produce a regularly spaced data set.
help(zoo)
help(its)

11.7 Practical TS analysis in R


First, we have to define the data set as a time series.
attach(t)

lake = ts(H_lake, start=c(1989,1),freq=365)

Next we can already plot an overview of the analysis


ts = stl(lake,s.window="periodic")

plot(ts)

or look at the text summary:


summary(ts)
A look at the structure of the results
str(ts)
reveils that you can extract the detrended and deseasonalized remainders with
clean_ts = ts$time.series[,3]
for further analysis. Please take a look at the help-page of the procedure to understand
what happens below the surface.

For time series analysis we often need so called “lag” variables, i.e. the data set moved back
or forth a number of timesteps. A typical example is e.g. Unit-Hydrograph, which compares
the actual discharge to the effective precipitation of a number of past days. This number is
called “lag”. You can create the corresponding time series with the lag function:
ts_test = as.ts(t$H_GW) # Groundwater
lagtest <- ts_test # temp var
for (i in 1:4) {lagtest <- cbind(lagtest,lag(ts_test,-i))}
Now check the structure and the content of lagtest.

46: analyse groundwater water level (detrend, remove seasonal trends)

11.7.1 Auto- and Crosscorrelation


We continue to use our data set with the water level in the lake Belau.

1/15/18 / 11:18:09 / 144 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


The function to calculate autocorrelation is acf, you can check the help-file for syntax and
parameters with
help(acf)

A simple function call with no parameters uses a lag of 15 days:


erle_acf <- acf(H_ERLdry)

erle_acf

The following, more complex command is more useful and adapted to our data set, it
calculates autocorrelation for a whole year (365 days) and plots the coefficients.
erle_acf <- acf(H_ERLdry, lag.max=365, plot=TRUE)

The cross correlation analysis is very similar. The analyse the relation between water level
and precipitation
erle_ccf <- ccf(H_ERLdry, Peff, lag.max=30, plot=TRUE)

By splitting the output screen into several windows you can get a concise overview about
the relations between the different variables:
split.screen(c(2,2))

ccf(H_ERLdry, Peff, lag.max=30, plot=TRUE)

screen(2)

ccf(H_ERLdry, Evpo_Edry, lag.max=30, plot=TRUE)

screen(3)

ccf(H_ERLdry, H_lake, lag.max=30, plot=TRUE)

screen(4)

ccf(H_ERLdry, Infiltra, lag.max=30, plot=TRUE)

close.screen(all = TRUE)

The ggplot version looks similar, but uses a different logic to create a narrow version of the
data set using the rbind command..
t1=ccf(t$H_ERLdry, t$Peff, lag.max=30, plot=FALSE)
ccf_all=data.frame(t1$acf,t1$lag,t1$snames)
t1=ccf(t$H_ERLdry, t$Evpo_Edry, lag.max=30, plot=FALSE)
ccf_all=rbind(ccf_all,data.frame(t1$acf,t1$lag,t1$snames))
t1=ccf(t$H_ERLdry, t$H_lake, lag.max=30, plot=FALSE)
ccf_all=rbind(ccf_all,data.frame(t1$acf,t1$lag,t1$snames))
t1=ccf(t$H_ERLdry, t$Infiltra, lag.max=30, plot=FALSE)
ccf_all=rbind(ccf_all,data.frame(t1$acf,t1$lag,t1$snames))

qplot(t1.lag,t1.acf,data=ccf_all,geom="line",col=t1.snames)+
geom_vline(xintercept = 0)+
geom_hline(yintercept = 0)

1/15/18 / 11:18:09 / 145 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


47: analyse the influence of water level in the wet part and the lake
(H_ERLdry) and groundwater level (H_GW)
48: analyse the autocorrelation of different nutrients from the wqual.data
(see page 149 for a description)

11.7.2 Fourier- or spectral analysis


One of the most frequent problems in time series analysis is the detection and identification
of the cycles or periods in a data set. In real life this is similar to the analysis of the different
notes in a sound file.
The steps for a spectral analysis are1:
air = read.csv(“air_temp.csv")

TempAirC <- air$T_air

Time <- as.Date(air$Date, "%d.%m.%Y")

N <- length(Time)

oldpar <- par(mfrow=c(4,1))

plot(TempAirC ~ Time)

# Using fft (fast Fourier Transform)

transform <- fft(TempAirC)

# Extract DC component from transform

# modules if

dc <- Mod(transform[1])/N

# for help see help(spec.pgram)

periodogram <- round( Mod(transform)^2/N, 3)

# Drop first element, which is the mean

periodogram <- periodogram[-1]

# keep first half up to Nyquist limit

# The Nyquist frequency is half the sampling frequency

1 The original code was created by Earl F. Glynn <[email protected]>

1/15/18 / 11:18:10 / 146 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


periodogram <- periodogram[1:(N/2)]

# Approximate number of data points in single cycle:

print( N / which(max(periodogram) == periodogram) )

# plot spectrum against Fourier Frequency index

plot(periodogram, col="red", type="o",

xlab="Fourier Frequency Index", xlim=c(0,25),

ylab="Periodogram",

main="Periodogram derived from 'fft'")


The plot reads as follows: a frequency index of 10 means that there are ten periods in the
dataset, the duration would be 10/N which is 365 days – one year.
There is a second possibility to find the frequency distribution is the use of the spectrum
function.

# The same thing, this time using spectrum function

s <- spectrum(TempAirC, taper=0, detrend=FALSE, col="red",

main="Spectral Density")

# this time with log scale

plot(log(s$spec) ~ s$freq, col="red", type="o",

xlab="Fourier Frequency", xlim=c(0.0, 0.005),

ylab="Log(Periodogram)",

main="Periodogram from 'spectrum'")

cat("Max frequency: ")

maxfreq <- s$freq[ which(max(s$spec) == s$spec) ]

print(maxfreq)

# Period will be 1/frequency:

cat("Corresponding period\n")

print(1/maxfreq)
Please take care that the frequency corresponds to the whole data set (, i.e. 3652 points) and
not for a year day or the defined time step.

1/15/18 / 11:18:10 / 147 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


# restore old graphics parameter

par(oldpar)

Next, we can use a different approach with a different scaling. The base period is now 365
days, i.e. frequency of 1 means one per year.
air =read.csv("http://www.hydrology.uni-
kiel.de/~schorsch/air_temp.csv")

airtemp = ts(T_air, start=c(1989,1), freq = 365)

spec.pgram(airtemp,xlim=c(0,10))

# draw lines for better visibility

abline(v=1:10,col="red")

To compute the residuals, we use the information from spectral analysis to create a linear
model.

x <- (1:3652)/365

summary(lm(air$T_air ~ sin(2*pi*x)+cos(2*pi*x)+ sin(4*pi*x)


+cos(4*pi*x) + sin(6*pi*x)+cos(6*pi*x)+x))

49: analyse the periodogram of the lake water level before and after the
stl analysis

11.7.3 Breakout Detection


A common problem in ecology is to find the point where a system changes, e.g. as a result of
political measures or changes in the environment, this is called a “breakpoint”. Twitter
uses this technique to detect “trending tweets”, this package can also be used to detect any
changes in time series. Unfortunately the installation is a little bit more complex, because it
requires the development package.

install.packages("devtools") # install development package


devtools::install_github("twitter/BreakoutDetection")
library(BreakoutDetection)

Frequently the required software is not fully installed the first time. If you get error
message you have to try it a second time.

load("wq_map.Rdata") # date set from the map exercises


no3=wq_map[((as.character(wq_map$PARAMETER))=="Nitrat-N") &
(wq_map$Regions=="Elbe SH"),] # select data from the Elbe
River

1/15/18 / 11:18:10 / 148 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


no3=droplevels.data.frame(no3)
no3=no3[no3$SamplePoint==120015,] # select one point
Z=no3[,c("Date","Value")]
Z$Date=as.POSIXct(Z$Date)
names(Z)=c("timestamp","count")
bre=breakout(Z,min.size=30,method="multi",plot=TRUE)
bre$plot
str(Z)

Another library for changepoint analysis is:

library(changepoint)
bre=cpt.mean(no3$Value)
qplot(data=no3,x=1:length(no3$Date),y=Value)+
geom_vline(xintercept=bre@cpts[1])
no3$Date[bre@cpts[1]]

Last not least a new one


library(breakpoint)
obj1 <- CE.Normal.Mean(as.data.frame(no3$Value), distyp = 1,
penalty = "mBIC", parallel =TRUE)
profilePlot(obj1, as.data.frame(no3$Value))

50: Check the changepoint(s) for phosphate

11.8 Sample data set for TS analysis


As a final example you can analyze the water quality data set of the Kielstau catchment, our
UNESCO-Ecohydrology demo site.
The sampling was carried out on a daily basis with an automatic sampler (daily sample) and
a weekly basis (manual sampling “Schöpfprobe”). The water of the automatic sample was
not cooled, the collected samples were taken to the lab once a week, at this time the manual
sample was taken. The parameters of the data set are composed of water quality indicators
(nutrients), climate variables (temperature, rain) and hydrologic variables (discharge).

Possible questions about the data set are


• what is the auto- or cross-correlation of the variables (i.e. how stable is the system,
does it react very fast...)
• what is the relation between the variables (correlation analysis). Is this relation
independent of other variables (summer/winter)
• the values for the manual and automatic sampling should be quite similar. Is this
true?

1/15/18 / 11:18:10 / 149 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


• does the climate influence the nutrient contents (temperature, precipitation)?
• do the hydrologic variables influence the nutrient contents?

x select a statistical question, present the results (max. 5 figures)

1/15/18 / 11:18:10 / 150 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Field Content
Datum_Jkorrekt Date, readable
Datum Date numeric
week
Month
Year
NH4_N NH4 Sampler
S_NH4_N NH4 Manual Sample
NO3
NO3_N
S_NO3_N
PO4_P
S_PO4_P
Ptot Total P
S_Ptot
Chlorid
S_Chlorid
Sulfat
S_Sulfat
Filt_vol Filtered volume
Sed Sediment
Q Discharge
W Water level
S_Watertemp
Quality_level
CloudCover
REL_HUMIDITY
VAPOURPRESSURE
AIRTEMP
AIR_PRESSURE
WINDSPEED
TEMP_MIN_SOIL
AIRTEMP_MINIMUM
AIRTEMP_MAXIMUM
WIND_MAXIMUM
PREC_INDEX
PRECIPITATION
SUNSHINE
SNOWDEPTH
S_Index Distance to manual sample
S_Number
reverse reverse distance to manual sample
Summer Summer yes/no
Table 1: Variables in the water quality data set („S_“ = from weekly manual sampling)

1/15/18 / 11:18:10 / 151 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


12 Practical Exercises

12.1 Tasks
The central question in the first units of the course is:
• Has the climate of the Hamburg station changed since measurement began?
The question can be divided, for example, into the following sub-questions or sub-tasks:
• Comparing winter precipitation intensities
• Has the intensity of (winter) precipitation during the years 1959-89 changed in
comparison to the years 1929-59?
• Are trends identifiable in annual mean, minimum and maximum temperature (linear
regression with time as the x-axis)?
• Has the difference between summer and winter temperatures changed?

12.1.1 Summaries
The following examples are organized according to level of difficulty.

1/15/18 / 11:18:10 / 152 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Calculation of the mean and sum of annual temperatures (pivot table with year as outline variable,
which must be created from the date field).
Calculation of annual and monthly means or sums (pivot table with year and month as outline
variables)
Calculation of mean daily variation (calculation of the difference between minimum and maximum,
followed by pivot table)
Creation of cumulative curves of precipitation, calculation of the mean cumulative curve via the
data storage record.
? Calculation of the sum of temperatures within a given range (temperatures above 5°C, obtained
via the “if” function).
Calculation of summer and winter precipitation (creation of a binary (0/1) variable for summer and
winter months using the if-function, followed by a pivot table with year and the binary variable as
outline variables).
Calculation of the onset of the vegetation period, defined as the day with a temperature sum >200,
coding with binary variables, extraction of the day-number using a pivot table.
Analysis of the vegetation period, defined by the time from temperature sum >200 until 11/01
(trees), calculation of the precipitation during the vegetation period.

12.1.2 Regression Line


All preciously created pivot tables can be used for trend analysis. It's possible to vary the time
period of the analysis, e.g. the last 5, 10, and 30 years.

12.1.3 Database Functions


Selection of the time of year
Amount of summer days or days with frost (selection with temperature<0, subsequently calculate
using a pivot table)
Obtain the mean precipitation intensities (selection of all days with precipitation<0, subsequently
calculate the mean value using a pivot table)
Calculation of the onset of the vegetation period, selection of day using the filter function

12.1.4 Frequency Analyses


• Analysis of extreme precipitation
Analysis of rain duration: definition of rain period: consecutive days with precipitation>0, create “if”
formula (add), if necessary distribute into multiple columns, calculate mean duration using a
crosstab (without day 0 !), and obtain the frequency.

1/15/18 / 11:18:11 / 153 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


13 Applied Analysis
Use the Hamburg climate dataset to compare the climate of the years 1950-1980 with 1981-
2010. The data set is already converted to R format, please load the workspace
climate.rdata to avoid formatting and conversion problems.

Select on variable and create a figure with 800x1200 pixels size and the following
contents:
• a plot of the original data
• a plot of annual, summer and winter mean values,
• a boxplot of decadal values (use the as.integer to calculate the factors)
• a violinplot of the two periods 1950-1980 and 1981-2010
• a lineplot of the monthly means or sums for the two periods 1950-1980 and 1981-
2010
• a boxplot of the daily values as function of period and month
• put everything together in a 800x1200 pixels size (see Fig. Error: Reference source
not found), send us the result by email and have a nice Christmas :-)

Hints
• prepare the figures step by step
• use aggregate to calculate the annual and monthly summaries

The following variables should be analyzed:

• Cloud_Cover
• RelHum
• Mean_Temp
• Airpressure
• Min_Temp_5cm
• Min_Temp
• Max_Temp
• prec
• sunshine
• snowdepth

If you have some time left:

1/15/18 / 11:18:11 / 154 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


51: Use ANOVA to compare annual means and variances of different variables

52: Analyse the slope of the different variables. Is there a significant increase?

1/15/18 / 11:18:11 / 155 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Figure 61: Summary of one variable (Mean Temperature)

1/15/18 / 11:18:11 / 156 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


1/15/18 / 11:18:11 / 157 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt
14 Solutions
Solution 2:
Climate$Year_fac = as.factor(Climate$Year)
Climate$Month_fac = as.factor(Climate$Month)

Solution Error: Reference source not found:

First Version:
Climate$Summer = 0
Climate$Summer[Climate$Month>5 & Climate$Month<10]=1

An alternative Version of the first command:


Climate$Summer[!(Climate$Month>5 & Climate$Month<10)]=0
The result is a numeric variable

Second Version:
Climate$Summer = (Climate$Month>5) & (Climate$Month<10)
The result is a boolean variable

Solution Error: Reference source not found:


plot(Mean_Temp ~ Date, type="l")
lines(Max_Temp ~ Date, type="l", col="red")
lines(Min_Temp ~ Date, type = "l", col="blue")

Solution 11:
m2 = (Max_Temp+Min_Temp)/2
scatterplot(Mean_Temp ~ m2)
scatterplot(Mean_Temp ~ m2| Year_fac)

Solution Error: Reference source not found:


attach(Climate_original)
from1950=Climate_original[Year>1949 & Year <1981,]
from1981=Climate_original[Year>1980 & Year <2011,]
detach(Climate_original)
t1950 = aggregate(x = from1950$Mean_Temp, by =
list(from1950$Month),FUN = mean, simplify = TRUE)
t1981 = aggregate(x = from1981$Mean_Temp, by =
list(from1981$Month),FUN = mean, simplify = TRUE)
ymax=max(t1981$x,t1950$x)
ymin=min(t1981$x,t1950$x)
plot(t1950,ylim=c(ymin, ymax))
lines(t1981)

1/15/18 / 11:18:11 / 158 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


scatterplotMatrix(~ Mean_Temp + Max_Temp + Mean_RelHum + Prec + Sunshine_h |Summer)

Solution 39:

Zählen der geänderte Landnutzung

check=lu07==lu87

Wieviel Wälder sind von 1987 bis 2007 verschwunden in den Höhen über
2000m?

Alle höhen < 1000 löschen

ue1000 =dem>1000

> t2=ue1000*dem

> spplot(t2)

> ue1000b=(dem>1000) * dem

forest87=lu87==1

forest07=lu07==1

ue1000 =dem>1000

forest87a=forest87*ue1000

forest07a=forest07*ue1000

# increase: 87=1, 07=0

diff87_07 = (forest87a ==1) & (forest07a == 0)

spplot(diff87_07)

summary(diff87_07)

Cells: 770875

NAs : 378939

Mode "logical"

FALSE "384320"

1/15/18 / 11:18:11 / 159 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


TRUE "7616" → Decrease

NA's "378939"

# increase 87=0, 07 = 1

diff07_87 = (forest87a ==0) & (forest07a == 1)

spplot(diff07_87)

summary(diff07_87)

Cells: 770875

NAs : 378943

Mode "logical"

FALSE "370912"

TRUE "21020" → increase

NA's "378943"

# any spatial patterns?

diff= diff87_07-diff07_87

spplot(diff)

Solution 42:

boxplot (H_lake ~ months)

boxplot (H_GW ~ months)

boxplot (H_lake ~ years)

boxplot (H_GW ~ years)

Solution Error: Reference source not found:


aggregate(H_lake, list(n = months), mean)

Solution 46:

gw = ts(H_GW, start=c(1989,1),freq=365)

plot(stl(gw,s.window="periodic"))

1/15/18 / 11:18:11 / 160 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt


Solution 47:

ccf(H_ERLwet, H_lake, lag.max=365, plot=TRUE)

ccf(H_ERLwet, H_GW, lag.max=365, plot=TRUE)

Solution 49:

1/15/18 / 11:18:11 / 161 of 161 / D:\cloud\Dropbox\lehre\statistik\buch_neu_ws16\analysis_ecological_data_ws17_18.odt

You might also like