Using R For Scientific Computing: Karline Soetaert
Using R For Scientific Computing: Karline Soetaert
Karline Soetaert
Centre for Estuarine and Marine Ecology
Netherlands Institute of Ecology
The Netherlands
June 2009
Abstract
R (R Development Core Team 2008) is the open-source (read: free-of-charge) version
of the language S. It is best known as a package that performs statistical analysis and
graphics. However, R is so much more: it is a high-level language in which one can perform
complex calculations, implement new methods, and make high-quality figures.
R has high-level functions to operate on matrices, perform numerical integration, advanced statistics,... which are easily triggered and which make it ideally suited for datavisualization, statistical analysis and mathematical modeling.
It is the aim of these lecture notes to make you acquainted with the R language. The
lecture notes are based on a book (Soetaert and Herman 2009) about ecological modelling
in which R is extensively used for developing, applying and visualizing simulation models.
There are many excellent sources for learning the R (or S) language. R comes with
several manuals that can be consulted from the main R program (Help/Manuals). Rintro.pdf is a good start. Many other good introductions to R are available, some freely
on the web, and accessible via the R web site (www.r-project.org). My favorite is the
R introduction by Petra Kuhnert and Bill Venables (Kuhnert and Venables 2005), but
beware: this introduction comprises more than 300 pages!
1. introduction
1.1. The R-software
Installing R
R is downloadable from the following web site: http://www.r-project.org/ Choose the precompiled binary distribution. On this website, you will also find useful documentation. To
use R for the examples in this course, several packages need to be downloaded.
deSolve. Performs integration. (Soetaert, Petzoldt, and Setzer 2009c)
rootSolve. Finds the root of equations (Soetaert 2009).
If you run R within windows, downloading specific packages can best be done within the R
program itself. Select menu item packages / install packages, choose a nearby site (e.g.
France (Paris)) and select the package you need. If you install package marelacTeaching then
all other packages will be automatically installed as well.
Karline Soetaert
Here sqrt and log are built-in functions in R; pi is a built-in constant; the semi-colon
(;) is used to separate R-commands.
In the console window, the <UP> and <DOWN> arrow keys are used to navigate
through previously typed sentences.
2. Alternatively, we can create R-scripts in an editor (e.g. Tinn-R) and save them in a file
(filename.R) for later re-use. R-scripts are sequences of R-commands and expressions.
These scripts should be submitted to R before they are executed. This can be done in
several ways:
by typing, in the R-console window:
> source ("filename.R")
by opening the file, copying the R-script to the clipboard (ctrl-C) and pasting it
(ctrl-V) into the R-console window
If you do not use the tinn-R editor, the file is opened as an R-script from within the
R console. After selecting the script, and pressing the send button the statements
are executed and the cursor moved to the next line.
you can
If you do use the Tinn-R editor,
either submit the entire file (buttons 1,2), selected parts of the text (buttons 3,4),
submit marked blocks (buttons 5,6) or line-by-line (last button).
Throughout these notes, the following convention is used:
> 3/2
denotes input to the console window (> is the prompt)
[1] 1
is R output, as written in the console window
getwd()
is an R statement in a script file (it lacks the prompt).
A screen capture of a typical Tinn-R session, with the Tinn-R editor (upper window) and the
R-console (lower window) is given below. A script file is opened in the Tinn-R editor. Note
the context-sensitive syntax (green=comments, blue= reserved words, rose = R-parameters).
Several lines of R-code have been selected (blue area) and sent to the R-console, which has
produced the graphics window that floats independently from the other windows.
?log
?sin
?sqrt
?round
?Special
will explain about logarithms and exponential functions, trigonometric functions, and other
functions.
> ?Arithmetic
lists the arithmetic operations in R.
> help.search("factor")
will list occurrences of the word <factor> in R-commands.
Sometimes the best help is provided by the very active mailing list. If you have a specific problem, just type R: <problem> on your search engine. Chances are that someone encountered
the problem and it was already solved.
Karline Soetaert
Most of the help files also include examples. You can run all of them by using R-statement
example.
For instance, typing into the console window:
> example(matrix)
will run all the examples from the matrix help file.
> example(pairs)
will run all the examples from the pairs help file. (! try this ! pairs is a very powerful way
of visualizing pair-wise relationships).
Alternatively, you may select one example, copy it to the clipboard (ctrl-C for windows users)
and then paste it (ctrl-V) in the console window. In addition, the R main software and many
R-packages come with demonstration material. Typing
> demo()
will give a list of available demonstrations in the main software.
> demo(graphics)
will demonstrate some simple graphical capabilities.
Be careful if you want to split a complex statement over several lines ! These errors are
very difficult to trace, so it is best to avoid them.
log2 (4096)
23
!
tip: you may need to look at the help files for some of these functions. typing ?"+" will
open a help file with the common arithmetic operators.
2. Now write the R-statements in a script file, using the Tinn-R editor. Try the various
ways in which to submit the statements to R .
Karline Soetaert
2. R-variables
R calculates as easily with vectors, matrices and arrays as with single numbers.
R also includes more complex structures such as data frames and lists, which allow to combine
several types of data.
Learning how to create these variables, how to address them and modify them is essential if
you want the make good use of the R software.
> 1/0
> 0/0
> 1e-8 * 1000
(where the e-8 notation denotes 108 ).
Vectors
Vectors can be created in many ways:
Using R-function vector
The function c() combines numbers into a vector
The operator : creates a sequence of values, each 1 larger (or smaller) than the previous
one
A more general sequence can be generated by R-function seq
The same quantity is repeated using R-function rep
For instance, the commands:
>c(0, pi/2, pi, 3*pi/2, 2*pi)
[1] 0.000000 1.570796 3.141593 4.712389 6.283185
>seq(from=0,to=2*pi, by=pi/2 )
[1] 0.000000 1.570796 3.141593 4.712389 6.283185
>seq(0, 2*pi, pi/2 )
[1] 0.000000 1.570796 3.141593 4.712389 6.283185
will all create a vector, consisting of: 0, , . . . 2 .
Note that R-function seq takes as input (amongst others) parameters from, to and by (2nd
example). If the order is kept, they not be specified by name (3rd example).
The next command calculates the sine of this vector and outputs the result:
>sin( seq(0, 2*pi, pi/2 ))
[1] 0.000000e+00
[5] -2.449294e-16
1.000000e+00
1.224647e-16 -1.000000e+00
Karline Soetaert
>rep(1,times=5)
[1] 1 1 1 1 1
>rep(c(1,2),times=5)
[1] 1 2 1 2 1 2 1 2 1 2
>c(rep(1,5),rep(2,5))
[1] 1 1 1 1 1 2 2 2 2 2
The next statements:
> V <- 1:20
> sqrt(V)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
[8] 2.828427 3.000000 3.162278 3.316625 3.464102 3.605551 3.741657
[15] 3.872983 4.000000 4.123106 4.242641 4.358899 4.472136
create a sequence of integers between 1 and 20 and take the square root of all of them,
displaying the result to the screen. The operator <- assigns the sequence to V.
Some other examples of the : operator are:
>(V <- 0.5:10.5)
[1]
0.5
1.5
2.5
3.5
4.5
5.5
6.5
7.5
8.5
9.5 10.5
>6:1
[1] 6 5 4 3 2 1
Finally, the statements:
>V <- vector(length=10)
>FF<- vector()
generate a vector V comprising 10 elements, and a vector FF of unknown length.
Note: a peculiar feature of R is that the elements of a vector can also be given names:
>(fruit <- c(banana=1, apple=2, orange =3))
banana
1
apple orange
2
3
10
>names(fruit)
[1] "banana" "apple"
"orange"
Matrices
Matrices can also be created in several ways:
By means of R-function matrix
By means of R-function diag which constructs a diagonal matrix
The functions cbind and rbind add columns and rows to an existing matrix, or to
another vector
The statement:
>A <-matrix(nrow=2,data=c(1,2,3,4))
creates a matrix A, with two rows, and, as there are four elements, two columns. Note that
the data are inputted as a vector (using the c() function).
The next two statements display the matrix followed by the square root of its elements:
>A
[1,]
[2,]
[,1] [,2]
1
3
2
4
>sqrt(A)
[,1]
[,2]
[1,] 1.000000 1.732051
[2,] 1.414214 2.000000
By default, R fills a matrix column-wise (see the example above). However, this can easily be
overruled, using parameter byrow:
>(M <-matrix(nrow=4, ncol=3, byrow=TRUE, data=1:12))
[1,]
[2,]
[3,]
[4,]
Karline Soetaert
11
>diag(1,nrow=2)
[,1] [,2]
[1,]
1
0
[2,]
0
1
The names of columns and rows are set as follows:
>rownames(A) <- c("x","y")
>colnames(A) <- c("c","b")
>A
c b
x 1 3
y 2 4
note that we also use the c() function here ! Row names and column names are in fact vectors
containing strings.
Matrices can also be created by combining (binding) vectors, e.g. rowwise:
>V <- 0.5:5.5
>rbind(V,sqrt(V))
[,1]
[,2]
[,3]
[,4]
[,5]
[,6]
V 0.5000000 1.500000 2.500000 3.500000 4.500000 5.500000
0.7071068 1.224745 1.581139 1.870829 2.121320 2.345208
t(A) will transpose matrix A (interchange rows and columns).
>t(A)
x y
c 1 2
b 3 4
Arrays
Arrays are multidimensional generalizations of matrices; matrices and arrays in R are actually
vectors with a dimension attribute. A multi-dimensional array is created as follows:
>AR <-array(dim=c(2,3,2),data=1)
In this case AR is a 2*3*2 array, and its elements are all 1.
2.2. Dimensions
The commands
12
>
>
>
>
Will return the length (total number of elements) of (vector or matrix) V, the dimension of
matrix or array A, and the number of columns and rows of matrix M respectively.
Simple indexing
The elements of vectors, matrices and arrays are indexed using the [] operator:
M[1 , 1]
M[1 , 1:2]
M[1:3 , c(2,4)]
Takes the element on the first row, first column of a matrix M (1st line), then selects the
entries in the first row and first two columns (2nd line) and then the elements on the first
three rows, and 2nd and 4th column of matrix M (3rd line).
If an index is omitted, then all the rows (1st index omitted) or columns (2nd index omitted)
are selected. In the following:
M[
,2] <-0
M[1:3, ] <- M[1:3, ] * 2
first all the elements on the 2nd column (1st line) of M are zeroed and then the elements on
the first three rows of M multiplied with 2 (2nd line). Similar selection methods apply to
vectors:
V[1:10]
V[seq(from=1,to=5,by=2)]
The statement on the 1st line takes the first 10 elements of vector V, whilst on the 2nd line,
the 1st , 3rd and 5th element of vector V are selected.
Logical expressions
Logical expressions are often used to select elements from vectors and matrices that obey
certain criteria.
R distinguishes logical variables TRUE and FALSE, represented by the integers 1 and 0.
13
Karline Soetaert
> ?Comparison
> ?Logic
will list the relational and logic operators available in R. The following will return TRUE for
values of sequence V that are positive:
>(V <- seq(-2,2,0.5))
[1] -2.0 -1.5 -1.0 -0.5
0.0
0.5
1.0
1.5
2.0
> V>0
[1] FALSE FALSE FALSE FALSE FALSE
TRUE
TRUE
TRUE
TRUE
while
>V [V > 0]
[1] 0.5 1.0 1.5 2.0
will select the positive values from V,
>V [V > 0] <- 0
will zero all positive elements in V,
> sum(V < 0)
[1] 4
will return the number of negative elements: it sums the TRUE (=1) values, and
>V [V != 0]
[1] -2.0 -1.5 -1.0 -0.5
will display all nonzero elements from V ( ! is the not operator). Logical tests can also be
combined, using | (the or operator), and & (and).
>V [V<(-1) | V>1]
[1] -2.0 -1.5
will display all values from V that are < -1 and > 1. Note that we have enclosed -1 between
brackets (can you see why this is necessary?) Finally,
>which (V == 0)
14
[1] 5 6 7 8 9
>which.min (V)
[1] 1
will return the element index of the 0-value, and of the minimum.
lists
A list is a combination of several objects; each object can be of different length:
> list(Array = AR, Matrix = M)
will combine the previously defined array AR and matrix M.
data.frames
These are combinations of different data types (e.g. characters, integers, logicals, reals),
arranged in tabular format:
>genus <- c("Sabatieria","Molgolaimus")
>dens <- c(1,2)
>Nematode <-data.frame(genus=genus,density=dens)
>Nematode
Karline Soetaert
15
genus density
1 Sabatieria
1
2 Molgolaimus
2
In the example above, the data.frame Nematode contains two columns, one with strings (the
genus name), one with values (the densities). Data.frames are in fact special cases of lists,
consisting of vectors with equal length. Many matrix-operations work on data.frames with a
single data type, but there exist also special operations on data.frames.
16
> is.data.frame(M)
> is.vector(A)
Or you can display the data type by:
> class(M)
2.8. Exercises
Creating and manipulating matrices and vectors is essential if we want to use R as a mathematical tool. Although this has been implemented in a consistent way in R , it is not simple
for novice users! Practice is the best teacher, so you will get plenty of exercise.
Most of the exercises can be answered with one single R -statement. However, as these
statement smay be quite complicated, it is often simpler to first break them up into smaller
parts, after which they are merged into one.
Vectors, sequences.
Use R-function mean to estimate the mean of two numbers, 9 and 17. (you may notice
that this is not as simple as you might think!).
Vector V
Create a vector, called V, with even numbers, between 16 and 56. Do not use
loops. (tip: use R-function seq )
Display this vector
What is the sum of all elements of V? Do not use loops; there exists an R-function
that does this; the name of this function is trivial.
Display the first 4 elements of V
Calculate the product of the first 4 elements of V
Display the 4th , 9th and 11th element of V . (tip: use the c() function).
Vector W
Create a new vector, W, which equals vector V, multiplied with 3; display its
content.
Karline Soetaert
17
Matrices
Use R -function matrix to create a matrix with the following contents:
"
#
3 9
7 4
18
1 1/2 1/3
Create a new matrix, B, by extracting the first two rows and first two columns of
A. Display it to the screen.
Use diag to create the following matrix, called D:
1 0 0
0 2 0
0 0 3
1 0 0
0 2 0
0 0 3
4
5 5 5 5
Karline Soetaert
19
You may also open the file in EXCEL, but do not forget to close it before proceeding. EXCEL
is very territorial, and will not allow another program, such as R , to access a file that is open
in EXCEL.
On the first line is the heading (the names of the stations), the first column contains the
species names. Before importing the file in R , check the working directory:
> getwd()
If the file called nemaspec.csv is not in this directory, you may need to change the working
directory:
> setwd("directory name")
(do not forget that R requires / where windows uses \) .
Make a script file in which you write the next steps; submit each line to R to check its
correctness. Read the comma-delimited file, using R-command read.csv. Type ?read.csv
if you need help.
Specify that the first row is the heading (header=TRUE) and the first column contains the
rownames (row.names=1).
Put the data in data.frame Nemaspec.
Nemaspec <- read.csv("nemaspec.csv", header=TRUE, row.names=1)
Check the contents of Nemaspec. As the dataset is quite substantial, it is best to output only
the first part of the data:
head(Nemaspec)
The rest is up to you:
Select the data from station M160b (the 2nd column of Nemaspec); put these data in a
vector called dens.
(remember: to select a complete column, you select all rows by leaving the first index
blanc).
Remove from vector dens, the densities that are 0. Display this vector on the screen.
(Answer: [1] 6.580261 5.919719 etc. . .)
Calculate N, the total nematode density of this station. The total density is simply the
sum of all species densities (i.e. the sum of values in vector dens). What is the value of
N ? (Answer :699).
Divide the values in vector dens by the total nematode density N. Put the results in
vector p, which now contains the relative proportions for all species. The sum of all
values in p should now equal 1. Check that.
Calculate S, the number of species: this is simply the length of p; call this value S.
(Answer: S=126)
20
You can calculate each of these values using only one R statement ! (A: 90.15358,
66.77841, 22.56157)
The 126 nematode species per 10 cm2 were obtained by looking at all 699 individuals.
Of course, the fewer individuals are determined to species, the fewer species will be
encountered. Some researchers determine 100 individuals, other 200 individuals. To
standardize their results, the expected number of species in a sample can be recalculated
based on a common number of individuals. The expected number of species in a sample
with size n, drawn from a population which size N, which has S species is given by:
"
*N N i , #
S
ES(n) =
1 n N
(n )
i=1
where Ni is the number of individuals in the ith species in the full sample and is the
so-called binomial coefficient, the number of different sets with size n that can be
chosen from a set with total size N.
In R, binomial coefficients are estimated with statement choose(N,n).
What is the expected number of species per 100 individuals ? (n=100,N=699). (A:
ES(100) = 60.68971).
Print all diversity indices to the screen, which should look like:
N
N0
699.00000 126.00000
N1
90.15358
N2
66.77841
Ni
22.56157
ESS
60.68971
Karline Soetaert
21
3. R functions
One of the strengths of R is that one can make user-defined functions that add to R -s built-in
functions.
After submitting this function to R , we can use it to calculate the surfaces of circles with
given radius:
>Circlesurface(10)
[1] 314.1593
>Circlesurface(1:20)
[1]
[6]
[11]
[16]
3.141593
113.097336
380.132711
804.247719
12.566371
28.274334
50.265482
78.539816
153.938040 201.061930 254.469005 314.159265
452.389342 530.929158 615.752160 706.858347
907.920277 1017.876020 1134.114948 1256.637061
the latter statement will calculate the surface of circles with radiuses 1, 2, ... ,20.
More complicated functions may return more than one element:
Sphere <- function(radius)
{
volume <- 4/3*pi*radius^3
surface <- 4 *pi*radius^2
return(list(volume=volume,surface=surface))
}
Here we recognize
the function heading (1st line), specifying the name of the function (Sphere) and the
input parameter (radius)
the function specification. As the function comprises multiple statements, the function
specification is embraced by curly braces {. . .}.
The return values (last line). Sphere will return the volume and surface of a sphere, as
a list.
22
The earth has approximate radius 6371 km, so its volume (km3) and surface (km2) are:
>Sphere(6371)
$volume
[1] 1.083207e+12
$surface
[1] 510064472
The next statement will only display the volume of spheres with radius 1, 2, . . . 5
>Sphere(1:5)$volume
[1]
4.18879
3.2. Programming
R has all the features of a high-level programming language:
Karline Soetaert
Dummy <- function (x)
{
if ( x<0 ) string <- "x<0"
if ( x<2 ) string <- "0>=x<2"
string <- "x>=2"
print(string)
}
23
else
else
>Dummy(-1)
[1] "x<0"
>Dummy(1)
[1] "0>=x<2"
>Dummy(2)
[1] "x>=2"
Note that we have specified the else clause on the same line as the if part so that R knows
that the statement is continued on the next line!
If and else constructs involving only one statement can be combined:
>x<-2
>ifelse (x>0, "positive", "negative,0")
[1] "positive"
Loops
Loops allow a set of statements to be executed multiple times:
The for loop iterates over a specified set of values. In the example below, the variable i
takes on the values (1,2,3):
>for (i in 1:3) print(c(i,2*i,3*i))
[1] 1 2 3
[1] 2 4 6
[1] 3 6 9
while and repeat will execute until a specified condition is met.
>i<-1 ; while(i<3) {print(i); i<-i+1}
24
[1] 1
[1] 2
break exits the loop
next stops the current iteration and advances to the next iteration.
>i<-1
>repeat
+ {
+
print(i)
+
i <-i+1
+
if(i>2) break
+ }
[1] 1
[1] 2
The curly braces {. . .} embrace multiple statements that are executed in each iteration.
Note: loops are implemented very inefficiently in R and should be avoided as often as possible.
Fortunately, R offers many high-level commands that operate on vectors and matrices. These
should be used as much as possible!
For more information about if constructs and loops, type
> ?Control
3.3. R- packages
A package in R is a file containing many functions that perform certain related tasks. Packages
can be downloaded from the R website.
Once installed, we generate a list of all available packages, we load a package and we obtain
a list with its contents by the following commands:
>library()
>library(deSolve)
>library(help=deSolve)
>help(package=deSolve)
3.4. Exercises
R-function sphere
Extend the Sphere function with the circumference of the sphere at the place of maximal
radius. The formula for estimating the circumference of a circle with radius r is: 2 r.
What is the circumference of the earth near the equator?
Karline Soetaert
25
Loops
The Fibonacci numbers are calculated by the following relation: Fn = Fn1 + Fn2
with F1 = F2 = 1
Tasks:
Compute the first 50 Fibonacci numbers; store the results in a vector (use R-command
vector to create it). You have to use a loop here
For large n, the ratio Fn /Fn1 approaches the golden mean: (1 + 5)/2
What is the value of F50 /F49 ; is it equal to the golden mean?
When is n large enough? (i.e. sufficiently close (<1e6 ) to the golden mean)
26
Rarefaction diversity
If you still have time and the courage: try an alternative way of estimating the number of
species per 100 individuals by taking random subsamples of 100 individuals and estimating
the number of species from this subsample.
3
If the procedure is repeated often enough, the mean value should converge to the expected
number of species, ESS(100); this is the rarefaction method of (Sanders 1968).
You may need the following R -functions:
round (converting reals to integers),
cumsum (take a cumulative sum),
sample (take random selection of elements),
table (to make a table of counts),
as well as length, mean.
(Hurlbert 1971) showed that rarefaction generally overestimates the true estimated number
of species ; can you corroborate this finding?
This question requires significant thought and imagination; there are several ways to do this.
Karline Soetaert
27
4. Statistics
R originated as a statistical package, and it is still predominantly used for this purpose.
You can do virtually any statistical analysis in R .
As there exist many documents that may help you with statistical analyses in R, we will not
deal with the subject here.
Statistics is used just to show you how to use efficiently use R , in cases where you have no
clue where to begin!
28
5. Graphics
R has extensive graphical capabilities, and allows making simple (1-D, x-y), image-like (2-D)
and perspective (3-D) figures.
Try:
> demo(graphics)
> demo(image)
> demo(persp)
simple (1-D, x-y), image-like (2-D) and perspective (3-D) capabilto obtain a display of RSs
ities.
Graphics are plotted in the figure window which floats independently from the other windows.
If not already present, it is launched by writing (in windows):
> windows()
or
> x11()
A figure consists of a plot region surrounded by 4 margins, which are numbered clockwise,
from 1 to 4, starting from the bottom. R distinguishes between:
1. high-level commands. By default, these create a new figure, e.g.
hist, barplot, pie, boxplot, ... (1-D plot)
plot, curve, matplot, pairs,... ((x-y)plots)
image, contour, filled.contour,... (2-D surface plots)
persp, scatterplot3d,... (3-D plots) 4 .
2. low-level commands that add new objects to an existing figure, e.g.
lines, points, segments, polygon, rect, text, arrows, legend, abline, locator,
rug, ... These add objects within the plot region
box, axis, mtext (text in margin), title, ... which add objects in the plot margin
3. graphical parameters that control the appearance of.
plotting objects:
cex (size of text and symbols), col (colors), font, las (axis label orientation), lty
29
0.0
1.0
0.5
sin(a)
0.5
1.0
Karline Soetaert
1.0
0.5
0.0
0.5
1.0
cos(a)
?plot.default
?par
?plot.window
?points
30
plot(cos(a),sin(a),type="l",lwd=2,xlab="",ylab="",axes=FALSE,
asp=1)
To this figure, we can now add several low-level objects:
a series of lines, representing smaller and smaller circles (lines).
for (i in seq( 0.1,0.9,by=0.1)) lines(i*sin(a), i*cos(a))
an innermost red polygon (polygon).
polygon(sin(a)*0.1,cos(a)*0.1,col="red")
point marks as text labels, ranging from from 10 to 1 (text). The closer to the centre,
the higher the score
for (i in 1:10) text(x=0,y=i/10-0.025,labels=11-i,font=2)
Now two archers take 10 shots at the target face.
We mimic their arrows by generating normally distributed (x,y) numbers, with mean=0
(the centre!) and where the experience of the archer is mimicked by the standard
deviation. The more experienced, the closer the arrows will be to the centre, i.e. the
lower the standard deviation.
R-statement rnorm generates normally distributed numbers; we need 20 of them, arranged as a matrix with 2 columns.
shots1 <- matrix(ncol=2, data=rnorm(n=20,sd=0.2))
shots2 <- matrix(ncol=2, data=rnorm(n=20,sd=0.5))
The shots are added to the plot as points, colored darkblue (experienced archer) and
darkgreen (beginners level). Note that we choose a 50% enlarged point size (cex), and
we choose a circular shaped point (pch=16)
points(shots1,col="darkblue",pch=16,cex=1.5)
points(shots2,col="darkgreen",pch=16,cex=1.5)
Finally, we add a legend, explaining who has done the shooting:
legend("topright",legend=c("A","B"),pch=16,
col=c("darkblue","darkgreen"),pt.cex=1.5)
31
Karline Soetaert
A
B
1
2
3
4
5
6
7
8
9
10
Figure 2: Figure with several low-level objects - see text for R-code
Note that the legend text and the colors are inputted as a vector of strings, using the
c() function (e.g. c("A", "B")).
1
2
3
4
5
6
>tail(Orange)
30
31
32
33
34
35
Tree
5
5
5
5
5
5
age circumference
484
49
664
81
1004
125
1231
142
1372
174
1582
177
32
200
150
100
50
circumference, mm
500
1000
1500
age, days
Figure 3: Simple plot of the orange dataset - see text for R-code
and make a rough plot of circumference versus age:
>plot(Orange$age, Orange$circumference,xlab="age, days",
+
ylab="circumference, mm", main= "Orange tree growth")
(as Orange is a dataframe, columns can be addressed by their names, Orange$age and
Orange$circumference).
The output (figure) shows that there is a lot of scatter, which is due to the fact that the five
trees did not grow at the same rate.
It is instructive to plot the relationship between circumference and age differently for each
tree. In R, this is simple: we can make some graphical parameters (symbol types, colors,
size,...) conditional to certain factors.
Factors play a very important part in the statistical applications of R; for our application, it
suffices to know that the factors are integers, starting from 1.
In the R-statement below, we simply use different symbols (pch) and colors (col) for each
tree: pch=(15:20)[Orange$Tree] means that, depending on the value of Orange$Tree (i.e.
the tree number), the symbol (pch) will take on the value 15 (tree=1), 16 (tree=2),... 20
(tree=5). col=(1:5) [Orange$Tree] does the same for the point color. The final statement
adds a legend, positioned at the bottom, right.
>plot(Orange$age, Orange$circumference,xlab="age,
+
days",ylab="circumference, mm", main= "Orange tree growth",
+
pch=(15:20)[Orange$Tree],col=(1:5) [Orange$Tree],cex=1.3)
>legend("bottomright",pch=15:20,col=1:5,legend=1:5)
The output shows that tree number 5 grows fastest, tree number 1 is slowest growing. (note:
it is also instructive to run the examples in the Orange help file. )
33
Karline Soetaert
200
150
100
circumference, mm
50
500
1000
1
2
3
4
5
1500
age,
days
34
3000
3000
50
5000
4000
2000
0
100
1000
00
00
00
1000
4000
50
00
50
1000
30
5000
2000
00
4000 3000
3000
150
100
4000
4000
1000
3000
0
4000
3000
4000
0 2000
500
00
0
0
200
00 4000
50
1000
0
10
4000
0
400
1000
1000
500
500
2000
0
00
4
50
1000
100
150
150
100
50
50
100
150
0.0
1.0
0.5
sin(3 * pi * x)
0.5
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Karline Soetaert
35
x values ranging between 0 and 2 (from, to), adding a main title (main) and x- and y-axis
labels (xlab, ylab) (1st sentence).
The 2nd R-sentence adds the function y = cos(3 x), as a red (col) dashed line (lty). Note
the use of parameter add=TRUE, as by default curve creates a new plot.
The final statements adds the x-axis, i.e. a horizontal, dashed (lty=2), line (abline) at y=0
and a legend.
5.7. Exercises
Simple curves
Create a script file which draws a curve of the function y = x3 sin2 (3x) in the interval
[-2, 2].
Make a curve of the function y = 1/cos(1 + x2 ) in the interval [-5,5].
K
1+
t0
a(tt0)
[ KN
Nt0 ]e
36
For the US, the population density in 1900 (N0) was 76.1 million; the population growth can
be described with parameter values: a=0.02 yr1 , K = 500 million of people.
Actual population values are:
1900 1910 1920 1930 1940 1950 1960 1970 1980
76.1 92.4 106.5 123.1 132.6 152.3 180.7 204.9 226.5
Tasks:
1. Plot the population density curve as a thick line, using the US parameter values.
2. Add the measured population values as points. Finish the graph with titles, labels etc...
Toxic ammonia
Ammonia nitrogen is present in two forms: the ammonium ion (N H4+ ) and unionized ammonia (N H3 ). As ammonia can be toxic at sufficiently high levels, it is often desirable to know
its concentration.
The relative importance of ammonia, (the contribution of ammonia to total ammonia nitrogen,
N H3 /(N H3 + N H4+ )) is a function of the proton concentration [H + ] and a parameter KN,
the so-called stoichiometric equilibrium constant:
p[N H3 ] =
KN
KN + [H + ]
Tasks:
Plot the relative fraction of toxic ammonia to the total ammonia concentration as a
function of pH, where pH = log10 ([H + ]) and for a temperature of 30 C. Use a range
of pH from 4 to 9.
The value of KN is 81010 at a temperature of 30 C.
Add to this plot the relative fraction of ammonia at 0 C; the value of KN at that
temperature is 8 1011 mol kg 1 .
Karline Soetaert
37
Produce a scatter plot of petal length against petal width; produce an informative title
and labels of the two axes.
Repeat the same graph, using different symbol colours for the three species.
Add a legend to the graph. Copy-paste the result to a WORD document. If you do not
have WORD, make a PDF file of the graph.
Create a box-and whisker plot for sepal length where the data values are split into
species groups; use as template the first example in the boxplot help file.
Now produce a similar box-and whisker plot for all four morphological measurements,
arranged in two rows and two columns. First specify the graphical parameter that
arranges the plots two by two.
38
6. Matrix algebra
Matrix algebra is very simple in R. Practically everything is possible! Here are the most
important R-functions that operate on matrices:
%*% Matrix multiplication
t(A) transpose of A
diag(A) diagonal of A
solve(A) inverse of A
solve(A,B) solving Ax=B for x
eigen(A) eigenvalues and eigenvectors for A
det(A) determinant of A
For instance the following first inverts matrix A (solve(A)), and then multiplies the inverse
with A , giving the unity matrix:
>(A <-matrix(nrow=2,data=c(1,2,3,4)))
[,1] [,2]
[1,]
1
3
[2,]
2
4
>solve(A) %*% A
[1,]
[2,]
[,1] [,2]
1
0
0
1
[1,]
[2,]
[,1] [,2]
1
2
3
4
The next set of statements will solve the linear system Ax=B for the unknown vector x:
>B <- c(5,6)
>solve(A,B)
[1] -1
39
Karline Soetaert
Finally, the eigenvalues and eigenvectors of A are estimated using R-function eigen. This
function returns a list that contains both the eigenvalues ($values) and the eigenvectors
($vectors), (the columns).
>eigen(A)
$values
[1] 5.3722813 -0.3722813
$vectors
[,1]
[,2]
[1,] -0.5657675 -0.9093767
[2,] -0.8245648 0.4159736
6.1. Exercises
Matrix algebra exercise 1
Use R-function matrix to create the matrices called A and B:
1 2 3
1 4 7
A = 6 4 1 , B = 2 5 8
2 1 1
3 6 9
0.0043 0.1132
0.9775 0.9111
0
0
0
0.0736 0.9534
0
0
0
0.0452 0.9804
What is the value of the largest eigenvalue (the so-called dominant eigenvalue) and the
corresponding eigenvector?.
40
6
Note: this is a stage-model of a killer whale (Caswell 2001). The eigenvalue-eigenvectors estimate the rate
of increase and stable age distribution, the matrix N contains the mean time spent in each stage.
Karline Soetaert
41
7. Roots of functions
7.1. Roots of a simple function
Suppose we want to solve the following problem: cos(x) = 2 x for x.
Mathematically, we seek the root of the function y = cos(x) 2 x, this is the value of x for
which y = 0.
As the function is quite complex, it is not possible to find an exact solution (an explicit
expression) for this root.
It is always a good idea to plot the equation (1st line), and add the x-axis (2nd line).
curve(cos(x)-2*x,-10,10)
abline(h=0,lty=2)
This figure shows that there indeed exists a value x, for which y = 0.
Now R-function uniroot can be used to locate this value.
Functions that seek a root from a nonlinear equation generally work iteratively, i.e. they
move closer and closer to the root in successive steps (iterations).
It is usually not feasible to find this root exactly, so it is approximated, i.e. up to a certain
accuracy (tol, a very small number) 7 .
For the method to work, there should be at least one root in the interval.
The statement below solves for the root; it returns several values, as a list.
>(rr<-uniroot(f = function(x) cos(x)-2*x, interval=c(-10,10)))
$root
[1] 0.4501686
$f.root
[1] 3.655945e-05
$iter
[1] 5
$estim.prec
[1] 6.103516e-05
The most important value is the root itself ($root), which is 0.45103686;
the function value at the root was 3.66e-5, the function performed 5 iterations.
In this example, the function was simple enough to include it in the call to uniroot.
The next chapter gives a more complex example from aquatic chemistry, where the equation
to solve is significantly more complex.
Finally, we add the root to the figure:
7
More specifically: the root of y = cos(x) 2 x is the value x for which |cos(x)-2*x | < tol or for which
successive changes of x are < tol
42
20
10
cos(x) 2 * x
10
20
10
10
Figure 6: Function drawn with curve and the root of the function plotted - see text for R-code
points(rr$root,0,pch=16,cex=2)
KC1 [H + ]
DIC
[H + ] [H + ] + KC1 [H + ] + KC1 KC2
[H + ]
[H + ]
KC1 KC2
DIC
+ KC1 [H + ] + KC1 KC2
+
T A = 2[CO2
3 ] + [HCO3 ] [H ]
Here is how to solve for the proton concentration [H + ] (or the pH value) in R .
The trick is to estimate alkalinity based on a guess of proton concentration, using equation (3)
and compare that with the measured alkalinity value. If both are equal within the tolerance
level, the proton concentration has been found.
In the implementation below, the dissociation constants for carbonate (kc1, kc2) and at
salinity 0, temperature 20, and pressure 0 are calculated in R s package seacarb, which has
to be loaded first (require).
We then define a function whose root has to be solved (pHfunction). In this function we
estimate total alkalinity, based on the guess of pH, the dissociation constants (kc1,kc2) and
8
in practice, it is possible to merge these 3 equations such that only one equation is obtained, but this is
neither didactically clearer nor computationally more efficient
Karline Soetaert
43
the DIC concentration. The difference of this calculated alkalinity (EstimatedAlk) with the
true alkalinity is then returned; if pH is correctly estimated, then true and estimated alkalinity
will be equal, and the difference will be zero. So, to find the pH, we need to find the root of
this function.
Note that the conversion from pH to [H + ] gives the proton concentration in molkg 1 . As
the concentrations of the other substances are in molkg 1 , we convert using a factor 106 .
We restrict the region of the pH root in between 0 and 12 (which is more than large enough),
and we set the tolerance (tol) to a very small number to increase precision.
require(seacarb)
kc1 <- K1(S=0,T=20,P=0)
kc2 <- K2(S=0,T=20,P=0)
pHfunction
{
H
<HCO3 <CO3 <-
# Carbonate k1
# Carbonate k2
7.3. Exercises
Simple functions
44
[CO2 ]
[H + ]
[HCO3 ]
[H + ]
pCO2 relates to [CO2 ] through Henrys constant, Kh, which can also be estimated as a function
of salinity, temperature and pressure, using R-package seacarb:
pCO2 =
[CO2 ]
Kh
Estimate the pH at equilibrium with alkalinity 2300 molkg 1 and the current pCO2
of 360 ppm.
Use package seacarb to estimate the dissociation constants and Henrys constants at
temperature 20 C, salinity 0, and pressure 0. (A: pH=8.19)
The Intergovernmental Panel on Climate Change predicts for 2100 an atmospheric CO2
concentration ranging between 490 and 1250 ppmv, depending on the socio-economic
scenario (IPCC, 2007). These increases of pCO2 make the water more acid. Make a
plot of pH as a function of these increased atmospheric pCO2 levels. (Assume that the
pCO2 of the ocean is at equilibrium with the atmospheric pCO2 ). What is the maximal
drop of pH ? (A: at pCO2 of 1250 ppmv, pH=7.68).
45
Karline Soetaert
interpolation,smoothing
spline
approx
smooth.spline
2
10
8. Interpolation, smoothing
Interpolating and smoothing in R can be done in several ways:
approx linearly interpolates through points
spline uses spline interpolation, which is smoother
smooth.spline smoothens data sets; this means that it does not connect the original
points.
The use of these functions is exemplified in the following script and corresponding output:
>x <- 1:10
>y <- c(9,8,6,7,5,8,9,6,3,5)
>plot(x,y,pch=16,cex=2,main="interpolation,smoothing")
>lines (spline(x,y, n=100),lty=1)
>points(approx(x,y, xout=seq(1,10,0.1)),pch=1)
>lines (smooth.spline(x,y),lty=2)
>legend("bottomleft",lty=c(1,NA,2),pch=c(NA,1,NA),
+
legend=c("spline","approx","smooth.spline"))
46
As an example, we now fit the US population density values, at 10-year intervals, with the
K
logistic growth model (see previous chapter). The model was: N (t) =
,
KNt0
a(tt )
1+[
Nt0
]e
Karline Soetaert
47
8.2. Exercises
Smoothing
An anemometer measures wind-velocity at three hourly intervals. On a certain day, these
velocities are: 5,6,7,9,4,6,3,7,9 at time 0, 3, ... 24 oclock respectively. In order to estimate
air-sea exchange, we need hourly measures.
Tasks:
Interpolate the three-hourly measurements to hourly measurements.
Make a plot of the interpolated values
Fitting
Primary production is measured by 14 C incubations from phytoplankton samples, at different
light intensities.
The data are:
>ll <- c(0.,1,10,20,40,80,120,160,300,480,700)
>pp <- c(0.,1,3,4,6,8,10,11,10,9,8)
Fit the resulting production estimates (pp), as a function of light intensity (ll) with the
3-parameter Eilers-Peeters equation. The primary production is calculated as:
pp = p max
2 (1 + ) I/Iopt
(I/Iopt)2 + 2 I/Iopt + 1
48
9. Differential equations
Differential equations express the rate of change of a constituent (C) along one or more
dimensions, usually time and/or space.
Consider the following set of two differential equations:
dA
dt
dB
dt
= r (x A) k A B
= r (y B) + k A B
dA
dt
Karline Soetaert
49
At each time t, ode will call function model, with the current values of the state variables
and the parameter values.
The output is stored in a data.frame, called out.
out
<- as.data.frame(ode(state,times,model,parms))
All we need to do now is to plot the model output. Before we do so, we have a look at
data.frame out:
>head(out)
1
2
3
4
5
6
time
0
1
2
3
4
5
A
1.0000000
0.9523189
0.9090687
0.8699226
0.8345728
0.8027203
B
1.000000
1.003787
1.005285
1.004715
1.002285
0.998201
The data are arranged in three columns: first the time, then the concentrations of A and B.
As out is a data frame we can extract the data using their names (out$time, out$A, out$B).
Before plotting the model output, the range of concentrations of substances A and B is
estimated; this is used to set the limits of the y-axis (ylim).
R -function plot creates a new plot; lines adds a line to this plot; lty selects a line type;
lwd=2 makes the lines twice as thick as the default. Finally a legend is added.
ylim
<- range(c(out$A,out$B))
plot(out$time,out$A,xlab="time",ylab="concentration",
lwd=2,type="l",ylim=ylim,main="model")
lines(out$time,out$B,lwd=2,lty=2)
legend("topright",legend=c("A","B"),lwd=2,lty=c(1,2))
9.1. Exercises
Lotka-volterra model
Write a script file that solves the following system of ODEs 9 :
dx
dt
dy
dt
= a x (1
x
K)
bxy
=gbxyey
for initial values x=300,y=10 and parameter values: a=0.05, K=500, b=0.0002, g=0.8,
e=0.03
9
The Lotka-Volterra models are a famous type of models that either describe predator-prey interactions
or competitive interactions between two species. A.J. Lotka and V. Volterra formulated the original model in
the 1920s almost simultaneously ((Lotka 1925), (Volterra 1926)).
50
1.0
model
0.7
0.6
0.4
0.5
concentration
0.8
0.9
A
B
50
100
150
200
250
300
time
Make three plots, one for x and one for y as a function of time, and one plot expressing
y as a function of x (this is called a phase-plane plot). Arrange these plots in 2 rows
and 2 columns.
Now run the model with other initial values (x=200, y=50); add the (x,y) trajectories
to the phase-plane plot
Butterfly
The Lorenz equations (Lorenz 1963) were the first chaotic system of differential equations to
be discovered. They are three differential equations that were derived to represent idealized
atmosphere.
behavior of the earthSs
dx
dt
dy
dt
dz
dt
= 83 x + y z
= 10 (y z)
= x y + 28y z
It takes about 10 lines of R-code to generate the solutions and plot them.
Function scatterplot3d from the package scatterplot3d generates 3-D scatterplots.
Can you recreate the following butterfly ? Use as initial conditions x=y=z=1; create
output for a time sequence ranging from 0 to 100, and with a time step of 0.005.
51
Karline Soetaert
0
10
out$z
10
20
30
Lorenz butterfly
20
20
10
0
30
10
20
0
10
20
30
40
50
out$x
10. Finally
10.1. The questions
These lecture notes have been generated with LaTeX and making use of R-package Sweave
(Leisch 2002), which allows to merge LaTeX with R-code.
If you do not like the layout a PDF version (ScientificComputing.pdf) made with WORD
(Microsoft) can be found in the /inst/lecture subdirectory of package marelacTeaching.
References
Caswell H (2001). Matrix population models: construction, analysis, and interpretation. Sinauer, Sunderland, second edition edition.
Hurlbert SH (1971). The nonconcept of species diversity: critique and alternative parameters. Ecology, 52, 577586.
Kuhnert P, Venables W (2005). An introduction to R: software for statistical modelling &
computing. URL www.r-project.org.
52
Leisch F (2002). Sweave: Dynamic Generation of Statistical Reports Using Literate Data
Analysis. In WH
ardle, BR
onz (eds.), Compstat 2002 - Proceedings in Computational
Statistics, pp. 575580. Physica Verlag, Heidelberg. ISBN 3-7908-1517-9, URL http:
//www.stat.uni-muenchen.de/~leisch/Sweave.
Ligges U, Machler M (2003). Scatterplot3d - an R Package for Visualizing Multivariate
Data. Journal of Statistical Software, 8(11), 120.
Lorenz E (1963). Deterministic non-periodic flows. J. Atmos. Sci, 20, 130141.
Lotka AJ (1925). Elements of Physical Biology. Williams & Wilkins Co., Baltimore.
Millero F, Poisson A (1981). International one-atmosphere equation of state for seawater.
Deep-Sea Research, 28(6), 625629.
Proye A, Gattuso JP, Epitalon JM, Gentili B, Orr J, Soetaert K (2007). seacarb: Calculates
parameters of the seawater carbonate system. R package version 1.2.3, URL http://www.
obs-vlfr.fr/~gattuso/seacarb.php.
R Development Core Team (2008). R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http:
//www.R-project.org.
Sanders H (1968). Marine benthic diversity: a comparative study. Amercian Naturalist,
102, 243282.
Soetaert K (2009). rootSolve: Nonlinear root finding, equilibrium and steady-state analysis of
ordinary differential equations. R package version 1.4.
Soetaert K, Heip C, Vincx M (1991). Diversity of nematode assemblages along a mediterranean deep-sea transect. Marine Ecology Progress Series, 75, 275282.
Soetaert K, Herman PMJ (2009). A Practical Guide to Ecological Modelling. Using R as a
Simulation Platform. Springer. ISBN 978-1-4020-8623-6.
Soetaert K, Petzoldt T, Meysman F (2009a). marelac: Constants, conversion factors, utilities
for the MArine, Riverine, Estuarine, LAcustrine and Coastal sciences. R package version
1.4.
Soetaert K, Petzoldt T, Meysman F (2009b). marelacTeaching: Datasets and tutorials for use
in the MArine, Riverine, Estuarine, LAcustrine and Coastal sciences. R package version
1.0.
Soetaert K, Petzoldt T, Setzer RW (2009c). deSolve: General solvers for initial value problems of ordinary differential equations (ODE), partial differential equations (PDE) and
differential algebraic equations (DAE). R package version 1.3.
Verhulst PF (1838). Notice sur la loi que la population pursuit dans son accroissement.
Correspondance mathematique et physique, 10, 113121.
Volterra V (1926). Variazioni e fluttuazioni del numero dindividui in specie animali conviventi. Mem. R. Accad. Naz. dei Lincei. Ser. VI, 2, 31113.
Karline Soetaert
53
Affiliation:
Karline Soetaert
Centre for Estuarine and Marine Ecology (CEME)
Netherlands Institute of Ecology (NIOO)
4401 NT Yerseke, Netherlands E-mail: [email protected]
URL: http://www.nioo.knaw.nl/users/ksoetaert