Phylogenetic Comparative Methods in R Using Phylogeny As A Statistical Fix
Phylogenetic Comparative Methods in R Using Phylogeny As A Statistical Fix
The aims of this problem set are to learn how to use R to answer phylogenetic comparative
questions without panicking! I am assuming that people have little or no experience with
R so more experienced users may want to skip quickly through the first few sections.
By the end of this problem set you should be able to:
We will be using the evolution of primate life-history variables as an example. These data
come from the PanTHERIA database (Jones et al. 2009) and 10kTrees (Arnold et al. 2010).
Note that this is an old version of 10kTrees, so if you want to use it in your research
please download the newest version.
Throughout, R code will be in shaded boxes:
install.packages("ape")
Note that many things in R can be done in multiple ways. You should choose the
methods you feel most comfortable with, and do not panic if someone is doing the
same analyses as you in a different way! This workshop will be full of different ways
to do things.
1
Preparations
First you need to download all the files for this problem set into a folder somewhere on
your computer, let’s call it “RAnalyses2014” and pop it on the Desktop. We will use this
folder throughout the problem set. You’ll need to know what the path of the folder is.
For example on my Windows machine, the path is:
C:/Users/Natalie/Desktop/RAnalyses2014
The path is really easy to find in a Windows machine, just click on the address bar of the
folder and the whole path will appear.
In Windows, paths usually include \ but R can’t read these. It’s easy to fix in your R
code, just change any \ in the path to / or \\.
~/Desktop/RAnalyses2014
It’s a bit trickier to find the path on a Mac, so ask if you need help. Note that the tilde ~
is a shorthand for /Users/Natalie.
Next, open a text editor. R has an inbuilt editor that works pretty well, but NotePad and
TextEdit are fine too. However, in the future I highly recommend using something that
will highlight code for you. My personal favorite is Sublime Text 2, because you can also
use it for any other kind of text editing like LaTeX, html etc.
You should type (or copy and paste) your code into the text editor, edit it until you think
it’ll work, and then paste it into R’s console window. Saving the text file lets you keep a
record of the code you used, which can be a great timesaver if you want to use it again,
especially as you know this code will work!
You can cut and paste code from this handout into your text editor, or straight into R.
You don’t need to retype everything!
2
If you want to add comments to the file (i.e., notes to remind yourself what the code is
doing), put a hash/pound sign (#) in front of the comment.
# Comments are ignored by R but can remind you what the code is doing.
# You need a hash sign at the start of each line of a comment.
To run comparative analyses (or any specialized analysis) in R, you need to download one
or more additional packages from the basic R installation. For this problem set you will
need to install the following packages: ape, picante and caper. To install the package
ape:
install.packages("ape")
Loading packages in R
You’ve installed the packages but they don’t automatically get loaded into your R session.
Instead you need to tell R to load them every time you start a new R session and want to
use functions from these packages. To load the package ape into your current R session:
library(ape)
3
Reading your data and phylogeny into R
Next we need to load the data we are going to use for the analysis. R can read files in lots
of formats, including comma-delimited and tab-delimited files. Excel (and many other
applications) can output files in this format (it’s an option in the “Save As” dialog box
under the “File” menu). To save time I have given you a tab-delimited text file called
“Primatedata.txt” which we are going to use. Load these data as follows.
You will need to replace MYPATH with the name of the path to the folder containing
the data and tree on your computer.
Note that sep = "\t" indicates that you have a tab-delimited file, sep = "," would
indicate a comma-delimited csv file. You can also use read.delim for tab delimited files
or read.csv for comma delimited files. header = TRUE, indicates that the first line of the
data contains column headings.
This is a good point to note that unless you tell R you want to do something, it won’t
do it automatically. So here if you successfully entered the data, R won’t give you any
indication that it worked. Instead you need to specifically ask R to look at the data.
We can look at the data by typing:
str(primatedata)
This shows the structure of the data frame (this can be a really useful command when you
have a big data file). It also tells you what kind of variables R thinks you have (characters,
4
integers, numeric, factors etc.). Some R functions need the data to be certain kinds of
variables so it’s useful to check this.
As you can see, the data contains the following variables: Order, Family, Binomial,
AdultBodyMass_g, GestationLen_d, HomeRange_km2, MaxLongevity_m, and Social-
GroupSize.
head(primatedata)
This gives you the first few rows of data along with the column headings.
names(primatedata)
primatedata
5
Reading and displaying your phylogeny in R
To load a tree you need either the function read.tree or read.nexus. read.tree can deal
with a number of different types of data (including DNA) whereas read.nexus reads
NEXUS files. We will use a NEXUS file of the consensus tree from 10kTrees.
You will need to replace MYPATH with the name of the path to the folder containing
the data and tree on your computer.
primatetree
str(primatetree)
## List of 4
## $ edge : int [1:446, 1:2] 227 228 229 230 231 232 ...
## $ edge.length: num [1:446] 4.95 17.69 19.65 8.12 4.82 ...
## $ Nnode : int 221
## $ tip.label : chr [1:226] "Allenopithecus_nigroviridis" ...
## - attr(*, "class")= chr "phylo"
## - attr(*, "order")= chr "cladewise"
primatetree is a fully resolved tree with branch lengths. There are 226 species and 221
internal nodes. We can plot the tree by using the plot function of ape:
plot(primatetree)
Note that the tree has too many species for us to read the species names. You can make
the tip labels smaller, but that doesn’t help much here:
6
plot(primatetree, cex = 0.5)
Alternatively you can zoom into different sections of the tree that you’re interested in:
This just gives you the tree for Cercopithecus species but you can also see how the species
fit into the rest of the tree using:
Note that zoom automatically sets the plotting window to display two plots at once. To
reset this to one plot only use:
You can also remove species from the tree very easily using the ape function drop.tip:
str(primatetree2)
## List of 4
## $ edge : int [1:444, 1:2] 226 227 228 229 ...
## $ edge.length: num [1:444] 4.95 17.69 19.65 8.12 ...
## $ Nnode : int 220
## $ tip.label : chr [1:225] "Allenopithecus_nigroviridis" ...
## - attr(*, "class")= chr "phylo"
## - attr(*, "order")= chr "cladewise"
To remove more than one species see below. To get further options for the plotting of
phylogenies:
?(plot.phylo)
Note that although you can use plot to plot the phylogeny, you need to specify
plot.phylo to find out the options for plotting trees. You can change the style of
the tree (type), the color of the branches and tips (edge.color, tip.color). Here’s an fun
example!
7
par(mfrow = c(1, 1))
plot(primatetree)
axisPhylo()
Most R functions require your tree to be dichotomous, i.e. to have no polytomies. To check
whether your tree is dichotomous use is.binary.tree. If this is FALSE, use multi2di to
make the tree dichotomous. This function works by randomly resolving polytomies with
zero-length branches.
## [1] FALSE
Most functions also require the tree to be rooted, i.e., to have one taxon designated as the
outgroup. Our tree is rooted but if you wanted to change the root, or root an unrooted
tree use root. Note that here I’ve just chosen a random species (Saimiri sciureus) to be the
root.
plot(primatetree.reroot)
Species names in the tree cannot contain spaces so they are generally written as
Genus_species (the gap between the genus name and species name replaced by _). If the
species names in the data are written as Genus species with a space, then you will have
to replace the spaces with _ so that they match up with the species names in the tree.
You can do this as follows:
8
primatedata$Binomial <- gsub(" ", "_", primatedata$Binomial)
gsub means general substitution. It replaces any instance of the first item (here it’s a space)
with the second item (_) but only in the variable you tell it to (primatedata$Binomial).
A number of R functions require that species names are the row names of the data. This
is really easy to fix:
Often you will have data for species which are not in your phylogeny and/or species
in your phylogeny which are not in your data. Some functions in R can deal with this,
others will produce an error telling you the tree and data do not match (e.g., most ape
functions). It’s useful to know how to deal with this so I have provided code below.
Note that the caper function comparative.data (see below) matches up species names
in the tree and data for you before you run any analyses. All geiger functions match
the tree and the data too. However, these functions are only as good as their inputs.
If you have even slightly misspelled a species name in the tree or the data it will
automatically be dropped from the analyses. It is therefore very important to check
this before running an analysis.
setdiff(primatetree$tip.label, primatedata$Binomial)
## [1] "Allenopithecus_nigroviridis"
## [2] "Cercopithecus_cephus_cephus"
## [3] "Cercopithecus_cephus_ngottoensis"
## [4] "Cercopithecus_diana"
etc.
9
setdiff tells you what is found in the first list, but not in the second. Here, it has listed
the species that are found in the tree but not in the data. We can then use setdiff with
drop.tip to prune the tree to just the species we have data for.
Note that you need to list the species which you do not want to select and then drop
them from the tree instead of selecting the species you want.
Again these are easy to identify using setdiff, just swap the inputs around:
setdiff(primatedata$Binomial, primatetree$tip.label)
## character(0)
In this case we don’t have any species in the tree missing from the data. However, if you
do, to remove species from the data which are not in the tree you can use match and
subset:
Remember != means “does not equal”. So this line of code only selects species which do
appear in the tree, i.e. their value from matches is not 0.
Always check this has worked as expected by checking the data and the phylogeny. In
the first instance you can just use str to make sure you have the expected number of
species in each:
str(primatedata2)
str(primatetree2)
10
Controlling for phylogenetic non-independence
OK let’s do a linear regression of log gestation length against log adult body mass:
summary(model.ols)
11
The output should look like this:
##
## Call:
## lm(formula = log(GestationLen_d) ~ log(AdultBodyMass_g), data = primatedata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6161 -0.0828 0.0065 0.1141 0.5056
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.1038 0.1108 37.04 < 2e-16 ***
## log(AdultBodyMass_g) 0.1204 0.0141 8.54 1.1e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.198 on 75 degrees of freedom
## Multiple R-squared: 0.493, Adjusted R-squared: 0.487
## F-statistic: 73 on 1 and 75 DF, p-value: 1.1e-12
The slope is positive (0.1204 ± 0.0141), and very significant (note that R can’t give values
lower than <2e-16 which is why it shows up so often).
We can also plot the regression line on a scatter plot:
abline(model.ols)
points(log(GestationLen_d[Family == "Cercopithecidae"]) ~
log(AdultBodyMass_g[Family == "Cercopithecidae"]),
data = primatedata, col = "blue", pch = 16)
12
Phylogenetic generalized least squares models (PGLS)
Phylogenetic generalized least squares (PGLS) models offer some important advantages
over independent contrasts. The model of trait evolution can be more flexible i.e., it can
depart from a strict Brownian motion process (λ or K = 1). Different scaling parameters
(λ, κ, and δ) can be incorporated in the analysis, which can significantly improve the fit
of the data to the model and thus also improve the estimation of the trait correlation.
Another advantage of PGLS is that the intercept of the regression is not forced to be zero.
To perform PGLS models in R, caper requires you to first combine the phylogeny and
data into one object using the function comparative.data.
Note that vcv = TRUE stores a variance covariance matrix of your tree (you will need
this for the pgls function). na.omit = FALSE stops the function from removing species
without data for all variables. warn.dropped = TRUE will tell you if any species are not in
both the tree and the data and are therefore dropped from the comparative data object.
Here we will be dropping the 149 species from the tree that we don’t have any data for,
but we won’t drop any species from the data.
This will give a warning telling you that some species have been dropped.
## Warning message:
## In comparative.data(phy = primatetree, data = primatedata, :
## Data dropped in compiling comparative data object
primate$dropped$tips
primate$dropped$unmatched.rows
Always make sure you check the list of dropped species is what you expected, it often
reveals typos in your species names, or mismatches in taxonomies used etc.
13
The function for PGLS analyses in caper is pgls. To fit a model which uses the maximum
likelihood estimate of λ we use the following code:
summary(model.pgls)
##
## Call:
## pgls(formula = log(GestationLen_d) ~ log(AdultBodyMass_g),
## data = primate, lambda = "ML")
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.09890 -0.01166 0.00308 0.01776 0.07513
##
## Branch length transformations:
##
## kappa [Fix] : 1.000
## lambda [ ML] : 0.892
## lower bound : 0.000, p = 1.1e-14
## upper bound : 1.000, p = 0.00046
## 95.0% CI : (0.753, 0.967)
## delta [Fix] : 1.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.2902 0.1604 26.75 < 2e-16 ***
## log(AdultBodyMass_g) 0.1049 0.0196 5.34 9.5e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0261 on 75 degrees of freedom
## Multiple R-squared: 0.276, Adjusted R-squared: 0.266
## F-statistic: 28.5 on 1 and 75 DF, p-value: 9.48e-07
As well as the standard regression outputs, the output includes the estimated ML value
of λ (0.892) and p values from likelihood ratio tests showing whether the ML λ is
significantly different from 0 or 1. κ and δ are also tree transformations which can
14
improve the fit of the data to the tree. It is also possible to use pgls to optimise κ or δ
(using kappa = "ML" or delta = "ML" instead of lambda = "ML" in the code above). We
will not cover this today. Note that optimizing more than one of these parameters at the
same time is not advisable because it would be impossible to interpret the results!
We can also plot the results as follows:
abline(model.pgls)
log(GestationLen_d)
5.4
●●●● ●
●●
●● ●
● ● ● ●● ●
● ● ●●●
●●●●●●
●
● ● ●●● ● ●
● ●●● ●
● ●
●● ●
● ●● ● ● ● ●●● ●●●
4.8
● ● ● ● ●●
●
●
●
4.2
●
●● ●
4 5 6 7 8 9
log(AdultBodyMass_g)
15
Sometimes you will find that pgls will not work and you get an optim error. This
is much more common when using a Mac. To fix it all you need to do is change
the bounds (upper and lower values) on the parameter being optimized, in this case
lambda. Just change the lower bound of lambda to something a little bigger than 1e-6
until it works. For example:
plot(lambda.profile)
20 25 30 35 40 45
logLik
lambda
Data: primate; Model: log(GestationLen_d) ~ log(AdultBodyMass_g)
kappa 1.00; lambda 0.89; delta 1.00
16
This graph shows the likelihood profile of λ in our model. Ideally you want a line with
an obvious peak/optimum like this, rather than a flat line which would suggest λ could
be anything. You can see that the optimum (the peak of the curve) is at 0.892 as estimated
in our PGLS model. The dotted red lines are the 95% confidence intervals on λ for our
model. pgls.confint prints out these numbers in $ci.val
pgls.confint(model.pgls, "lambda")$ci.val
You will often find strange λ profiles when you don’t have a lot of species in your data,
because λ (and Blomberg’s K - see below) has very low power to detect phylogenetic
signal for less than 20-30 data points (see Freckleton et al. 2002 Am Nat). This means
that using PGLS on small datasets is tricky - you almost always get ML λ of zero but
the λ profile will show a pretty flat likelihood surface. Unfortunately people often
forget to look at the λ profile so erroneously conclude that there is no phylogenetic
autocorrelation in their data.
Generally I’d say don’t use small datasets, however, this seems unavoidable in many
areas of anthropology. Therefore my advice is to (only in this situation!) ignore one
of Freckleton’s deadly sins (2009, JEB) and report the results from an OLS model
(equivalent of PGLS with lambda = 0) and also report the results from a PGLS model
with lambda set to 1 (equivalent to independent contrasts). This problem comes
up every year at AnthroTree and consensus among the PCM community (well Rob
Freckleton, Charlie Nunn and others!) is that this is best solution at present, if
collecting more data is really not an option!
To set λ to 1 you just replace "ML" with 1:
17
Model diagnostics for PGLS models
You should always check model diagnostic plots whenever you fit a model in R to check
that your data meet the assumptions of the model. The method for this in PGLS is the
same for OLS, independent contrasts and PGLS models (though the graphs are slightly
different). To get model diagnostic plots for PGLS:
plot(model.pgls)
0 1 2 3
0.0 0.1 0.2 0.3 0.4
●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●●
●
●●
●
●
●●
●●
●
●
●●
●●
●●
●●
●●●
●
−2
●●
−4
−4 −2 0 2 4 −2 −1 0 1 2
0 1 2 3
● ●
5.3
●●
●●
● ●● ●●●● ● ●
●● ● ● ● ●●●
● ● ●
●
●●
● ● ●● ● ● ●●●●●●●●●
Fitted value
● ● ● ●●●
● ●● ● ● ●● ●●
● ● ● ● ● ●●
●
●●
5.1
● ●● ● ●
● ● ●
● ● ●● ●●●
● ● ● ●● ● ● ● ● ● ● ●●
● ●● ● ● ● ● ●● ●
● ● ●
● ●● ●● ●
● ● ● ●
●● ● ● ●● ●
● ●●
● ● ●
●
4.9
● ●
−2
● ●
● ● ● ●
● ●●
●● ●
● ●
4.7
−4
● ●
18
Without going into the statistical details, what you are looking for in these plots is:
1. In plot one you shouldn’t see any data with a studentized residual > ± 3. Any
species with such large residuals should be removed as these outliers may overly
influence the results of the regression (see Jones and Purvis 1997). Often these are
the result of measurement error associated with species pairs joined by very short
branches. You should generally report results with and without outliers unless the
results remain qualitatively the same.
2. The points of the Q-Q plot (plot 2) should approximately fall on the line.
3. Plots 3 and 4 should show a fairly random scattering of points. You want to avoid
any clear patterns.
It takes practice to know what is “good”, “bad” and “acceptable” with these plots. I
would say the plots above are fine, but there appear to be a couple of data points with
studentized residuals > ± 3 in plot 1 that should be removed, or at least checked for
errors.
Phylogenetic signal is merely the pattern where close relatives have more similar trait
values than more distant relatives. Often people will mention that they corrected for
phylogeny because of the phylogenetic signal in their variables. However, we do not
correct for phylogeny because our variables show phylogenetic signal. We correct for
phylogeny because the residuals from our models show phylogenetic signal – see Revell
2010 Methods in Ecology and Evolution. λ shown in PGLS models above is the λ for
the model residuals not the individual variables. Also see Kamilar and Cooper 2013 Phil
Trans B.
Sometimes however, you might be interested in the phylogenetic signal of just one trait.
λ is really easy to estimate using caper. To do this for log GestationLen_d:
summary(est.lambda)
19
## Call:
## pgls(formula = log(GestationLen_d) ~ 1, data = primate, lambda = "ML")
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.12835 -0.01446 0.00139 0.01757 0.07433
##
## Branch length transformations:
##
## kappa [Fix] : 1.000
## lambda [ ML] : 0.948
## lower bound : 0.000, p = <2e-16
## upper bound : 1.000, p = 0.03
## 95.0% CI : (0.859, 0.996)
## delta [Fix] : 1.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.005 0.117 42.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0334 on 76 degrees of freedom
## Multiple R-squared: 0, Adjusted R-squared: 0
## F-statistic: NaN on 0 and 76 DF, p-value: NA
Note that by replacing the explanatory variables with 1 you are just investigating the
relationship between log gestation length and the phylogeny. Thus the λ value is the λ
estimate for log GestationLen_d and the p values are from likelihood ratio tests showing
whether the ML λ is significantly different from 0 (no phylogenetic signal) or 1 (the
expectation under Brownian motion).
20
We can then calculate K for log gestation length:
Kcalc(lngest[primatetree$tip.label], primatetree)
## [1] "Dropping taxa from the data because they are not present in the phylogeny:"
## [1] NA
## [1] "Dropping tips from the tree because they are not present in the data:"
## [1] "Allenopithecus_nigroviridis"
## [2] "Cercopithecus_cephus_cephus"
## [3] "Cercopithecus_cephus_ngottoensis"
## [4] "Cercopithecus_diana"
etc.
## [,1]
## [1,] 0.7758
Note that we are using the original data and tree here, not the pruned down versions, so
lots of tips are being dropped.
Kcalc (and phylosignal) require the trait data to be in the same order as the tree tip
labels. lngest[primatetree$tip.label] selects gestation lengths from the species in the
same order as they are in the tree (square brackets [ ] are used in R to subset data).
K for log gestation length is 0.7758. As with λ above, we’re interested in whether the value
of K is significantly different from what we’d expect by chance. We can do this using the
function phylosignal.This function randomly assigns the trait values to the species and
then calculates K. This is repeated 1000 times (if reps = 1000) and the observed value of
K is then compared to the randomized values to determine its significance.
After the warnings about which tips have been dropped, the output looks like this:
In this case the observed K is significantly higher than the random values of K (Z = -2.281,
p < 0.001, mean of the random values = 0.012).
Note this is a randomization test so your output here will not be identical to mine.
21
Practice exercises
The code above only runs these analyses on a few variables. If you have time, try and
run some analyses on the other variables. Here are some example questions to get you
started (or you can head straight to testing it on your own data!).
2. Make a plot of social group size against home range size. Color the Cebidae in
green.
4. Does home range size differ significantly among the different primate genera?
7. Run a PGLS analysis to determine the relationship between social group size and
gestation length. How does this differ from the result obtained with an OLS
regression? Plot the result.
22