0% found this document useful (0 votes)
205 views12 pages

Estimating Phylogenetic Trees With Phangorn (Version 1.6-0) : Klaus P. Schliep April 5, 2012

This document provides an overview of using the phangorn package in R to estimate phylogenetic trees from sequence alignment data using different reconstruction methods, including distance-based methods, maximum parsimony, and maximum likelihood. It demonstrates how to read in alignment data, build initial trees with UPGMA and NJ, optimize trees with parsimony and likelihood methods, compare models using likelihood ratio tests and AIC, and calculate bootstrap support values. The goal is to enable users to perform phylogenetic analyses and tree reconstruction through examples of common workflows and functions in phangorn.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
205 views12 pages

Estimating Phylogenetic Trees With Phangorn (Version 1.6-0) : Klaus P. Schliep April 5, 2012

This document provides an overview of using the phangorn package in R to estimate phylogenetic trees from sequence alignment data using different reconstruction methods, including distance-based methods, maximum parsimony, and maximum likelihood. It demonstrates how to read in alignment data, build initial trees with UPGMA and NJ, optimize trees with parsimony and likelihood methods, compare models using likelihood ratio tests and AIC, and calculate bootstrap support values. The goal is to enable users to perform phylogenetic analyses and tree reconstruction through examples of common workflows and functions in phangorn.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Estimating phylogenetic trees with phangorn (Version 1.

6-0)
Klaus P. Schliep April 5, 2012

Introduction

These notes should enable the user to estimate phylogenetic trees from alignment data with dierent methods using the phangorn package [10]. For more background on all the methods see e.g. [2, 12]. This document illustrates some of the phangorn features to estimate phylogenetic trees using dierent reconstruction methods. Small adaptations to the scripts in section 6 should enable the user to perform phylogenetic analyses.

Getting started

The rst thing we have to do is to read in an alignment. Unfortunately there exists many dierent le formats that alignments can be stored in. The function read.phyDat is used to read in an alignment. There are several functions to read in alignments depending on the format of the dataset (nexus, phylip, fasta) and the kind of data (amino acid or nucleotides) in the ape package [5] and phangorn. The function read.phyDat calls these other functions. For the specic parameter settings available look in the help les of the function read.dna (for phylip, fasta, clustal format), read.nexus.data for nexus les. For amino acid data additional read.aa is called. We start our analysis loading the phangorn package and then reading in an alignment. > library(phangorn) > primates = read.phyDat("primates.dna", format="phylip", type="DNA")

mailto:[email protected]

Distance based methods

After reading in the alignment we can build a rst tree with distance based methods. The function dist.dna from the ape package computes distances for many DNA substitution models. To use the function dist.dna we have to transform the data to class DNAbin. For amino acids the function dist.ml oers common substitution models (WAG, JTT, LG, Dayho, cpREV, mtmam, mtArt, MtZoa and mtREV24). After constructing a distance matrix we reconstruct a rooted tree with UPGMA and alternatively an unrooted tree using Neighbor Joining [9, 11]. > dm = dist.dna(as.DNAbin(primates)) > treeUPGMA = upgma(dm) > treeNJ = NJ(dm) We can plot the trees treeUPGMA and treeNJ (gure 1) with the commands: > > > > layout(matrix(c(1,2), 2, 1), height=c(1,2)) par(mar = c(.1,.1,.1,.1)) plot(treeUPGMA, main="UPGMA") plot(treeNJ, "unrooted", main="NJ")

Distance based methods are very fast and we will use the UPGMA and NJ tree as starting trees for the maximum parsimony and maximum likelihood analyses.

Parsimony

The function parsimony returns the parsimony score, that is the number of changes which are at least necessary to describe the data for a given tree. We can compare the parsimony score or the two trees we computed so far: > parsimony(treeUPGMA, primates) [1] 751 > parsimony(treeNJ, primates) [1] 746 The function optim.parsimony performs tree rearrangements to nd trees with a lower parsimony score. So far the only tree rearrangement implemented is nearest-neighbor interchanges (NNI). However is also a version of the parsimony ratchet [4] implemented, which is likely to nd better trees than just doing NNI rearrangements. > treePars = optim.parsimony(treeUPGMA, primates) Final p-score 746 after 1 nni operations > treeRatchet = pratchet(primates, trace = 0) > parsimony(c(treePars, treeRatchet), primates) 2

Mouse Lemur Bovine Tarsier BarbMacaq CrabE.Mac Jpn Macaq Rhesus Mac Gibbon Orang Gorilla Chimp Human Squir Monk

CrabE.Mac BarbMacaq Squir Monk Mouse Jpn Rhesus Mac Macaq

Gibbon Orang

Tarsier Lemur Bovine

Gorilla Human Chimp

Figure 1: Rooted UPGMA tree and unrooted NJ tree

[1] 746 746 For small datasets it is also possible to nd all most parsimonious trees using a branch and bound algorithm [3]. For datasets with more than 10 taxa this can take a long time and depends strongly on how tree like the data are. > (trees <- bab(subset(primates,1:10))) [1] "lower bound: 368" [1] "upper bound: 580" [1] "4 species added; 3 trees retained" [1] "lower bound: 391" [1] "5 species added; 15 trees retained" [1] "lower bound: 436" [1] "6 species added; 105 trees retained" [1] "lower bound: 482" [1] "7 species added; 945 trees retained" [1] "lower bound: 517" [1] "8 species added; 1612 trees retained" [1] "lower bound: 548" [1] "9 species added; 49 trees retained" [1] "lower bound: 572" [1] "10 species added; 1 trees retained" [1] "lower bound: 580" 1 phylogenetic trees

Maximum likelihood

The last method we will describe in this vignette is Maximum Likelihood (ML) as introduced by Felsenstein [1]. We can easily compute the likelihood for a tree given the data > fit = pml(treeNJ, data=primates) > fit loglikelihood: -3077.846 unconstrained loglikelihood: -1230.335 Rate matrix: a c g t a 0 1 1 1 c 1 0 1 1 4

g 1 1 0 1 t 1 1 1 0 Base frequencies: 0.25 0.25 0.25 0.25 The function pml returns an object of class pml. This object contains the data, the tree and many dierent parameters of the model like the likelihood etc. There are many generic functions for the class pml available, which allow the handling of these objects. > methods(class="pml") [1] anova.pml* logLik.pml* plot.pml* [6] vcov.pml* Non-visible functions are asterisked The object t just estimated the likelihood for the tree it got supplied, but the branch length are not optimized for the Jukes-Cantor model yet, which can be done with the function optim.pml. > fitJC = optim.pml(fit, TRUE) > logLik(fitJC) With the default values pml will estimate a Jukes-Cantor model. The function update.pml allows to change parameters. We will change the model to the GTR + (4) + I model and then optimize all the parameters. > fitGTR = update(fit, k=4, inv=0.2) > fitGTR = optim.pml(fitGTR, TRUE,TRUE, TRUE, TRUE, TRUE, + control = pml.control(trace = 0)) > fitGTR loglikelihood: -2609.593 unconstrained loglikelihood: -1230.335 Proportion of invariant sites: 0.006054318 Discrete gamma model Number of rate categories: 4 Shape parameter: 3.175015 Rate matrix: a a 0.0000000 c 0.6468241 g 33.6154683

print.pml*

update.pml*

c g t 0.646824114 33.615468343 0.4052634 0.000000000 0.008337981 14.3652888 0.008337981 0.000000000 1.0000000 5

0.4052634 14.365288779

1.000000000

0.0000000

Base frequencies: 0.3917047 0.3796838 0.04024865 0.1883629 We can compare the objects for the JC and GTR + (4) + I model using likelihood ratio statistic > anova(fitJC, fitGTR) Likelihood Ratio Test Table Log lik. Df Df change Diff log lik. Pr(>|Chi|) 1 -3068.3 25 2 -2609.6 35 10 917.4 < 2.2e-16 *** --Signif. codes: 0 a***^ 0.001 ^**^ 0.01 a*^ 0.05 a.^ 0.1 a ^ 1 ^AY aAZ aAY aAZ ^AY aAZ ^AY aAZ ^AY aAZ with the AIC > AIC(fitGTR) [1] 5289.186 > AIC(fitJC) [1] 6186.59 or the Shimodaira-Hasegawa test. > SH.test(fitGTR, fitJC) Trees ln L Diff ln L p-value [1,] 1 -2609.593 0.0000 0.5039 [2,] 2 -3068.295 458.7019 0.0000 An alternative is to use the function modelTest to compare dierent models the AIC or BIC, similar to popular program of [7, 8]. > mt = modelTest(primates) The results of is illustrated in table 1 The thresholds for the optimisation in modelTest are not as strict as for optim.pml and no tree rearrangements are performed. As modelTest computes and optimises a lot of models it would be a waste of computer time not to save these results. The results are saved as call together with the optimised trees in an environment and this call can be evaluated to get a pml object back to use for further optimisation or analysis.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Model JC JC+I JC+G JC+G+I F81 F81+I F81+G F81+G+I K80 K80+I K80+G K80+G+I HKY HKY+I HKY+G HKY+G+I SYM SYM+I SYM+G SYM+G+I GTR GTR+I GTR+G GTR+G+I

df 25.00 26.00 26.00 27.00 28.00 29.00 29.00 30.00 26.00 27.00 27.00 28.00 29.00 30.00 30.00 31.00 30.00 31.00 31.00 32.00 33.00 34.00 34.00 35.00

logLik -3068.42 -3062.63 -3066.92 -3062.71 -2918.17 -2909.12 -2912.58 -2908.52 -2952.94 -2944.51 -2944.99 -2942.38 -2647.74 -2629.83 -2618.49 -2615.15 -2813.91 -2811.73 -2804.76 -2804.68 -2642.89 -2624.07 -2613.65 -2610.31

AIC 6186.83 6177.26 6185.83 6179.43 5892.33 5876.24 5883.17 5877.04 5957.89 5943.02 5943.99 5940.76 5353.48 5319.67 5296.99 5292.30 5687.83 5685.45 5671.53 5673.36 5351.78 5316.15 5295.30 5290.62

BIC 6273.00 6266.87 6275.45 6272.49 5988.84 5976.20 5983.12 5980.44 6047.50 6036.08 6037.05 6037.27 5453.43 5423.07 5400.39 5399.15 5791.23 5792.30 5778.38 5783.65 5465.52 5433.34 5412.49 5411.26

Table 1: Summary table of modelTest > env <- attr(mt, "env") > ls(envir=env) [1] "F81" "F81+G" [5] "GTR" "GTR+G" [9] "HKY" "HKY+G" [13] "JC" "JC+G" [17] "K80" "K80+G" [21] "SYM" "SYM+G" [25] "data" "tree_F81" [29] "tree_F81+I" "tree_GTR" [33] "tree_GTR+I" "tree_HKY" [37] "tree_HKY+I" "tree_JC" [41] "tree_JC+I" "tree_K80" [45] "tree_K80+I" "tree_SYM" 7

"F81+G+I" "GTR+G+I" "HKY+G+I" "JC+G+I" "K80+G+I" "SYM+G+I" "tree_F81+G" "tree_GTR+G" "tree_HKY+G" "tree_JC+G" "tree_K80+G" "tree_SYM+G"

"F81+I" "GTR+I" "HKY+I" "JC+I" "K80+I" "SYM+I" "tree_F81+G+I" "tree_GTR+G+I" "tree_HKY+G+I" "tree_JC+G+I" "tree_K80+G+I" "tree_SYM+G+I"

[49] "tree_SYM+I" > (fit <- eval(get("HKY+G+I", env), env)) loglikelihood: -2615.149 unconstrained loglikelihood: -1230.335 Proportion of invariant sites: 0.003869274 Discrete gamma model Number of rate categories: 4 Shape parameter: 2.911518 Rate matrix: a c g t a 0.00000 1.00000 33.58626 1.00000 c 1.00000 0.00000 1.00000 33.58626 g 33.58626 1.00000 0.00000 1.00000 t 1.00000 33.58626 1.00000 0.00000 Base frequencies: 0.4129084 0.3650499 0.04424032 0.1778014 At last we may want to apply bootstrap to test how well the edges of the tree are supported: > bs = bootstrap.pml(fitJC, bs=100, optNni=TRUE, + control = pml.control(trace = 0)) Now we can plot the tree with the bootstrap support values on the edges > par(mar=c(.1,.1,.1,.1)) > plotBS(fitJC$tree, bs) Several analyses, e.g. bootstrap and modelTest, can be computationally demanding, but as nowadays most computers have several cores one can distribute the computations using the multicore package. However it is only possible to use this approach if R is running from command line (X11), but not using a GUI (for example Aqua on Macs) and unfortunately the multicore package does not work at all under Windows.

Appendix: Standard scripts for nucleotide or amino acid analysis

Here we provide two standard scripts which can be adapted for the most common tasks. Most likely the arguments for read.phyDat have to be adapted to accommodate your le format. Both scripts assume that the multicore package, see comments above. 8

Tarsier Lemur

Mouse Squir Monk

Bovine 70 41 68 72 83 100 CrabE.Mac 98 96 Orang 74 86 Gorilla 90 Human Chimp Gibbon

Rhesus Mac Jpn Macaq BarbMacaq

Figure 2: Unrooted tree with bootstrap support values

library(parallel) # supports parallel computing library(phangorn) file="myfile" dat = read.phyDat(file) dm = dist.ml(dat) tree = NJ(dm) # 1. alternative: estimate an GTR model fitStart = pml(tree, dat, k=4, inv=.2) fit = optim.pml(fitStart, TRUE, TRUE, TRUE, TRUE, TRUE) # 2. alternative: modelTest (mt <- modelTest(dat, multicore=TRUE)) mt$Model[which.min(mt$BIC)] # choose best model from the table, assume now GTR+G+I env = attr(mt, "env") fitStart = eval(get("GTR+G+I", env), env) fitStart = eval(get(mt$Model[which.min(mt$BIC)], env), env) fit = optim.pml(fitStart, optNni=TRUE, optGamma=TRUE, optInv=TRUE, model="GTR") bs = bootstrap.pml(fit, bs=100, optNni=TRUE, multicore=TRUE) You can specify dierent several models build in which you can specify, e.g. WAG, JTT, Dayho, LG. Optimising the rate matrix for amino acids is possible, but would take a long, a very long time. So make sure to set optBf=FALSE and optQ=FALSE in the function optim.pml, which is also the default. library(parallel) # supports parallel computing library(phangorn) file="myfile" dat = read.phyDat(file, type = "AA") dm = dist.ml(dat, model="JTT") tree = NJ(dm) (mt <- modelTest(dat, model=c("JTT", "LG", "WAG"), multicore=TRUE)) fitStart = eval(get(mt$Model[which.min(mt$BIC)], env), env) fitNJ = pml(tree, dat, model="JTT", k=4, inv=.2) fit = optim.pml(fitNJ, optNni=TRUE, optInv=TRUE, optGamma=TRUE) fit bs = bootstrap.pml(fit, bs=100, optNni=TRUE, multicore=TRUE)

References
[1] Joseph Felsenstein. Evolutionary trees from dna sequences: a maxumum likelihood approach. Journal of Molecular Evolution, 17:368376, 1981. 10

[2] Joseph Felsenstein. Inferring Phylogenies. Sinauer Associates, Sunderland, 2004. [3] M.D. Hendy and Penny D. Branch and bound algorithms to determine minimal evolutionary trees. Math. Biosc., 59:277290, 1982. [4] K.Nixon. The parsimony ratchet, a new method for rapid rarsimony analysis. Cladistics, 15:407414, 1999. [5] E.Paradis, J.Claude, and K.Strimmer. Ape: Analyses of phylogenetics and evolution in r language. Bioinformatics, 20(2):289290, 2004. [6] Emmanuel Paradis. Analysis of Phylogenetics and Evolution with R. Springer, New York, 2006. [7] D.Posada and K.A. Crandall. Modeltest: testing the model of dna substitution. Bioinformatics, 14(9):817818, 1998. [8] David Posada. jmodeltest: Phylogenetic model averaging. Molecular Biology and Evolution, 25(7):12531256, 2008. [9] N.Saitou and M.Nei. The neighbor-joining method - a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4):406425, 1987. [10] KlausPeter Schliep. 27(4):592593, 2011. phangorn: Phylogenetic analysis in R. Bioinformatics,

[11] J.A. Studier and K.J. Keppler. A note on the neighbor-joining algorithm of saitou and nei. Molecular Biology and Evolution, 5(6):729731, 1988. [12] Ziheng Yang. Computational Molecular evolution. Oxford University Press, Oxford, 2006.

Session Information
R version 2.14.0 (2011-10-31), i686-pc-linux-gnu Locale: LC_CTYPE=en_NZ.UTF-8, LC_NUMERIC=C, LC_TIME=en_NZ.UTF-8, LC_COLLATE=C, LC_MONETARY=en_NZ.UTF-8, LC_MESSAGES=en_NZ.UTF-8, LC_PAPER=C, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_NZ.UTF-8, LC_IDENTIFICATION=C Base packages: base, datasets, grDevices, graphics, grid, methods, stats, utils Other packages: Matrix1.0-4, ape3.0-1, igraph0.5.5-4, lattice0.20-6, phangorn1.6-0, seqLogo1.20.0, xtable1.7-0

The version number of R and packages loaded for generating the vignette were:

11

Loaded via a namespace (and not attached): gee4.13-17, nlme3.1-103, tools2.14.0

12

You might also like