Yang and Rannala 2012 Molecular Phylogenetics.
Yang and Rannala 2012 Molecular Phylogenetics.
Yang and Rannala 2012 Molecular Phylogenetics.
S T U DY D E S I G N S
Molecular phylogenetics:
principles and practice
Ziheng Yang1,2 and Bruce Rannala1,3
Abstract | Phylogenies are important for addressing various biological questions such
as relationships among species or genes, the origin and spread of viral infection and
the demographic changes and migration patterns of species. The advancement of
sequencing technologies has taken phylogenetic analysis to a new height. Phylogenies
have permeated nearly every branch of biology, and the plethora of phylogenetic
methods and software packages that are now available may seem daunting to an
experimental biologist. Here, we review the major methods of phylogenetic analysis,
including parsimony, distance, likelihood and Bayesian methods. We discuss their
strengths and weaknesses and provide guidance for their use.
Systematics
The inference of phylogenetic
relationships among species
and the use of such information
to classify species.
Taxonomy
The description, classification
and naming of species.
Coalescent
The process of joining ancestral
lineages when the genealogical
relationships of a random
sample of sequences from
a modern population are
traced back.
REVIEWS
Box 1 | Tree concepts
A phylogeny is a model of genealogical history in
a Rooted tree
b Unrooted tree
which the lengths of the branches are unknown
Time
2
parameters. For example, the phylogeny on the left
0
is generated by two speciation events that occurred
at time points 0 and 1. The branch lengths (b0, b1, b2
b0
b2
and b3) are typically expressed in units of expected
1
number of substitutions per site and measure the
amount of evolution along the branches.
If the substitution rate is constant over time or among
b2
b1
b3
b1
b3
lineages, we say that the molecular clock holds60. The
tree will then have a root and be ultrametric, meaning
1
2
3
1
3
that the distances from the tips of the tree to the root
are all equal (for example, b0+b1=b0+b2=b3). A rooted
tree for s species can then be represented by the ages of the s1 ancestral nodes and thus involves
s1
branch-length
Nature
Reviews
| Genetics
parameters. The procedure of inferring rooted trees by assuming the molecular clock is called molecular clock rooting.
For distantly related species, the clock hypothesis should not be assumed. Most phylogenetic analyses are therefore
conducted without the assumption of the clock. If every branch on the tree is allowed to have an independent
evolutionary rate, commonly used models and methods are unable to identify the location of the root, so only unrooted
trees are inferred. An unrooted tree for s species then has 2s3 branch length parameters. A commonly used strategy to
root the tree is to include outgroup species in the analysis, which are known to be more distantly related than the species
of interest. Although the inferred tree for all species is unrooted, the root is believed to be located along the branch that
leads to the outgroup so that the tree for the ingroup species is rooted. This strategy is called outgroup rooting.
Gene trees
The phylogenetic or
genealogical tree of
sequences at a gene locus
or genomic region.
Statistical phylogeography
The statistical analysis of
population data from closely
related species to infer
population parameters and
processes such as population
sizes, demography, migration
patterns and rates.
Species tree
A phylogenetic tree for a set
of species that underlies the
gene trees at individual loci.
Systematic errors
Errors that are due to an
incorrect model assumption.
They are exacerbated when
the data size increases.
Cluster algorithm
An algorithm of assigning a
set of individuals to groups (or
clusters) so that objects of the
same cluster are more similar
to each other than those from
different clusters. Hierarchical
cluster analysis can be
agglomerative (starting
with single elements and
successively joining them into
clusters) or divisive (starting
with all objects and successively
dividing them into partitions).
Markov chain
A stochastic sequence (or chain)
of states with the property that,
given the current state, the
probabilities for the next state
do not depend on the past
states.
Transitions
Substitutions between the two
pyrimidines (TC) or between
the two purines (AG).
Transversions
Substitutions between a
pyrimidine and a purine
(T or CA or G).
impact of missing data and strategies of data partitioning. The literature of molecular phylogenetics is large
and complex 23,24; the aim of this Review is to provide a
starting point for exploring the methodsfurther.
www.nature.com/reviews/genetics
2012 Macmillan Publishers Limited. All rights reserved
REVIEWS
HKY85
K80
JC69
T
Q = (dij dij)2
i =1 i =1
(1)
This is the same least squares method used in statistics for fitting a straight line y=a+bx to a scatter plot.
Optimizing branch lengths (or d ij) leads to the score Q
for the given tree, and the tree with the smallest score is
the least squares estimate of the truetree.
The minimum evolution method34,35 uses the tree
length (which is the sum of branch lengths) instead of
Q for tree selection, even though the branch lengths can
still be estimated using the least squares criterion. Under
the minimum evolution criterion, shorter trees are more
likely to be correct than longer treesare.
The most widely used distance method is neighbour
joining 25. This is a cluster algorithm and operates by
starting with a star tree and successively choosing a pair
of taxa to join together (based on the taxon distances),
until a fully resolved tree is obtained. The taxa to be
joined are chosen in order to minimize an estimate of
tree length36. The two joined taxa (for example, species
1 and 2 in FIG.2) are then represented by their ancestor (for example, node y in FIG.2), and the number of
taxa that are connected to the root (node x in FIG. 2)
is reduced by one (FIG.2). The distance matrix is then
updated with the joined taxa replacing the two original taxa. See REF.36 for a discussion of the neighbour
joining updating formula. An efficient implementation
of neighbour joining is found in the program MEGA37
(TABLE1).
Unrooted trees
Phylogenetic trees for
which the location of
the root is unspecified.
Note that it might be important to use a realistic substitution model to calculate the pairwise distances.
Distance methods can perform poorly for very divergent
sequences because large distances involve large sampling
errors, and most distance methods (such as neighbour
joining) do not account for the high variances of large
distance estimates. Distance methods are also sensitive
to gaps in the sequence alignment 38.
Maximum parsimony
Parsimony tree score. The maximum parsimony method
minimizes the number of changes on a phylogenetic
tree by assigning character states to interior nodes
on the tree. The character (or site) length is the minimum number of changes required for that site, whereas
the tree score is the sum of character lengths over all
sites. The maximum parsimony tree is the tree that
minimizes the treescore.
Some sites are not useful for tree comparison by
parsimony. For example, constant sites, for which the
same nucleotide occurs in all species, have a character
length of zero on any tree. Singleton sites, at which only
one of the species has a distinct nucleotide, whereas all
others are the same, can also be ignored, as the character length is always one. The parsimony-informative
sites are those at which at least two distinct characters
are observed, each at least twice. For four species, only
three site patterns are informative: xxyy, xyxy and xyyx,
where x and y are any two distinct nucleotides. There
are three possible unrooted trees for four species, and
which of them is the maximum parsimony tree depends
on which of the three site patterns occurs most often in
the alignment.
An algorithm for finding the minimum number of
changes on a binary tree (and for reconstructing the
ancestral states to achieve the minimum) was developed
by Fitch39 and Hartigan40. PAUP41, MEGA37 and TNT42
are commonly used parsimony programs.
Parsimony was originally developed for use in analysing discrete morphological characters. During the
late 1970s, it began to be applied to molecular data.
A controversy arose concerning whether parsimony
(without explicit assumptions) or likelihood (with an
explicit evolutionary model) was a better method for
phylogenetic analysis23. The controversy has subsided,
and the importance of model-based inference methods
is broadly recognized. The use of parsimony is still common: not because it is believed to be assumption-free,
but because it often produces reasonable results and is
computationally efficient.
Strengths and weaknesses of parsimony. A strength of
parsimony is its simplicity; it is easy to describe and to
understand, and it is amenable to rigorous mathematical
analysis. The simplicity also helps in the development of
efficient computer algorithms.
A major weakness of parsimony is its lack of explicit
assumptions, which makes it nearly impossible to incorporate any knowledge of the process of sequence evolution in tree reconstruction. The failure of parsimony
to correct for multiple substitutions at the same site
REVIEWS
3
2
4
5
2
x
1
8
6
8
Long-branch attraction
The phenomenon of inferring
an incorrect tree with long
branches grouped together by
parsimony or by model-based
methods under simplistic
models.
Maximum likelihood
Basis of maximum likelihood. Maximum likelihood was
developed by R. A. Fisher in the 1920s as a statistical
methodology for estimating unknown parameters in a
model. The likelihood function is defined as the probability of the data given the parameters but is viewed as
a function of the parameters with the data observed and
fixed. It represents all information in the data about the
parameters. The maximum likelihood estimates (MLEs)
of parameters are the parameter values that maximize the
likelihood. Most often, the MLEs are found numerically
www.nature.com/reviews/genetics
2012 Macmillan Publishers Limited. All rights reserved
REVIEWS
Table 1 | Functionalities of a few commonly used phylogenetic programs
Name
Brief description
Link
Refs
Bayesian evolutionary
analysis sampling trees
(BEAST)
A Bayesian MCMC program for inferring rooted trees under the clock or
relaxed-clock models. It can be used to analyse nucleotide and amino acid
sequences, as well as morphological data. A suite of programs, such as Tracer
and FigTree, are also provided to diagnose, summarize and visualize results
http://beast.bio.ed.ac.uk
135
55
http://www.hyphy.org
136
Molecular evolutionary
genetic analysis (MEGA)
http://www.megasoftware.net
37
MrBayes
http://mrbayes.net
71
Phylogenetic analysis
by maximum likelihood
(PAML)
137
Phylogenetic analysis
PAUP* 4.0 is still a beta version (at the time of writing). It implements parsimony, http://www.sinauer.com/
using parsimony* and
distance and likelihood methods of phylogeny reconstruction
detail.php?id=8060
other methods (PAUP* 4.0)
PHYLIP
http://evolution.
gs.washington.edu/phylip.html
PhyML
A fast program for searching for the maximum likelihood trees using nucleotide http://www.atgc-montpellier.
or protein sequence data
fr/phyml/binaries.php
53
RAxML
A fast program for searching for the maximum likelihood trees under the GTR
model using nucleotide or amino acid sequences. The parallel versions are
particularly powerful
http://scoh-its.org/exelixis/
software.html
54
http://www.zmuc.dk/public/
phylogeny/TNT
42
Note: all programs can run on Windows, Mac OSX and Unix or Linux platforms. Except for PAUP*, which charges a nominal fee, all packages are free for download.
See Felsensteins comprehensive list of programs at http://evolution.genetics.washington.edu/phylip/software.html. GTR, general time reversible; MCMC, Markov
chain Monte Carlo.
Molecular clock
The hypothesis or observation
that the evolutionary rate
is constant over time or
across lineages.
Prior distribution
The distribution assigned
to parameters before the
analysis of the data.
Posterior distribution
The distribution of the
parameters (or models)
conditional on the data. It
combines the information
in the prior and in the data
(likelihood).
for the model to accommodate variable amino acid substitution rates among sites56 or even different amino acid
frequencies among sites57,58.
Maximum likelihood has a clear advantage over distance or parsimony methods if the aim is to understand
the process of sequence evolution. The likelihood ratio test
can be used to examine the fit of evolutionary models59
and to test interesting biological hypotheses, such as the
molecular clock60,49 and Darwinian selection affecting protein evolution6163. See REFS 22,24,64,65 for summaries of
such tests in phylogenetics.
The main drawback of maximum likelihood is that
the likelihood calculation and, in particular, tree search
under the likelihood criterion is computationally
demanding. Another drawback is that the method has
potentially poor statistical properties if the model is misspecified. This is also true for Bayesian analysis (TABLE2).
Bayesian methods
Basis of Bayesian inference. Bayesian inference is a general methodology of statistical inference. It differs from
maximum likelihood in that parameters in the model
are considered to be random variables with statistical
REVIEWS
c The Gnepine tree
a Correct tree, T1
1
100
Ginkgo
Cycas
100
92
Amborella
Cryptomeria
100
100
68
81
4
100
b Wrong tree, T2
100
Keteleeria
Pinus
100
Cupressophyta
Ephedra
Welwitschia
Gnetum
Gnetales
Pinaceae
Psilotum
Huperzia
Marchantia
Physcomitrella
0.1
2
90
100
100
Amborella
Cryptomeria
100
100
100
100
100
Keteleeria
100
Pinus
Ginkgo
Cycas
Huperzia
100
Cupressophyta
Ephedra
Welwitschia Gnetales
Gnetum
Pinaceae
Marchantia
Physcomitrella
0.1
Figure 3 | Long-branch attraction in theory and in practice. Panels a and b show the fourspecies case analysed
Nature Reviews | Genetics
by Felsenstein43. If the correct tree (T1 in a) has two long branches separated by a short internal branch, parsimony
(as well as model-based methods such as likelihood and Bayesian methods under simplistic models) tends to recover a
wrong tree (T2 in b), in which the two long branches are grouped together. Panels c and d show a similar phenomenon
in a real data set, concerning the phylogeny of seed plants134. The Gnetales is a morphologically and ecologically
diverse group of Gymnosperms including three genera (Ephedra, Gnetum and Welwitschia), but its phylogenetic
position has been controversial. Maximum likelihood analysis of 56 chloroplast proteins produced the GneCup tree
(d), in which the Gnetales are grouped with Cupressophyta, apparently owing to a long-branch attraction artefact.
However, the Gnepine tree (c), in which the Gnetales joins the Pinaceae, was inferred by excluding the fastest-evolving
18 proteins as well as three proteins (namely, psbC, rpl2 and rps7) that had experienced many parallel substitutions
between the Cryptomeria branch and the branch ancestral to the Gnetales. The Gnepine tree (c) is also supported by
two proteins from the nuclear genome and appears to be the correct tree. Branch lengths and bootstrap proportions
are all calculated using RAxML. See REF.134 for details.
www.nature.com/reviews/genetics
2012 Macmillan Publishers Limited. All rights reserved
REVIEWS
Table 2 | A summary of strengths and weaknesses of different tree reconstruction methods
Strengths
Weaknesses
Parsimony methods
Simplicity and intuitive appeal
The only framework appropriate for some data
(such as SINES and LINES)
Distance methods
Fast computational speed
Can be applied to any type of data as long as a
genetic distance can be defined
Models for distance calculation can be chosen
to fit data
Likelihood methods
Can use complex substitution models to
approach biological reality
Powerful framework for estimating parameters
and testing hypotheses
Bayesian methods
Can use realistic substitution models, as in
maximum likelihood
Prior probability allows the incorporation of
information or expert knowledge
Posterior probabilities for trees and clades have
easy interpretations
Clades
Groups of species that
have descended from a
common ancestor.
data and model. By contrast, concepts such as the confidence interval in a likelihood analysis have a contrived
interpretation that eludes many users of statistics. In
phylogenetics, it has not been possible to define a confidence interval for the tree. The widely used bootstrap
method73 (BOX3) has been difficult to interpret despite
numerous efforts7477. However, the odds are not entirely
against maximum likelihood. Posterior probabilities for
trees and clades that have been calculated from real data
sets often appear to be too high66,7880. In many analyses,
nearly all nodes had posterior probabilities of ~100%.
Posterior tree probabilities are also sensitive to model
violations, and use of simplistic models may lead to
inflated posterior probabilities81.
Second, the prior probability allows incorporation
of apriori information about the trees or parameters.
However, such information is rarely available, and
specification of the prior is most often a burden on the
user; almost all data analyses are conducted using
the default priors in the computer program. Highdimensional priors are notoriously hard to specify, and
an innocent-looking prior can have an undue and unexpected influence on the posterior. For example, it has
recently been pointed out that the independent exponential prior on branch lengths used by MrBayes can
induce a strongly informative and unreasonable prior
on the tree length, producing unreasonably long trees in
some data sets8284. It is therefore important to conduct
Bayesian robustness analysis to assess the impact of the
prior on the posterior estimates.
REVIEWS
Box 2 | Markov chain Monte Carlo
Markov chain Monte Carlo (MCMC) is a simulation algorithm in which one moves
from one tree (or parameter value) to another and, in the long run, visits the trees
(or parameters) in proportion to their posterior probabilities. The tree parameter set
(T,) constitutes the state of the algorithm. Here, parameters may include the
branch lengths of the tree and parameters in the evolutionary model, such as
the transition/transversion rate ratio. The following scheme demonstrates the main
features of MCMC algorithms.
Step 1. Initialization. Choose a starting tree and starting parameters at random (T,).
Step 2. Main loop.
Step 2a. Proposal to change the tree T. Propose a new tree, T*, by changing the
current tree, T. If T* has higher posterior probability than the current tree,
P(T*,|D) > P(T,|D), accept the new tree T*. Otherwise, accept T* with probability:
P(T*,|D)
P(T,|D)
P(T*,)P(D|T*,)
P(T,)P(D|T,)
(3)
P(T,*)P(D|T,*)
P(T,)P(D|T,)
(4)
www.nature.com/reviews/genetics
2012 Macmillan Publishers Limited. All rights reserved
REVIEWS
Box 3 | Sampling error in the estimated tree and bootstrap analysis
Sequence alignment
NENLFASFIA
NENLFASFAA
NENLFASFAA
NENLFASFIA
NEDLFTPFTT
NESLFTPFIT
NENLFTSFAT
PTVLGLPAAV
PTILGLPAAV
PTILGLPAAV
PTILGLPAAV
PTVLGLPAAI
PTVLGLPAAV
PTILGLPAAV
Chimpanzee
Bonobo
99.4 00
1
Gorilla
...
...
...
...
...
...
...
Human
100
Human
Chimpanzee
Bonobo
Gorilla
Bornean orangutan
Sumatran orangutan
Gibbon
100
Gibbon
Bootstrap
data set 1
Bootstrap
data set 2
...
Bootstrap
data set 1,000
Maximum
likelihood
tree 1
Maximum
likelihood
tree 2
...
Maximum
likelihood
tree 1,000
Bornean orangutan
Sumatran orangutan
Nature
Reviews
| Genetics
In traditional parameter estimation, we attach a confidence interval to indicate the uncertainty
involved
in the
point
estimate of the parameter. This has not been possible in molecular phylogenetics, as concepts such as the variance and
confidence interval are not meaningful when applied to trees. For distance, parsimony and likelihood methods, the
most commonly used procedure to assess the confidence in a tree topology estimate is the bootstrap analysis73. In this
approach, the sites in the sequence alignment are resampled with replacement as many times as the sequence length,
generating a bootstrap pseudo-sample that is of the same size as the original data set. Typically, 100 or 1,000 bootstrap
samples are generated in this way, and each one is analysed in the same way as the original sequence alignment. An
example that uses the maximum likelihood method is illustrated in the figure. The inferred trees from those bootstrap
samples are then tabulated to calculate the bootstrap support values. For every clade in the estimated tree, its
bootstrap support value is simply the proportion of bootstrap trees that include that clade24,65,133. The commonly used
but less satisfactory approach is to use the bootstrap trees to generate a majority-rule consensus tree, which shows a
clade if and only if it occurs in more than half of the bootstrap trees.
Perspectives
We focus here on three research areas that are currently
the focus of much methodological development. The
first is multiple sequence alignment. Many heuristic
methods and programs for aligning sequences exist 99,100,
and improved algorithms continue to appear 101,102.
Efforts have also been taken to infer alignment statistically under an explicit model of insertions and deletions103,104 and to infer alignment and phylogeny jointly
in a Bayesian framework105,106. An advantage of those
model-based alignment methods is that they produce
estimates of insertion and deletion rates. For now, those
algorithms are based on simplistic insertiondeletion
models and involve heavy computation, and so they
do not compare favourably against good heuristic algorithms either in computational efficiency or alignment
quality. Nevertheless, they are biologically appealing,
and improvements are verylikely.
REVIEWS
The second area of development is molecular clock
estimation of divergence dates. Under the clock assumption, the distance between sequences increases linearly
with the time of divergence, and if a particular divergence can be assigned an absolute geological age based
on the fossil record, the substitution rate can be calculated, and all divergences on the tree can be dated.
Similar ideas can be used to estimate divergence times
of viral strains when sample dates for viral sequences are
available and act as calibrations. However, in practice, the
molecular clock may be violated, especially for distantly
related species, and the fossil record can never provide
unambiguous times of lineage divergence. In the past
several years, advancements have been made using the
Bayesian framework to deal with those issues. Since
the pioneering work of Thorne and colleagues 107,108,
models of evolutionary rate drift over time have been
developed to relax the molecular clock72,109. Soft age
bounds and flexible probability distributions have been
implemented to accommodate uncertainties in fossil
calibrations72,110,111. The fossil record (that is, the presence
and absence of fossils in the rock layers) has also been
statistically analysed to generate calibration densities for
molecular dating analysis112,113.
The third area of exciting development, which was
mentioned at the beginning of this Review, is statistical
phylogeography 20,114116. The availability of genomic data
at species and population levels offers unprecedented
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
www.nature.com/reviews/genetics
2012 Macmillan Publishers Limited. All rights reserved
REVIEWS
46. Yang, Z. Among-site rate variation and its impact
on phylogenetic analyses. Trends Ecol. Evol. 11,
367372 (1996).
47. Philippe, H. etal. Acoelomorph flatworms are
deuterostomes related to Xenoturbella. Nature 470,
255258 (2011).
48. Zhong, B. etal. Systematic error in seed plant
phylogenomics. Genome Biol. Evol. 3, 13401348
(2011).
49. Felsenstein, J. Evolutionary trees from DNA
sequences: a maximum likelihood approach. J.Mol.
Evol. 17, 368376 (1981).
This paper introduces the pruning algorithm for
likelihood calculation on a tree. This approach
forms the basis for modern likelihood and Bayesian
methods of phylogenetic analysis.
50. Yang, Z. Phylogenetic analysis using parsimony and
likelihood methods. J.Mol. Evol. 42, 294307
(1996).
51. Felsenstein, J. Phylip: Phylogenetic Inference Program,
Version 3.6. (Univ. of Washington, Seattle, 2005).
52. Adachi, J. & Hasegawa, M. MOLPHY version 2.3:
programs for molecular phylogenetics based on
maximum likelihood. Comput. Sci. Monogr. 28,
1150 (1996).
53. Guindon, S. & Gascuel, O. A simple, fast, and
accurate algorithm to estimate large phylogenies
by maximum likelihood. Syst. Biol. 52, 696704
(2003).
54. Stamatakis, A. RAxMLVIHPC: maximum likelihoodbased phylogenetic analyses with thousands of
taxa and mixed models. Bioinformatics 22,
26882690 (2006).
55. Zwickl, D. Genetic Algorithm Approaches for the
Phylogenetic Analysis of Large Biological Sequence
Datasets Under the Maximum Likelihood Criterion.
Thesis, Univ. Texas at Austin (2006).
56. Yang, Z. Maximum likelihood phylogenetic estimation
from DNA sequences with variable rates over sites:
approximate methods. J.Mol. Evol. 39, 306314
(1994).
57. Lartillot, N. & Philippe, H. A Bayesian mixture model
for across-site heterogeneities in the amino-acid
replacement process. Mol. Biol. Evol. 21, 10951109
(2004).
58. Blanquart, S. & Lartillot, N. A site- and timeheterogeneous model of amino acid replacement.
Mol. Biol. Evol. 25, 842858 (2008).
59. Goldman, N. Statistical tests of models of DNA
substitution. J.Mol. Evol. 36, 182198 (1993).
60. Zuckerkandl, E. & Pauling, L. in Evolving Genes and
Proteins (eds Bryson, V. & Vogel, H.J.) 97166
(Academic Press, New York, 1965).
61. Nielsen, R. & Yang, Z. Likelihood models for
detecting positively selected amino acid sites and
applications to the HIV1 envelope gene. Genetics
148, 929936 (1998).
62. Yang, Z. Likelihood ratio tests for detecting positive
selection and application to primate lysozyme
evolution. Mol. Biol. Evol. 15, 568573 (1998).
63. Yang, Z. & Nielsen, R. Codon-substitution models
for detecting molecular adaptation at individual
sites along specific lineages. Mol. Biol. Evol. 19,
908917 (2002).
64. Huelsenbeck, J.P. & Rannala, B. Phylogenetic methods
come of age: testing hypotheses in an evolutionary
context. Science 276, 227232 (1997).
65. Whelan, S., Li, P. & Goldman, N. Molecular
phylogenetics: state of the art methods for looking
into the past. Trends Genet. 17, 262272 (2001).
66. Rannala, B. & Yang, Z. Probability distribution
of molecular evolutionary trees: a new method of
phylogenetic inference. J.Mol. Evol. 43, 304311
(1996).
67. Yang, Z. & Rannala, B. Bayesian phylogenetic
inference using DNA sequences: a Markov
chain Monte Carlo Method. Mol. Biol. Evol. 14,
717724 (1997).
68. Mau, B. & Newton, M.A. Phylogenetic inference for
binary data on dendrograms using Markov chain
Monte Carlo. J.Comput. Graph. Stat. 6, 122131
(1997).
69. Li, S., Pearl, D. & Doss, H. Phylogenetic tree
reconstruction using Markov chain Monte Carlo.
J.Am. Stat. Assoc. 95, 493508 (2000).
70. Larget, B. & Simon, D.L. Markov chain Monte Carlo
algorithms for the Bayesian analysis of phylogenetic
trees. Mol. Biol. Evol. 16, 750759 (1999).
71. Huelsenbeck, J.P. & Ronquist, F. MrBayes: Bayesian
inference of phylogenetic trees. Bioinformatics 17,
754755 (2001).
REVIEWS
122. Yang, Z. & Rannala, B. Bayesian species delimitation
using multilocus sequence data. Proc. Natl Acad. Sci.
USA 107, 92649269 (2010).
This paper describes a Bayesian MCMC method for
delimiting species using sequence data from multiple
loci under the multi-species coalescent model.
123. Rohland, N. etal. Genomic DNA sequences from
mastodon and woolly mammoth reveal deep
speciation of forest and savanna elephants. PLoS Biol.
8, e1000564 (2010).
124. Bos, K.I. etal. A draft genome of Yersinia pestis from
victims of the Black Death. Nature 478, 506510
(2011).
125. Patterson, N., Richter, D.J., Gnerre, S., Lander, E.S.
& Reich, D. Genetic evidence for complex
speciation of humans and chimpanzees. Nature 441,
11031108 (2006).
126. Innan, H. & Watanabe, H. The effect of gene flow
on the coalescent time in the humanchimpanzee
ancestral population. Mol. Biol. Evol. 23,
10401047 (2006).
127. Becquet, C. & Przeworski, M. A new approach to
estimate parameters of speciation models with
application to apes. Genome Res. 17, 15051519
(2007).
128. Hobolth, A., Christensen, O.F., Mailund, T. &
Schierup, M.H. Genomic relationships and
speciation times of human, chimpanzee, and gorilla
inferred from a coalescent hidden Markov model.
PLoS Genet. 3, e7 (2007).
Acknowledgements
FURTHER INFORMATION
Ziheng Yangs homepage: http://abacus.gene.ucl.ac.uk
Bruce Rannalas homepage: http://www.rannala.org
A comprehensive list of phylogenetic programs maintained
by Joe Felsenstein: http://evolution.genetics.washington.
edu/phylip/software.html
Nature Reviews Genetics article series on Study designs:
http://www.nature.com/nrg/series/studydesigns/index.html
ALL LINKS ARE ACTIVE IN THE ONLINE PDF
www.nature.com/reviews/genetics
2012 Macmillan Publishers Limited. All rights reserved