Yang and Rannala 2012 Molecular Phylogenetics.

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

REVIEWS

S T U DY D E S I G N S

Molecular phylogenetics:
principles and practice
Ziheng Yang1,2 and Bruce Rannala1,3

Abstract | Phylogenies are important for addressing various biological questions such
as relationships among species or genes, the origin and spread of viral infection and
the demographic changes and migration patterns of species. The advancement of
sequencing technologies has taken phylogenetic analysis to a new height. Phylogenies
have permeated nearly every branch of biology, and the plethora of phylogenetic
methods and software packages that are now available may seem daunting to an
experimental biologist. Here, we review the major methods of phylogenetic analysis,
including parsimony, distance, likelihood and Bayesian methods. We discuss their
strengths and weaknesses and provide guidance for their use.
Systematics
The inference of phylogenetic
relationships among species
and the use of such information
to classify species.

Taxonomy
The description, classification
and naming of species.

Coalescent
The process of joining ancestral
lineages when the genealogical
relationships of a random
sample of sequences from
a modern population are
traced back.

Center for Computational


and Evolutionary Biology,
Institute of Zoology,
Chinese Academy of Sciences,
Beijing 100101, China.
2
Department of Biology,
University College London,
Darwin Building, Gower
Street, London WC1E 6BT, UK.
3
Genome Center and
Department of Evolution
and Ecology, University of
California, Davis, California
95616, USA.
Correspondence to Z.Y.
email: [email protected]
doi:10.1038/nrg3186
Published online
28 March 2012
1

Before the advent of DNA sequencing technologies,


phylogenetic trees were used almost exclusively to
describe relationships among species in systematics and
taxonomy. Today, phylogenies are used in almost every
branch of biology. Besides representing the relationships among species on the tree of life, phylogenies are
used to describe relationships between paralogues in a
gene family 1, histories of populations2, the evolutionary and epidemiological dynamics of pathogens3,4, the
genealogical relationship of somatic cells during differentiation and cancer development 5 and the evolution
of language6. More recently, molecular phylogenetics
has become an indispensible tool for genome comparisons. In this context, it is used: to classify metagenomic
sequences7; to identify genes, regulatory elements and
non-coding RNAs in newly sequenced genomes810; to
interpret modern and ancient individual genomes1113;
and to reconstruct ancestral genomes14,15.
In other applications, the phylogeny itself may not be
of direct interest but must nevertheless be accounted for
in the analysis. This tree thinking has transformed many
branches of biology. In population genetics, the development of the coalescent theory 16,17 and the widespread
availability of gene sequences for multiple individuals
from the same species have prompted the development
of genealogy-based inference methods, which have revolutionized modern computational population genetics. Here, the gene trees that describe the genealogy of
sequences in a sample are highly uncertain; they are not
of direct interest but nevertheless contain valuable information about parameters in the model. Tree thinking has
also forged a deep synthesis of population genetics and

phylogenetics, creating the emerging field of statistical


phylogeography. In species tree methods2,18,19, the gene
trees at individual loci may not be of direct interest and
may be in conflict with the species tree. By averaging
over the unobserved gene trees under the multi-species
coalescent model20, those methods infer the species tree
despite uncertainty in the gene trees. In comparative
analysis, inference of associations between traits (for
example, testis size and sexual promiscuity) using the
observed traits of modern species should consider
the species phylogeny to avoid misinterpreting historical
contingencies as causal relationships21. In the inference
of adaptive protein evolution, the phylogeny is used to
trace the synonymous and nonsynonymous substitutions along branches to identify cases of accelerated
amino acid change22, even though the phylogeny is not
of direct interest.
Nowadays, every biologist needs to know something
about phylogenetic inference. However, to an experimental biologist who is unfamiliar with the field, the
existence of many analytical methods and software
packages might seem daunting. In this Review, we
describe the suite of current methodologies for phylogenetic inference using sequence data. We also discuss
various statistical criteria that are useful for choosing the
methods that are best suited for a particular question
and data type. Next-generation sequencing (NGS) technologies are generating huge data sets. In the analysis of
such data sets, reducing systematic errors and increasing
robustness to model violations are much more important than reducing random sampling errors. We discuss
several issues in the analysis of large data sets, such as the

NATURE REVIEWS | GENETICS

VOLUME 13 | MAY 2012 | 303


2012 Macmillan Publishers Limited. All rights reserved

REVIEWS
Box 1 | Tree concepts
A phylogeny is a model of genealogical history in
a Rooted tree
b Unrooted tree
which the lengths of the branches are unknown
Time
2
parameters. For example, the phylogeny on the left
0
is generated by two speciation events that occurred
at time points 0 and 1. The branch lengths (b0, b1, b2
b0
b2
and b3) are typically expressed in units of expected

1
number of substitutions per site and measure the
amount of evolution along the branches.
If the substitution rate is constant over time or among
b2
b1
b3
b1
b3
lineages, we say that the molecular clock holds60. The
tree will then have a root and be ultrametric, meaning
1
2
3
1
3
that the distances from the tips of the tree to the root
are all equal (for example, b0+b1=b0+b2=b3). A rooted
tree for s species can then be represented by the ages of the s1 ancestral nodes and thus involves
s1
branch-length
Nature
Reviews
| Genetics
parameters. The procedure of inferring rooted trees by assuming the molecular clock is called molecular clock rooting.
For distantly related species, the clock hypothesis should not be assumed. Most phylogenetic analyses are therefore
conducted without the assumption of the clock. If every branch on the tree is allowed to have an independent
evolutionary rate, commonly used models and methods are unable to identify the location of the root, so only unrooted
trees are inferred. An unrooted tree for s species then has 2s3 branch length parameters. A commonly used strategy to
root the tree is to include outgroup species in the analysis, which are known to be more distantly related than the species
of interest. Although the inferred tree for all species is unrooted, the root is believed to be located along the branch that
leads to the outgroup so that the tree for the ingroup species is rooted. This strategy is called outgroup rooting.

Gene trees
The phylogenetic or
genealogical tree of
sequences at a gene locus
or genomic region.

Statistical phylogeography
The statistical analysis of
population data from closely
related species to infer
population parameters and
processes such as population
sizes, demography, migration
patterns and rates.

Species tree
A phylogenetic tree for a set
of species that underlies the
gene trees at individual loci.

Systematic errors
Errors that are due to an
incorrect model assumption.
They are exacerbated when
the data size increases.

Random sampling errors


Errors or uncertainties in
parameter estimates owing
to limited data.

Cluster algorithm
An algorithm of assigning a
set of individuals to groups (or
clusters) so that objects of the
same cluster are more similar
to each other than those from
different clusters. Hierarchical
cluster analysis can be
agglomerative (starting
with single elements and
successively joining them into
clusters) or divisive (starting
with all objects and successively
dividing them into partitions).

Markov chain
A stochastic sequence (or chain)
of states with the property that,
given the current state, the
probabilities for the next state
do not depend on the past
states.

Transitions
Substitutions between the two
pyrimidines (TC) or between
the two purines (AG).

Transversions
Substitutions between a
pyrimidine and a purine
(T or CA or G).

impact of missing data and strategies of data partitioning. The literature of molecular phylogenetics is large
and complex 23,24; the aim of this Review is to provide a
starting point for exploring the methodsfurther.

Phylogenetic tree reconstruction: basic concepts


A phylogeny is a tree containing nodes that are connected by branches. Each branch represents the persistence of a genetic lineage through time, and each
node represents the birth of a new lineage (BOX1). If
the tree represents the relationship among a group of
species, then the nodes represent speciation events. In
other contexts, the interpretation might be different.
For example, in a gene tree of sequences sampled from a
population, the nodes represent birth events of individuals who are ancestral to the sample, whereas in a tree
of paralogous gene families, the nodes might represent
gene duplicationevents.
Phylogenetic trees are not directly observed and are
instead inferred from sequence or other data. Phylogeny
reconstruction methods are either distance-based or
character-based. In distance matrix methods, the distance between every pair of sequences is calculated,
and the resulting distance matrix is used for tree reconstruction. For instance, neighbour joining 25 applies
a cluster algorithm to the distance matrix to arrive at a
fully resolved phylogeny. Character-based methods
include maximum parsimony, maximum likelihood
and Bayesian inference methods. These approaches
simultaneously compare all sequences in the alignment,
considering one character (a site in the alignment) at a
time to calculate a score for each tree. The tree score is
the minimum number of changes for maximum parsimony, the log-likelihood value for maximum likelihood
and the posterior probability for Bayesian inference. In
theory, the tree with the best score should be identified
by comparing all possible trees. In practice, because of

the huge number of possible trees, such an exhaustive


search is not computationally feasible except for very
small data sets. Instead, heuristic tree search algorithms
are used. These approaches often generate a starting tree
using a fast algorithm and then perform local rearrangements to attempt to improve the tree score. A heuristic
tree search is not guaranteed to find the best tree under
the criterion, but it makes it feasible to analyse large data
sets. To describe the data, distance matrix, maximum
likelihood and Bayesian inference all make use of a substitution model and are therefore model-based, whereas
maximum parsimony does not have an explicit model
and its assumptions are implicit.

Distance matrix method


Distance calculation. Pairwise sequence distances are
calculated assuming a Markov chain model of nucleotide
substitution. Several commonly used models are illustrated in FIG.1. The JC69 model26 assumes an equal rate
of substitution between any two nucleotides, whereas
the K80 model27 assumes different rates for transitions
and transversions. Both models predict equal frequencies of the four nucleotides. The assumption of equal
base frequencies is relaxed in the HKY85 model28 and
the general time reversible (GTR) model29,30. Because
of the variation in local mutation rate and in selective
constraint, different sites in a DNA or protein sequence
often evolve at different rates. In distance calculation,
such rate variation is accommodated by assuming a
gamma () distribution of rates for sites31, leading to
models such as JC69+, HKY85+ or GTR+.
Distance matrix methods. After the distances have been
calculated, the sequence alignment is no longer used
in distance matrix methods. Here we mention three
such methods: least squares, minimum evolution and
neighbour joining. The least squares method32 (see also

304 | MAY 2012 | VOLUME 13

www.nature.com/reviews/genetics
2012 Macmillan Publishers Limited. All rights reserved

REVIEWS
HKY85

K80

JC69
T

Figure 1 | Markov models of nucleotide substitution. The thickness of the arrows


indicates the substitution rates of the four nucleotides (T, C, ANature
and G),
and the| Genetics
sizes
Reviews
of the circles represent the nucleotide frequencies when the substitution process
is in equilibrium. Note that both JC69 and K80 predict equal proportions of the
four nucleotides.

REF.33) minimizes a measure of the differences between

the calculated distances (dij) in the distance matrix


and the expected distances (d ij) on the tree (that is,
the sum of branch lengths on the tree linking the two
species i and j):
s

Q = (dij dij)2
i =1 i =1

(1)

This is the same least squares method used in statistics for fitting a straight line y=a+bx to a scatter plot.
Optimizing branch lengths (or d ij) leads to the score Q
for the given tree, and the tree with the smallest score is
the least squares estimate of the truetree.
The minimum evolution method34,35 uses the tree
length (which is the sum of branch lengths) instead of
Q for tree selection, even though the branch lengths can
still be estimated using the least squares criterion. Under
the minimum evolution criterion, shorter trees are more
likely to be correct than longer treesare.
The most widely used distance method is neighbour
joining 25. This is a cluster algorithm and operates by
starting with a star tree and successively choosing a pair
of taxa to join together (based on the taxon distances),
until a fully resolved tree is obtained. The taxa to be
joined are chosen in order to minimize an estimate of
tree length36. The two joined taxa (for example, species
1 and 2 in FIG.2) are then represented by their ancestor (for example, node y in FIG.2), and the number of
taxa that are connected to the root (node x in FIG. 2)
is reduced by one (FIG.2). The distance matrix is then
updated with the joined taxa replacing the two original taxa. See REF.36 for a discussion of the neighbour
joining updating formula. An efficient implementation
of neighbour joining is found in the program MEGA37
(TABLE1).

Unrooted trees
Phylogenetic trees for
which the location of
the root is unspecified.

Strengths and weaknesses of distance methods. One


advantage of distance methods (especially of neighbour
joining) is their computational efficiency. The cluster
algorithm is fast because it does not need to compare
as many trees under an optimality criterion as maximum parsimony and maximum likelihood do. For this
reason, neighbour joining is useful for analysing large
data sets that have low levels of sequence divergence.

Note that it might be important to use a realistic substitution model to calculate the pairwise distances.
Distance methods can perform poorly for very divergent
sequences because large distances involve large sampling
errors, and most distance methods (such as neighbour
joining) do not account for the high variances of large
distance estimates. Distance methods are also sensitive
to gaps in the sequence alignment 38.

Maximum parsimony
Parsimony tree score. The maximum parsimony method
minimizes the number of changes on a phylogenetic
tree by assigning character states to interior nodes
on the tree. The character (or site) length is the minimum number of changes required for that site, whereas
the tree score is the sum of character lengths over all
sites. The maximum parsimony tree is the tree that
minimizes the treescore.
Some sites are not useful for tree comparison by
parsimony. For example, constant sites, for which the
same nucleotide occurs in all species, have a character
length of zero on any tree. Singleton sites, at which only
one of the species has a distinct nucleotide, whereas all
others are the same, can also be ignored, as the character length is always one. The parsimony-informative
sites are those at which at least two distinct characters
are observed, each at least twice. For four species, only
three site patterns are informative: xxyy, xyxy and xyyx,
where x and y are any two distinct nucleotides. There
are three possible unrooted trees for four species, and
which of them is the maximum parsimony tree depends
on which of the three site patterns occurs most often in
the alignment.
An algorithm for finding the minimum number of
changes on a binary tree (and for reconstructing the
ancestral states to achieve the minimum) was developed
by Fitch39 and Hartigan40. PAUP41, MEGA37 and TNT42
are commonly used parsimony programs.
Parsimony was originally developed for use in analysing discrete morphological characters. During the
late 1970s, it began to be applied to molecular data.
A controversy arose concerning whether parsimony
(without explicit assumptions) or likelihood (with an
explicit evolutionary model) was a better method for
phylogenetic analysis23. The controversy has subsided,
and the importance of model-based inference methods
is broadly recognized. The use of parsimony is still common: not because it is believed to be assumption-free,
but because it often produces reasonable results and is
computationally efficient.
Strengths and weaknesses of parsimony. A strength of
parsimony is its simplicity; it is easy to describe and to
understand, and it is amenable to rigorous mathematical
analysis. The simplicity also helps in the development of
efficient computer algorithms.
A major weakness of parsimony is its lack of explicit
assumptions, which makes it nearly impossible to incorporate any knowledge of the process of sequence evolution in tree reconstruction. The failure of parsimony
to correct for multiple substitutions at the same site

NATURE REVIEWS | GENETICS

VOLUME 13 | MAY 2012 | 305


2012 Macmillan Publishers Limited. All rights reserved

REVIEWS
3
2

4
5

2
x

1
8

6
8

Figure 2 | The neighbour joining algorithm. The neighbour joining algorithm is a


| Genetics
divisive cluster algorithm. It starts from a star tree: two nodes Nature
are thenReviews
joined together
on this tree (in this example, nodes 1 and 2), reducing the number of nodes at the root
(node x) by one. The process is repeated until a fully resolved tree is generated.

makes it suffer from a problem known as long-branch


attraction43. If the correct tree (T1 in FIG.3a) has two
long external branches separated by a short internal
branch, parsimony tends to infer the incorrect tree (T2
in FIG.3b), and the long branches are grouped together.
When the branch lengths in T1 are extreme enough, the
probability for site pattern xxyy, which supports the correct tree T1, may be smaller than that for xyxy, which
supports the incorrect tree T2. Thus, the more sites
there are in the sequence, the more probable it is for
the pattern xxyy to be observed at fewer sites than xyxy,
and the more certain that the incorrect tree T2 will be
chosen to be the maximum parsimony tree. Parsimony
thus converges to a wrong tree and is statistically inconsistent. Long-branch attraction has been demonstrated
in many real and simulated data sets44 and is due to the
failure of parsimony to correct for multiple changes at
the same site or to accommodate parallel changes on the
two long branches. See REFS24,45 for more discussions
of theissue.
Note that model-based methods (namely, distance,
likelihood and Bayesian methods) also suffer from longbranch attraction if the assumed model is too simplistic
and ignores among-site rate variation46. In the reconstruction of deep phylogenies, long-branch attraction
(as well as unequal nucleotide or amino acid frequencies among species) is an important source of systematic
error 47,48 (FIG.3c,d). In such analyses, it is advisable to use
realistic substitution models and likelihood or Bayesian
methodologies. Dense taxon sampling to break long
branches and removing fast-evolving proteins or sites
can also behelpful.

Long-branch attraction
The phenomenon of inferring
an incorrect tree with long
branches grouped together by
parsimony or by model-based
methods under simplistic
models.

Maximum likelihood
Basis of maximum likelihood. Maximum likelihood was
developed by R. A. Fisher in the 1920s as a statistical
methodology for estimating unknown parameters in a
model. The likelihood function is defined as the probability of the data given the parameters but is viewed as
a function of the parameters with the data observed and
fixed. It represents all information in the data about the
parameters. The maximum likelihood estimates (MLEs)
of parameters are the parameter values that maximize the
likelihood. Most often, the MLEs are found numerically

using iterative optimization algorithms. The MLEs have


desirable asymptotic (large-sample) properties: they are
unbiased, consistent (they approach the true values)
and efficient (they have the smallest variance among
unbiased estimates).
Maximum likelihood tree reconstruction. The first
algorithm for maximum likelihood analysis of DNA
sequence data was developed by Felsenstein49. The
method is now widely used owing to the increased computing power and software implementations and to the
development of increasingly realistic models of sequence
evolution. Note that two optimization steps are involved
in maximum likelihood tree estimation: optimization of
branch lengths to calculate the tree score for each candidate tree and a search in the tree space for the maximum likelihood tree. From a statistical point of view,
the tree (topology) is a model instead of a parameter,
whereas branch lengths on the given tree and substitution parameters are parameters in the model. Maximum
likelihood tree inference is thus equivalent to comparing many statistical models, each with the same number
of parameters. The attractive asymptotic properties of
MLEs mentioned above apply to parameter estimation
when the true tree is given but not to the maximum
likelihood tree24,50.
Calculation of the likelihood on a given tree under
various substitution models is explained in REFS23,24.
All substitution models used in distance calculation
can be used here. Indeed, joint comparison of many
sequences by likelihood makes it feasible to accommodate much more sophisticated models of sequence
evolution. Most models used in molecular phylogenetics
assume independent evolution of sites in the sequence
so that the likelihood is a product of the probabilities
for different sites. The probability at any particular site
is an average over the unobserved character states at the
ancestral nodes. Likelihood and parsimony analyses are
similar in this respect, although parsimony only uses
the optimal ancestral states, whereas likelihood averages
over all possiblestates.
Early maximum likelihood implementations include
PHYLIP51, MOLPHY52 and PAUP* 4.0 (REF.41). Modern
implementations, such as PhyML53, RAxML 54 and
GARLI55, are not only computationally much faster but
are also more effective in finding trees with high likelihood scores. The recent inclusion of maximum
likelihood in MEGA 5 (REF.37) has made the method
more accessible to biologists who are not experienced
computer users (TABLE1).
Strengths and weaknesses of the maximum likelihood
method. One advantage of the maximum likelihood
method is that all of its model assumptions are explicit,
so that they can be evaluated and improved. The availability of a rich repertoire of sophisticated evolutionary models in the likelihood (and Bayesian) method is
one of its major advantages over maximum parsimony.
Modern inferences of deep phylogenies using conserved
proteins almost exclusively rely on likelihood and
Bayesian methods. For such inference, it is important

306 | MAY 2012 | VOLUME 13

www.nature.com/reviews/genetics
2012 Macmillan Publishers Limited. All rights reserved

REVIEWS
Table 1 | Functionalities of a few commonly used phylogenetic programs
Name

Brief description

Link

Refs

Bayesian evolutionary
analysis sampling trees
(BEAST)

A Bayesian MCMC program for inferring rooted trees under the clock or
relaxed-clock models. It can be used to analyse nucleotide and amino acid
sequences, as well as morphological data. A suite of programs, such as Tracer
and FigTree, are also provided to diagnose, summarize and visualize results

http://beast.bio.ed.ac.uk

Genetic algorithm for


A program that uses genetic algorithms to search for maximum likelihood trees. http://code.google.com/p/
rapid likelihood inference It includes the GTR+ model and special cases and can analyse nucleotide,
garli
(GARLI)
amino acid and codon sequences. A parallel version is also available

135

55

Hypothesis testing using


phylogenies (HYPHY)

A maximum likelihood program for fitting models of molecular evolution. It


implements a high-level language that the user can use to specify models and
to set up likelihood ratio tests

http://www.hyphy.org

136

Molecular evolutionary
genetic analysis (MEGA)

A Windows-based program with a full graphical user interface that can be


run under Mac OSX or Linux using Windows emulators. It includes distance,
parsimony and likelihood methods of phylogeny reconstruction, although its
strength lies in the distance methods. It incorporates the alignment program
ClustalW and can retrieve data from GenBank

http://www.megasoftware.net

37

MrBayes

A Bayesian MCMC program for phylogenetic inference. It includes all of the


models of nucleotide, amino acid and codon substitution developed for
likelihood analysis

http://mrbayes.net

71

Phylogenetic analysis
by maximum likelihood
(PAML)

A collection of programs for estimating parameters and testing hypotheses


http://abacus.gene.ucl.ac.uk/
using likelihood. It is mostly used for tests of positive selection, ancestral
software
reconstruction and molecular clock dating. It is not appropriate for tree searches

137

Phylogenetic analysis
PAUP* 4.0 is still a beta version (at the time of writing). It implements parsimony, http://www.sinauer.com/
using parsimony* and
distance and likelihood methods of phylogeny reconstruction
detail.php?id=8060
other methods (PAUP* 4.0)
PHYLIP

A package of programs for phylogenetic inference by distance, parsimony and


likelihood methods

http://evolution.
gs.washington.edu/phylip.html

PhyML

A fast program for searching for the maximum likelihood trees using nucleotide http://www.atgc-montpellier.
or protein sequence data
fr/phyml/binaries.php

53

RAxML

A fast program for searching for the maximum likelihood trees under the GTR
model using nucleotide or amino acid sequences. The parallel versions are
particularly powerful

http://scoh-its.org/exelixis/
software.html

54

Tree analysis using new


technology (TNT)

A fast parsimony program intended for very large data sets

http://www.zmuc.dk/public/
phylogeny/TNT

42

Note: all programs can run on Windows, Mac OSX and Unix or Linux platforms. Except for PAUP*, which charges a nominal fee, all packages are free for download.
See Felsensteins comprehensive list of programs at http://evolution.genetics.washington.edu/phylip/software.html. GTR, general time reversible; MCMC, Markov
chain Monte Carlo.

Likelihood ratio test


A general hypothesis-testing
method that uses the
likelihood to compare two
nested hypotheses, often using
the 2 distribution to assess
significance.

Molecular clock
The hypothesis or observation
that the evolutionary rate
is constant over time or
across lineages.

Prior distribution
The distribution assigned
to parameters before the
analysis of the data.

Posterior distribution
The distribution of the
parameters (or models)
conditional on the data. It
combines the information
in the prior and in the data
(likelihood).

for the model to accommodate variable amino acid substitution rates among sites56 or even different amino acid
frequencies among sites57,58.
Maximum likelihood has a clear advantage over distance or parsimony methods if the aim is to understand
the process of sequence evolution. The likelihood ratio test
can be used to examine the fit of evolutionary models59
and to test interesting biological hypotheses, such as the
molecular clock60,49 and Darwinian selection affecting protein evolution6163. See REFS 22,24,64,65 for summaries of
such tests in phylogenetics.
The main drawback of maximum likelihood is that
the likelihood calculation and, in particular, tree search
under the likelihood criterion is computationally
demanding. Another drawback is that the method has
potentially poor statistical properties if the model is misspecified. This is also true for Bayesian analysis (TABLE2).

Bayesian methods
Basis of Bayesian inference. Bayesian inference is a general methodology of statistical inference. It differs from
maximum likelihood in that parameters in the model
are considered to be random variables with statistical

distributions, whereas in maximum likelihood they


are unknown fixed constants. Before the analysis of the
data, parameters are assigned a prior distribution, which
is combined with the data (or likelihood) to generate
the posterior distribution. All inferences concerning the
parameters are then based on the posterior distribution. In the past two decades, Bayesian inference has
gained popularity thanks to advances in computational
methods, especially Markov chain Monte Carlo algorithms
(MCMC algorithms).
Bayesian phylogenetics. Bayesian inference was introduced to molecular phylogenetics in the late 1990s6669.
The early methods assumed a molecular clock 60.
Development of more efficient MCMC algorithms 70
that eliminate the clock assumption (allowing independent branch lengths on unrooted trees) and the
release of the program MrBayes 71 made the method
popular among molecular systematists. A more recent
Bayesian implementation in the program BEAST72 uses
the so-called relaxed-clock models to infer rooted trees
even though the model allows substitution rates to vary
across lineages (TABLE1).

NATURE REVIEWS | GENETICS

VOLUME 13 | MAY 2012 | 307


2012 Macmillan Publishers Limited. All rights reserved

REVIEWS
c The Gnepine tree

a Correct tree, T1
1

100

Ginkgo
Cycas

100
92

Amborella
Cryptomeria

100

100

68

81

4
100

b Wrong tree, T2

100

Keteleeria
Pinus

100

Cupressophyta
Ephedra
Welwitschia
Gnetum

Gnetales
Pinaceae

Psilotum
Huperzia
Marchantia
Physcomitrella
0.1

d The GneCup tree


Psilotum

2
90

100

100

Amborella
Cryptomeria
100

100
100
100

100

Keteleeria
100
Pinus
Ginkgo
Cycas
Huperzia

100

Cupressophyta
Ephedra
Welwitschia Gnetales
Gnetum
Pinaceae

Marchantia
Physcomitrella
0.1

Figure 3 | Long-branch attraction in theory and in practice. Panels a and b show the fourspecies case analysed
Nature Reviews | Genetics
by Felsenstein43. If the correct tree (T1 in a) has two long branches separated by a short internal branch, parsimony
(as well as model-based methods such as likelihood and Bayesian methods under simplistic models) tends to recover a
wrong tree (T2 in b), in which the two long branches are grouped together. Panels c and d show a similar phenomenon
in a real data set, concerning the phylogeny of seed plants134. The Gnetales is a morphologically and ecologically
diverse group of Gymnosperms including three genera (Ephedra, Gnetum and Welwitschia), but its phylogenetic
position has been controversial. Maximum likelihood analysis of 56 chloroplast proteins produced the GneCup tree
(d), in which the Gnetales are grouped with Cupressophyta, apparently owing to a long-branch attraction artefact.
However, the Gnepine tree (c), in which the Gnetales joins the Pinaceae, was inferred by excluding the fastest-evolving
18 proteins as well as three proteins (namely, psbC, rpl2 and rps7) that had experienced many parallel substitutions
between the Cryptomeria branch and the branch ancestral to the Gnetales. The Gnepine tree (c) is also supported by
two proteins from the nuclear genome and appears to be the correct tree. Branch lengths and bootstrap proportions
are all calculated using RAxML. See REF.134 for details.

Bayesian inference relies on Bayess theorem, which


statesthat
P(T,)P(D|T,)
(2)
P(T,|D) =
P(D)

Markov chain Monte Carlo


algorithms
(MCMC algorithms). A Monte
Carlo simulation is a computer
simulation of a biological
process using random
numbers. An MCMC algorithm
is a Monte Carlo simulation
algorithm that generates
a sample from a target
distribution (often a Bayesian
posterior distribution).

where P(T,) is the prior probability for tree T and


parameter , P(D|T,) is the likelihood or probability
of the data given the tree and parameter, and P(T,|D)
is the posterior probability. The denominator P(D) is
a normalizing constant, as its role is to ensure that
P(T,|D) sums over the trees and integrates over the
parameters to one. The theorem states that the posterior
is proportional to the prior times the likelihood, or the
posterior information is the prior information plus
the data information.
In general, the posterior probabilities of trees cannot be directly calculated. In particular, the normalizing
constant P(D) involves high-dimensional integrals (over
all possible parameter values) and summation over all
possible trees. Instead, Bayesian phylogenetic inference

relies on MCMC algorithms to generate a sample from


the posterior distribution. This is illustrated in BOX2.
See Chapter 5 of REF.24 for an introduction toMCMC.
Strengths and weaknesses of the Bayesian inference
method. Both likelihood and Bayesian methods use
the likelihood function and thus share many statistical
properties, such as consistency and efficiency. However,
maximum likelihood and Bayesian inference represent opposing philosophies of statistical inference. The
same feature of Bayesian inference may thus be viewed
as either a strength or a weakness, depending on ones
philosophy. See REF.24 for a brief description of the controversy. Here we comment on two issues relating to the
interpretability of the results and to the practicalities of
incorporating prior information in themodel.
First, Bayesian statistics is known to answer the
biological questions directly and yields results that are
easy to interpret: the posterior probability of a tree is
simply the probability that the tree is correct, given the

308 | MAY 2012 | VOLUME 13

www.nature.com/reviews/genetics
2012 Macmillan Publishers Limited. All rights reserved

REVIEWS
Table 2 | A summary of strengths and weaknesses of different tree reconstruction methods
Strengths

Weaknesses

Parsimony methods
Simplicity and intuitive appeal
The only framework appropriate for some data
(such as SINES and LINES)

Assumptions are implicit and poorly understood


Lack of a model makes it nearly impossible to incorporate our
knowledge of sequence evolution
Branch lengths are substantially underestimated when
substitution rates are high
Maximum parsimony may suffer from long-branch attraction

Distance methods
Fast computational speed
Can be applied to any type of data as long as a
genetic distance can be defined
Models for distance calculation can be chosen
to fit data

Most distance methods, such as neighbour joining, do not


consider variances of distance estimates
Distance calculation is problematic when sequences are
divergent and involve many alignment gaps
Negative branch lengths are not meaningful

Likelihood methods
Can use complex substitution models to
approach biological reality
Powerful framework for estimating parameters
and testing hypotheses

Maximum likelihood iteration involves heavy computation


The topology is not a parameter so that it is difficult to apply
maximum likelihood theory for its estimation. Bootstrap
proportions are hard to interpret

Bayesian methods
Can use realistic substitution models, as in
maximum likelihood
Prior probability allows the incorporation of
information or expert knowledge
Posterior probabilities for trees and clades have
easy interpretations

Clades
Groups of species that
have descended from a
common ancestor.

Markov chain Monte Carlo (MCMC) involves heavy computation


In large data sets, MCMC convergence and mixing problems can
be hard to identify or rectify
Uninformative prior probabilities may be difficult to specify.
Multidimensional priors may have undue influence on the
posterior without the investigators knowledge
Posterior probabilities often appear too high
Model selection involves challenging computation138,139

data and model. By contrast, concepts such as the confidence interval in a likelihood analysis have a contrived
interpretation that eludes many users of statistics. In
phylogenetics, it has not been possible to define a confidence interval for the tree. The widely used bootstrap
method73 (BOX3) has been difficult to interpret despite
numerous efforts7477. However, the odds are not entirely
against maximum likelihood. Posterior probabilities for
trees and clades that have been calculated from real data
sets often appear to be too high66,7880. In many analyses,
nearly all nodes had posterior probabilities of ~100%.
Posterior tree probabilities are also sensitive to model
violations, and use of simplistic models may lead to
inflated posterior probabilities81.
Second, the prior probability allows incorporation
of apriori information about the trees or parameters.
However, such information is rarely available, and
specification of the prior is most often a burden on the
user; almost all data analyses are conducted using
the default priors in the computer program. Highdimensional priors are notoriously hard to specify, and
an innocent-looking prior can have an undue and unexpected influence on the posterior. For example, it has
recently been pointed out that the independent exponential prior on branch lengths used by MrBayes can
induce a strongly informative and unreasonable prior
on the tree length, producing unreasonably long trees in
some data sets8284. It is therefore important to conduct
Bayesian robustness analysis to assess the impact of the
prior on the posterior estimates.

Statistical assessments of phylogenetic methods


The aim of phylogenetic inference is to estimate the tree
topology and possibly also the branch lengths. Four
criteria have been used to judge tree reconstruction
methods.
Consistency. An estimation method is said to be consistent if the estimate converges to the true parameter value
when the amount of data approaches infinity. A tree
reconstruction method is consistent if the estimated
tree converges to the true tree when the number of sites
in the sequence grows. Model-based methods (that is,
distance matrix, maximum likelihood and Bayesian
inference) are consistent if the assumed model is correct.
Parsimony may be inconsistent under some model tree
combinations; Felsensteins43 demonstration of this has
spurred much heated discussion.
Efficiency. In the statistical estimation of a parameter,
an unbiased estimate with a smaller variance is more
efficient than one with a larger variance. In phylogenetics, efficiency may be measured by the probability of
recovering the correct tree or subtree given the number
of sites. This can be estimated by computer simulation.
The complexity of tree reconstruction means that the
asymptotic theory of MLEs does not apply. Nevertheless,
computer simulations have generally found a higher
efficiency of maximum likelihood than maximum
parsimony or neighbour joining in recovering the
correcttree23.

NATURE REVIEWS | GENETICS

VOLUME 13 | MAY 2012 | 309


2012 Macmillan Publishers Limited. All rights reserved

REVIEWS
Box 2 | Markov chain Monte Carlo
Markov chain Monte Carlo (MCMC) is a simulation algorithm in which one moves
from one tree (or parameter value) to another and, in the long run, visits the trees
(or parameters) in proportion to their posterior probabilities. The tree parameter set
(T,) constitutes the state of the algorithm. Here, parameters may include the
branch lengths of the tree and parameters in the evolutionary model, such as
the transition/transversion rate ratio. The following scheme demonstrates the main
features of MCMC algorithms.
Step 1. Initialization. Choose a starting tree and starting parameters at random (T,).
Step 2. Main loop.
Step 2a. Proposal to change the tree T. Propose a new tree, T*, by changing the
current tree, T. If T* has higher posterior probability than the current tree,
P(T*,|D) > P(T,|D), accept the new tree T*. Otherwise, accept T* with probability:
P(T*,|D)
P(T,|D)

P(T*,)P(D|T*,)
P(T,)P(D|T,)

(3)

If T* is accepted, set T=T*.


Step 2b. Proposal to change parameters . Propose new parameter value, *, by
changing the current . Here, for simplicity, we assume that the proposals are
symmetrical so that the probability of proposing * from equals the probability of
proposing from *. If P(T,*|D) > P(T,|D), accept new *. Otherwise, accept *
with probability:
P(T,*|D)
P(T,|D)

P(T,*)P(D|T,*)
P(T,)P(D|T,)

(4)

If the new * is accepted, set =*.


Step 2c. Sample from the chain. Print out (T,).
Note that first the algorithm does not need calculation of the normalizing constant
P(D), as it cancels in the posterior ratios in proposal steps 2a and 2b. Second, in the
long run, a tree parameter set (T1,1) will be visited more often by the algorithm than
another set (T2,2) if its posterior probability is higher: P(T1,1|D) > P(T2,2|D). Indeed,
the expected proportion of time that the algorithm spends in any tree T is exactly
its posterior probability: P(T|D). Thus, by counting the frequencies at which each tree
is visited in the algorithm, we get an MCMC estimate of the posterior probabilities
for the trees.
The sequence (or chain) of values for (T,) generated by the algorithm has the
property that, given the current state (T,), the probabilities by which it moves to new
states do not depend on past states. This memory-less property is known as the
Markovian property, which states that given the present, the future does not depend
on the past. The generated sequence is called a Markov chain. The algorithm is
called Markov chain Monte Carlo because the Markov chain is generated by Monte
Carlo simulation.

Robustness. A method is robust if it gives correct


answers even when its assumptions are violated. Some
assumptions matter more than others. With the rapid
accumulation of sequence data, sampling errors in tree
reconstruction are considerably reduced, so systematic errors or robustness of the method become more
important.

Graphical processing units


(GPU). Specialized units that
are traditionally used to
manipulate output on a video
display and have recently been
explored for use in parallel
computation.

Computational speed. This property is easy to assess.


Neighbour joining uses a cluster algorithm to arrive
at a tree and is very fast. Methods that search for the
best tree under a criterion, such as maximum evolution, maximum parsimony and maximum likelihood, are slower. The computational speed of the
Bayesian method depends on the length of the chain
(generated by MCMC algorithms), which is highly

data-dependent. As calculation of the phylogenetic


likelihood is expensive, maximum likelihood and
Bayesian inference are typically slower than maximum parsimony. Nevertheless, considerable advancements in computational algorithms 5355 have made
likelihood-based methods feasible for the analysis of
large data sets. Algorithms that take advantage of new
computers with multicore processors and graphical
processing units (GPUs)85,86 are pushing the boundary
even further.

Phylogenomic analysis of large data sets


With the advent of new sequencing technologies and
the completion of various genome projects, phylo
genetics has entered the era of genome-scale data sets.
Here we discuss a few issues relating to the analysis of
such large datasets.
Super tree and supermatrix approaches. Two
approaches have been advocated for the phylogenetic
analysis of hundreds or thousands of genes or proteins,
especially when some loci are missing in some species. The supertree approach separately analyses each
gene and then uses heuristic algorithms to assemble
the subtrees for individual genes into a supertree for
all species87. The separate analysis is useful for studying the differences in the reconstructed subtrees or the
prevalence of horizontal gene transfer. However, it is
inefficient for estimating a common phylogeny that
underlies allgenes.
In the supermatrix approach, sequences for multiple
genes are concatenated to generate a data supermatrix, in
which missing data are replaced by question marks, and
the supermatrix is then used for tree reconstruction88.
Most supermatrix analyses ignore differences in evolutionary dynamics among the genes. Note that a supermatrix analysis that assumes different evolutionary
models and different trees and branch lengths for the
genes is equivalent to a separate or supertree analysis. When a common tree underlies all genes, the ideal
approach should be a combined (supermatrix) analysis of
all genes, using the likelihood to accommodate the amonggene heterogeneity in the evolutionary process8991. Our
comments below relate to this combined approach.
Impact of missing data. Many genomic data sets are
highly incomplete, and so most cells in the species by
gene matrix will be empty. Although, in theory, the likelihood function (in the maximum likelihood and Bayesian
methods) can properly accommodate missing data23,24,
the impact of such large-scale missing data and alignment gaps is not well-understood. Simulations suggest
that maximum likelihood and Bayesian inference generally perform better than neighbour joining or maximum
parsimony in dealing with missing data, and Bayesian
inference was found to perform the best 9294. The poor
performance of neighbour joining may be understood
if one considers extreme cases in which, after removal
of alignment gaps, the pairwise distances are calculated
from different sets of genes or sites, some of which are
fast-evolving, whereas others are slowly evolving.

310 | MAY 2012 | VOLUME 13

www.nature.com/reviews/genetics
2012 Macmillan Publishers Limited. All rights reserved

REVIEWS
Box 3 | Sampling error in the estimated tree and bootstrap analysis
Sequence alignment

NENLFASFIA
NENLFASFAA
NENLFASFAA
NENLFASFIA
NEDLFTPFTT
NESLFTPFIT
NENLFTSFAT

PTVLGLPAAV
PTILGLPAAV
PTILGLPAAV
PTILGLPAAV
PTVLGLPAAI
PTVLGLPAAV
PTILGLPAAV

Chimpanzee
Bonobo
99.4 00
1
Gorilla

...
...
...
...
...
...
...

Human

100

Human
Chimpanzee
Bonobo
Gorilla
Bornean orangutan
Sumatran orangutan
Gibbon

Maximum likelihood tree inferred from original data

100

Gibbon
Bootstrap
data set 1

Bootstrap
data set 2

...

Bootstrap
data set 1,000

Maximum
likelihood
tree 1

Maximum
likelihood
tree 2

...

Maximum
likelihood
tree 1,000

Bornean orangutan
Sumatran orangutan

Use maximum likelihood trees


from the bootstrap data sets to
place support values on the
original maximum likelihood tree

Nature
Reviews
| Genetics
In traditional parameter estimation, we attach a confidence interval to indicate the uncertainty
involved
in the
point
estimate of the parameter. This has not been possible in molecular phylogenetics, as concepts such as the variance and
confidence interval are not meaningful when applied to trees. For distance, parsimony and likelihood methods, the
most commonly used procedure to assess the confidence in a tree topology estimate is the bootstrap analysis73. In this
approach, the sites in the sequence alignment are resampled with replacement as many times as the sequence length,
generating a bootstrap pseudo-sample that is of the same size as the original data set. Typically, 100 or 1,000 bootstrap
samples are generated in this way, and each one is analysed in the same way as the original sequence alignment. An
example that uses the maximum likelihood method is illustrated in the figure. The inferred trees from those bootstrap
samples are then tabulated to calculate the bootstrap support values. For every clade in the estimated tree, its
bootstrap support value is simply the proportion of bootstrap trees that include that clade24,65,133. The commonly used
but less satisfactory approach is to use the bootstrap trees to generate a majority-rule consensus tree, which shows a
clade if and only if it occurs in more than half of the bootstrap trees.

Importance of systematic errors. In the analysis of very


large data sets, almost all bootstrap support values or
Bayesian posterior probabilities are calculated to be
100%, even though the inferred phylogenies might
be conflicting across genes or might depend on the
method and model used47. Systematic biases are thus
much more important than random sampling errors in
such analyses, and methods that are robust to violations
of model assumptions, even if they are less efficient,
should be preferable.

Current strategies for data partitioning include


partitioning genes according to their relative substitution rates97 and separating the three codon positions in
coding genes into different partitions89. The likelihood
ratio test has also been used to decide whether two genes
should be in the same or different partitions98. In summary, data partitioning is more of an art than a science,
and it should rely on our knowledge of the biological system: for example, on whether it is reasonable to assume
that the same phylogeny underlies allgenes.

Data-partitioning strategies. The rationale for data


partitioning is to group genes or sites with similar
evolutionary characteristics into the same partition so
that all sites in the same partition are described using
the same model, and different partitions use different
models89,90. Partitioning too finely increases computation time and can cause over-fitting, but partitioning
too coarsely may lead to under-fitting or model violation. However, the situation is complicated, as some
models instead allow random variation among sites in
substitution rate31,56,61, in amino acid frequencies57,58,95 or
in the pattern of substitution96. Such mixture models use
a statistical distribution to accommodate the amongsite heterogeneity without data partitioning. Often, the
choice between using partition or mixture models is a
philosophical one: it corresponds, respectively, to the
preference for fixed-effects models or random-effects
models in statistics.

Perspectives
We focus here on three research areas that are currently
the focus of much methodological development. The
first is multiple sequence alignment. Many heuristic
methods and programs for aligning sequences exist 99,100,
and improved algorithms continue to appear 101,102.
Efforts have also been taken to infer alignment statistically under an explicit model of insertions and deletions103,104 and to infer alignment and phylogeny jointly
in a Bayesian framework105,106. An advantage of those
model-based alignment methods is that they produce
estimates of insertion and deletion rates. For now, those
algorithms are based on simplistic insertiondeletion
models and involve heavy computation, and so they
do not compare favourably against good heuristic algorithms either in computational efficiency or alignment
quality. Nevertheless, they are biologically appealing,
and improvements are verylikely.

NATURE REVIEWS | GENETICS

VOLUME 13 | MAY 2012 | 311


2012 Macmillan Publishers Limited. All rights reserved

REVIEWS
The second area of development is molecular clock
estimation of divergence dates. Under the clock assumption, the distance between sequences increases linearly
with the time of divergence, and if a particular divergence can be assigned an absolute geological age based
on the fossil record, the substitution rate can be calculated, and all divergences on the tree can be dated.
Similar ideas can be used to estimate divergence times
of viral strains when sample dates for viral sequences are
available and act as calibrations. However, in practice, the
molecular clock may be violated, especially for distantly
related species, and the fossil record can never provide
unambiguous times of lineage divergence. In the past
several years, advancements have been made using the
Bayesian framework to deal with those issues. Since
the pioneering work of Thorne and colleagues 107,108,
models of evolutionary rate drift over time have been
developed to relax the molecular clock72,109. Soft age
bounds and flexible probability distributions have been
implemented to accommodate uncertainties in fossil
calibrations72,110,111. The fossil record (that is, the presence
and absence of fossils in the rock layers) has also been
statistically analysed to generate calibration densities for
molecular dating analysis112,113.
The third area of exciting development, which was
mentioned at the beginning of this Review, is statistical
phylogeography 20,114116. The availability of genomic data
at species and population levels offers unprecedented
1.
2.
3.
4.
5.
6.

7.
8.

9.
10.
11.
12.

13.
14.
15.

Maser, P. etal. Phylogenetic relationships within cation


transporter families of Arabidopsis. Plant Physiol.
126, 16461667 (2001).
Edwards, S.V. Is a new and general theory of molecular
systematics emerging? Evolution 63, 119 (2009).
Marra, M.A. etal. The genome sequence of the SARSassociated coronavirus. Science 300, 13991404
(2003).
Grenfell, B.T. etal. Unifying the epidemiological and
evolutionary dynamics of pathogens. Science 303,
327332 (2004).
Salipante, S.J. & Horwitz, M.S. Phylogenetic fate
mapping. Proc. Natl Acad. Sci. USA 103, 54485453
(2006).
Gray, R.D., Drummond, A.J. & Greenhill, S.J.
Language phylogenies reveal expansion pulses and
pauses in pacific settlement. Science 323, 479483
(2009).
Brady, A. & Salzberg, S. PhymmBL expanded:
confidence scores, custom databases, parallelization
and more. Nature Methods 8, 367 (2011).
Kellis, M., Patterson, N., Endrizzi, M., Birren, B. &
Lander, E.S. Sequencing and comparison of yeast
species to identify genes and regulatory elements.
Nature 423, 241254 (2003).
Pedersen, J.S. etal. Identification and classification
of conserved RNA secondary structures in the human
genome. PLoS Comput. Biol. 2, e33 (2006).
Lindblad-Toh, K. etal. A high-resolution map of human
evolutionary constraint using 29 mammals. Nature
478, 476482 (2011).
Green, R.E. etal. A draft sequence of the Neandertal
genome. Science 328, 710722 (2010).
Gronau, I., Hubisz, M.J., Gulko, B., Danko, C.G. &
Siepel, A. Bayesian inference of ancient human
demography from individual genome sequences.
Nature Genet. 43, 10311034 (2011).
Li, H. & Durbin, R. Inference of human population
history from individual whole-genome sequences.
Nature 475, 493496 (2011).
Paten, B. etal. Genome-wide nucleotide-level
mammalian ancestor reconstruction. Genome Res. 18,
18291843 (2008).
Ma, J. Reconstructing the history of large-scale genomic
changes: biological questions and computational
challenges. J.Comput. Biol. 18, 879893 (2011).

opportunities for addressing interesting questions in


evolutionary biology. Multi-locus sequence data can
be used to estimate divergence times between closely
related species and the sizes of both extant and extinct
populations 117,118 to infer population demographic
changes and to estimate migration patterns and
rates119,120. Such data can also be used to delimit species (that is, to determine whether a population consists of one or two species, for example)121,122. The past
few years have seen the appearance of many individual
genome sequences and the rise of population genomics.
Currently, the data are mostly from humans and their
close relatives, but genomes from other species are
being sequenced as well, such as mastodons and mammoths123 and the bacterium Yersinia pestis from Black
Death victims124. Genomic sequence data from humans
and apes are used to infer the species divergence times
and to test for possible hybridization during the
humanchimpanzee separation125131. Comparison of
a few human individual genomes provides insights
into the recent demographic history of our species12,13,
whereas sequencing of the Neanderthal genome allows
estimation of the Neanderthal contribution to the
genome of modern humans11,132. The size of the data and
the complexity of the model pose great statistical
and computational challenges. Again, Bayesian MCMC
algorithms, under the multi-species coalescent model118,
provide the natural framework for inference.

16. Kingman, J.F.C. On the genealogy of large


populations. J.Appl. Probab. 19A, 2743 (1982).
17. Kingman, J.F.C. The coalescent. Stoch. Process. Appl.
13, 235248 (1982).
18. Edwards, S.V., Liu, L. & Pearl, D.K. High-resolution
species trees without concatenation. Proc. Natl Acad.
Sci. USA 104, 59365941 (2007).
This paper introduces a method for estimating the
species tree despite the presence of conflicting
gene trees.
19. Than, C. & Nakhleh, L. Species tree inference by
minimizing deep coalescences. PLoS Comput. Biol. 5,
e1000501 (2009).
20. Rannala, B. & Yang, Z. Phylogenetic inference using
whole genomes. Annu. Rev. Genomics Hum. Genet. 9,
217231 (2008).
21. Felsenstein, J. Phylogenies and the comparative
method. Am. Nat. 125, 115 (1985).
This paper introduces the bootstrap approach to
phylogenetic analysis. This is the most commonly
used method for assessing sampling errors in
estimated phylogenies.
22. Yang, Z. in Handbook of Statistical Genetics
(eds Balding, D., Bishop, M. & Cannings, C.) 377406
(Wiley, New York, 2007).
23. Felsenstein, J. Inferring Phylogenies (Sinauer
Associates, Sunderland, Massachusetts, 2004).
24. Yang, Z. Computational Molecular Evolution (Oxford
Univ. Press, UK, 2006).
25. Saitou, N. & Nei, M. The neighbor-joining method:
a new method for reconstructing phylogenetic trees.
Mol. Biol. Evol. 4, 406425 (1987).
26. Jukes, T.H. & Cantor, C.R. in Mammalian Protein
Metabolism (ed. Munro, H.N.) 21123 (Academic
Press, New York, 1969).
27. Kimura, M. A simple method for estimating
evolutionary rate of base substitution through
comparative studies of nucleotide sequences. J.Mol.
Evol. 16, 111120 (1980).
28. Hasegawa, M., Kishino, H. & Yano, T. Dating the
humanape splitting by a molecular clock of
mitochondrial DNA.J.Mol. Evol. 22, 160174
(1985).
29. Tavar, S. Some probabilistic and statistical problems
on the analysis of DNA sequences. Lect. Math. Life Sci.
17, 5786 (1986).

312 | MAY 2012 | VOLUME 13

30. Yang, Z. Estimating the pattern of nucleotide


substitution. J.Mol. Evol. 39, 105111 (1994).
31. Yang, Z. Maximum-likelihood estimation of phylogeny
from DNA sequences when substitution rates differ
over sites. Mol. Biol. Evol. 10, 13961401 (1993).
32. Cavalli-Sforza, L.L. & Edwards, A.W.F. Phylogenetic
analysis: models and estimation procedures. Evolution
21, 550570 (1967).
33. Fitch, W.M. & Margoliash, E. Construction of
phylogenetic trees. Science 155, 279284 (1967).
34. Rzhetsky, A. & Nei, M. A simple method for estimating
and testing minimum-evolution trees. Mol. Biol. Evol.
9, 945967 (1992).
35. Desper, R. & Gascuel, O. Fast and accurate phylogeny
reconstruction algorithms based on the minimumevolution principle. J.Comput. Biol. 9, 687705 (2002).
36. Gascuel, O. & Steel, M. Neighbor-joining revealed.
Mol. Biol. Evol. 23, 19972000 (2006).
37. Tamura, K. etal. MEGA5: molecular evolutionary
genetics analysis using maximum likelihood,
evolutionary distance, and maximum parsimony
methods. Mol. Biol. Evol. 28, 27312739 (2011).
38. Bruno, W.J., Socci, N.D. & Halpern, A.L. Weighted
neighbor joining: a likelihood-based approach to
distance-based phylogeny reconstruction. Mol. Biol.
Evol. 17, 189197 (2000).
39. Fitch, W.M. Toward defining the course of evolution:
minimum change for a specific tree topology. Syst.
Zool. 20, 406416 (1971).
40. Hartigan, J.A. Minimum evolution fits to a given tree.
Biometrics 29, 5365 (1973).
41. Swofford, D.L. PAUP*: Phylogenetic Analysis
by Parsimony (and Other Methods) 4.0 Beta
(Sinauer Associates, Massachusetts, 2000).
42. Goloboff, P.A., Farris, J.S. & Nixon, K.C.
TNT, a free program for phylogenetic analysis.
Cladistics 24, 774786 (2008).
43. Felsenstein, J. Cases in which parsimony and
compatibility methods will be positively misleading.
Syst. Zool. 27, 401410 (1978).
44. Huelsenbeck, J.P. Systematic bias in phylogenetic
analysis: is the Strepsiptera problem solved? Syst.
Biol. 47, 519537 (1998).
45. Swofford, D.L. etal. Bias in phylogenetic estimation
and its relevance to the choice between parsimony and
likelihood methods. Syst. Biol. 50, 525539 (2001).

www.nature.com/reviews/genetics
2012 Macmillan Publishers Limited. All rights reserved

REVIEWS
46. Yang, Z. Among-site rate variation and its impact
on phylogenetic analyses. Trends Ecol. Evol. 11,
367372 (1996).
47. Philippe, H. etal. Acoelomorph flatworms are
deuterostomes related to Xenoturbella. Nature 470,
255258 (2011).
48. Zhong, B. etal. Systematic error in seed plant
phylogenomics. Genome Biol. Evol. 3, 13401348
(2011).
49. Felsenstein, J. Evolutionary trees from DNA
sequences: a maximum likelihood approach. J.Mol.
Evol. 17, 368376 (1981).
This paper introduces the pruning algorithm for
likelihood calculation on a tree. This approach
forms the basis for modern likelihood and Bayesian
methods of phylogenetic analysis.
50. Yang, Z. Phylogenetic analysis using parsimony and
likelihood methods. J.Mol. Evol. 42, 294307
(1996).
51. Felsenstein, J. Phylip: Phylogenetic Inference Program,
Version 3.6. (Univ. of Washington, Seattle, 2005).
52. Adachi, J. & Hasegawa, M. MOLPHY version 2.3:
programs for molecular phylogenetics based on
maximum likelihood. Comput. Sci. Monogr. 28,
1150 (1996).
53. Guindon, S. & Gascuel, O. A simple, fast, and
accurate algorithm to estimate large phylogenies
by maximum likelihood. Syst. Biol. 52, 696704
(2003).
54. Stamatakis, A. RAxMLVIHPC: maximum likelihoodbased phylogenetic analyses with thousands of
taxa and mixed models. Bioinformatics 22,
26882690 (2006).
55. Zwickl, D. Genetic Algorithm Approaches for the
Phylogenetic Analysis of Large Biological Sequence
Datasets Under the Maximum Likelihood Criterion.
Thesis, Univ. Texas at Austin (2006).
56. Yang, Z. Maximum likelihood phylogenetic estimation
from DNA sequences with variable rates over sites:
approximate methods. J.Mol. Evol. 39, 306314
(1994).
57. Lartillot, N. & Philippe, H. A Bayesian mixture model
for across-site heterogeneities in the amino-acid
replacement process. Mol. Biol. Evol. 21, 10951109
(2004).
58. Blanquart, S. & Lartillot, N. A site- and timeheterogeneous model of amino acid replacement.
Mol. Biol. Evol. 25, 842858 (2008).
59. Goldman, N. Statistical tests of models of DNA
substitution. J.Mol. Evol. 36, 182198 (1993).
60. Zuckerkandl, E. & Pauling, L. in Evolving Genes and
Proteins (eds Bryson, V. & Vogel, H.J.) 97166
(Academic Press, New York, 1965).
61. Nielsen, R. & Yang, Z. Likelihood models for
detecting positively selected amino acid sites and
applications to the HIV1 envelope gene. Genetics
148, 929936 (1998).
62. Yang, Z. Likelihood ratio tests for detecting positive
selection and application to primate lysozyme
evolution. Mol. Biol. Evol. 15, 568573 (1998).
63. Yang, Z. & Nielsen, R. Codon-substitution models
for detecting molecular adaptation at individual
sites along specific lineages. Mol. Biol. Evol. 19,
908917 (2002).
64. Huelsenbeck, J.P. & Rannala, B. Phylogenetic methods
come of age: testing hypotheses in an evolutionary
context. Science 276, 227232 (1997).
65. Whelan, S., Li, P. & Goldman, N. Molecular
phylogenetics: state of the art methods for looking
into the past. Trends Genet. 17, 262272 (2001).
66. Rannala, B. & Yang, Z. Probability distribution
of molecular evolutionary trees: a new method of
phylogenetic inference. J.Mol. Evol. 43, 304311
(1996).
67. Yang, Z. & Rannala, B. Bayesian phylogenetic
inference using DNA sequences: a Markov
chain Monte Carlo Method. Mol. Biol. Evol. 14,
717724 (1997).
68. Mau, B. & Newton, M.A. Phylogenetic inference for
binary data on dendrograms using Markov chain
Monte Carlo. J.Comput. Graph. Stat. 6, 122131
(1997).
69. Li, S., Pearl, D. & Doss, H. Phylogenetic tree
reconstruction using Markov chain Monte Carlo.
J.Am. Stat. Assoc. 95, 493508 (2000).
70. Larget, B. & Simon, D.L. Markov chain Monte Carlo
algorithms for the Bayesian analysis of phylogenetic
trees. Mol. Biol. Evol. 16, 750759 (1999).
71. Huelsenbeck, J.P. & Ronquist, F. MrBayes: Bayesian
inference of phylogenetic trees. Bioinformatics 17,
754755 (2001).

72. Drummond, A.J., Ho, S.Y.W., Phillips, M.J. &


Rambaut, A. Relaxed phylogenetics and dating with
confidence. PLoS Biol. 4, e88 (2006).
This paper introduces a Bayesian MCMC algorithm
(the BEAST program) to estimate rooted trees
under relaxed-clock models.
73. Felsenstein, J. Confidence limits on phylogenies:
an approach using the bootstrap. Evolution 39,
783791 (1985).
74. Felsenstein, J. & Kishino, H. Is there something wrong
with the bootstrap on phylogenies? A reply to Hillis
and Bull. Syst. Biol. 42, 193200 (1993).
75. Efron, B., Halloran, E. & Holmes, S. Bootstrap
confidence levels for phylogenetic trees. Proc. Natl
Acad. Sci. USA 93, 70857090 (1996); corrected
article Proc. Natl Acad. Sci. USA 93, 1342913434
(1996).
76. Berry, V. & Gascuel, O. On the interpretation of
bootstrap trees: appropriate threshold of clade
selection and induced gain. Mol. Biol. Evol. 13,
9991011 (1996).
77. Susko, E. First-order correct bootstrap support
adjustments for splits that allow hypothesis testing
when using maximum likelihood estimation. Mol. Biol.
Evol. 27, 16211629 (2010).
78. Suzuki, Y., Glazko, G.V. & Nei, M. Overcredibility of
molecular phylogenies obtained by Bayesian
phylogenetics. Proc. Natl Acad. Sci. USA 99,
1613816143 (2002).
79. Lewis, P.O., Holder, M.T. & Holsinger, K.E.
Polytomies and Bayesian phylogenetic inference.
Syst. Biol. 54, 241253 (2005).
80. Yang, Z. & Rannala, B. Branch-length prior influences
Bayesian posterior probability of phylogeny. Syst. Biol.
54, 455470 (2005).
81. Huelsenbeck, J.P. & Rannala, B. Frequentist
properties of Bayesian posterior probabilities of
phylogenetic trees under simple and complex
substitution models. Syst. Biol. 53, 904913 (2004).
82. Brown, J.M., Hedtke, S.M., Lemmon, A.R. &
Lemmon, E.M. When trees grow too long:
investigating the causes of highly inaccurate Bayesian
branch-length estimates. Syst. Biol. 59, 145161
(2010).
83. Rannala, B., Zhu, T. & Yang, Z. Tail paradox, partial
identifiability and influential priors in Bayesian branch
length inference. Mol. Biol. Evol. 29, 325335 (2012).
84. Zhang, C., Rannala, B. & Yang, Z. Robustness of
compound Dirichlet priors for Bayesian inference
of branch lengths. Syst. Biol. 10Feb 2012
(doi:10.1093/sysbio/sys030).
85. Suchard, M. & Rambaut, A. Many-core algorithms
for statistical phylogenetics. Bioinformatics 25,
13701376 (2009).
86. Zierke, S. & Bakos, J. FPGA acceleration of the
phylogenetic likelihood function for Bayesian MCMC
inference methods. BMC Bioinform. 11, 184 (2010).
87. Bininda-Emonds, O.R.P. Phylogenetic Supertrees:
Combining Information to Reveal the Tree of Life
(Kluwer Academic, the Netherlands, 2004).
88. de Queiroz, A. & Gatesy, J. The supermatrix approach
to systematics. Trends Ecol. Evol. 22, 3441 (2007).
89. Yang, Z. Maximum-likelihood models for combined
analyses of multiple sequence data. J.Mol. Evol. 42,
587596 (1996).
90. Shapiro, B., Rambaut, A. & Drummond, A.J.
Choosing appropriate substitution models for the
phylogenetic analysis of protein-coding sequences.
Mol. Biol. Evol. 23, 79 (2006).
91. Ren, F., Tanaka, H. & Yang, Z. A likelihood look at
the supermatrixsupertree controversy. Gene 441,
119125 (2009).
92. Criscuolo, A., Berry, V., Douzery, E.J. & Gascuel, O.
SDM: a fast distance-based approach for (super) tree
building in phylogenomics. Syst. Biol. 55, 740755
(2006).
93. Wiens, J.J. & Moen, D.S. Missing data and the
accuracy of Bayesian phylogenetics. J.Syst. Evol. 46,
307314 (2008).
94. Dwivedi, B. & Gadagkar, S. Phylogenetic inference
under varying proportions of indel-induced alignment
gaps. BMC Evol. Biol. 9, 14712148 (2009).
95. Rodrigue, N., Philippe, H. & Lartillot, N.
Mutation-selection models of coding sequence
evolution with site-heterogeneous amino acid fitness
profiles. Proc. Natl Acad. Sci. USA 107, 46294634
(2010).
96. Pagel, M. & Meade, A. A phylogenetic mixture model
for detecting pattern-heterogeneity in gene sequence
or character-state data. Syst. Biol. 53, 571581
(2004).

NATURE REVIEWS | GENETICS

97. Nishihara, H., Okada, N. & Hasegawa, M.


Rooting the Eutherian tree the power and pitfalls
of phylogenomics. Genome Biol. 8, R199 (2007).
98. Leigh, J.W., Susko, E., Baumgartner, M. & Roger, A.J.
Testing congruence in phylogenomic analysis. Syst.
Biol. 57, 104115 (2008).
99. Higgins, D.G. & Sharp, P.M. CLUSTAL: a package for
performing multiple sequence alignment on a
microcomputer. Gene 73, 237244 (1988).
100. Edgar, R.C. MUSCLE: multiple sequence alignment
with high accuracy and high throughput. Nucleic Acids
Res. 32, 17921797 (2004).
101. Lytynoja, A. & Goldman, N. An algorithm for
progressive multiple alignment of sequences
with insertions. Proc. Natl Acad. Sci. USA 102,
1055710562 (2005).
102. Lytynoja, A. & Goldman, N. Phylogeny-aware gap
placement prevents errors in sequence alignment and
evolutionary analysis. Science 320, 16321635 (2008).
103. Thorne, J.L., Kishino, H. & Felsenstein, J.
An evolutionary model for maximum likelihood
alignment of DNA sequences. J.Mol. Evol. 33,
114124 (1991); erratum J. Mol. Evol. 34, 91 (1992).
104. Hein, J., Jensen, J.L. & Pedersen, C.N. Recursions for
statistical multiple alignment. Proc. Natl Acad. Sci.
USA 100, 1496014965 (2003).
105. Redelings, B.D. & Suchard, M.A. Joint Bayesian
estimation of alignment and phylogeny. Syst. Biol. 54,
401418 (2005).
106. Lunter, G., Miklos, I., Drummond, A., Jensen, J.L. &
Hein, J. Bayesian coestimation of phylogeny and
sequence alignment. BMC Bioinformatics 6, 83 (2005).
107. Thorne, J.L., Kishino, H. & Painter, I.S. Estimating the
rate of evolution of the rate of molecular evolution.
Mol. Biol. Evol. 15, 16471657 (1998).
This paper describes the first Bayesian MCMC
method for dating species divergence using
minimum and maximum bounds to incorporate
fossil calibrations.
108. Kishino, H., Thorne, J.L. & Bruno, W.J.
Performance of a divergence time estimation
method under a probabilistic model of rate evolution.
Mol. Biol. Evol. 18, 352361 (2001).
109. Rannala, B. & Yang, Z. Inferring speciation times under
an episodic molecular clock. Syst. Biol. 56, 453466
(2007).
110. Yang, Z. & Rannala, B. Bayesian estimation of species
divergence times under a molecular clock using
multiple fossil calibrations with soft bounds. Mol. Biol.
Evol. 23, 212226 (2006).
111. Inoue, J., Donoghue, P.C.H. & Yang, Z. The impact of
the representation of fossil calibrations on Bayesian
estimation of species divergence times. Syst. Biol. 59,
7489 (2010).
112. Tavar, S., Marshall, C.R., Will, O., Soligos, C. &
Martin, R.D. Using the fossil record to estimate the
age of the last common ancestor of extant primates.
Nature 416, 726729 (2002).
113. Wilkinson, R.D. etal. Dating primate divergences
through an integrated analysis of palaeontological
and molecular data. Syst. Biol. 60, 1631 (2011).
114. Knowles, L.L. Statistical phylogeography. Annu. Rev.
Ecol. Syst. 40, 593612 (2009).
115. Lemey, P., Rambaut, A., Drummond, A.J. &
Suchard, M.A. Bayesian phylogeography finds its
roots. PLoS Comp. Biol. 5, e1000520 (2009).
116. Lemey, P., Rambaut, A., Welch, J.J. & Suchard, M.A.
Phylogeography takes a relaxed random walk in
continuous space and time. Mol. Biol. Evol. 27,
18771885 (2010).
117. Takahata, N., Satta, Y. & Klein, J. Divergence time and
population size in the lineage leading to modern
humans. Theor. Popul. Biol. 48, 198221 (1995).
118. Rannala, B. & Yang, Z. Bayes estimation of species
divergence times and ancestral population sizes using
DNA sequences from multiple loci. Genetics 164,
16451656 (2003).
This study describes the multi-species coalescent
model. This is the basis for carrying out comparative
analyses of individual genomes and phylogeographic
studies and for applying species tree methods.
119. Drummond, A.J., Nicholls, G.K., Rodrigo, A.G. &
Solomon, W. Estimating mutation parameters,
population history and genealogy simultaneously from
temporally spaced sequence data. Genetics 161,
13071320 (2002).
120. Hey, J. Isolation with migration models for more than
two populations. Mol. Biol. Evol. 27, 905920 (2010).
121. Knowles, L.L. & Carstens, B.C. Delimiting species
without monophyletic gene trees. Syst. Biol. 56,
887895 (2007).

VOLUME 13 | MAY 2012 | 313


2012 Macmillan Publishers Limited. All rights reserved

REVIEWS
122. Yang, Z. & Rannala, B. Bayesian species delimitation
using multilocus sequence data. Proc. Natl Acad. Sci.
USA 107, 92649269 (2010).
This paper describes a Bayesian MCMC method for
delimiting species using sequence data from multiple
loci under the multi-species coalescent model.
123. Rohland, N. etal. Genomic DNA sequences from
mastodon and woolly mammoth reveal deep
speciation of forest and savanna elephants. PLoS Biol.
8, e1000564 (2010).
124. Bos, K.I. etal. A draft genome of Yersinia pestis from
victims of the Black Death. Nature 478, 506510
(2011).
125. Patterson, N., Richter, D.J., Gnerre, S., Lander, E.S.
& Reich, D. Genetic evidence for complex
speciation of humans and chimpanzees. Nature 441,
11031108 (2006).
126. Innan, H. & Watanabe, H. The effect of gene flow
on the coalescent time in the humanchimpanzee
ancestral population. Mol. Biol. Evol. 23,
10401047 (2006).
127. Becquet, C. & Przeworski, M. A new approach to
estimate parameters of speciation models with
application to apes. Genome Res. 17, 15051519
(2007).
128. Hobolth, A., Christensen, O.F., Mailund, T. &
Schierup, M.H. Genomic relationships and
speciation times of human, chimpanzee, and gorilla
inferred from a coalescent hidden Markov model.
PLoS Genet. 3, e7 (2007).

129. Burgess, R. & Yang, Z. Estimation of hominoid


ancestral population sizes under Bayesian coalescent
models incorporating mutation rate variation and
sequencing errors. Mol. Biol. Evol. 25, 19791994
(2008).
130. Becquet, C. & Przeworski, M. Learning about modes
of speciation by computational approaches. Evolution
63, 25472562 (2009).
131. Yang, Z. A likelihood ratio test of speciation with gene
flow using genomic sequence data. Genome Biol. Evol.
2, 200211 (2010).
132. Reich, D. etal. Genetic history of an archaic hominin
group from Denisova Cave in Siberia. Nature 468,
10531060 (2010).
133. Sitnikova, T., Rzhetsky, A. & Nei, M. Interior-branch
and bootstrap tests of phylogenetic trees. Mol. Biol.
Evol. 12, 319333 (1995).
134. Zhong, B., Yonezawa, T., Zhong, Y. & Hasegawa, M.
The position of gnetales among seed plants:
overcoming pitfalls of chloroplast phylogenomics.
Mol. Biol. Evol. 27, 28552863 (2010).
135. Drummond, A.J. & Rambaut, A. BEAST: Bayesian
evolutionary analysis by sampling trees. BMC Evol.
Biol. 7, 214 (2007).
136. Kosakovsky Pond, S.L., Frost, S.D.W. & Muse, S.V.
HyPhy: hypothesis testing using phylogenies.
Bioinformatics 21, 676679 (2005).
137. Yang, Z. PAML 4: phylogenetic analysis by
maximum likelihood. Mol. Biol. Evol. 24,
15861591 (2007).

314 | MAY 2012 | VOLUME 13

138. Lartillot, N. & Philippe, H. Computing Bayes factors


using thermodynamic integration. Syst. Biol. 55,
195207 (2006).
139. Xie, W., Lewis, P.O., Fan, Y., Kuo, L. & Chen, M.H.
Improving marginal likelihood estimation for Bayesian
phylogenetic model selection. Syst. Biol. 60,
150160 (2011).

Acknowledgements

We thank the three referees for their constructive comments


and M. Hasegawa and B. Zhong for providing the seed-plant
phylogenies of Fig.3. Z.Y. is supported by a UK Biotechnology
and Biological Sciences Research Council grant and a Royal
Society Wolfson Research Merit Award. B.R. is supported by
a US National Institutes of Healthgrant.

Competing interests statement

The authors declare no competing financial interests.

FURTHER INFORMATION
Ziheng Yangs homepage: http://abacus.gene.ucl.ac.uk
Bruce Rannalas homepage: http://www.rannala.org
A comprehensive list of phylogenetic programs maintained
by Joe Felsenstein: http://evolution.genetics.washington.
edu/phylip/software.html
Nature Reviews Genetics article series on Study designs:
http://www.nature.com/nrg/series/studydesigns/index.html
ALL LINKS ARE ACTIVE IN THE ONLINE PDF

www.nature.com/reviews/genetics
2012 Macmillan Publishers Limited. All rights reserved

You might also like