Phylogenetic Tree Reconstruction: I519 Introduction To Bioinformatics, 2012
Phylogenetic Tree Reconstruction: I519 Introduction To Bioinformatics, 2012
taxon
C A
S2 A mutation=2 mutation=1
A S4 S3 C A S4
(S1,S2), (S3, S4) (S1,S3), (S2, S4)
We can EASILY get different trees
(the “reality check” paper)
Input sequences
Multiple alignment programs
Substitution models
Phylogenetic tree reconstruction methods
Trees – what might they mean?
Species A
Species B
Species tree
Species C
Species D
Seq A
Gene tree
Seq D
Seq C
Seq B
Lack of resolution
Seq A
Seq D
Seq C
Seq B
Seq B
Seq A
Gene tree
Seq D
Seq C
speciation Seq B
gene transfer
Gene duplication & loss
species A
Species tree
species B
species C
Duplication species D
Seq A
Loss
Seq B
Gene tree Seq C A
D
Seq D C
Seq B’
B
Seq C’
Seq D’
Orthologs and paralogs
(important for function annotation)
Seq A
Orthologs:
Seq B sequences diverged after
a speciation event
Seq C
Paralogs:
Seq D sequences diverged after
gene Seq B a duplication event
duplication Xenologs:
Seq C
sequences diverged after
Seq D a horizontal transfer
Transversion: R to Y
Y to R
C T
b where R = A,G
Y = C,T
Distance based phylogeny
reconstruction
Phylogeny reconstruction for 3 sequences is
EASY
– There is a single tree topology
– The branch lengths can be calculated as follows:
To compute: branch lengths a, b and c, such that
A
a
c C
Input: DAB, DBC and DAC (pairwise distances) b
B
Output:
Fitch-Margoliash (FM) algorithm
For phylogeny reconstruction with more than 3 sequences
For example, given 5 sequences, A, B, C, D and E. The
tree can be reconstructed as follows
– First choose the closest sequence pair, suppose it is D and E
(based on the input pairwise distances; e.g., DDE=10)
– To calculate the branch lengths from D and E to their common
ancestor (denoted as d and e), we combine the remaining
three sequences (A, B and C) and treat them as a single
composite sequence (and define and so on)
-- so again we are dealing with 3 sequences, and we can
easily calculate the branch lengths
– Then merge D and E into a cluster and treat it as a composite
sequence, and update the distance table so that
and so on.
– Repeat the above steps until no more clusters to merge
Distance based method: UPGMA
UPGMA: Unweighted Pair Group Method with Arithmetic
Mean
Assume same rate evolution (molecular clock hypothesis)
The length from root to each leaf is the same (ultra
metric).
It is similar to Fitch-Margoliash algorithm (merge two most
similar sequences or clusters first); but the calculation of
branch lengths is even simpler.
For example, for the same example shown above with
five sequences, d=e=5 (d and e are the branch lengths
from sequence D and E to their common ancestor)
Neighbor-joining (NJ)
(Seitou & Nei algorithm)
Minimum evolution -- the least total branch
length (distance-based)
Bottom-up clustering method
Does not assume same rate evolution
Fast & produce reasonable trees
NJ method
3 3
2 2
X 4 Y X 4
1 1
5 5
Where
C1 and C2 are most similar to each other, while they are most
dissimilar to the other clusters (far from others)
Comparison of FM, UPGMA and
NJ methods
All are hierarchical clustering methods
All define the distance between two clusters as the
average pairwise distance
AGC +1 Cost=4 C
CT +1 GC +1 C C
CT= +1 +1 +1
C T G C A T C T G C A T
Calculate Fitch score Internal node labeling
Sankoff Algorithm
More general than the Fitch algorithm.
Assumes we have a table of costs cab for all possible
changes between states a and b (A, T, C or G for DNA)
For each node i in the tree we compute S(i,a) the minimal
cost given that node i is assigned state a.
In particular we can compute the minimum value over a
for S(root,a) which will be the cost of the tree.
Sankoff algorithm: DP
G C T
inf inf 0 inf inf inf inf 0
A G inf 0 inf inf
T
0 inf inf inf inf inf 0 inf inf inf inf 0
S(i,A) =
S(i,G)=0 S(i,T) = 0
1 4 1 4 4 1 4 1
Initialization:
S(i,a) = 0 i labeled by a, or
S(i,a) =
2 5 1 5
5 2 5 1
a{A,T,C,G}
cab A C G T Iteration:
A 0 2 1 2 S(i,a) = minx(S(j,x)+c(a,x))
C 2 0 2 1 + miny(S(k,y)+c(a,y))
G 1 2 0 2 Termination:
5 5 4 4
T 2 1 2 0 minaS(root,a)
Large Parsimony Problem
The small parsimony problem – to find the score
of a given tree - can be solved in linear time in
the size of the tree.
The large parsimony problem is to find the tree
with minimum score.
It is known to be NP-Hard.
Tree search strategies
Exact search
– possible for small n only
Branch and Bound
– Use “cleaver” rules to avoid some branches of trees
– up to ~20 (25) taxa
http://evolution.gs.washington.edu/gs541/2005/lecture25.pdf
Branch and bound
This branch
can be safely
“neglected”!
Nearest neighbor interchange
Probabilistic approaches to
phylogeny
Rank trees according to their likelihood
P(data|tree), or, posterior probability P(tree|data)
(Bayesian)
Maximum likelihood methods
Sampling methods
Calculate likelihood
Felsenstein’s algorithm for likelihood
Initialization: i=2n-1
Recursion: Compute P(Li|a) for all a as follows
if i is leaf
P(Li|a)=1 if a is the label of the leaf, otherwise 0
else
Compute P(Lj|a), P(Lk|a) for all a i
P(Li|a)=b,cP(b|a,tj)P(Lj|b)P(c|a,tk)P(Lk|c) a
j k
b c
How can one tell if a tree is
significant
Biological knowledge