0% found this document useful (0 votes)
308 views67 pages

Principles-Of-Computational-Biology

This document discusses phylogenetic trees and methods for constructing them. It begins by defining phylogenetic trees and their components. It then describes three main methods for constructing phylogenetic trees: parsimony, distance matrix based, and maximum likelihood. It focuses on parsimony methods, outlining the assumptions of character-based parsimony and providing an example. It also discusses variants of parsimony models and the concept of homoplasy. Finding the most parsimonious tree is an NP-complete problem for most variants. The document also discusses perfect phylogeny, character compatibility, and algorithms like the clique method and dynamic programming for solving the small and large parsimony optimization problems.

Uploaded by

Gimber Breg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
308 views67 pages

Principles-Of-Computational-Biology

This document discusses phylogenetic trees and methods for constructing them. It begins by defining phylogenetic trees and their components. It then describes three main methods for constructing phylogenetic trees: parsimony, distance matrix based, and maximum likelihood. It focuses on parsimony methods, outlining the assumptions of character-based parsimony and providing an example. It also discusses variants of parsimony models and the concept of homoplasy. Finding the most parsimonious tree is an NP-complete problem for most variants. The document also discusses perfect phylogeny, character compatibility, and algorithms like the clique method and dynamic programming for solving the small and large parsimony optimization problems.

Uploaded by

Gimber Breg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Lecture 11

Phylogenetic trees

Principles of Computational
Biology
Teresa Przytycka, PhD
Phylogenetic (evolutionary) Tree
•  showing the evolutionary relationships among
various biological species or other entities that
are believed to have a common ancestor.
•  Each node is called a taxonomic unit.
•  Internal nodes are generally called hypothetical
taxonomic units
•  In a phylogenetic tree, each node with
descendants represents the most recent
common ancestor of the descendants, and the
•  edge lengths (if present) correspond to time
estimates.
Methods to construct phylogentic
trees
•  Parsimony
•  Distance matrix based
•  Maximum likelihood
Parsimony methods
The preferred evolutionary tree is
the one that requires
“the minimum net amount of evolution”
[Edwards and Cavalli-Sforza, 1963]
Assumption of character based
parsimony
•  Each taxa is described by a set of characters
•  Each character can be in one of finite number
of states
•  In one step certain changes are allowed in
character states
•  Goal: find evolutionary tree that explains the
states of the taxa with minimal number of
changes
Example

Taxon1 Yes Yes No


Taxon 2 YES Yes Yes
Taxon 3 Yes No No
Taxon 4 Yes No No
Taxon 5 Yes No Yes
Taxon 6 No No Yes
Ancestral
states

4 Changes
Version parsimony models:
•  Character states
–  Binary: states are 0 and 1 usually interpreted as presence or
absence of an attribute (eg. character is a gene and can be
present or absent in a genome)
–  Multistate: Any number of states (Eg. Characters are
position in a multiple sequence alignment and states are
A,C,T,G.
•  Type of changes:
–  Characters are ordered (the changes have to happen in
particular order or not.
–  The changes are reversible or not.
Variants of parsimony

•  Fitch Parsimony unordered, multistate characters with


reversibility
•  Wagner Parsimony ordered, multistate characters
with reversibility
•  Dollo Parsimony ordered, binary characters with
reversibility but only one insertion allowed per
character characters that are relatively chard to gain but easy
to lose (like introns)
•  Camin-Sokal Parsimony- no reversals, derived states
arise once only
•  (binary) prefect phylogeny – binary and non-
reversible; each character changes at most once.
Prefect – No
(triangle gained
and the lost)

Dollo – Yes

Camin-Sokal –
No (for the same
reason as perfect)
Camin-Sokal Parsimony

But this is
not prefect
and not
Dollo

Triangle inserted twice!

3 Changes
Homoplasy
Having some states arise more than once is
called homoplasy.

Example – triangle in the tree on the


previous slide
Finding most parsimonious tree
•  There are exponentially many trees with n
nodes
•  Finding most parsimonious tree is NP-
complete (for most variants of parsimony
models)
•  Exception: Perfect phylogeny if exists can
be found quickly. Problem – perfect
phylogeny is to restrictive in practice.
Perfect phylogeny
•  Each change can happen only once and is
not reversible.
•  Can be directed or not

Example: Consider binary characters. Each character


corresponds to a gene.
0-gene absent
1-gene present
It make sense to assume directed changes only form 0 to 1.
The root has to be all zeros
Perfect phylogeny
Example: characters = genes; 0 = absent ; 1 = present
Taxa: genomes (A,B,C,D,E)

genes
A000110
B110000 1
C000111 1
D101000 1 1
1
1
E000100
B D E A C
Perfect phylogeny tree

Goal: For a given character state matrix construct a tree topology


that provides perfect phylogeny.
Does there exist prefect
parsimony tree for our example
with geometrical shapes?

There is a simple test


Character Compatibility
•  Two characters A, B are
compatible if there do A B
not exits four taxa
containing all four T1 1 1
combinations as in the
table T2 1 0
•  Fact: there exits perfect
phylogeny if and only if T3 0 1
and only if all pairs of
characters are
T4 0 0
compatible
Are not compatible

Taxon1 Yes Yes No


Taxon 2 YES Yes Yes
Taxon 3 Yes No No
Taxon 4 Yes No No
Taxon 5 Yes No Yes
Taxon 6 No No Yes
One cannot add triangle
to the tree so that no
character changes it state
? twice:
If we add it to on of the
left branches it will be
inserted twice if to the
right most – circle would
have to be deleted
(insertion and the
deletion of the circle)
Ordered characters and perfect
phylogeny
•  Assume that we in the last common
ancestor all characters had state 0.
•  This assumption makes sense for many
characters, for example genes.
•  Then compatibility criterion is even
simpler: characters are compatible if and
only if there do not exist three taxa
containing combinations (1,0),(0,1),(1,1)
Example
A000110
B110000 Under assumption that states are directed
form 0 to 1: if i and j are two different genes
C000111 then the set of species containing i is either
D101000 disjoint with set if species containing j or
E000100 one of this sets contains the other.

•  The above property is necessary and sufficient for prefect


phylogeny under 0 to 1 ordering
•  Why works: associated with each character is a subtree.
These subtrees have to be nested.
Simple test for prefect phylogeny
•  Fact: there exits perfect phylogeny if and only if and
only if all pairs of characters are compatible
•  Special case: if we assume directed parsimony (0!1
only) then characters are compatible if and only if
there do not exist tree taxa containing combinations
(1,0),(0,1),(1,1)
•  Observe the last one is equivalent to non-overlapping
criterion
•  Optimal algorithm: Gusfield O(nm):
n = # taxa; m= #characters
Two version optimization problem:
Small parsimony: Tree is given and we want to find the
labeling that minimizes #changes – there are good
algorithms to do it.
Large parsimony: Find the tree that minimize number of
evolutionary changes. For most models NP complete
One approach to large parsimony requires:
-  generating all possible trees
-  finding optimal labeling of internal nodes for each tree.
Fact 1: #tree topologies grows exponentially with #nodes
Fact 2: There may be many possible labels leading to the
same score.
Clique method for large parsimony
characters

•  Consider the following 1 2 3 4 5 6


graph:
α
1 0 0 1 1 0
–  nodes – characters;
–  edge if two characters β
0 0 1 0 0 0
are compatible
γ
1 1 0 0 0 0
2 3
δ
1 1 0 1 1 1
4
ε
0 0 1 1 1 0
1

ω
0 0 0 0 0 0
6
5
3,5 INCOMPATIBLE
Max. compatible set
Clique method (Meacham 1981) -
•  Find maximal compatible clique (NP-
complete problem)
•  Each characters defines a partition of the
set into two subsets
α,γ,
γ,
γ,

α,
α,

δ
δ
δ

1 2
α,β,γ,

δ,ε,ω
3

β

ε,ω
β

β

ε,ω
ε,ω


Small parsimony
•  Assumptions: the tree is known
•  Goal: find the optimal labeling of the tree
(optimal = minimizing cost under given
parsimony assumption)
“Small” parsimony

Infer nodes labels


Application of small parsimony
problem

•  errors in data
•  loss of function
•  convergent evolution (a trait
developed independently by two
evolutionary pathways e.g. wings
in birds an bats)
•  lateral gene transfer (transferring
genes across species not by
inheritance)
Red – gene encoding
N-acetylneuraminate
lyase
From paper: Are There Bugs in Our Genome:
Anderson, Doolittle, Nesbo, Science 292 (2001) 1848-51
Dynamic programming algorithm for small
parsimony problem
•  Sankoff (1975) comes with the DP approach
(Fitch provided an earlier non DP algorithm)
•  Assumptions
–  one character with multiple states
- The cost of change from state v to w is δ(v,w)
(note that it is a generalization, so far we talk about
cost of any change equal to 1)
DP algorithm continued
st(v) = minimum parsimony cost for node v under
assumption that the character state is t.
st(v) = 0 if v is a leaf. Otherwise let u, w be children
of u
st(v) = min i {si(u)+ δ(i,t)}+ min j {sj(w)+ δ(j,t)}
t

δ(i,t) δ(j,t)
Try all possible states in u and v
u w

O(nk) cost where n=number of nodes


k = number of sates
7

Exercise 5
6

Left and right characters are independent,


We will compute the left. 3
4

1 2

St(v) 1 2 3 4 5 6 7
t
Branch lengths
•  Numbers that indicate the
number of changes in each
branch 1
•  Problem – there may by many 6
most parsimonious trees
•  Method 1: Average over all 1

most parsimonious trees. 4


•  Still a problem – the branch
lengths are frequently
underestimated
Character patterns and parsimony

•  Assume 2 state characters (0/1) and four


taxa A,B,C,D
•  The possible topologies are:
C B A
A A B

B
D C D D C
Changes in each topology
ABC D 0, 0, 0
0 000 1, 1, 1
0 00 1 1, 1, 1
0 0 10 Informative character
1, 2, 2
0 0 1 1 (helps to decide the tree topology)

Informative characters: xxyy, xyxy,xyyx


Inconsistency
A C
p
p
q
q q
B D

•  Let p, q character change probability


•  Consider the three informative patters xxyy, xyxy,
xyyx
•  The tree selected by the parsimony depends
which pattern has the highest fraction;
•  If q(1-q) < p2 then the most frequent pattern is
xyxy leading to incorrect tree.
Distance based methods

•  When two sequences are similar they are


likely to originate from the same ancestor
•  Sequence similarity can approximate
evolutionary distances

GA(A/G)T(C/T)

GAATC GAGTT
Distance Method
•  Assume that for any pair of species we
have an estimation of evolutionary
distance between them
–  eg. alignment score

•  Goal: construct a tree which best


approximates these distance
Tree from distance matrix
M A B C D E
T
A 0 2 7 7 12
3
B 2 0 7 7 12 5

C 7 7 0 4 11 3 1 E

D 7 7 4 0 11 2 2
1 1
E 12 12 11 11 0 C D

A B

length of the path from A to D = 1+3+1+2=7

Consider weighted trees: w(e) = weight of edge e


Recall: In a tree there is a unique path between any two nodes.
Let e1,e2,…ek be the edges of the path connecting u and v then the
distance between u and v in the tree is:
d(u,v) = w(e1) + w(e2) + … + w(ek)
Can one always represent a
distance matrix as a weighted
tree?
0 10 5 10
a
b 10 0 9 5
There is no way to add d
c 5 9 0 8 to the tree and preserve
the distances
d 10 5 8 0

a b c d

3 7 c
a ?
2
b d
Quadrangle inequality
a c

b d

d(a,c) + d(b,d) = d(a,d) + d(b,c) >= d(a,b) + d(d,c)

•  Matrix that satisfies quadrangle inequality (called


also the four point condition) for every four taxa is
called additive.
•  Theorem: Distance matrix can be represented
precisely as a weighted tree if and only if it is
additive.
Constructing the tree representing an additive
matrix (one of several methods)
1.  Start form 2-leaf tree a,b where a,b are y
any two elements x
2.  For i = 3 to n (iteratively add vertices) x y
1.  Take any vertex z not yet in the tree
and consider 2 vertices x,y that are c
in the tree and compute
d(z,c) = (d(z,x) + d(z,y) - d(x,y) )/2
d(x,c) = (d(x,z) + d(x,y) – d(y,z))/2
2.  From step 1 we know position of c
and the length of brunch (c,z).
If c did not hit exactly a brunching
point add c and z x z
else take as y any node from sub-tree
that brunches at c and repeat steps z
1,2. y
Example
0 10 5 9 10 v
u u
v 10 0 9 5 Adding x:
d(x,c) = (d(u,x) + d(v,x) – d(u,v))/2 = (5+9-10)/2= 2
x 5 9 0 9 d(u,c) = (d(u,x) + d(u,v) - d(x,v))/2 = (5+10-9)/2 = 3
y 9 5 9 0 3 c 7 v
u
u v x y
2
Adding y: x
d(y,c’) = (d(u,y) + d(v,y) – d(u,w))/2 = (5+9-10)/2= 2
d(u,c’) = (d(u,y) + d(u,v) - d(y,v))/2 = (10+9-5)/2 = 7

3 c 4 c’ 3 v
u

2 2
y
x
Real matrices are almost never
additive
•  Finding a tree that minimizes the error
Optimizing the error is hard
•  Heuristics:
–  Unweighted Pair Group Method with
Arithmetic Mean (UPGMA)
–  Neighborhood Joining (NJ)
Hierarchical Clustering

•  Clustering problem: Group items (e.g. genes) with


similar properties (e.g. expression pattern, sequence
similarity) so that
–  The clusters are homogenous (the items in each cluster are
highly similar, as measured by the given property –
sequence similarity, expression pattern)
–  The clusters are well separated (the items in different
clusters are different)
•  Hierarchical clustering Many clusters have natural
sub-clusters which are often easier to identify e.g. cuts
are sub-cluster of carnivore sub-cluster of mammals
Organize the elements into a tree rather than forming
explicit portioning
The basic algorithm
Input: distance array d; cluster to cluster distance function
Initialize:
1.  Put every element in one-element cluster
2.  Initialize a forest T of one-node trees (each tree
corresponds to one cluster)
while there is more than on cluster
1.  Find two closest clusters C1 and C2 and merge them into
C
2.  Compute distance from C to all other clusters
3.  Add new vertex corresponding to C to the forest T and
make nodes corresponding to C1, C2 children of this
node.
4.  Remove from d columns corresponding to C1,C2
5.  Add to d column corresponding to C
***
A distance function

•  dave(C1,C2) = 1/(|C1||C2|) Σ d(x,y)


x in C1 y in C2

Average over all distances


Example (on blackboard)

d A B C D E

A 0 2 7 7 12

B 2 0 7 7 12

C 7 7 0 4 11

D 7 7 4 0 11

E 12 12 11 11 0
Unweighted Pair Group Method with
Arithmetic Mean (UPGMA)
•  Idea:
–  Combine hierarchical clustering with a method
to put weights on the edges
–  Distance function used:

Σ d(x,y)
dave(C1,C2) = 1/(|C1||C2|)
x in C1
y in C2
–  We need to come up with a method of
computing brunch lengths
Ultrametric trees

•  The distance from


any internal node C
to any of its leaves
is constant and
equal to h(C)
•  For each node (v)
we keep variable h –
height of the node in
the tree. h(v) = 0 for
all leaves.
UPGMA algorithm
Initialization (as in hierarchical clustering); h(v) = 0
while there is more than on cluster
1.  Find two closest clusters C1 and C2 and
merge them into C
2.  Compute dave from C to all other clusters
3.  Add new vertex corresponding to C to the
forest T and make nodes corresponding to C1,
C2 children of this node.
4.  Remove from d columns corresponding to
C1,C2
5.  Add to d column corresponding to C
6.  h(C) = d(C1, C2) /2
7.  Assign length h(C)-h(C1) to edge (C1,C)
8.  Assign length h(C)-h(C2) to edge (C2,C)
A 0 5 7 10
Neighbor Joining B 5 0 4 7
C 7 4 0 5
•  Idea: D 10 7 5 0
–  Construct tree by
iteratively combing first
nodes that are neighbors
in the tree A
•  Trick: Figuring out a pair
of neighboring vertices 4 2
takes a trick – the closest 4
pair want always do: B
C 1
•  B and C are the closest
but are NOT neighbors. D
Finding Neighbors

•  Let u(C) = 1/(#clusters-2)Σ d(C,C’)


all clusters C’
•  Find a pair C1C2 that minimizes
f(C1,C2)= d(C1,C2)-(u(C1)+u(C2))

•  Motivation: keep d(C1,C2) small while


(u(C1)+u(C2)) large
Finding Neighbors

•  Let u(C) = 1/(#clusters-2)Σ d(C,C’)


all clusters C’

•  Find a pair C1C2 that minimizes


f(C1,C2)= d(C1,C2)-(u(C1)+u(C2))
•  For the data from example:
u(CA) = u(CD) = 1/2(5+7+10) = 11
u(CB) = u(CC) = 1/2(5+4+7) = 8
f(CA,CB) = 5-11 -8 = -14
f(CB,CC) = 4- 8 -8 = -12
NJ algorithm
Initialization (as in hierarchical clustering); h(v) = 0
while there is more than on cluster
1.  Find clusters C1 and C2 minimizing f(C1C2) and
merge then into C
2.  Compute for all C*: d(C,C*) = (d(C1C)+ d(C2C))/2
3.  Add new vertex corresponding to C to the forest T
and connect it to C1, C2
4.  Remove from d columns corresponding to C1,C2
5.  Add to d column corresponding to C
6.  Assign length ½(d(C1C2)+u(C1)-u(C2) to edge C1C
7.  Assign length ½(d(C1C2)+u(C2)-u(C1) to edge C2C
NJ tree is not rooted
The order of construction of internal nodes of NJ does not suggest an
ancestral relation:

1 2 3 4 5
Rooting a tree
•  Choose one distant organism as an out-
group
root

out-group

Species of interests
Bootstraping
•  Estimating confidence in the tree topology

• Are we sure if this is


correct?
• Is there enough
evidence that A is a B
successor of B not the
other way around? A
Bootstrapping, continued

•  Assume that the tree is build form multiple


sequence alignment
Select columns randomly
(with replacement)

1 2 3 4 5 67 8 9 10 11 13 2 1 10 6 5 4 5 0 11
Initial tree New tree
Columns
of the alignment
B
59% A
A
B

Repeat, say 1000 times,


For each edge of initial tree calculate % times
it is present in the new tree
Summary
•  Assume you have multiple alignment of length N.
Let T be the NJ tree build from this alignment
•  Repeat, say 1000 times the following process:
–  Select randomly with replacement N columns of the
alignment to produce a randomized alignment
–  Build the tree for this randomized alignment
•  For each edge of T report % time it was present
in a tree build form randomized alignment . This
is called the bootstrap value.
•  Trusted edges: 80% or better bootstrap.
Maximum Likelihood Method

•  Given is a multiple sequence alignment


and probabilistic model of for substitutions
(like PAM model) find the tree which has
the highest probability of generating the
data.
•  Simplifying assumptions:
–  Positions involved independently
–  After species diverged they evolve
independently.
Formally:
•  Find the tree T such that assuming evolution
model M
Pr[Data| T,M] is maximized
•  From the independence of symbols:
Pr[Data| T,M] = P i Pr[Di| T,M]
Where the product is taken over all characters i
and Di value of the character i is over all taxa
Computing Pr[Di| T,M]
p(x,y,t) = prob. of x consider all possible
mutation x to y in time t t3
t1 assignments here
(from the model)
y z
t2 t6 t4 t5 time

A B C D
Column Di

Pr[Di| T,M]
=ΣxΣyΣz
p(x)p(x,y,t1)p(y,A,t2)p(y,B,t6)p(x,z,t3)p(z,C,t4)p(z,D,t5)
Discovering the tree of life
•  “Tree of life” – evolutionary tree of all
organisms
•  Construction: choose a gene universally present
in all organisms; good examples: small rRNA
subunit, mitochondrial sequences.
•  Things to keep in mind while constructing tree of
life from sequence distances:
–  Lateral (or horizontal) gene transfer
–  Gene duplication: genome may contain similar genes
that may evolve using different pathways. Phylogeny
tree need to be derived based on orthologous genes.
Where we go with it…
•  We now how to compute for given column
and given tree Pr[Di| T,M]
•  Sum up over all columns to get
Pr[Data| T,M]
Now, explore the space of possible trees

Problem:
•  Bad news: the space of all possible trees is HUGE

Various heuristic approaches are used.


Metropolis algorithm:Random Walk
in Energy Space
The network
of possible Tree i can be obtained
trees, there is from tree j by a “local” change
an edge
between State i State j
“similar (in our case tree) (in our case another tree)
trees”

Goal design transition probabilities so that the


probability of arriving at state j is temperature
const.
P(j) = q(j) / Z
–E(S)/kT
(typically, q(S) = e , where E – energy)
Z – partition function = sum over all states S of
terms q(s). Z cannot be computed analytically since the
space is to large
Monte Carlo, Metropolis
Algorithm
•  At each state i choose uniformly at random
one of neighboring conformations j.

•  Compute p(i,j) = min ( 1, q(i)/q(j) )


•  With probability p(i,j) move to state j.
•  Iterate
MrBayes
•  Program developed by J.Huelsenbeck &
F.Ronquist.
•  Assumption:
•  q(i) = Pr[D| Ti,M]
Prior probabilities: All trees are equally
likely.
•  Proportion of time a given tree is visited
approximates posterior probabilities.
Most Popular Phylogeny Software

•  PAUP
•  PHYLIP

You might also like