Computational Biology B. Tech - Bio-Tech (VI Semester)
Computational Biology B. Tech - Bio-Tech (VI Semester)
Choose
sequences
Multiple
sequence
aligment
Yes
Strong
similarity?
No
Yes
Maximum
parsimony
methods
Recognizable
similarity?
Distance
methods
No
Maxium likelihood
methods
UPGMA algorithm
Input an n x n distance matrix D
1. Initialize a set C to consist of n sequences (singleton clusters)
2. Initialize distance matrix
dist({i},{j}) = D(i, j)
3. Repeat the following n-1 times
a) Determine a pair c, d of clusters in C such that distance (c, d)
in minimal;
define dmin= dist(c, d)
b) Define a new cluster e = c U d;
define C = C {e}
c) Define a node with label e and daughters c and d,
where the e has distance dmin /2 to its leaves
d) Define for all f in C with f different from e
dist(e,f) = dist(f,e) = [dist(c,f) + dist(d,f)]/2
Fitch-Margolish Method
Use Distance table
Based on least sqarues
Branch lengths are assumed to be additive
Method is accurate for trees with short branches
Tries different tree topologies, swapping branches among
closely related sequences and recalculates the distance
Fitch-Margolish Algorithm:
Find most closely related pair of sequences, ex: A and B
Treat rest of sequences as a single composite sequence and
calculate the average distance from A to all other sequences,
and from B to all other sequences
Calculate the branch length of A and b from their common node
Now treat A and B as composite sequence AB, calculate the
average distance between AB and all other sequences
Make new distance table
Identify new closely related sequence and repeat the process
and built a tree
Repeat entire procedure with all possible pairs of sequences A
and B, A and C, A and D etc.
Calculate predicted distances between each pair of sequence of
tree to find tree that fits the original data
Distance Matrix
A
B
C
D
E
A
-
B
22
-
C
39
41
-
C
D/E
A/B
C
0
0
0
D/E
19
0
0
D
39
41
18
-
A
B
C
D/E
A/B
40
41
0
E
41
43
20
10
A
0
0
0
0
D
E
A/B/C
B
22
0
0
0
C
39
41
0
0
D/E
40
42
19
0
D
0
0
0
E
10
0
0
A/B/C
32.7 (39+41+18)/3
(41+43+20)/3
34.7
0
D + E =10
(1)
D + A/B/C = 32.7
E + A/B/C =34.7
(2)
(3)
(4)
x + [ ( e + d ) / 2 ] = 10
x + 5 =10
x=5
A
B
C/D/E
A
0
0
0
B
22
0
0
C/D/E
39.5
41.5
0
c
4
(3)
d
e
A + B = 22
(1)
A + C/D/E = 39.5
B + C/D/E = 41.5
(2)
(3)
(4)
Advantages of FM method
Tests more than one tree
Fast in execution
Can use empirical substitution scoring methods
Disadvantages of FM method
Long execution time
Does not consider intermediate ancestors
Misses homoplasies
Long evolutionary distances are underestimated
N-J algorithm
1. The raw data are provided as a distance matrix and the initial tree is a star tree.
2. A modified distance matrix is constructed in which the separation between
each pair of nodes is adjusted on the basis of their average divergence from all
other nodes.
3. The tree is constructed by linking the least-distant pair of nodes in this
modified matrix (Using FM method).
4. When two nodes are linked, their common ancestral node is added to the tree
and the terminal nodes with their respective branches are removed from tree.
5. This pruning process converts the newly added common ancestor into a
terminal node on a tree of reduced size.
6. At each stage in the process two terminal nodes are replaced by one new node.
7. The process is complete when two nodes remain, separated by a single branch.
A
B
C
D
E
F
A
0
5
4
7
6
8
B
5
0
7
10
9
11
C
4
7
0
7
6
8
D
7
10
7
0
5
9
E
6
9
6
5
0
8
F
8
11
8
9
8
0
Distance Matrix
A
F
|
B
| /
\ | /
\|/
/|\
/ | \
/ | \
E
|
C
D
\
Star Tree
A
A 0
B 5
C 4
D 7
E 6
F 8
A
B
C
D
E
F
B C D
5 4 7
0 7 10
7 0 7
10 7 0
9 6 5
11 8 9
E
6
9
6
5
0
8
F
8
11
8
9
8
0
F
A B
C
D
E
0 -13 -11.5 -10 -10 -10.5
0 -11.5 -10 -10 -10.5
0 -10.5 -10.5 -11.0
0
-13 -11.5
0 -11.5
0
U C D
U 0 3 6
C
0 7
D
0
E
F
E
5
6
5
0
C
D |
\ |
A
\|___/ 1
/|
\
/ |
\ 6
E |
\
F
\
B
F
7
8
9
8
0
1
2
3
4
G
G
A
A
G
G
C
C
A
G
A
G
Is a substitution
3
1
A
4
Maximum Parsimony
ACT
GTA
GTT
ACA
ACA
ACT
GTT
GTA
ACA
GTA
ACT
GTT
Maximum Parsimony
ACT
GTT
2 GTT GTA
1
2
ACA
GTA
GTT
ACA
MP score = 7
MP score = 5
ACA
ACT
ACA ACT
1
3
3
GTA
ACA GTA
1
MP score = 4
Optimal MP tree
GTT
ACT
GTA
Maximum Likelihood
Goal: Construct a phylogenetic tree from DNA sequences whose
likelihood is a maximum. (Felsenstein 1981)
Procedure
Start with a given topology and use the maximum likelihood method to
optimize branch lengths
Make local modifications to the topology and re-optimize the branch
lengths
New taxa are added one by one, optimizing branch lengths and
topologies each time
Assumes an evolutionary process that is a reversible Markov process
Very computationally expensive to use
Likelihood of a Tree
We want to find L(tree) = Pr[data|tree]
Given the data: a1=CT, a2=CG and a3=AT
Consider the tree
Likelihood of a Tree
L(tree) = Pr [data|tree]
Multiply likelihood for each character position
Recursive definition of Likelihood
Saves computational time