0% found this document useful (0 votes)
34 views40 pages

Computational Biology B. Tech - Bio-Tech (VI Semester)

(1) The document discusses various phylogenetic tree building methods including distance-based, character-based, and probabilistic methods. (2) Distance-based methods like UPGMA, Neighbor-Joining, and Fitch-Margolish first calculate genetic distances between sequences and then input the distance matrix into a tree-building algorithm. (3) Maximum parsimony is a character-based method that identifies the tree requiring the fewest evolutionary changes, or steps, to explain the observed sequence variation between taxa. (4) Maximum likelihood is a probabilistic method that selects the tree topology that has the highest probability of producing the given sequence data under an assumed model of sequence evolution.

Uploaded by

sd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views40 pages

Computational Biology B. Tech - Bio-Tech (VI Semester)

(1) The document discusses various phylogenetic tree building methods including distance-based, character-based, and probabilistic methods. (2) Distance-based methods like UPGMA, Neighbor-Joining, and Fitch-Margolish first calculate genetic distances between sequences and then input the distance matrix into a tree-building algorithm. (3) Maximum parsimony is a character-based method that identifies the tree requiring the fewest evolutionary changes, or steps, to explain the observed sequence variation between taxa. (4) Maximum likelihood is a probabilistic method that selects the tree topology that has the highest probability of producing the given sequence data under an assumed model of sequence evolution.

Uploaded by

sd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 40

COMPUTATIONAL BIOLOGY

B. Tech Bio-Tech (VIth Semester)


Module 3

Phylogenetic Tree Building

Choose
sequences
Multiple
sequence
aligment

Yes

Strong
similarity?

No

Yes

Maximum
parsimony
methods

Recognizable
similarity?

Distance
methods

How well does data support


prediction?

No

Maxium likelihood
methods

Distance Based Methods


Distance methods first convert aligned sequences into a pair
wise distance matrix then input that matrix into a tree building
method.
Drawback:
The major objections to distance methods are that summarizing
a set of sequences by distance data loses information and branch
lengths estimated by some distance methods might not be
evolutionarily determinable.

Types of Distance Based Methods


Unweighted Pair Group Method with Arithmetic Mean
(UPGMA)
Fitch-Margolish Method (FM)
Neighbour Joining Method (NJ)
Minimum Evolution method (ME)

Unweighted Pair Group Method with Arithmetic


Mean (UPGMA)
Clustering algorithm
Joins tree branches based on the criterion of greatest similarity
among pairs and averages of joined pairs
Not strictly an evolutionary distance method
Follows Molecular Clock Hypothesis i.e. the rate of
change
along the branches of the tree is constant and distances are
approximately ultrametric

UPGMA algorithm
Input an n x n distance matrix D
1. Initialize a set C to consist of n sequences (singleton clusters)
2. Initialize distance matrix
dist({i},{j}) = D(i, j)
3. Repeat the following n-1 times
a) Determine a pair c, d of clusters in C such that distance (c, d)
in minimal;
define dmin= dist(c, d)
b) Define a new cluster e = c U d;
define C = C {e}
c) Define a node with label e and daughters c and d,
where the e has distance dmin /2 to its leaves
d) Define for all f in C with f different from e
dist(e,f) = dist(f,e) = [dist(c,f) + dist(d,f)]/2

Fitch-Margolish Method
Use Distance table
Based on least sqarues
Branch lengths are assumed to be additive
Method is accurate for trees with short branches
Tries different tree topologies, swapping branches among
closely related sequences and recalculates the distance

Fitch-Margolish Algorithm:
Find most closely related pair of sequences, ex: A and B
Treat rest of sequences as a single composite sequence and
calculate the average distance from A to all other sequences,
and from B to all other sequences
Calculate the branch length of A and b from their common node
Now treat A and B as composite sequence AB, calculate the
average distance between AB and all other sequences
Make new distance table
Identify new closely related sequence and repeat the process
and built a tree
Repeat entire procedure with all possible pairs of sequences A
and B, A and C, A and D etc.
Calculate predicted distances between each pair of sequence of
tree to find tree that fits the original data

Distance Matrix
A
B
C
D
E

A
-

B
22
-

C
39
41
-

C and D/E closest

C
D/E
A/B

C
0
0
0

D/E
19
0
0

D
39
41
18
-

A
B
C
D/E
A/B
40
41
0

D and E are closest (10)

E
41
43
20
10
A
0
0
0
0

D
E
A/B/C

B
22
0
0
0

C
39
41
0
0

D/E
40
42
19
0

D
0
0
0

E
10
0
0

A/B/C
32.7 (39+41+18)/3
(41+43+20)/3
34.7
0

D + E =10

(1)

D + A/B/C = 32.7
E + A/B/C =34.7

(2)
(3)

Subtract (3) and (2)


ED=2

(4)

Add (1) and (4)


2E =12
Therefore, E = 6, D = 4

C + D/E = 19 (1) C + A/B = 40 (2) D/E + A/B = 41


Subtract (3) and (2) D/E C = 1 (4)
Add (1) and (4)
2D/E = 20
Therefore D/E = 10, C =9

x + [ ( e + d ) / 2 ] = 10
x + 5 =10
x=5

A
B
C/D/E

A
0
0
0

B
22
0
0

C/D/E
39.5
41.5
0

c
4

(3)

d
e

A + B = 22

(1)

A + C/D/E = 39.5
B + C/D/E = 41.5

(2)
(3)

Subtract (3) and (2)


B A =2

(4)

Add (1) and (4)


2B = 24
Therefore, B =12, A =10

Advantages of FM method
Tests more than one tree
Fast in execution
Can use empirical substitution scoring methods
Disadvantages of FM method
Long execution time
Does not consider intermediate ancestors
Misses homoplasies
Long evolutionary distances are underestimated

Neighbor Joining Method


Is a cluster method but does not require the data to be ultrametric
Suitable when rate of evolution of separate lineages (taxa) under
consideration varies
Very much like F-M Method
Choice of which sequences to pair is determined by a different
algorithm
The following programs are available
- Neighbor of the Phylip package
- ClustalW
- Distnj in the Protml package

N-J algorithm
1. The raw data are provided as a distance matrix and the initial tree is a star tree.
2. A modified distance matrix is constructed in which the separation between
each pair of nodes is adjusted on the basis of their average divergence from all
other nodes.
3. The tree is constructed by linking the least-distant pair of nodes in this
modified matrix (Using FM method).
4. When two nodes are linked, their common ancestral node is added to the tree
and the terminal nodes with their respective branches are removed from tree.
5. This pruning process converts the newly added common ancestor into a
terminal node on a tree of reduced size.
6. At each stage in the process two terminal nodes are replaced by one new node.
7. The process is complete when two nodes remain, separated by a single branch.

A
B
C
D
E
F

A
0
5
4
7
6
8

B
5
0
7
10
9
11

C
4
7
0
7
6
8

D
7
10
7
0
5
9

E
6
9
6
5
0
8

F
8
11
8
9
8
0

Distance Matrix
A
F

|
B
| /
\ | /
\|/
/|\
/ | \
/ | \
E
|
C
D
\

Star Tree

Total 6 OTUs (N=6)

Step 1: We calculate the net divergence r(i) for each


OTU from all other OTUs
r(A) = 5+4+7+6+8=30
r(B) = 42
r(C) = 32
r(D) = 38
r(E) = 34
r(F) = 44
Step 2: Calculate new distance matrix using
following formula for each pair of OTUs
M(ij) = d(ij) - ( [r(i) + r(j)]/(N-2) )
ex: For AB
M(AB)=d(AB) ([(r(A) + r(B)]/(N-2) )= -13
Step 3: Choose those two OTUs as neighbors for
which Mij is the smallest
Here we have A and B, D and E

A
A 0
B 5
C 4
D 7
E 6
F 8

A
B
C
D
E
F

B C D
5 4 7
0 7 10
7 0 7
10 7 0
9 6 5
11 8 9

E
6
9
6
5
0
8

F
8
11
8
9
8
0

F
A B
C
D
E
0 -13 -11.5 -10 -10 -10.5
0 -11.5 -10 -10 -10.5
0 -10.5 -10.5 -11.0
0
-13 -11.5
0 -11.5
0

Step 4: Let's take A and B as neighbors and we


form a new node called U, such that A and B are
descendants of U
Step 5: Apply F-M method to calculate branch
length of A and B using initial distance matrix
which came out to be AU=1, BU=6
Step 6: Create a new distance matrix: by
replacing A and B by U
Step 7: Repeat from Step1,
now N = N-1 = 5

U C D
U 0 3 6
C
0 7
D
0
E
F

E
5
6
5
0

C
D |
\ |
A
\|___/ 1
/|
\
/ |
\ 6
E |
\
F
\
B

F
7
8
9
8
0

Advantages and disadvantages of the neighbor-joining


method
Advantages
Is fast and thus suited for large datasets
Permits lineages with largely different branch lengths
Permits correction for multiple substitutions
Disadvantages
Sequence information is reduced
Gives only one possible tree
Strongly dependent on the model of evolution used
Misses homoplasies

Maximum Parsimony Method


Tree which minimizes the number of steps required to generate the observed
variation in the sequences
Also called as minimum evolution method
Identify informative sites in MSA
Each and every position, phylogenetic tree that require smallest number of
evolutionary changes to produce observed sequence changes are identified
Method is useful for sequences that are quite similar and for small
numbers of sequences
Guaranteed to find best tree, because all possible trees relating a
group of sequences are examined
Time-consuming
Not useful for data that include a large number of sequences

Maximum Parsimony Example


1 AA G A G T G C A
2AGCCGTGCG
3AGATATCCA
4 AG A G ATC C G

Four sequences, three possible unrooted trees


1

Some sites are informative, others are not


Informative site has same sequence character in at least
two different sequences
Only informative sites are considered
1 A A G A G T G C A
2 A G C C G T G C G
3 A G A T A T C C A
4 A G A G A T C C G
Three informative columns

1
2
3
4

G
G
A
A

G
G
C
C

A
G
A
G

Is a substitution

3
1

A
4

The optimal tree is obtained by adding the numbers of


changes at each informative site for each tree, and
picking the tree requiring the least Number of
changes.
A scoring matrix is generally used instead of scoring
change.

Maximum parsimony (example)


Input: Four sequences
ACT
ACA
GTT
GTA

Question: which of the three trees has the best


MP scores?

Maximum Parsimony
ACT

GTA

GTT

ACA

ACA

ACT

GTT

GTA

ACA

GTA

ACT

GTT

Maximum Parsimony
ACT
GTT

2 GTT GTA
1
2

ACA

GTA

GTT

ACA

MP score = 7

MP score = 5

ACA
ACT

ACA ACT
1
3
3

GTA

ACA GTA
1

MP score = 4
Optimal MP tree

GTT

ACT
GTA

Advantages and Disadvantages of MP method


Advantages:
Reconstructs ancestral nodes
Performs better than distance methods
Provides numerous parsimonious trees
Disadvantages:
Branch length can not be determined
Slower than matrix methods
Provides numerous parsimonious tree
Sensitive to order in which sequences are added to tree

Maximum Likelihood
Goal: Construct a phylogenetic tree from DNA sequences whose
likelihood is a maximum. (Felsenstein 1981)
Procedure
Start with a given topology and use the maximum likelihood method to
optimize branch lengths
Make local modifications to the topology and re-optimize the branch
lengths
New taxa are added one by one, optimizing branch lengths and
topologies each time
Assumes an evolutionary process that is a reversible Markov process
Very computationally expensive to use

Likelihood of a Tree
We want to find L(tree) = Pr[data|tree]
Given the data: a1=CT, a2=CG and a3=AT
Consider the tree

We can calculate the likelihood of this tree if we


fill in the internal nodes

Maximum Likelihood Methods


Given a probabilistic model for nucleotide
substitution (e:g:, the Jukes and Cantor model), pick
the tree that has the highest probability of generating
the observed data.
In other words, given character data D and a model
M, we want to find the tree T that maximizes the
expression
Much more computational intensive than parsimony

Maximum Likelihood Methods


Assumptions
different characters evolve independently.
It is also assumed that after species have
diverged, they evolve independently.
Thus, if Di is the data for the ith character,
then:

Likelihood of a Tree
L(tree) = Pr [data|tree]
Multiply likelihood for each character position
Recursive definition of Likelihood
Saves computational time

Example Tree for Maximum


Likelihood

You might also like