0% found this document useful (0 votes)
32 views24 pages

Tutorial Note 9 (Week 11) Phylogenetic Tree

The document provides an overview of phylogenetic tree reconstruction methods including maximum parsimony, maximum likelihood, distance-based methods like UPGMA and neighbor-joining. It discusses computing likelihood values using dynamic programming and hill climbing to find local maxima. Examples are given to demonstrate computing likelihoods on sample trees using the Jukes-Cantor nucleotide substitution model. The UPGMA clustering algorithm is also summarized.

Uploaded by

Romario Tim Vaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views24 pages

Tutorial Note 9 (Week 11) Phylogenetic Tree

The document provides an overview of phylogenetic tree reconstruction methods including maximum parsimony, maximum likelihood, distance-based methods like UPGMA and neighbor-joining. It discusses computing likelihood values using dynamic programming and hill climbing to find local maxima. Examples are given to demonstrate computing likelihoods on sample trees using the Jukes-Cantor nucleotide substitution model. The UPGMA clustering algorithm is also summarized.

Uploaded by

Romario Tim Vaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Tutorial Note 9 (Week 11)

Phylogenetic Tree

The Chinese University of Hong Kong


CSCI3220 Algorithms for Bioinformatics
TA: Zhenghao Zhang
Agenda
1. Phylogenetic tree reconstruction
– Problem definition
2. Sequence-based methods
– Maximum parsimony
– Maximum likelihood
3. Distance-based methods
– UPGMA
– Neighbor-joining
4. Questions from Assignment3

CSCI3220 Algorithms for Bioinformatics Tutorial Notes 2


Small likelihood
• We try different time of divergence and mutation
rates, and change them a little bit in each iteration
• After many trials, the likelihood will converge to
some local (possibly global) maximum

• You need to know:


– Evaluate likelihood given time of divergence and
mutation rate using dynamic programming
– Compare the values of likelihood
Image credit: http://www.absoluteastronomy.com/topics/Hill_climbing

CSCI3220 Algorithms for Bioinformatics Tutorial Notes 3


Exercise1: Computing Likelihood
• Using Jukes-Cantor model with mutation rate =
0.01, divergence time are drawn on the branches
of the tree. Assume the probability of having A, C,
G and T are equal (0.25) for the sequence g.
Calculate the likelihood of the tree.

g:G
1 1

e:G f:G
1 1 1 1

a:G b:G c:T d:G

CSCI3220 Algorithms for Bioinformatics Tutorial Notes 4


Answer: Computing Likelihood

Node Likelihood
g:G
a 0.97 (no mutation, a:G == e:G)
1 1
b 0.97 (no mutation, b:G == e:G)
c 0.01 (mutation, c:T != f:G) e:G f:G
d 0.97 (no mutation, d:G == f:G)
1 1 1 1
e (0.97 a:G)(0.97 b:G)(0.97 e:G == g:G) = 0.912673
f (0.97 d:G)(0.01 c:T)(0.97 f:G == g:G) = 0.009409 a:G b:G c:T d:G
g (0.25 g:G)(0.912673 e:G)(0.009409 f:G)
= 0.00214683506
A Particular
Assume the probability of having A, C, G and T are equal (0.25) for the sequence g. Case Probability

1 mutation 0.01
1 no change 0.97
Mutations of G include GA, GC, GT
Thus, 3*P(Mutation) + P(No Change) = 1

CSCI3220 Algorithms for Bioinformatics Tutorial Notes 5


Computing likelihood
• Define table V, where entry V(i,x) is the likelihood of the sub-tree
rooted at i when the parent of i takes character x
• Examples:
– Likelihood =
Pr(g:A) * V(e,A) * V(f,A) +
Pr(g:C) * V(e,C) * V(f,C) +
g:?
Pr(g:G) * V(e,G) * V(f,G) +
teg tfg
Pr(g:T) * V(e,T) * V(f,T)
– V(e, A) = e:? f:?
Pr(e:A|g:A,teg) * V(a,A) * V(b,A) +
tae tbe tcf tdf
Pr(e:C|g:A,teg) * V(a,C) * V(b,C) +
Pr(e:G|g:A,teg) * V(a,G) * V(b,G) + a:G b:G c:T d:G
Pr(e:T|g:A,teg) * V(a,T) * V(b,T)
– V(a,A) = Pr(a:G|e:A, tae)
–…

CSCI3220 Algorithms for Bioinformatics Tutorial Notes 6


Exercise2: Likelihood
• Given that mutation rate = 0.1, divergence time are
drawn on the branches of the tree. Assume the
probability of having A, C, G, or T are equal for the
sequence g, using Jukes-Cantor model, find the
likelihood of this tree.
g:?
2 1

e:? f:?
1 1 2 2

a:G b:G c:T d:G

CSCI3220 Algorithms for Bioinformatics Tutorial Notes 7


Answer: Likelihood
X (parent takes) (parent takes) (parent takes) (parent takes)
A C G T

V(a,X) 0.1 0.1 0.7 0.1


g:?
V(b,X) 0.1 0.1 0.7 0.1
2(0.1)(0.7) 2(0.1)(0.7) 2(0.1)(0.7) (0.7)(0.7)
2 1
V(c,X) +2(0.1)(0.1) AAT +2(0.1)(0.1) +2(0.1)(0.1) +3(0.1)(0.1) TAT
ACT TCT
= 0.16 = 0.16 = 0.16 = 0.52
AGT
ATT
TGT
TTT
e:? f:?
2(0.1)(0.7) 2(0.1)(0.7) (0.7)(0.7) 2(0.1)(0.7)
V(d,X) +2(0.1)(0.1) +2(0.1)(0.1) +3(0.1)(0.1) +2(0.1)(0.1) 1 1 2 2
= 0.16 = 0.16 = 0.52 = 0.16
(0.52)(0.1)(0.1) (0.16)(0.1)(0.1) (0.16)(0.1)(0.1) (0.16)(0.1)(0.1)
+(0.16)(0.1)(0.1) +(0.52)(0.1)(0.1) +(0.16)(0.1)(0.1) +(0.16)(0.1)(0.1) a:G b:G c:T d:G
V(e,X) +(0.16)(0.7)(0.7) +(0.16)(0.7)(0.7) +(0.52)(0.7)(0.7) +(0.16)(0.7)(0.7)
+(0.16)(0.1)(0.1) +(0.16)(0.1)(0.1) +(0.16)(0.1)(0.1) +(0.52)(0.1)(0.1)
= 0.0868 = 0.0868 = 0.2596 = 0.0868 A Particular
Case Probability
(0.7)(0.16)(0.16) (0.1)(0.16)(0.16) (0.1)(0.16)(0.16) (0.1)(0.16)(0.16)
+(0.1)(0.16) +(0.7)(0.16) +(0.1)(0.16) +(0.1)(0.16) 1 mutation 0.1
V(f,X) (0.16)+(0.1)(0.16) (0.16)+(0.1)(0.16) (0.16)+(0.7)(0.16) (0.16)+(0.1)(0.16)
(0.52)+(0.1)(0.52) (0.52)+(0.1)(0.52) (0.52)+(0.1)(0.52) (0.52)+(0.7)(0.52) 1 no change 0.7
(0.16) = 0.0371 (0.16) = 0.0371 (0.16) = 0.0717 (0.16) = 0.0717
1 mutation
+ 1 no change (0.1)(0.7)

2 mutations (0.1)(0.1)
2 no changes (0.7)(0.7)
Likelihood = (0.25)(0.0868)(0.0371) +(0.25)(0.0868)(0.0371)
+(0.25)(0.2596)(0.0717) +(0.25)(0.0868)(0.0717) = 0.0078

CSCI3220 Algorithms for Bioinformatics Tutorial Notes 8


UPGMA Algorithm
1. Obtain / Compute the distance matrix of all pairs
of sequences, for example, alignment length
minus best alignment score
2. Treat each sequence as a set (cluster)
3. Pick two closest sets, and group them, and re-
calculate the distance between new set and other
sets as the average of distances of all pairs or
sequences involved
4. Repeat step 3 until only one set is left

CSCI3220 Algorithms for Bioinformatics Tutorial Notes 9


Example: UPGMA
• Given the distance matrix and the tree is non-
additive, construct the phylogenetic tree with
UPGMA.
• Note: we will specify if we want the branch lengths
in the question.
A B C AB C
A 0 4 6 AB 0 9
3.5 3.5
B 4 0 12 C 9 0
C 6 12 0 C
2 2
2 2
A B
A B C A B C

CSCI3220 Algorithms for Bioinformatics Tutorial Notes 10


Exercise3: UPGMA
• Given the distance matrix and the tree is additive,
construct the phylogenetic tree with UPGMA
algorithm. Represent the tree in Newick format.

A B C D
A 0 11 4 11
B 11 0 13 4
C 4 13 0 13
D 11 4 13 0

CSCI3220 Algorithms for Bioinformatics Tutorial Notes 11


Answer: UPGMA
A B C D AC B D AC BD
A 0 11 4 11 AC 0 12 12 AC 0 12
B 11 0 13 4 B 12 0 4 BD 12 0
C 4 13 0 13 D 12 4 0
D 11 4 13 0

A B C D 2 2 2 2 2 2
A C B D A C B D

4 4

Newick:
((A:1,C:3):4,(B:2,D:2):4); 1
A 2 2
3 B D

C
CSCI3220 Algorithms for Bioinformatics Tutorial Notes 12
UPGMA – Ultrametric distances
1. d(x, y)  0 1. d(x, y) + d(y, z)  d(x, z)
2. d(x, y) = 0 if x = y 2. d(x, y)  max{d(x, z), d(y,
3. d(x, y) = d(y, x) z)}
All leaf nodes have equal distance from root(1) Satisfy all
additive
Branch lengths represents sequences distance (2) Satisfy 1-4 only
Branch lengths represents cluster distance only (3) Satisfy 1-3 only non-
additive
d A B C d A B C d A B C
A 0 10 20 A 0 13 19 A 0 4 13
B 10 0 20 B 13 0 22 B 4 0 19
C 20 20 0 C 19 22 0 C 13 19 0

(1) 5 (2) 7 7 (3) 7 7


10
5 5 5 8 C –1!? 5 C
A B C A A
B B
CSCI3220 Algorithms for Bioinformatics Tutorial Notes 13
Way to assign branch length
• Additive tree:
– Find the branch lengths by solving system of linear
equations

CSCI3220 Algorithms for Bioinformatics Tutorial Notes 14


Introduction to Neighbour Joining
• Problems with UPGMA
– Hard to assign branch length
– Chaining effect
– Not always unique
A B C
A 0 4 6 A C
B 4 0 4 B C A B
3 OR 3
C 6 4 0 1 2 2 1
1
A B C

• Idea of Neighbour Joining


– Group some species that are relatively close to each other and
distant from the other species
Image credit: ENGG5103/CSCI5180 Lecture Notes

CSCI3220 Algorithms for Bioinformatics Tutorial Notes 15


Neighbor Joining
• Initialize by step 0, then repeat the following 1~3
steps until all branch lengths are assigned and the
whole tree topology is determined
0. Initially, each node is connecting to “hub”.
1. Find two sets connected to hub with minimum Q, say
set Ci and Cj
2. Insert an internal node Ck between Ci, Cj and the hub
3. Compute the distances with following equations
d (C i , C j ) u  Ci   u  C j  d (C i , C j ) u  C j   u  C i 
CC   C C  
2 r  2  2 r  2 
i k j k
2 2
r is the number of clusters before the merge)

 
Q  i , j    r  2  d Ci , C j  u  C i   u C j   u C x    d (C , C
y
x y)

CSCI3220 Algorithms for Bioinformatics Tutorial Notes 16


Example: Neighbor Joining
d A B C D u Q A B C D
A B
A 0 14 37 7 58 A -94 -94 -112

B 14 0 31 19 64 B -112 -94
D C C 37 31 0 42 110 C -94
D 7 19 42 0 68 D
7 58  68
1 B d AD B C u Q AD B C C AC AD   1
A 2 2 4  2
AD AD 0 13 36 49 AD -80 -80 7 68  58
C D C AD   6
2 2 4  2
6 B 13 0 31 44 B -80
D 13 49  44
C C AD C ABD   9
C 36 31 0 67 C 2 2 3  2 
13 44  49
C B C ABD   4
2 2 3  2
A 1 4 B d ABD C A 1
9 ABD 4 B
AD ABD 0 27 AD 9 ABD

6 C 27 0 6 27
D C D C

CSCI3220 Algorithms for Bioinformatics Tutorial Notes 17


Example: Neighbor Joining
• Repeat previous example, but at the first step, you have
to join B and C, instead of A and D.
d A B C D u Q A B C D
A 0 14 37 7 58 A -94 -94 -112
B 14 0 31 19 64 B -112 -94

C 37 31 0 42 110 C -94
D 7 19 42 0 68 D

• Note: you will get the same tree.


• Think about: can we represent the tree in Newick
format?
CSCI3220 Algorithms for Bioinformatics Tutorial Notes 18
Exercise4: Neighbor Joining
• Using Neighbor Joining algorithm to construct the
phylogenetic tree with the following distance
matrix.
d A B C D E
A 0 4 5 11 16
B 4 0 7 13 18
C 5 7 0 8 13
D 11 13 8 0 11
E 16 18 13 11 0

CSCI3220 Algorithms for Bioinformatics Tutorial Notes 19


Answer: Neighbor Joining
d A B C D E u Q A B C D E
A 0 4 5 11 16 36 A -66 -54 -46 -46
B 4 0 7 13 18 42 B -54 -46 -46
C 5 7 0 8 13 33 C -52 -52
D 11 13 8 0 11 43 D -68
E 16 18 13 11 0 58
E
11 43  58
C D C DE    3,
2 2 5  2 
A E A E
8 11 58  43
C DE C E   8
2 2 5  2 
B D B
DE D
6
3

C C

CSCI3220 Algorithms for Bioinformatics Tutorial Notes 20


Answer: Neighbor Joining
d A B C D E u d A B C DE

A 0 4 5 11 16 36 A 0 4 5 8
B 4 0 7 13 18 42 B 4 0 7 10
C 5 7 0 8 13 33 C 5 7 0 5
D 11 13 8 0 11 43 DE 8 10 5 0
E 16 18 13 11 0 58
d  C A , C DE 
d (C A , C D )  d (C A , C E )  d (C D , C E )

A E 2
8 11  16  11

2
B
DE D
6 8
3

CSCI3220 Algorithms for Bioinformatics Tutorial Notes 21


Answer: Neighbor Joining
d A B C DE u Q A B C DE
A 0 4 5 8 17 A -30 -24 -19

B 4 0 7 10 21 B -24 -19

C 5 7 0 5 17 C -25

DE 8 10 5 0 18 DE

A
E
1 8
3 AB DE D
B 3 6

CSCI3220 Algorithms for Bioinformatics Tutorial Notes 22


Answer: Neighbor Joining
d AB C DE u Q AB C DE d ABC DE
AB 0 4 7 11 AB -16 -16 ABC 0 4

C 4 0 5 9 C -16 DE 4 0
DE 7 5 0 12 DE

A A

1 E 1 E
8 8
3 3 3
AB 3 B
AB
AB 4
B AB DE D
6 C DE D
C
3 3
1 1

C C

CSCI3220 Algorithms for Bioinformatics Tutorial Notes 23


Check List
• Why are the biologists interested in
phylogeny?
• What are the similarities and differences
among the four phylogenetic tree
reconstruction algorithms?
• For an additive tree, if all leaf nodes lie on
the same line, why is this feature interesting?

CSCI3220 Algorithms for Bioinformatics Tutorial Notes 24

You might also like