0% found this document useful (0 votes)
35 views36 pages

Introduction - Arbres - Phylogénique

The document introduces basic concepts and vocabulary related to phylogenetic inference including phylogenetic trees, characters, and molecular sequences. It discusses how phylogenetic trees are inferred from character data by comparing character states among taxa and selecting trees that optimize criteria like parsimony or probability under an evolutionary model.

Uploaded by

Hajar Mahir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views36 pages

Introduction - Arbres - Phylogénique

The document introduces basic concepts and vocabulary related to phylogenetic inference including phylogenetic trees, characters, and molecular sequences. It discusses how phylogenetic trees are inferred from character data by comparing character states among taxa and selecting trees that optimize criteria like parsimony or probability under an evolutionary model.

Uploaded by

Hajar Mahir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Introduction to phylogenetic inference

Properties and vocabulary

01
Phylogenetic tree: basic concepts
A useful tool for representing
evolutionary relationships between
objects arising from a common ancestor

02
Content

Phylogenetic tree: concepts and vocabulary

Phylogenetic data: concepts and vocabulary

From phylogenetic data to phylogenetic inference: general concepts

The NEWICK coding to store phylogenetic trees

03
A phylogenetic tree is a combinatorial object

A graph is an ordered pair G = (V, E) where


● V is a set of vertices (nodes)
● E is a set of edges (arcs) each joining two
vertices from V

The degree (valency) of a vertex is the number of


edges that connect to it.

A path between two vertices is a set of edges that


connect them.

A connected component is a subgraph in which any


two vertices are connected to each other by paths,
and which is connected to no additional vertex.

A cycle is a subgraph in which there exist at least two


paths between each pair of vertices.

04
A phylogenetic tree is a combinatorial object

A tree is a special graph verifying the following


properties
● only one connected component
● no cycle

a
b An X-tree is a special tree verifying the following
properties
f ● no vertex of degree 2
● every vertex of degree 1 is distinctly labelled
e
c
d

05
A phylogenetic tree is a combinatorial object
a
b
A binary X-tree is an X-tree with only vertices of degree
1 or 3..

e
c
d

a
b
A rooted X-tree is an X-tree with one vertex of degree
2, called the root.
f
r
e
c
d

06
A phylogenetic tree is a combinatorial object
a
b
A phylogenetic tree is an X-tree:
● labelled by gene, population or species names,
● sometimes rooted,
● often binary,
e ● often with branch length.
c
d
a

a
b b

f f
r r
e e
c
d c

07
Phylogenetic tree: current terms

external
branch a

external node
leaf
b operational taxonomic unit (OTU)
taxon

f
root

internal c
branch

internal d
node

08
Phylogenetic inference: general concepts

09
Inferring phylogenetic trees from characters
Key hypothesis:
each considered character is evolving by descent with modification (i.e. homology)
Key idea:
evolutionary history is reconstructed by comparing observed character states from contemporary taxa

a character that is
comparable within taxa
putative
evolutionary
event
a: 0

b: 0
putative ancestral
character state 1>0 f: 0

e: 0
r: 1
c: 1 a character state drawn
from the alphabet of the
d: 1 corresponding
character

10
Inferring phylogenetic trees from characters
Key hypothesis:
each considered character is evolving by descent with modification (i.e. homology)
Key idea:
evolutionary history is reconstructed by comparing observed character states from contemporary taxa

a character that is
comparable within taxa

a: 2 G 0 T ATG M
a set of characters (e.g.
number, nucleotide, binary,
b: 2 A 0 T TTG L
codon, amino acid) for each
taxon
f: 2 A 0 T CTG L

e: 4 G 0 T ATA I

c: 6 C 1 F AAA K a character state drawn


from the alphabet of the
d: 6 T 1 F AGA R corresponding
character

11
Inferring phylogenetic trees from molecular
characters
Key hypothesis:
each character has evolved by descent with modification; therefore, homologous sequences are quite
similar with some local differences (i.e. primary homology)
Key idea:
looking for putative homologous sequences by similarity searches (e.g. BLAST); when performing a multiple
sequence alignment, all possible homologous characters are estimated by comparing all character states
and optimizing some sequence similarity criteria

12
Phylogenetic data: current terms
an aligned
character or
site

ACGTGCATGCATGACGTGATATGCGTGACGTGAACGTGTAACGTG
a (molecular)
ACGTGCATGCATCACGAGATATGCGTGAGGTGATCGTGTAACGTG
sequence
ACGTGCATGCTTGACGTGATATGCGTGACGTGAACCTGACCCGTG
ACGACCATGCTTGACGTGATATGCGTGACGTGAACGTGACCCGTG
ACGTGCATGCTTTACCTGATTTGRGTCACGTGA---TGATCCGTG

a (known)
character state
a gap
a degenerated
character state

13
Phylogenetic reconstruction in practice
Key approach:
1. associating an optimality criterion to a given tree topology
2. selecting the tree(s) that optimize(s) this criterion

ACGTGCATGCATGACGTGATATGCGTGACGTGAACGTGTAACGTG

ACGTGCATGCATCACGAGATATGCGTGAGGTGATCGTGTAACGTG

ACGTGCATGCTTGACGTGATATGCGTGACGTGAACCTGACCCGTG

ACGACCATGCTTGACGTGATATGCGTGACGTGAACGTGACCCGTG

ACGTGCATGCTTTACCTGATTTGRGTCACGTGA---TGATCCGTG

Phylogenetic dataset Phylogenetic tree

14
Phylogenetic reconstruction in practice
Key approach:
1. associating an optimality criterion to a given tree topology
2. selecting the tree(s) that optimize(s) this criterion

Three main classes of criteria:


1. distance-based tree reconstruction
● estimating the pairwise genetic distance between each pair of aligned
sequences
● search for the tree representation that best represents the estimated
distances
2. maximum of parsimony approach
● searching for the tree(s) that minimize(s) the number of substitutions when
representing the evolutionary history of the aligned characters
3. probabilistic criteria
○ given an evolutionary model, searching for the tree representation that
maximizes the probability of the aligned characters to be represented by
this tree

15
Phylogenetic reconstruction in practice
Key approach:
1. associating an optimality criterion to a given tree topology
2. selecting the tree(s) that optimize(s) this criterion

In practice, a ‘good’ criterion should be:


● easy to compare and interpret, e.g. a numerical value to minimize or maximize
● fast to compute from the data for a given tree topology, e.g. sum of branch
lengths (distance), number of substitution steps (parsimony)...

16
Phylogenetic reconstruction in practice
Key approach:
1. associating an optimality criterion to a given tree topology
2. selecting the tree(s) that optimize(s) this criterion

In practice, a ‘good’ criterion should be:


● easy to compare and interpret, e.g. a numerical value to minimize or maximize
● fast to compute from the data for a given tree topology, e.g. sum of branch
lengths (distance), number of substitution steps (parsimony)...

A naive approach could therefore be conducted as followed:


● generating every possible tree topology
● estimating the optimality criterion for each generated tree topology
● selecting the one that optimize the considered criterion

17
How many different phylogenetic trees?
Number of distinct binary X-trees on n taxa:
t(n) := (2n−5)!! := (2n−5) (2n−7) (2n−9) ... 15 × 13 × 11 × 9 × 7 × 5 × 3

n t(n)
B C
4 3
5 15
A D

t(n) = 3
6 105
7 945
C B
8 10,395
9 135,135
10 2,027,025

n=4
A D
11 34,459,425
12 654,729,075 D C
13 13,749,310,575
… … A B

18
Phylogenetic reconstruction in practice
Key approach:
1. associating an optimality criterion to a given tree topology
2. selecting the tree(s) that optimize(s) this criterion

In practice, a ‘good’ criterion should be:


● easy to compare, e.g. a numerical value to minimize or maximize
● fast to compute from the data for a given tree topology, e.g. sum of branch lengths (distance), number
of substitution steps (parsimony)...

A naive approach could therefore be conducted as followed:


● generating every possible tree topology
● estimating the optimality criterion for each generated tree topology
● selecting the one that optimize the considered criterion

Impossible!!! Too many different tree topologies.


For example, if 1 ms is required to estimate the criterion value for a given tree topology on n = 13 taxa, then
~13,749,310 s (~5 months) are required to perform the naive approach…
Parallelization leads to the same difficulties, but for larger n values.

19
Phylogenetic reconstruction in practice
Key approach:
1. associating an optimality criterion to a given tree topology
2. selecting the tree(s) that optimize(s) this criterion

In practice, a ‘good’ criterion should be:


● easy to compare, e.g. a numerical value to minimize or maximize
● fast to compute from the data for a given tree topology, e.g. sum of branch lengths (distance), number
of substitution steps (parsimony)...
● easy to use into algorithmic tree searching strategies

20
(Criscuolo & Gribaldo 2011)
A note on the root

21
A note on the root
For practical reasons, most of the phylogenetic tree reconstruction strategies lead to
unrooted trees…

B
A

C
E
D

22
A note on the root
An X-tree on n taxa contains 2n−3 branches; there exist therefore 2n−3 possible
rooting.
Unfortunately, very few optimality criteria exist to assess the most likely root of a
phylogenetic tree (e.g. probabilistic model requiring important computing resources).
However, there exist some tricks...

B
A

C
E
D

n=5 2n−3 = 7

23
Midpoint rooting
If every sequence has evolved following approximately the same substitution rate, a
simple way is to consider the midpoint of the associated phylogenetic tree as the
putative root.
The midpoint is defined as the middle of the longest path in the tree.

the more distantly


related taxa in the tree
the middle of the unique
path joining x and y

24
Midpoint rooting
If every sequence has evolved following approximately the same substitution rate, a
simple way is to consider the midpoint of the associated phylogenetic tree as the
putative root.
The midpoint is defined as the middle of the longest path in the tree.

Of note, NJplot (tree editor within SeaView) automatically performs midpoint rooting
when displaying a phylogenetic tree.

However, this approach often leads to erroneous rooting when substitution rate is far
from constant...
midpoint root

real root

25
Outgroup rooting
The outgroup rooting is a simple, more general and robust approach that consists in
adding several homologous but distantly related sequences into the considered
dataset.
Be careful: when too distant, the outgroup could sometimes causes artefactual
relationship (e.g. long branch attraction)

In both trees, the outgroup C. albicans is clearly distinct from the ingroup (Saccharomyces taxa)
(from Criscuolo 2011)

26
Storing phylogenetic trees

0
82
A

0
0.
6 8
00
0.
0.0686

0.1012 0.0218 C
02
E 0.
02 0.
04
48
D

27
The NEWICK format for storing a tree

(A:0.0068,B:0.0820)

0
82
A

0
0.
6 8
00
0.
0.0686

0.1012 0.0218 C
02
E 0.
02 0.
04 (C:0.0218,D:0.0448)
48
E:0.1012
D

28
The NEWICK format for storing a tree

(A:0.0068,B:0.0820):0.0686

0
82
A

0
0.
6 8
00
0.
0.0686

0.1012 0.0218 C
02
E 0.
02 0.
04 (C:0.0218,D:0.0448):0.0202
48
E:0.1012
D

29
The NEWICK format for storing a tree

(A:0.0068,B:0.0820):0.0686

0
82
A

0
0.
6 8
00
0.
0.0686

0.1012 0.0218 C
02
E 0.
02 0.
04 (C:0.0218,D:0.0448):0.0202
48
E:0.1012
D

((A:0.0068,B:0.0820):0.0686,(C:0.0218,D:0.0448):0.0202,E:0.1012);

30
The NEWICK format for storing a tree
((A:0.0068,B:0.0820):0.0686,(C:0.0218,D:0.0448):0.0202,E:0.1012);

A B C D E

0.0068 0.0820 0.0218 0.0448

0.1012
0.0202
0.0686

B
20

A
08
0.

68
00
0.
0.0686

0.1012 0.0218 C
02
E 0.
02 0.
04
48
D

31
The NEWICK format for storing a tree
The NEWICK format is a simple way to store phylogenetic trees:
● with branch lengths

((A:0.0068,B:0.0820):0.0686,(C:0.0218,D:0.0448):0.0202,E:0.1012);

A B C D E

0.0068 0.0820 0.0218 0.0448

0.1012
0.0202
0.0686

32
The NEWICK format for storing a tree
The NEWICK format is a simple way to store phylogenetic trees:
● without branch length (only topology)

((A,B),(C,D),E);

A B C D E

33
The NEWICK format for storing a tree
The NEWICK format is a simple way to store phylogenetic trees:
● with internal node supports

((A,B)90,(C,D)76,E);

A B C D E

76%
90%

34
The NEWICK format for storing a tree
The NEWICK format is a simple way to store phylogenetic trees:
● with branch lengths and internal node supports

((A:0.0068,B:0.0820)90:0.0686,(C:0.0218,D:0.0448)76:0.0202,E:0.1012);

A B C D E

0.0068 0.0820 0.0218 0.0448

76%
90%
0.1012
0.0202
0.0686

35
Conclusion

36

You might also like