0% found this document useful (0 votes)
17 views44 pages

Intro To Phyl o Genetics

Uploaded by

mcinerneyjames
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views44 pages

Intro To Phyl o Genetics

Uploaded by

mcinerneyjames
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
You are on page 1/ 44

An Introduction To Phylogeny

Conor Meehan (he/they)


[email protected]
@con_meehan
Learning outcomes
Identify the various sections of a phylogenetic tree

Define various phylogenetic terminology

Describe the process of parsimony tree building

Translate a distance matrix into a phylogeny

Discuss the drawbacks of non- and semi-parametric phylogenetic methods

Describe the process of tree space searching

Recognise the maximum likelihood and bootstrap approaches


What is a phylogeny?
The evolutionary relationship between lineages

Lineages can be:


Genes
Individuals
Populations
Species

Phylogenetics is the study of these relationships


Derived from the greek meaning “tribe origin”
Often depicted as a tree with sampled taxa at the tips
Tree Terminology
Internal nodes also
called ancestors or (internal) node
parents

The node labelled Root Peripheral branch


(internal) node is the Taxon (plural:
most recent common taxa)
ancestor (MRCA) or Internal branch or Terminal
node
A, B and C
Writing trees on a computer
Trees are represented in a text file using parentheses
and commas (newick format)
Every internal node is a comma and the leaf nodes
that connect to that internal node have ( ) around
them
This tree is represented as:
(((A,B),C),(D,(E,F)))
If we know the lengths of the branches, they are
written after each label or ) after a colon
(((A:0.5,B:0.1):0.2,C:0.5):0.3,(D:0.1,(E:0.2,F:0.2)))
Phylogenetic tree representation
Cladogram: Branch lengths have no meaning Phylogram: Branch lengths have meaning
(Sometimes have straight or curved lines) (e.g. estimated/expected/observed amount of change or time)

Unrooted: No common ancestor is known or time is not known


Phylogenetic relationships terminology

Monophyletic
A group of taxa that contains an ancestor and all its descendants
Monophyletic group is also referred to as a clade

Polyphyletic
A group of taxa brought together by convergent evolution

Paraphyletic
A monophyletic group where a subgroup has been removed
Phylogenetic relationships terminology

Wikipedi
a
Phylogenetic terminology

Monophyletic
M. avium

Polyphyletic
M. simiae
(M. kubicae separate)

Paraphyletic
M. simiae without M. kubicae
(M. avium shares the same ancestor)

Fedrizzi, Meehan et al, Sci Rep


2017
Phylogenetic nodes terminology
Bifurcating
A node that connects only 2 branches

Multifurcating (polytomy)
A node in a tree which connects more than two branches

In-group/out-group
In-group is the set of taxa of interest. Assumed to be monophyletic
Out-group is a related set of taxa to the in-group, used for rooting
Phylogenetic nodes terminology
Multifurcating node

In-group

Bifurcating node

Out-group
(if we only used
it to place root)
In-group vs out-group example
Research into L8 and its
relationshop to L1-L7
In-group
M. canettii used as
outgroup
Allows us to see placement
of L8 samples

Ngabonziza et al, Nat Comms


2020
Tasks 1
Do sampled taxa sit at the end of internal or external branches of a phylogenetic
tree?

What does monophyletic mean?

How many offspring does a bifurcating node have?

How do you write this tree in newick format?


I.e. using the ( X, Y ) format
How do we create a phylogeny?
Use an optimality criterion to define how we measure the fit of the data to a
solution

Tree-search method: how do we decide between possible solutions?

Simplest is parsimony
Referred to as non-parametric
The tree that represents the minimum number of character changes between taxa is the
optimal solution
A morphological example
A morphological example
A morphological parsimony example
Parsimony looks for the least amount of changes that explain the relationships

https://www.khanacademy.org/science/biology/her/tree-of-life/a/building-an-evolutionary-tree
A morphological parsimony example
Parsimony looks for the least amount of changes that explain the relationships

https://www.khanacademy.org/science/biology/her/tree-of-life/a/building-an-evolutionary-tree
The primary problem with parsimony
Convergent evolution: the evolution of the same trait multiple times
E.g. wings in bats and birds

https://www.khanacademy.org/science/biology/her/tree-of-life/a/building-an-evolutionary-tree
Use of molecular data in phylogenetics
Phylogeny is most often undertaken using molecular data
Nucleotides/DNA sequence (sometimes RNA sequences)
Amino acids/Protein sequence
Less sensitive to convergent evolution
Parsimony also can be used on molecular data
Use G, T, C, A or amino acids as the characters
Fine on very closely related isolates (e.g. local transmission cluster seperated by a
couple of SNPs)
Early tree building used distance based methods
Semi-parametric
Get a distance between 2 sequences
Sequences with shortest distance are most related
Most often uses UPGMA or neighbour joining method
UPGMA (sequences)

S1: TTCAG
S2: TTCGG
S3: TTTTG
S4: TTATG
S5: AACTG
UPGMA (distance matrix) S1: TTCAG
S2: TTCGG
S3: TTTTG
S4: TTATG
S5: AACTG

Note: count as differences, not as number of characters in common


Dissimilarity is differences/total sites
UPGMA (filled distance matrix) S1: TTCAG
S2: TTCGG
S3: TTTTG
S4: TTATG
S5: AACTG

Note: count as differences, not as number of characters in common


Dissimilarity is differences/total sites
UPGMA tree building steps
1.

2. Start with every sequence in its own cluster


3.

4. Select the smallest distance between clusters


a) Create a new cluster by joining these two
b) Branch lengths are distance/2
5.

6. Get the distance from that cluster to all others


a) Use a proportional average distance (i.e. using size of cluster*distance)
7.

8. Repeat steps 2 and 3 until complete


9.

We will do this in the hands on session


UPGMA tree building steps
1.

1. S1-S2 and S3-S4 are both smallest so randomly choose


S1-S2
2.
S1: TTCAG
3. New distances to S3 is (0.4+0.4)/2= 0.4 S2: TTCGG
Repeat for all sequences
1.
S3: TTTTG
4.

5. Get smallest distance and repeat 1 and 2 to build entire S4: TTATG
tree S5: AACTG
UPGMA tree building steps
What about unobserved back mutations
over long periods of time?
S1: TTCAG
A->G->A in a sequence
Underestimated distances S2: TTCGG
S3: TTTTG
Are all substitutions equal? S4: TTATG
Some more likely than others S5: AACTG
Transitions vs transversions

Need to correct distances


Use a model of evolution
See optional separate lecture
Tasks 2
Is parsimony a non-parametric or semi-parametric method?

What is the main drawback of parsimony methods?

Is UPGMA a non-parametric or semi-parametric method?

Do you pick the samples with the smallest or largest distance at each step of the
distance approach?
Break
Parametric methods of phylogenetic inference
Previously seen methods take 2 sequences and calculate a distance and continue in
a pairwise manner through the whole set of sequences
Parametric methods take a column in an alignment and calculate the optimality
criterion per position in alignment
Positions are independent
There are two main parametric methods of phylogenetic inference:
Maximum likelihood
Bayesian analysis
Both methods:
Search tree space to find the best tree
Require an explicit model of evolution
Usually GTR
Can incorporate rate heterogeneity
Tree space searching
Imagine a blind person is dropped randomly in the world and told to find Mount Everest
(the global maximum)

They walk in a random direction until they find a section that is sloped upwards

They continue to walk upwards until every direction around them is a downwards slope

They conclude that since they are at the highest point, they must be on Everest

Thus, the highest position they stand at has the maximum likelihood of being Everest,
given the data and starting point
Tree space searching

Lik
elih
oo
d
Trees
Tree space searching

X Random starting
Lik tree
elih
oo
d
Trees
Tree space searching

Search for better trees

X
Lik
elih
oo
d
Trees
Tree space searching
X
Reach the maximum likelihood
tree
(In theory, closest to true true)

X
Lik
elih
oo
d
Trees
Tree space searching

Lik
elih
oo X Random starting
tree
d
Trees
Tree space searching

Lik Search for better trees


elih
oo X
d
Trees
Tree space searching

A local maximum tree is better than those that


are close by (maybe only differ by one or two
branch positions) but is not the best overall tree.

Stuck in local
maximum
X

Lik
elih
oo X
d
Trees
Tree space searching
X Maximum likelihood tree
(Global maximum)

Local
X
maximum

Lik
elih
oo
d
Trees
Tree space searching

The problem is that if they only walk upwards they could get stuck in a local
maximum

Computer programs will implement different strategies to try and get around
this
Multiple starting points
Multiple searches at once; can switch between searching chains
Allow large and small rearrangements
Allow some steps backwards to try improve score
Maximum Likelihood (ML)
A tree topology is proposed
A likelihood score is calculated for each position and added up to get an overall
likelihood score for the data for a given tree topology
Uses P(v) for the proposed branches (see models of evolution lecture)
Searches tree space and tries to find the tree that has the maximum likelihood of
generating the given data
Compare topologies through optimising variables for each to fit data
Example programs: RAxML, IQ-TREE, PAUP*, PhyML

S1: TTCAG Total likelihood: 1225


S2: TTCGG Repeat for new topology
S3: TGTTG
S4:
TGATG
S5:
AACTG
10
0
35
0
etc

Bootstrapping
Bootstrapping is a method to test the reliability of an inferred tree
Algorithm (for an alignment of M taxa and N columns):
Create a new alignment of length N from original, sampling columns at random with replacement
Create a new topology for this using the same method
Often in ML analysis this is done with some extra heuristics to speed up the process
Compare topology to original
Every branch that is the same is given a score of 1, any that are not present are given a score of 0
Repeat hundred(s) of times and report as a percentage found for each branch

S1: S1:
TTCAG GACAC
S2: S2:
TTCGG GGCGC
S3: S3:
TGTTG GTTTT
Phylogenetic tree building summary
Can be done on morphological or molecular data
Molecular less likely to be affected by convergent evolution
Many methods for building trees exist, each with its own criteria for
the best fit for the data
Parsimony
Distance
Maximum Likelihood/Bayesian
Most methods require a model of evolution to give information on
how the sequences evolved (see models of evolution learning package)
Complex algorithms such as ML or Bayesian require efficient
searching of the tree space
Tasks 3
What are the main ways to avoid getting stuck in a local maximum in
tree searching?

Does maximum likelihood go sequence by sequence or column by


column?

In ML, at each step do you change the alignment or the tree?

In bootstrapping is sampling done with or without replacement?


Learning outcomes
Identify the various sections of a phylogenetic tree

Define various phylogenetic terminology

Describe the process of parsimony tree building

Translate a distance matrix into a phylogeny

Discuss the drawbacks of non- and semi-parametric phylogenetic methods

Describe the process of tree space searching

Recognise the maximum likelihood and bootstrap approaches

You might also like