Intro To Phyl o Genetics
Intro To Phyl o Genetics
Monophyletic
A group of taxa that contains an ancestor and all its descendants
Monophyletic group is also referred to as a clade
Polyphyletic
A group of taxa brought together by convergent evolution
Paraphyletic
A monophyletic group where a subgroup has been removed
Phylogenetic relationships terminology
Wikipedi
a
Phylogenetic terminology
Monophyletic
M. avium
Polyphyletic
M. simiae
(M. kubicae separate)
Paraphyletic
M. simiae without M. kubicae
(M. avium shares the same ancestor)
Multifurcating (polytomy)
A node in a tree which connects more than two branches
In-group/out-group
In-group is the set of taxa of interest. Assumed to be monophyletic
Out-group is a related set of taxa to the in-group, used for rooting
Phylogenetic nodes terminology
Multifurcating node
In-group
Bifurcating node
Out-group
(if we only used
it to place root)
In-group vs out-group example
Research into L8 and its
relationshop to L1-L7
In-group
M. canettii used as
outgroup
Allows us to see placement
of L8 samples
Simplest is parsimony
Referred to as non-parametric
The tree that represents the minimum number of character changes between taxa is the
optimal solution
A morphological example
A morphological example
A morphological parsimony example
Parsimony looks for the least amount of changes that explain the relationships
https://www.khanacademy.org/science/biology/her/tree-of-life/a/building-an-evolutionary-tree
A morphological parsimony example
Parsimony looks for the least amount of changes that explain the relationships
https://www.khanacademy.org/science/biology/her/tree-of-life/a/building-an-evolutionary-tree
The primary problem with parsimony
Convergent evolution: the evolution of the same trait multiple times
E.g. wings in bats and birds
https://www.khanacademy.org/science/biology/her/tree-of-life/a/building-an-evolutionary-tree
Use of molecular data in phylogenetics
Phylogeny is most often undertaken using molecular data
Nucleotides/DNA sequence (sometimes RNA sequences)
Amino acids/Protein sequence
Less sensitive to convergent evolution
Parsimony also can be used on molecular data
Use G, T, C, A or amino acids as the characters
Fine on very closely related isolates (e.g. local transmission cluster seperated by a
couple of SNPs)
Early tree building used distance based methods
Semi-parametric
Get a distance between 2 sequences
Sequences with shortest distance are most related
Most often uses UPGMA or neighbour joining method
UPGMA (sequences)
S1: TTCAG
S2: TTCGG
S3: TTTTG
S4: TTATG
S5: AACTG
UPGMA (distance matrix) S1: TTCAG
S2: TTCGG
S3: TTTTG
S4: TTATG
S5: AACTG
5. Get smallest distance and repeat 1 and 2 to build entire S4: TTATG
tree S5: AACTG
UPGMA tree building steps
What about unobserved back mutations
over long periods of time?
S1: TTCAG
A->G->A in a sequence
Underestimated distances S2: TTCGG
S3: TTTTG
Are all substitutions equal? S4: TTATG
Some more likely than others S5: AACTG
Transitions vs transversions
Do you pick the samples with the smallest or largest distance at each step of the
distance approach?
Break
Parametric methods of phylogenetic inference
Previously seen methods take 2 sequences and calculate a distance and continue in
a pairwise manner through the whole set of sequences
Parametric methods take a column in an alignment and calculate the optimality
criterion per position in alignment
Positions are independent
There are two main parametric methods of phylogenetic inference:
Maximum likelihood
Bayesian analysis
Both methods:
Search tree space to find the best tree
Require an explicit model of evolution
Usually GTR
Can incorporate rate heterogeneity
Tree space searching
Imagine a blind person is dropped randomly in the world and told to find Mount Everest
(the global maximum)
They walk in a random direction until they find a section that is sloped upwards
They continue to walk upwards until every direction around them is a downwards slope
They conclude that since they are at the highest point, they must be on Everest
Thus, the highest position they stand at has the maximum likelihood of being Everest,
given the data and starting point
Tree space searching
Lik
elih
oo
d
Trees
Tree space searching
X Random starting
Lik tree
elih
oo
d
Trees
Tree space searching
X
Lik
elih
oo
d
Trees
Tree space searching
X
Reach the maximum likelihood
tree
(In theory, closest to true true)
X
Lik
elih
oo
d
Trees
Tree space searching
Lik
elih
oo X Random starting
tree
d
Trees
Tree space searching
Stuck in local
maximum
X
Lik
elih
oo X
d
Trees
Tree space searching
X Maximum likelihood tree
(Global maximum)
Local
X
maximum
Lik
elih
oo
d
Trees
Tree space searching
The problem is that if they only walk upwards they could get stuck in a local
maximum
Computer programs will implement different strategies to try and get around
this
Multiple starting points
Multiple searches at once; can switch between searching chains
Allow large and small rearrangements
Allow some steps backwards to try improve score
Maximum Likelihood (ML)
A tree topology is proposed
A likelihood score is calculated for each position and added up to get an overall
likelihood score for the data for a given tree topology
Uses P(v) for the proposed branches (see models of evolution lecture)
Searches tree space and tries to find the tree that has the maximum likelihood of
generating the given data
Compare topologies through optimising variables for each to fit data
Example programs: RAxML, IQ-TREE, PAUP*, PhyML
S1: S1:
TTCAG GACAC
S2: S2:
TTCGG GGCGC
S3: S3:
TGTTG GTTTT
Phylogenetic tree building summary
Can be done on morphological or molecular data
Molecular less likely to be affected by convergent evolution
Many methods for building trees exist, each with its own criteria for
the best fit for the data
Parsimony
Distance
Maximum Likelihood/Bayesian
Most methods require a model of evolution to give information on
how the sequences evolved (see models of evolution learning package)
Complex algorithms such as ML or Bayesian require efficient
searching of the tree space
Tasks 3
What are the main ways to avoid getting stuck in a local maximum in
tree searching?