0% found this document useful (0 votes)

54 views73 pages

Lec9 Distances

The document discusses various metrics and methods used in phylogenetic analysis, including consistency index (CI), retention index (RI), parsimony, long branch attraction, distance-based tree inference, and least squares branch lengths. It notes that parsimony is not always statistically consistent and can be misleading, especially for trees with long terminal branches. Distance-based approaches convert sequence data to a pairwise distance matrix and try to find a tree that explains the observed distances, accounting for issues like multiple substitutions saturating distances between more divergent sequences. Model-based distance corrections and least squares are described as ways to address these issues.

Uploaded by

Princess Earl Dianne Ladera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views73 pages

Lec9 Distances

Uploaded by

Princess Earl Dianne Ladera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 73

Consistency Index (CI)

• minimum number of changes divided by the number required on the tree.

• CI=1 if there is no homoplasy

• negatively correlated with the number of species sampled

Retention Index (RI)

MaxSteps − ObsSteps
RI =
MaxSteps − MinSteps

• defined to be 0 for parsimony uninformative characters

• RI=1 if the character fits perfectly

• RI=0 if the tree fits the character as poorly as possible

Qualitative description of parsimony

• Enables estimation of ancestral sequences.

• Even though parsimony always seeks to minimizes the number of changes,

it can perform well even when changes are not rare.

• Does not “prefer” to put changes on one branch over another

• Hard to characterize statistically

– the set of conditions in which parsimony is guaranteed to work well is
very restrictive (low probability of change and not too much branch
length heterogeneity);
– Parsimony often performs well in simulation studies (even when outside
the zones in which it is guaranteed to work);
– Estimates of the tree can be extremely biased.
Long branch attraction

Felsenstein, J. 1978. Cases in which

parsimony or compatibility methods will be
positively misleading. Systematic Zoology
27: 401-410.

1.0 1.0

0.01 0.01
0.01
Long branch attraction

A G Felsenstein, J. 1978. Cases in which

parsimony or compatibility methods will be
positively misleading. Systematic Zoology
27: 401-410.

1.0 1.0
The probability of a parsimony informative
site due to inheritance is very low,
(roughly 0.0003).

0.01 0.01
A 0.01 G
Long branch attraction

A A Felsenstein, J. 1978. Cases in which

parsimony or compatibility methods will be
positively misleading. Systematic Zoology
27: 401-410.

1.0 1.0
The probability of a parsimony informative
site due to inheritance is very low,
(roughly 0.0003).

The probability of a misleading parsimony

informative site due to parallelism is much
0.01 0.01
G 0.01 G higher (roughly 0.008).
Long branch attraction

Parsimony is almost guaranteed to get this tree wrong.

1 3

2 4

2 4 Inferred
True
Inconsistency

• Statistical Consistency (roughly speaking) is converging to the true

answer as the amount of data goes to ∞.

• Parsimony based tree inference is not consistent for some tree shapes. In
fact it can be “positively misleading”:
– “Felsenstein zone” tree
– Many clocklike trees with short internal branch lengths and long
terminal branches (Penny et al., 1989, Huelsenbeck and Lander,
2003).

• Methods for assessing confidence (e.g. bootstrapping) will indicate that

you should be very confident in the wrong answer.
If the data is generated such that:

   
A A
 A   G 
Pr 
  ≈ 0.0003 and Pr   ≈ 0.008
G   G 
G A

then how can we hope to infer the tree ((1,2),3,4) ?

Looking at the data in “bird’s eye” view (using Mesquite):
Looking at the data in “bird’s eye” view (using Mesquite):

We see that sequences 1 and 4 are clearly very different.

Perhaps we can estimate the tree if we use the branch length information
from the sequences...
Distance-based approaches to inferring trees

• Convert the raw data (sequences) to a pairwise distances

• Try to find a tree that explains these distances.

• Not simply clustering the most similar sequences.

1 2 3 4 5 6 7 8 9 10
Species 1 C G A C C A G G T A
Species 2 C G A C C A G G T A
Species 3 C G G T C C G G T A
Species 4 C G G C C A T G T A
Can be converted to a distance matrix:
Species 1 Species 2 Species 3 Species 4
Species 1 0 0 0.3 0.2
Species 2 0 0 0.3 0.2
Species 3 0.3 0.3 0 0.3
Species 4 0.2 0.2 0.3 0
Note that the distance matrix is symmetric.
Species 1 Species 2 Species 3 Species 4
Species 1 0 0 0.3 0.2
Species 2 0 0 0.3 0.2
Species 3 0.3 0.3 0 0.3
Species 4 0.2 0.2 0.3 0
. . . so we can just use the lower triangle.
Species 1 Species 2 Species 3
Species 2 0
Species 3 0.3 0.3
Species 4 0.2 0.2 0.3

Can we find a tree that would predict these observed character divergences?
Species 1 Species 2 Species 3
Species 2 0
Species 3 0.3 0.3
Species 4 0.2 0.2 0.3

Can we find a tree that would predict these observed character divergences?

Sp. 1 Sp. 3

0.0 0.1 0.2

0.0 0.1

Sp. 2 Sp. 4
1 a c
3
i

b d

2 4
parameters data
p12 =a+b
p13 =a+i+c 1 2 3
2 d12
p14 =a+i+d
3 d13 d23
p23 =b+i+c 4 d14 d24 d34
p23 =b+i+d
p34 =c+d
If our pairwise distance measurements were error-free estimates
of the evolutionary distance between the sequences, then we
could always infer the tree from the distances.

The evolutionary distance is the number of mutations that have

occurred along the path that connects two tips.

We hope the distances that we measure can produce good

estimates of the evolutionary distance, but we know that they
cannot be perfect.
Intuition of sequence divergence vs evolutionary distance

This can’t be right!

1.0

p-dist

0.0
Evolutionary distance
0.0 ∞
Sequence divergence vs evolutionary distance

1.0

the p-dist
“levels off”

p-dist

0.0
Evolutionary distance
0.0 ∞
“Multiple hits” problem (also known as saturation)

• Levelling off of sequence divergence vs time plot is caused by

multiple substitutions affecting the same site in the DNA.

• At large distances the “raw” sequence divergence (also known

as the p-distance or Hamming distance) is a poor estimate
of the true evolutionary distance.

• Large p-distances respond more to model-based correction –

and there is a larger error associated with the correction.
15
Obs. Number of differences
10
5
0

1 5 10 15 20
Number of substitutions simulated onto a twenty-base sequence.
Distance corrections

• applied to distances before tree estimation,

• converts raw distances to an estimate of the evolutionary
distance

3 4c
d = − ln −1
4 3

“raw” p-distances corrected distances

1 2 3 1 2 3
2 c12 2 d12
3 c13 c23 3 d13 d23
4 c14 c24 c34 4 d14 d24 d34

3 4c
d = − ln 1 −
4 3

“raw” p-distances corrected distances

1 2 3 1 2 3
2 0.0 2 0
3 0.3 0.3 3 0.383 0.383
4 0.2 0.2 0.3 4 0.233 0.233 0.383
Least Squares Branch Lengths

X X (pij − dij )2
Sum of Squares =
i j
σijk

• minimize discrepancy between path lengths and

observed distances

• σijk is used to “downweight” distance estimates

with high variance
Least Squares Branch Lengths

X X (pij − dij )2
Sum of Squares =
i j
σijk

• in unweighted least-squares (Cavalli-Sforza &

Edwards, 1967): k = 0

• in the method Fitch-Margoliash (1967): k = 2 and

σij = dij
Poor fit using arbitrary branch lengths

Species dij pij (p − d)2

Hu-Ch 0.09267 0.2 0.01152 Go
Hu-Go 0.10928 0.3 0.03637
0.1
Hu-Or 0.17848 0.4 0.04907
Hu-Gi 0.20420 0.4 0.03834 0.1 0.1
0.1 0.1
Ch-Go 0.11440 0.3 0.03445 Hu Or
Ch-Or 0.19413 0.4 0.04238 0.1 0.1
Ch-Gi 0.21591 0.4 0.03389
Ch Gi
Go-Or 0.18836 0.3 0.01246
Go-Gi 0.21592 0.3 0.00707
Or-Gi 0.21466 0.2 0.00021
S.S. 0.26577
Optimizing branch lengths yields the least-squares score

Species dij pij (p − d)2

Go
Hu-Ch 0.09267 0.09267 0.000000000
Hu-Go 0.10928 0.10643 0.000008123 0.05790
Hu-Or 0.17848 0.18026 0.000003168
Hu-Gi 0.20420 0.20528 0.000001166 0.00761 0.03691
Ch-Go 0.11440 0.11726 0.000008180
Hu 0.04092 0.09482Or
Ch-Or 0.19413 0.19109 0.000009242
Ch-Gi 0.21591 0.21611 0.000000040 0.05175 0.11984
Go-Or 0.18836 0.18963 0.000001613
Go-Gi 0.21592 0.21465 0.000001613 Ch Gi
Or-Gi 0.21466 0.21466 0.000000000
S.S. 0.000033144
Least squares as an optimality criterion

SS = 0.00034 SS = 0.0003314
(best tree)
Ch Go

0.05591 0.05790

-0.00701 0.04178 0.00761 0.03691

Hu 0.04742 0.09482 Or Hu 0.04092 0.09482 Or

0.05175 0.11984 0.05175 0.11984

Go Gi Ch Gi
Minimum evolution optimality criterion

Sum of branch lengths Sum of branch lengths

=0.41152 =0.40975
(best tree)
Ch Go

0.05591 0.05790

-0.00701 0.04178 0.00761 0.03691

Hu 0.04742 0.09482 Or Hu 0.04092 0.09482 Or

0.05175 0.11984 0.05175 0.11984

Go Gi Ch Gi

We still use least squares branch lengths when we use Minimum Evolution
Huson and Steel – distances that perfectly mislead

Huson and Steel (2004) point out problems when our

pairwise distances have errors (do not reflect true evolutionary
distances). Consider:

Taxa
Taxon Characters Taxa A B C D
A A A C A A C C A - 6 6 5
B A C A C C A A B 6 - 5 6
C C A G G G A A C 6 5 - 6
D C G A A A G G D 5 6 6 -

Homoplasy-free on tree AB|CD, but additive on tree AD|BC

(and not additive on any other tree).
Huson and Steel – distances that perfectly mislead

Clearly, the previous matrix was contrived and not typical of

realistic data.
Would we ever expect to see additive distances on the wrong
tree as the result of a reasonable evolutionary process?

Yes.

Huson and Steel (2004) show that under the equal-input

model(more on this later), the uncorrected distances can be
additive on the wrong tree leading to long-branch attraction.
The result holds even if the number of characters → ∞
Failure to correct distance sufficiently leads to poor
performance

“Under-correcting” will underestimate long evolutionary

distances more than short distances

1 2

3 4
Failure to correct distance sufficiently leads to poor
performance

The result is the classic “long-branch attraction” phenomenon.

1 2

3 4
Distance methods – summary

We can:

• summarize a dataset as a matrix of distances or dissimilarities.

• correct these distances for unseen character state changes.
• estimate a tree by finding the tree with path lengths that are
“closest” to the corrected distances.
A C
@
@
@a
@
c A B C
@
@
@ i B dAB
@
@
@ C dAC dBC
@
b d@@@ D dAD dBD dCD
@

B D
@
If the tree above is correct then:

pAB = a + b
pAC = a + i + c
pAD = a + i + d
pBC = b + i + c
pBD = b + i + d
pCD = c + d
A@@@ C
@
a@ c A B C
@
@ @
@
@ @
@
@
@
@
i B dAB
@
@
@ C dAC dBC
@
b d@@@ D dAD dBD dCD
@

B D
@

dAC
A@@@ C
@
a@ c A B C
@
@ @
@
@ @
@
@
@
@
i B dAB
@
@
@ C dAC dBC
@
b @
@d
@
@
@ D dAD dBD dCD
@
@ @
@
B D
@
@
@
@

dAC + dBD
A@@@ C
@ @
a c A B C
@
@ @ @
@ @ @
@ @ @
@
@
@
@
@
@
@
i B dAB
@ @
@
@ C dAC dBC
@
b @
@d
@
@
@ D dAD dBD dCD
@
@ @
@
B D
@
@
@
@

dAC + dBD

dAB
A C
@
@
@ a
@
c A B C
@
@
@ i B dAB
@
@
@ C dAC dBC
@
b @
@ d
@
@
@ D dAD dBD dCD
@
@ @
@
B D
@
@
@
@

dAC + dBD −dAB

A C
@
@
@ a
@
c A B C
@
@
@ i B dAB
@
@
@
@
@ C dAC dBC
@ @
b @
@ d
@
@
@
@
@ D dAD dBD dCD
@ @
@ @ @
@
B D
@ @
@
@
@

dAC + dBD −dAB

dCD
A C
@
@
@ a
@
c A B C
@
@
@ i B dAB
@
@
@ C dAC dBC
@
b d@@@ D dAD dBD dCD
@

B D
@

dAC + dBD −dAB − dCD

A C
@
@
@ a
@
c A B C
@
@
@ i B dAB
@
@
@ C dAC dBC
@
b d@@@ D dAD dBD dCD
@

B D
@

† dAC +dBD −dAB −dCD

i = 2
Note that our estimate

† dAC + dBD −dAB − dCD

i =
2
does not use all of our data. dBC and dAD are
ignored!
We could have used dBC + dAD instead of dAC + dBD
(you can see this by going through the previous slides
after rotating the internal branch).

∗ dBC + dAD −dAB − dCD

i =
2
A better estimate than either i or i∗ would be the
average of both of them:

0 dBC + dAD + dAC + dBD dAB − dCD

i = −
4 2
A C A B A C
@
@ νa νc @
ν
@ a νb @
@ νa νc
@ @ @
@
@@
νi @
@
@
νi @
@
@
νi
@ @ @

νb νc νd
@ @ @
νd@@ νd@@ νb @
@

B D C D D B
@
@ @
@ @
@

dAB + dCD νa + νb + νc + νd νa +νb +νc +νd +2νi νa + νb + νc + νd + 2νi

dAC + dBD νa +νb +νc +νd +2νi νa + νb + νc + νd νa + νb + νc + νd + 2νi

dAD + dBC νa +νb +νc +νd +2νi νa +νb +νc +νd +2νi νa + νb + νc + νd

The four point condition of Buneman (1971).

This assumes additivity of distances.
A B
@
ν
@ a νb
@
@
@
@
νi
@

νc
@
νd@@
C D
@
@

dAB + dCD νa + νb + νc + νd + 2νi + AB + CD

dAC + dBD νa + νb + νc + νd + AC + BD

dAD + dBC νa + νb + νc + νd + 2νi + AD + BC

If |ij | < ν2i then dAC + dBD will still be the smallest sum – So
Buneman’s method will get the tree correct.
νi
Worst case: AC = BD = 2 and AB = CD = − ν2i then

dAC + dBD = νa + νb + νc + νd + νi = dAB + dCD

Both Buneman’s four-point condition and Hennigian logic,
return the tree given perfectly clean data. But what does
“perfectly clean data” mean?

1. Hennigian analysis → no homoplasy. The infinite alleles

model.
2. Buneman’s four-point test → no multiple hits to the same
site. The infinite sites model.
The guiding principle of distance-based methods

If our data are true measures of evolutionary distances (and the

distance along each branch is always > 0) then:

1. The distances will be additive on the true tree.

2. The distances will not be additive on any other tree.

This is the basis of Buneman’s method and the motivation for

minimizing the sum-of-squared error (least squares) too choose
among trees.
Balanced minimum evolution

The logic behind Buneman’s four-point condition has been

extend to trees of more than 4 taxa by Pauplin (2000) and
Semple and Steel (2004).
Pauplin (2000) showed that you can calculate a tree length
from the pairwise distances without calculating branch lengths.
The key is weighting the distances:
N X
X N
l= wij dij
i j=i+1

where:
1
wij =
2n(i,j)
and n(i, j) is the number of nodes on the path from i to j.
Balanced minimum evolution

“Balanced Minimum Evolution” Desper and Gascuel (2002,

2004) – fitting the branch lengths using the estimators of
Pauplin (2000) and preferring the tree with the smallest tree
length

BME = a form of weighted least squares in which distances are

down-weighted by an exponential function of the topological
distances between the leaves.

Desper and Gascuel (2005): neighbor-joining is star

decomposition (more on this later) under BME. See Gascuel
and Steel (2006)
FastME

Software by Desper and Gascuel (2004) which implements

searching under the balanced minimum evolution criterion.

It is extremely fast and is more accurate than neighbor-joining

(based on simulation studies).
Distance methods: pros

• Fast – the new FastTree method Price et al. (2009) can

calculate a tree in less time than it takes to calculate a full
distance matrix!

• Can use models to correct for unobserved differences

• Works well for closely related sequences

• Works well for clock-like sequences

Distance methods: cons

• Do not use all of the information in sequences

• Do not reconstruct character histories, so they not enforce

all logical constraints

A A

G G
Neighbor-joining

Saitou and Nei (1987). r is the number of leaves remaining.

Start with r = N .

1. choose the pair of leaves x and y that minimize Q(x, y):

r
X r
X
Q(i, j) = (r − 2)dij − dik − djk
k=1 k=1

2. Join x and y with at a new node z. Take x and y out of

the leaf set and distance matrix, and add the new node z as
a leaf.
Neighbor-joining (continued)

3. Set the branch length from x to z using:

r
X r
!
dxy 1 X
dxz = + dxk − dyk
2 2(r − 2)
k=1 k=1

(the length of the branch from y to z is set with a similar

formula).
4. Update the distance matrix, by adding (for any other taxon
k) the distance:

dxk + dyk − dxz − dyz

dzk =
2
5. return to step 1 until you are down to a trivial tree.
Neighbor-joining (example)

A B C D E F
A -
B 0.258 -
C 0.274 0.204 -
D 0.302 0.248 0.278 -
E 0.288 0.224 0.252 0.268 -
F 0.250 0.160 0.226 0.210 0.194 -
Neighbor-joining (example)

P
k dik A B C D E F
1.372 A 0.0 0.258 0.274 0.302 0.288 0.25
1.094 B 0.258 0.0 0.204 0.248 0.224 0.16
1.234 C 0.274 0.204 0.0 0.278 0.252 0.226
1.306 D 0.302 0.248 0.278 0.0 0.268 0.21
1.226 E 0.288 0.224 0.252 0.268 0.0 0.194
1.040 F 0.25 0.16 0.226 0.21 0.194 0.0
Q(A, B) -1.434
Q(A, C) -1.510
Q(A, D) -1.470
Q(A, E) -1.446
Q(A, F ) -1.412
Q(B, C) -1.512
Q(B, D) -1.408
Q(B, E) -1.424
Q(B, F ) -1.494
Q(C, D) -1.428
Q(C, E) -1.452
Q(C, F ) -1.370
Q(D, E) -1.460
Q(D, F ) -1.506
Q(E, F ) -1.490
Neighbor-joining (example)

A D E F (B,C)
A 0.0 0.302 0.288 0.25 0.164
D 0.302 0.0 0.268 0.21 0.161
E 0.288 0.268 0.0 0.194 0.136
F 0.25 0.21 0.194 0.0 0.091
(B,C) 0.164 0.161 0.136 0.091 0.0
Neighbor-joining (example)

P
k dik A D E F (B,C)
1.004000 A 0.0 0.302 0.288 0.25 0.164
0.941000 D 0.302 0.0 0.268 0.21 0.161
0.886000 E 0.288 0.268 0.0 0.194 0.136
0.745000 F 0.25 0.21 0.194 0.0 0.091
0.552000 (B,C) 0.164 0.161 0.136 0.091 0.0
Neighbor-joining (example)

Q(A, D) -1.039000
Q(A, E) -1.026000
Q(A, F ) -0.999000
Q(A, (B, C)) -1.064000
Q(D, E) -1.023000
Q(D, F ) -1.056000
Q(D, (B, C)) -1.010000
Q(E, F ) -1.049000
Q(E, (B, C)) -1.030000
Q(F, (B, C)) -1.024000
Neighbor-joining (example)

D E F (A,(B,C))
D 0.0 0.268 0.21 0.1495
E 0.268 0.0 0.194 0.13
F 0.21 0.194 0.0 0.0885
(A,(B,C)) 0.1495 0.13 0.0885 0.0
Neighbor-joining (example)

P
k dik D E F (A,(B,C))
0.627500 D 0.0 0.268 0.21 0.1495
0.592000 E 0.268 0.0 0.194 0.13
0.492500 F 0.21 0.194 0.0 0.0885
0.368000 (A,(B,C)) 0.1495 0.13 0.0885 0.0
Neighbor-joining (example)

Q(D, E) -0.683500
Q(D, F ) -0.700000
Q(D, (A, (B, C))) -0.696500
Q(E, F ) -0.696500
Q(E, (A, (B, C))) -0.700000
Q(F, (A, (B, C))) -0.683500

((D, F ), E, (A, (B, C)))

Neighbor-joining is special

Bryant (2005) discusses neighbor joining in the context of

clustering methods that:

• Work on the distance (or dissimilarity) matrix as input.

• Repeatedly
– select a pair of taxa to agglomerate (step 1 above)
– make the pair into a new group (step 2 above)
– estimate branch lengths (step 3 above)
– reduce the distance matrix (step 4 above)
Neighbor-joining is special (cont)

Bryant (2005) shows that if you want your selection criterion

to be:

• based solely on distances

• invariant to the ordering of the leaves (no a priori special
taxa).
• work on linear combinations of distances (simple coefficients
for weights, no fancy weighting schemes).
• statistically consistent

then neighbor-joining’s Q-criterion as a selection rule is the

only choice.
Neighbor-joining is not perfect

• BioNJ (Gascuel, 1997) does a better job by using the variance

and covariances in the reduction step.
• Weighbor (Bruno et al., 2000) includes the variance
information in the selection step.
• FastME (Desper and Gascuel, 2002, 2004) does a better job
of finding the BME tree (and seems to get the true tree right
more often).
References

Bruno, W., Socci, N., and Halpern, A. (2000). Weighted

neighbor joining: A likelihood-based approach to distance-
based phylogeny reconstruction. Molecular Biology and
Evolution, 17(1):189–197.

Bryant, D. (2005). On the uniqueness of the selection criterion

in neighbor-joining. Journal of Classification, 22:3–15.

Buneman, P. (1971). The recovery of trees from measures of

dissimilarity. In Hodson, F. R., Kendall, D. G., and Tautu,
P., editors, Mathematics in the Archaeological and Historical
Sciences, Edinburgh. The Royal Society of London and the
Academy of the Socialist Republic of Romania, Edinburgh
University Press.
Desper, R. and Gascuel, O. (2002). Fast and accurate
phylogeny reconstruction algorithms based on the minimum-
evolution principle. Journal of Computational Biology,
9(5):687–705.

Desper, R. and Gascuel, O. (2004). Theoretical foundation

of the balanced minimum evolution method of phylogenetic
inference and its relationship to weighted least-squares tree
fitting. Molecular Biology and Evolution.

Desper, R. and Gascuel, O. (2005). The minimum evolution

distance-based approach to phylogenetic inference. In
Gascuel, O., editor, Mathematics of Evolution and Phylogeny,
pages 1–32. Oxford University Press.

Gascuel, O. (1997). BIONJ: an improved version of the

NJ algorithm based on a simple model of sequence data.
Molecular Biology and Evolution, 14(7):685–695.

Gascuel, O. and Steel, M. (2006). Neighbor-joining revealed.

Molecular Biology and Evolution, 23(11):1997–2000.

Huson, D. and Steel, M. (2004). Distances that perfectly

mislead. Systematic Biology, 53(2):327–332.

Pauplin, Y. (2000). Direct calculation of a tree length

using a distance matrix. Journal of Molecular Evolution,
2000(51):41–47.

Price, M. N., Dehal, P., and Arkin, A. P. (2009). FastTree:

Computing large minimum-evolution trees with profiles
instead of a distance matrix. Molecular Biology and Evolution,
26(7):1641–1650.
Saitou, N. and Nei, M. (1987). The neighbor-joining method: a
new method for reconstructing phylogenetic trees. Molecular
Biology and Evolution, 4(4):406–425.

Semple, C. and Steel, M. (2004). Cyclic permutations

and evolutionary trees. Advances in Applied Mathematics,
32(4):669–680.

Gordon B. Moskowitz - Introduction To Social Cognition - The Essential Questions and Ideas (2024, The Guilford Press) - Libgen - Li
No ratings yet
Gordon B. Moskowitz - Introduction To Social Cognition - The Essential Questions and Ideas (2024, The Guilford Press) - Libgen - Li
579 pages
Lec 10 Phylogenetics
No ratings yet
Lec 10 Phylogenetics
51 pages
Phylogenetic Tree Reconstruction: I519 Introduction To Bioinformatics, 2012
No ratings yet
Phylogenetic Tree Reconstruction: I519 Introduction To Bioinformatics, 2012
40 pages
Introduction To Molecular Evolution: Mike Thomas October 3, 2002
No ratings yet
Introduction To Molecular Evolution: Mike Thomas October 3, 2002
32 pages
Estimating Phylogenetic Trees With Phangorn (Version 1.6-0) : Klaus P. Schliep April 5, 2012
No ratings yet
Estimating Phylogenetic Trees With Phangorn (Version 1.6-0) : Klaus P. Schliep April 5, 2012
12 pages
Disclaimer
No ratings yet
Disclaimer
36 pages
L13 PhylogenyTrees
No ratings yet
L13 PhylogenyTrees
19 pages
Phylogenetics
100% (1)
Phylogenetics
51 pages
Molecular Evolution and Phylogenetics Session.3
No ratings yet
Molecular Evolution and Phylogenetics Session.3
37 pages
Syst Biol-2004-Penny-669-70
No ratings yet
Syst Biol-2004-Penny-669-70
2 pages
BTC 506 Phylogenetic Analysis
No ratings yet
BTC 506 Phylogenetic Analysis
58 pages
4 - Phylogenetics
No ratings yet
4 - Phylogenetics
30 pages
Intro To Phyl o Genetics
No ratings yet
Intro To Phyl o Genetics
44 pages
Molecular Phylogeny
No ratings yet
Molecular Phylogeny
78 pages
Tree Traversal
No ratings yet
Tree Traversal
19 pages
Bioinformatics Session16!17!25102021
No ratings yet
Bioinformatics Session16!17!25102021
39 pages
Phylogenetic Trees (BIOINFORMATICS)
No ratings yet
Phylogenetic Trees (BIOINFORMATICS)
7 pages
PNAS 2004 Tamura 11030 5
No ratings yet
PNAS 2004 Tamura 11030 5
6 pages
4 Phylogenetics
No ratings yet
4 Phylogenetics
43 pages
Computational Methods in Phylogenetic Analysis: Tutorial at CSB 2004 Tandy Warnow
No ratings yet
Computational Methods in Phylogenetic Analysis: Tutorial at CSB 2004 Tandy Warnow
89 pages
Computational Phylogenetics
No ratings yet
Computational Phylogenetics
404 pages
Phylogenic Tree
No ratings yet
Phylogenic Tree
42 pages
Phylogenetic Tree
No ratings yet
Phylogenetic Tree
31 pages
Phylogenetic Tree Constructions Methods and Programmes - L 11 - 12
No ratings yet
Phylogenetic Tree Constructions Methods and Programmes - L 11 - 12
27 pages
Holmes Statistical
No ratings yet
Holmes Statistical
15 pages
Phylogeny Lars Arvestad
No ratings yet
Phylogeny Lars Arvestad
31 pages
Slides Week03
No ratings yet
Slides Week03
49 pages
Final 2
No ratings yet
Final 2
85 pages
BIOL 401 - W22 - Lecture - Phylogenetic Inference
No ratings yet
BIOL 401 - W22 - Lecture - Phylogenetic Inference
39 pages
Maximum Parsimony and Likelihood
No ratings yet
Maximum Parsimony and Likelihood
34 pages
Phylogenetic Tree
No ratings yet
Phylogenetic Tree
25 pages
Phylogeny
No ratings yet
Phylogeny
22 pages
Sullivan&Joyce 2005
No ratings yet
Sullivan&Joyce 2005
24 pages
Ultrametricity
No ratings yet
Ultrametricity
35 pages
Intro Phylo Notes
No ratings yet
Intro Phylo Notes
36 pages
385 Full
No ratings yet
385 Full
8 pages
Computational Biology B. Tech - Bio-Tech (VI Semester)
No ratings yet
Computational Biology B. Tech - Bio-Tech (VI Semester)
40 pages
Yang 1995
No ratings yet
Yang 1995
16 pages
Phylogenetic Tree Bioinformatics
No ratings yet
Phylogenetic Tree Bioinformatics
4 pages
Phylogenetic Analysis1
No ratings yet
Phylogenetic Analysis1
62 pages
Multiple Sequence Alignment For Construction of Phylogenetic Tree
No ratings yet
Multiple Sequence Alignment For Construction of Phylogenetic Tree
5 pages
46 Differentiable Search of Ev
No ratings yet
46 Differentiable Search of Ev
8 pages
1992 - Statistical Properties of Bootstrap Estimation of Phylogenetic Variability From Nucleotide Sequences
No ratings yet
1992 - Statistical Properties of Bootstrap Estimation of Phylogenetic Variability From Nucleotide Sequences
29 pages
Slides 9
No ratings yet
Slides 9
62 pages
Understanding Phylogenies
No ratings yet
Understanding Phylogenies
6 pages
Molecular Phylogeny - Introduction
No ratings yet
Molecular Phylogeny - Introduction
12 pages
Principles-Of-Computational-Biology
No ratings yet
Principles-Of-Computational-Biology
67 pages
Bioinformatics Workshop 3: Estimating Evolutionary Distances and Building Phylogenies
No ratings yet
Bioinformatics Workshop 3: Estimating Evolutionary Distances and Building Phylogenies
12 pages
Bscol 7
No ratings yet
Bscol 7
29 pages
Assignment5 BI12-223
No ratings yet
Assignment5 BI12-223
9 pages
2006phylogenynotes Lecture1
No ratings yet
2006phylogenynotes Lecture1
34 pages
Phylogenetics Basics
No ratings yet
Phylogenetics Basics
28 pages
Yang 1993
No ratings yet
Yang 1993
6 pages
2006 Liviu P. Dinu, Andrea Sgarro, 2006. A Low-Complexity Distance For DNA Strings
No ratings yet
2006 Liviu P. Dinu, Andrea Sgarro, 2006. A Low-Complexity Distance For DNA Strings
14 pages
Neighbor-Joining Revealed
No ratings yet
Neighbor-Joining Revealed
5 pages
L 1 Ou
No ratings yet
L 1 Ou
19 pages
Phylogenetics
No ratings yet
Phylogenetics
108 pages
Cladistics - 2005 - Nixon - The Parsimony Ratchet A New Method For Rapid Parsimony Analysis
No ratings yet
Cladistics - 2005 - Nixon - The Parsimony Ratchet A New Method For Rapid Parsimony Analysis
8 pages
Phylogenetic Tree Construction - Methods
No ratings yet
Phylogenetic Tree Construction - Methods
7 pages
Upgma: Presented by Shreya Gopinath
No ratings yet
Upgma: Presented by Shreya Gopinath
17 pages
Metric, Myth & Quasicrystals
From Everand
Metric, Myth & Quasicrystals
Antony J. Bourdillon
No ratings yet
'Academic Stressors and Its Effects: Positive or Negative On Academic Learning Among NNHS SHS SY.2018-2019
No ratings yet
'Academic Stressors and Its Effects: Positive or Negative On Academic Learning Among NNHS SHS SY.2018-2019
25 pages
Concrete Masonry Report
No ratings yet
Concrete Masonry Report
21 pages
Seismic Design UBC97 Code
100% (1)
Seismic Design UBC97 Code
73 pages
We Were Doing A Site Visit in The Chiller Area of The Building
No ratings yet
We Were Doing A Site Visit in The Chiller Area of The Building
4 pages
PEG Station Distance Bearing BS IS FS Rise Fall HI R.L Remarks
No ratings yet
PEG Station Distance Bearing BS IS FS Rise Fall HI R.L Remarks
3 pages
Steel Barssteel Pipe
No ratings yet
Steel Barssteel Pipe
16 pages
UniFAST FAQs
No ratings yet
UniFAST FAQs
11 pages
Factors Affecting Selection of Construction Materials
No ratings yet
Factors Affecting Selection of Construction Materials
19 pages
PEG Station Distance Bearing BS IS FS Rise Fall HI R.L Remarks
No ratings yet
PEG Station Distance Bearing BS IS FS Rise Fall HI R.L Remarks
2 pages
研究范围的界定
100% (1)
研究范围的界定
10 pages
Macarthur 9 Thgradesciencefairguidelines
No ratings yet
Macarthur 9 Thgradesciencefairguidelines
4 pages
Literature Review Vs Documentary Research
100% (3)
Literature Review Vs Documentary Research
7 pages
The Discipline of History
No ratings yet
The Discipline of History
15 pages
JIB531 - 01 - Introduction To Systematics 20221102
No ratings yet
JIB531 - 01 - Introduction To Systematics 20221102
15 pages
Kueter Morgan Resume
No ratings yet
Kueter Morgan Resume
1 page
Research Process
No ratings yet
Research Process
15 pages
Chapter 4
No ratings yet
Chapter 4
9 pages
What Is Quantum Biology? What Is Quantum Biology?: Submitted By: Group#10
No ratings yet
What Is Quantum Biology? What Is Quantum Biology?: Submitted By: Group#10
1 page
Simulation-Based Sample-Size Calculation For Designing New Clinical Trials and Diagnostic Test Accuracy Studies To Update An Existing Meta-Analysis
No ratings yet
Simulation-Based Sample-Size Calculation For Designing New Clinical Trials and Diagnostic Test Accuracy Studies To Update An Existing Meta-Analysis
26 pages
Answers To Exercises and Review Questions: Part One: Getting Started
No ratings yet
Answers To Exercises and Review Questions: Part One: Getting Started
2 pages
Problem Statement Literature Review Objectives of Research Justification of Reserch Research Questions Research Hypothesis Limitations References
No ratings yet
Problem Statement Literature Review Objectives of Research Justification of Reserch Research Questions Research Hypothesis Limitations References
1 page
Weekly Home Learning Plan Quarter 4 Week 2
No ratings yet
Weekly Home Learning Plan Quarter 4 Week 2
5 pages
Comment For ISO5725
No ratings yet
Comment For ISO5725
9 pages
Chapter 2
No ratings yet
Chapter 2
8 pages
7 Types of Statistical Analysis
100% (1)
7 Types of Statistical Analysis
9 pages
Term Paper Stat
No ratings yet
Term Paper Stat
4 pages
AAT-263 (Question Bank)
No ratings yet
AAT-263 (Question Bank)
7 pages
Cultural Centre SYNOPSIS 2
No ratings yet
Cultural Centre SYNOPSIS 2
4 pages
(Clark A Merrill) Strauss' Indictment of Christian Philosophy
No ratings yet
(Clark A Merrill) Strauss' Indictment of Christian Philosophy
30 pages
A Review On Game Theory As A Tool in Operation Research
No ratings yet
A Review On Game Theory As A Tool in Operation Research
22 pages
Short Note For Theme 2
No ratings yet
Short Note For Theme 2
63 pages
Chapter2-Qualitative Research
No ratings yet
Chapter2-Qualitative Research
14 pages
Class - IX English Marking Scheme 2022-23
No ratings yet
Class - IX English Marking Scheme 2022-23
27 pages
Task 1
No ratings yet
Task 1
3 pages
European Journal of Operational Research PDF
No ratings yet
European Journal of Operational Research PDF
2 pages
Week 1 - Daily Lesson Log inSHS Life Science
100% (3)
Week 1 - Daily Lesson Log inSHS Life Science
3 pages
UG Regulations 2023
No ratings yet
UG Regulations 2023
28 pages

Lec9 Distances

Uploaded by

Lec9 Distances

Uploaded by

Consistency Index (CI)

• minimum number of changes divided by the number required on the tree.

• CI=1 if there is no homoplasy

• negatively correlated with the number of species sampled

• defined to be 0 for parsimony uninformative characters

• RI=1 if the character fits perfectly

• RI=0 if the tree fits the character as poorly as possible

• Enables estimation of ancestral sequences.

• Even though parsimony always seeks to minimizes the number of changes,

• Does not “prefer” to put changes on one branch over another

• Hard to characterize statistically

Felsenstein, J. 1978. Cases in which

A G Felsenstein, J. 1978. Cases in which

A A Felsenstein, J. 1978. Cases in which

The probability of a misleading parsimony

Parsimony is almost guaranteed to get this tree wrong.

• Statistical Consistency (roughly speaking) is converging to the true

• Methods for assessing confidence (e.g. bootstrapping) will indicate that

then how can we hope to infer the tree ((1,2),3,4) ?

We see that sequences 1 and 4 are clearly very different.

• Convert the raw data (sequences) to a pairwise distances

• Try to find a tree that explains these distances.

• Not simply clustering the most similar sequences.

0.0 0.1 0.2

The evolutionary distance is the number of mutations that have

We hope the distances that we measure can produce good

This can’t be right!

• Levelling off of sequence divergence vs time plot is caused by

• At large distances the “raw” sequence divergence (also known

• Large p-distances respond more to model-based correction –

• applied to distances before tree estimation,

“raw” p-distances corrected distances

“raw” p-distances corrected distances

• minimize discrepancy between path lengths and

• σijk is used to “downweight” distance estimates

• in unweighted least-squares (Cavalli-Sforza &

• in the method Fitch-Margoliash (1967): k = 2 and

Species dij pij (p − d)2

Species dij pij (p − d)2

-0.00701 0.04178 0.00761 0.03691

Hu 0.04742 0.09482 Or Hu 0.04092 0.09482 Or

0.05175 0.11984 0.05175 0.11984

Sum of branch lengths Sum of branch lengths

-0.00701 0.04178 0.00761 0.03691

Hu 0.04742 0.09482 Or Hu 0.04092 0.09482 Or

0.05175 0.11984 0.05175 0.11984

Huson and Steel (2004) point out problems when our

Homoplasy-free on tree AB|CD, but additive on tree AD|BC

Clearly, the previous matrix was contrived and not typical of

Huson and Steel (2004) show that under the equal-input

“Under-correcting” will underestimate long evolutionary

The result is the classic “long-branch attraction” phenomenon.

• summarize a dataset as a matrix of distances or dissimilarities.

dAC + dBD −dAB

dAC + dBD −dAB

dAC + dBD −dAB − dCD

† dAC +dBD −dAB −dCD

† dAC + dBD −dAB − dCD

∗ dBC + dAD −dAB − dCD

0 dBC + dAD + dAC + dBD dAB − dCD

dAB + dCD νa + νb + νc + νd νa +νb +νc +νd +2νi νa + νb + νc + νd + 2νi

dAC + dBD νa +νb +νc +νd +2νi νa + νb + νc + νd νa + νb + νc + νd + 2νi

The four point condition of Buneman (1971).

dAB + dCD νa + νb + νc + νd + 2νi + AB + CD

dAC + dBD νa + νb + νc + νd + AC + BD

dAD + dBC νa + νb + νc + νd + 2νi + AD + BC

dAC + dBD = νa + νb + νc + νd + νi = dAB + dCD

1. Hennigian analysis → no homoplasy. The infinite alleles

If our data are true measures of evolutionary distances (and the

1. The distances will be additive on the true tree.

This is the basis of Buneman’s method and the motivation for

The logic behind Buneman’s four-point condition has been

“Balanced Minimum Evolution” Desper and Gascuel (2002,

BME = a form of weighted least squares in which distances are

Desper and Gascuel (2005): neighbor-joining is star

Software by Desper and Gascuel (2004) which implements

dAB + dCD νa + νb + νc + νd + 2νi + AB + CD

dAC + dBD νa + νb + νc + νd + AC + BD

dAD + dBC νa + νb + νc + νd + 2νi + AD + BC