Algorithms in Bioinformatics
Algorithms in Bioinformatics
Algorithms
in Bioinformatics
13
Series Editors
Olivier Gascuel
LIRMM, 161 rue Ada
34392 Montpellier, France
E-mail: [email protected]
Bernard M.E. Moret
University of New Mexico
Department of Computer Science
Albuquerque, NM 87131, USA
E-mail: [email protected]
CR Subject Classication (1998): F.1, E.1, F.2.2, G.1, G.2.1, G.3, J.3
ISSN 0302-9743
ISBN 3-540-42516-0 Springer-Verlag Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microlms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are
liable for prosecution under the German Copyright Law.
Springer-Verlag Berlin Heidelberg New York
a member of BertelsmannSpringer Science+Business Media GmbH
http://www.springer.de
We were also fortunate to attract Dr. Gene Myers, Vice-President for Infor-
matics Research at Celera Genomics, and Prof. Jotun Hein, Aarhus University,
to address the joint workshops, joining ve other distinguished speakers (Profs.
Herbert Edelsbrunner and Lars Arge from Duke University, Prof. Susanne Al-
bers from Dortmund University, Prof. Uri Zwick from Tel Aviv University, and
Dr. Andrei Broder from Alta Vista). The quality of the submissions and the
interest expressed in the workshop is promising plans for next years workshop
are under way.
We would like to thank all the authors for submitting their work to the
workshop and all the presenters and attendees for their participation. We were
particularly fortunate in enlisting the help of a very distinguished panel of re-
searchers for our program committee, which undoubtedly accounts for the large
number of submissions and the high quality of the presentations. Our heartfelt
thanks go to all:
In addition, the opinion of several other researchers was solicited. These subref-
erees include Tim Beissbarth, Vincent Berry, Benny Chor, Eivind Coward, Ing-
var Eidhammer, Thomas Faraut, Nicolas Galtier, Michel Goulard, Jacques van
Helden, Anja von Heydebreck, Ina Koch, Chaim Linhart, Hannes Luz, Vsevolod
Yu, Michal Ozery, Itsik Peer, Sven Rahmann, Katja Rateitschak, Eric Rivals,
Mikhail A. Roytberg, Roded Sharan, Jens Stoye, Dekel Tsur, and Jian Zhang.
We thank them all.
Lastly, we thank Prof. Erik Meineche-Schmidt, BRICS codirector, who
started the entire enterprise by calling on one of us (Bernard Moret) to set up the
workshop and who led the team of committee chairs and organizers through the
Preface VII
setup, development, and actual events of the three combined workshops, with
the assistance of Prof. Gerth Brdal.
We hope that you will consider contributing to WABI 2002, through a sub-
mission or by participating in the workshop.
1 Introduction
The traditional sequence analysis [1] needs proper evolutionary parameters. These pa-
rameters depend on the actual divergence time, which is usually unknown as well. An-
other major problem is that the evolutionary parameters cannot be estimated from a
single alignment. Incorrectly determined parameters might cause unrecognizable bias
in the sequence alignment.
One way to break this vicious circle is the maximum likelihood parameter estima-
tion. In the pioneering work of Bishop and Thompson [2], an approximate likelihood
calculation was introduced. Several years later, Thorne, Kishino, and Felsenstein wrote
a landmark paper [3], in which they presented an improved maximum likelihood algo-
rithm, which estimates the evolutionary distance between two sequences involving all
possible alignments in the likelihood calculation. Their 1991 model (frequently referred
to as the TKF91 model) considers only single insertions and deletions, but this con-
sideration is rather unrealistic [4,5]. Later it was further improved by allowing longer
insertions and deletions [4] in the model, which is usually coined as the TKF92 model.
However, this model assumes that sequences contain unbreakable fragments, and only
whole fragments are inserted and deleted. As it was shown [4], the fragment model has
a flaw: considering unbreakable fragments, there is no possible explanation for overlap-
ping deletions with a scenario of just two events. This problem is solvable by assuming
that the ancestral sequence was fragmented independently on both branches immedi-
ately after the split, and sequences evolved since then according to the fragment model
[6]. However, this assumption does not solve the problem completely: fragments do not
O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 110, 2001.
c Springer-Verlag Berlin Heidelberg 2001
2 Istvan Miklos and Zoltan Toroczkai
have biological realism. The lack of the biological realism is revealed when we want
to generalize this split model for multiple sequence comparison. For example, consider
that we have proteins from humans, gorillas and chimps. When we want to analyze
the three sequences simultaneously, two pairs of fragmentation are needed: one pair
at the gorilla-(human and chimp) split and one at the human-chimp split. When only
sequences from gorillas and humans are compared, the fragmentation at the human-
chimp split is omitted. Thus, the description of the evolution of two sequences depends
on the number of the introduced splits, and there is no sensible interpretation to this
dependence.
Since our model is related to the TKF91 model we describe it briefly. Most of the
definitions and notations are introduced in here.
The TKF model is the fusion of two independent time-continuous Markov pro-
cesses, the substitution and the insertion-deletion process.
The Substitution Process: Each character can be substituted independently for an-
other character dictated by one of the well-known substitution processes [7],[8]. The
substitution process is described by a system of linear differential equations
dx(t)
= Q x(t) (1)
dt
where Q is the rate matrix. Since Q contains too many parameters, it is usually sepa-
rated into two components, Q 0 s, where Q0 is kept constant and is estimated with a less
rigorous method than maximum likelihood [4]. The solution of ( 1) is
Fig. 1. The Possible Fates of Links. The second column shows the fate of the immortal
link (o). After a time period t it has k descendants including itself. The third column
describes the fate of a survived mortal link (*). It has k descendants including itself
after time t. The fourth column depicts the fate of a mortal link that died, but left k
descendants after time t.
Calculating the Joint Probability of Two Sequences: The joint probability of two
sequences A and B is calculated as the equilibrium probability of sequence A times the
probability that sequence B evolved from A under time 2t, where t is the divergence
time.
P (A, B) = P (A)P2t (B | A) (3)
A possible transition is described as an alignment. The upper sequence is the ancestor;
the lower sequence is the descendant. For example the following alignment describes
that the immortal link o has one descendant, the first mortal link * died out, and the
second mortal link has two descendants including itself.
o - A* U* -
o G* - C* A*
The probability of an alignment is the probability of the ancestor, times the probability
of the transition. For example, the probability of the above alignment is
(2) (1)
2 (A)(U )p2 (t)(G)p0 (t)p2 (t)fUC (2t)(A) (4)
where n is the probability that a sequence contains n mortal links, (X) is the fre-
quency of the character X, and f ij (2t) is the probability that a character i is of j at
time 2t. The joint probability of two sequences is the summation of the alignment prob-
abilities.
2 The Model
Our model differs from the TKF models in the insertion-deletion process. The TKF91
model assumes only single insertions and deletions, as illustrated in Figure 2. Long
insertions and deletions are allowed in the TKF92 model, as illustrated in Figure 3.
However, these long indels are considered as unbreakable fragments as they have only
one common mortal link. The death of the mortal link causes the deletion of every char-
acter in the long insertion. The distinction from the previous model is that in our model
4 Istvan Miklos and Zoltan Toroczkai
A* A*C* A*C*G*
Fig. 2. The Flow-chart of the TKF91 Model. Each link can give birth to a mortal link
with birth rate > 0. Mortal links die with death rate > 0.
r(1r)
A* A*C* A*CG*
r(1r)
r r
A* A*C* A*C*G*
Fig. 4. The Flowchart of Our Model. Each link can give birth to k mortal links with birth
rate r(1r)k1 , with > 0 and 0 < r < 1. Each newborn link can die independently
with death rate > 0.
every character has its own mortal link in the long insertions, as illustrated in Figure 4.
Thus, this model allows long insertions without considering unbreakable fragments. It
is possible that a long fragment is inserted into the sequence first and some of the in-
serted links die and some of them survive after then. A link gives birth to a block of k
mortal links with rate k , where
k = r(1 r)k1 , k = 1, 2, . . . , > 0, 0 < r < 1 (5)
Only mortal links can die with rate > 0.
An Improved Model for Statistical Alignment 5
n1 n1
Using j=1 j = and j=1 (n j)j pnj = k=1 knk pk , we have:
dpn n1
= r k(1 r)nk1 pk + npn+1 (n + (n 1)) pn (7)
dt
k=1
Due to the immortal link, we have t, p 0 (t) = 0. For n = 1, the sum in (7) is void. The
initial conditions are given by:
pn (0) = n,1 (8)
Next, we introduce the generating function [9]:
P (; t) = n pn (t) (9)
n=0
Multiplying (7) by n , then summing over n, we obtain a linear PDE for the generating
function:
P P
(1 ) = (1 ) P (10)
t 1 (1 r)
with initial condition P (; 0) = .
Solution to the PDE for the Generating Function: We use the method of Lagrange:
dt d dP
=
= (11)
1 (1 ) (1 ) P
1(1r)
The two equalities define two, one-parameter families of surfaces, namely v(t; ; P )
and w(t; ; P ). After integrating the first and the second equalities in ( 11) the following
families of surfaces are obtained:
(1 )r t(r)
v(; t) = e = c1 (12)
( a)/a
( a)r
w(; t; P ) = P = c2 (13)
with a + (1 r) > 0. The general form of the solution is an arbitrary function of
w = g(v). This means:
(1 )r t(r)
P (; t) = ( a)/a g e (14)
( a)/a
6 Istvan Miklos and Zoltan Toroczkai
where
(1 x)r
f (x) = (16)
( ax)/a
Thus the exact form for the generating function becomes:
/a
af 1 (f ()e(r)t )
P (; t) = (17)
a
The Probabilities for the Fate of the Mortal Links: The Master Equations for the
(1) (2)
probabilities pn (t) and pn (t) are given by
(1)
n1
dpn (1) (1)
= (n j)j pnj + npn+1 n j + n p(1)
n (18)
dt j=1 j=1
(2)
n1
dpn (2) (2) (1)
= (n j)j pnj + (n + 1)pn+1 + pn+1 n j + n p(2)
n (19)
dt j=1 j=1
n 0, p(1) (2)
n (0) = n,1 , pn (0) = 0 (21)
Solution to the PDEs for the Generating Functions of the Mortal Links: First, we
solve (22) using the method of Lagrange
dt d dP (1)
=
= (24)
1 (1 ) P (1)
1(1r)
An Improved Model for Statistical Alignment 7
The two, one-parameter families of surfaces are v(t; ; P (1) ) and w(t; ; P (1) ). Since v
comes from the integration of the first equality in ( 22), it is the same as (12). Integrating
the second equality yields:
(1 )r/(r)
w(; t; P (1) ) = P (1) = c2 (25)
( a)/(r)
Proceeding as in the previous section, we have:
P (1) (; t) =
r
r
a 1 f 1 (f ()e(r)t ) r
(26)
af 1 (f ()e(r)t ) 1
with f given by (16). To calculate P (2) (; t), we first define Q(; t) = P (1) (; t) +
P (2) (; t). Summing (22) and (23) the following equation is obtained for Q:
Q Q
(1 ) =0 (27)
t 1 (1 r)
This is again easily solved with the method of characteristics. First, we integrate the
characteristic equation, which is the first equation in (24), to obtain the family of char-
acteristic curves, given by v(; t) = c 1 as in (12). Thus, Q(; t) = g(v) is the general
solution, where g(x) is an arbitrary, differentiable function, to be set by the initial con-
ditions. Using (20) and (21), we have Q(; 0) = . This leads to:
Q(; t) = f 1 (f ()e(r)t ) (28)
with f given by (16), and therefore:
P (2) (; t) = f 1 (f ()e(r)t ) P (1) (; t) (29)
with P (1) (; t) given by (26).
3 The Algorithm
3.1 Calculating the Transition Probabilities
Unfortunately, the inverse of f given by ( 16) does not have a closed form. Thus a
numerical approach is needed for calculating the transition probability functions p n (t),
(1) (2)
pn (t), and pn (t). We calculate the generating functions P (; t), P (1) (; t) and
(2)
P (; t) in l1 + 1 points around = 0, where l 1 is the length of the shorter sequence.
For doing this, the following equation must be solved for x numerically where , , ,
r, t, and a are given.
(1 x)r
f ()e(r)t = (33)
( ax) a
Given l1 = 1 points, the functions are partially derived l 1 times. After this
n P (, t) 1
pn (t) = (34)
n n!
(1) (2)
and similarly for p n (t) and for p n (t). Thus, the transition probability functions can
be calculated in O(l 2 ) time.
where ai is the ith character in A and l(A) is the length of the sequence.
Let Ai denote the i-long prefix of A and let B j denote the j-long prefix of B.
There is a dynamic programming algorithm for calculating the transition probabilities
Pt (Ai | Bj ). The initial conditions are given by:
j
Pt (Ao | Bj ) = pn+1 (t)k=1 (bk ) (36)
k
To save computation time, we calculate k=l (bk ) for every l < j before the recursion.
Then the recursion follows
j
(2)j
Pt (Ai | Bj ) = Pt (Ai1 | Bl )pjl (t)k=l+1 (bk )
l=0
j1
(1) j
+ Pt (Ai1 | Bl )pjl (f )fai bl+1 k=l+2 (bk ) (37)
l=0
The dynamic programming is the most time-consuming part of the algorithm, it takes
O(l3 ) running time.
An Improved Model for Statistical Alignment 9
As mentioned earlier, the substitution process is described with only one parameter,
st. (A general phenomenon is that the time and rate parameters can not be estimated
individually, only their product.) The insertion-deletion model is described with three
parameters, t, t, and r, which however, can be reduced to two, if the following equa-
tion is taken under consideration
l(A) + l(B)
= (38)
r 2
namely, the mean of the sequence lengths is the maximum likelihood estimator for the
expected value of the length distribution.
The maximum likelihood values of the three remaining parameters can be obtained
using one of the well-known numerical methods (gradient method, etc.).
There is an increasing desire for statistical methods of sequence analysis in the bioinfor-
matics community. The statistical alignment provides a sensitive homology testing [5],
which is better than the traditional, similarity-based methods [10]. The summation over
the possible alignments leads to a good evolutionary parameter estimation [3], while
the parameter estimation from a single alignment is doubtful [3,11].
Methods based on evolutionary models integrate the multiple alignment and the
evolutionary tree reconstruction. The generalization of the ThorneKishinoFelsenstein
model to arbitrary number of sequences is straightforward [12,13]. A novel approach is
to treat the evolutionary models as HMM. The TKF model fits into the concept of pair-
HMM [14]. Similarly, the generalization to n sequences can be handled as multiple-
HMM. Following this approach, one can sample alignments related to a tree providing
an objective approximation to the multiple alignment problem [15]. Sampling pairwise
alignments and evolutionary parameters allows further investigations of the evolution-
ary process [16].
The weak point of the statistical approach is the lack of an appropriate evolutionary
model. A new model and an associated algorithm for computing the joint probability
were introduced. This new model is superior to the ThorneKishinoFelsenstein model:
it allows long insertions without considering unbreakable fragments. However, it is only
a small inch to the reality, as it contains at least two unrealistic properties. It cannot
deal with long deletions, and the rates for the long insertions form a geometric series.
The elimination of both these problems seems to be rather difficult but not impossible.
Other rate functions for long insertions lead to more difficult PDE-s whose characteris-
tic equations may not be integrated without a rather involved computational overhead.
The same situation appears when long deletions are allowed. Moreover, in this case
calculating only the fates of the individual links is not sufficient. Thus, for achieving
more appropriate models, numerical calculations are needed in an earlier state of the
procedure. Nevertheless, we hope that the generating function approach will open some
novel avenues for further research.
10 Istvan Miklos and Zoltan Toroczkai
Acknowledgments
We thank Carsten Wiuf and the anonymous referees for useful discussions and sugges-
tions. Z.T. was supported by the DOE under contract W-7405-ENG-36.
References
1. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarites
in the amino acid sequences of two proteins. J. Mol. Biol. 48 (1970), 443453.
2. Bishop, M. J., Thompson, E.A.: Maximum likelihood alignment of DNA sequences. J. Mol.
Biol. 190 (1986), 159165.
3. Thorne, J.L., Kishino, H., Felsenstein, J.: An evolutionary model for maximum likelihood
alignment of DNA sequences. J. Mol. Evol. 33 (1991), 114124.
4. Thorne, J.L., Kishino, H., Felsenstein, J.: Inching toward reality: an improved likelihood
model of sequence evolution. J. Mol. Evol. 34 (1992), 316.
5. Hein, J., Wiuf, C., Knudsen, B., Moller, M.B., Wiblig, G.: Statistical alignment: computa-
tional properties, homology testing and goodness-of-fit. J. Mol. Biol. 302 (2000), 265279.
6. Miklos, I.: Irreversible likelihood models, European Mathematical Genetics Meeting, 2021.
April, 2001, Lille, France.
7. Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.: A model for evolutionary change in proteins,
matrices for detecting distant relationships. In: Dayhoff, M.O. (ed.): Atlas of Protein Se-
quence and Structure, Vol. 5. Cambridge University Press, Washingtown DC. (1978), 343
352.
8. Tavare, S.: Some probabilistic and statistical problems in the analysis of DNA sequences.
Lec. Math. Life Sci. 17 (1986), 5786.
9. Feller, W.: An introduction to the probability theory and its applications, Vol. 1. McGraw-
Hill, New York (1968), 264269.
10. Altschul, S.F.: A protein alignment scoring system sensitive at all evolutionary distances. J.
Mol. Evol. 36 (1993), 290300.
11. Fleissner, R., Metzler, D., von Haeseler, A.: Can one estimate distances from pairwise se-
quence alignments? In: Bornberg-Bauer, E., Rost, U., Stoye, J., Vingron, M. (eds) GCB 2000,
Proceedings of the German Conference on Bioinformatics, Heidelberg (2000), Logos Verlag,
Berlin, 8995.
12. Hein, J.: Algorithm for statistical alignment of sequences related by a binary tree. In: Altman,
R.B., Dunker, A.K., Hunter, L., Lauderdale, K., Klein, T.E. (eds), Pacific Symposium on
Biocomputing, World Scientific, Singapore (2001), 179190.
13. Hein, J., Jensen, J.L., Pedersen, C.S.N.: Algorithm for statistical multiple alignment. Bioin-
formatics 2001, Skovde, Sweden.
14. Durbin, R., Eddy, S., Krogh, A, Mitchison, G.: Biological Sequence Analysis: Probabilistic
Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1998).
15. Holmes, I., Bruno, W.J.: Evolutionary HMMs: A Bayesian Approach to Multiple Alignment,
Bioinformatics (2001), accepted.
16. http://www.math.uni-frankfurt.de/stoch/software/mcmcalgn/
Improving Profile-Profile Alignments
via Log Average Scoring
Niklas von Ohsen and Ralf Zimmer
1 Introduction
The use of alignment algorithms for the establishing of protein homology relationships
has a long tradition in the field of bioinformatics. When first developed, these algorithms
aimed at assessing the homology of two protein sequences and at constructing their best
mapping onto each other in terms of homology. By extending these algorithms to align
sequences of amino acids not only to their counterparts but to frequency profiles, which
was first proposed by Gribskov [10], it became feasible to analyse the relationship of a
single protein with a whole family of proteins described by the frequency profile. Based
on this idea the PSI-Blast program [2] was developed which belongs to the most well
known and heavily used tools in computational biology. Recently, a further abstraction
has proven to be of considerable use in protein structure prediction. In the CAFASP2
contest of fully automated protein structure prediction the group of Rychlewski et al.
reached the second rank using a profile-profile alignment method called FFAS [ 18].
The notion of alignment is thus extended to provide a mapping between two protein
families represented by their frequency profiles. Rychlewski et al. used the dot product
to calculate the alignment score for a pair of profile vectors. In this paper we present a
new approach which allows to choose an amino acid substitution model like the BLO-
SUM model [12] and leads to a score that not only increases the ability to judge the
relatedness of two proteins by the alignment score but also has a meaning in terms of
the underlying substitution model.
We start by introducing the definition of profiles and subsequently discuss the three
candidate methods for scoring profile vectors against each other. In the second part
O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 1126, 2001.
c Springer-Verlag Berlin Heidelberg 2001
12
Niklas von Ohsen and Ralf Zimmer
the fold recognition experiments we performed are described and discussed. In the ap-
pendix further technical information on the benchmarks can be found.
2 Theory
The use of profiles is to represent a set of related proteins by a statistical model that
does not increase in size even when the set of proteins gets large. This is done by mak-
ing a multiple alignment of all the sequences in the set and then counting the relative
frequencies of occurrence of each amino acid in each position of the multiple align-
ment. Usually it is assumed that the underlying set of proteins is not known completely,
but that we have a small subset of representatives, for instance from a homology search
over a database. Extensive work has been done on the issue of estimating the real fre-
quencies of the full set from the sample retrieved by the homology search. Any of these
methods like pseudo counts [20], Dirichlet mixture models [6], minimal-risk estimation
[24], sequence weighting methods [13], [16], [19] may be used to preprocess the sample
to get the best estimation of the frequencies before one of the following scoring meth-
ods is applied. In any case will the construction yield a vector of probability vectors
which are in our setting of dimension 20 (one for each amino acid). These probabilities
are positive real numbers that sum up to one and stand for the probability of seeing a
certain amino acid in this position of a multiple alignment of all family members. This
sequence of vectors will be called frequency profile or profile throughout the paper. The
gaps occurring in the multiple alignments are not accounted for in our models, there-
fore the frequency vectors must eventually be scaled up to reach a total probability sum
of one. All of the profile-to-profile scoring methods introduced will be defined by a
formula which gives the corresponding score depending on two probability vectors (or
profile positions) named and .
likelihood that the alignment occurs between unrelated sequences. The notion of re-
latedness is defined here by the employed substitution model, which incorporates two
probability distributions describing each case. The first distribution, called null model,
describes the average case in which the two positions are each distributed like the amino
acid background and are unrelated, yielding P (X = i, Y = j) = p i pj . Here pk stands
for the probability of seeing an amino acid k when randomly picking one amino acid
from an amino acid sequence database. The probability of seeing a pair of amino acids
in a related pair of sequences in corresponding positions has been estimated by some
authors using different methods. M. Dayhoff derived the distribution from observations
of single point mutations resulting in the series of PAM Matrices [ 8]. In the case of the
BLOSUM matrix series [12] the distribution is derived from blocks of multiply aligned
sequences, which are clustered up to a certain amount of sequence identity. We intro-
duce an event called related for the case that the values of X and Y are related amino
acids and call the probability distribution P (X = i, Y = j|related) = p rel (i, j). Using
this, we receive the formula for the log odds score (log always standing for the natural
logarithm)
prel (i, j)
M (i, j) = log (1)
pi pj
which are the values stored in the substitution matrices except for a constant factor
which is log1010 in the Dayhoff models and log2 2 in the BLOSUM matrices.
Using Bayes formula we get an interpretation of the likelihood ratio term defining
the log odds alignment score:
P (X = i, Y = j|related)
P (related|X = i, Y = j) = P (related) (2)
P (X = i, Y = j)
prel (i, j)
= P (related) (3)
pi pj
This means that except for the prior probability P (related), which is a constant, the
usual sequence-sequence alignment score is the log of the probability that the two amino
acids come from related positions, given the data.
If different positions are assumed to be independent of each other, the log odds
score summed up over all aligned amino acid pairs is the log of the probability that
the alignment occurs in case the sequences are related divided by the probability that
the alignment occurs between unrelated sequences. It is therefore in a certain statis-
tical sense the best means to decide whether the alignment maps related or unrelated
sequences onto each other (Neyman-Pearson lemma, e.g. [ 23]). This quantity will be
maximised by the dynamic programming approach yielding the alignment that max-
imises the likelihood ratio in favour of the related hypothesis. The gap parameters
add penalties to this log-likelihood-ratio score, which indicate that the more gaps an
alignment has, the more likely it is to occur between unrelated sequences rather than
between related sequences.
tion of amino acids is calculated by taking the expected value of the sequence-sequence
score under the profile distribution while keeping the amino acid from the sequence
fixed. This can be extended to to profile-profile alignments in a straightforward fashion
and has been used in ClustalW [22]. There, two multiple alignments are aligned using
as score an average over all pairwise scores between residues, which is equivalent to
the average scoring approach as used here. The formula which we obtain this way is the
following:
20 20
prel (i, j)
scoreaverage (, ) = i j log (4)
i=1 j=1
pi pj
It can easily be shown that this score has an interpretation: Let N be a large integer
and lets take a sample of size N from the two profile positions (each sample being a
pair of amino acids, the distribution being ( i j )i,j1,...,20 ). Then this score divides
the likelihood that the related distribution produced the sample by the likelihood that
the unrelated distribution produced the sample, takes the log and divides this by N .
The average score summed up over all aligned profile positions thus has the following
meaning: If we draw for each aligned pair of profile positions a sample of size N which
happens to show N i j times the amino acid pair (i, j), then the summed up average
score is the best means to decide whether this happens rather under the related or the
unrelated model.
The problem with the approach is, that this is not the question we are asking. The
two distributions (related and unrelated) that are suggested as only options are both
known to be inappropriate since their marginal distributions (the distributions that are
obtained by fixing the first letter and allowing the second to take any value and vice
versa) are the background distribution of amino acids by the definition of the substitu-
tion model. The appropriate setting for this model describes a situation in which each
profile position would in fact be occupied by a completely random amino acid (de-
termined by the background probability distribution) meaning that, if we drew more
and more amino acids from a position, then the observed distribution would have to
converge to the background amino acid distribution. This is not compatible with the
meaning usually associated with a profile vector which is thought of being itself the
limiting distribution to which the distribution of such a sample should converge.
Another drawback to this method is the fact that the special case of this formula,
when one of the profiles degenerates to a single sequence (at each position a probability
distribution which has probability one for a single amino acid), has not the expected
behaviour of a good scoring system. This will be shown in the following section, where
we will extend the commonly used sequence-sequence score in a first step to the profile-
sequence setting such that a strict statistical interpretation of the score is at hand and
then further to the profile-profile setting which will be evaluated further on.
by
j
score(, j) = log (5)
pj
yields an optimal test statistics by which to decide whether the amino acid j is a sam-
ple from the distribution or rather from the background distribution p. These values
summed up over all aligned positions therefore give a direct measure of how likely it is
that the amino acid sequence is a sample from the profile rather than being random. If
for a protein family only the corresponding profile is known, calculating this score is an
optimal way to decide whether an amino acid sequence is from this family or not. This
is a rather limited question to ask if we want to explore distant relationships. Therefore,
in our setting it is of interest whether the sequence is somehow evolutionary related to
the family characterised by the profile or not.
Evolutionary Profile-Sequence Scoring. One method for evaluating this in the profile-
sequence case is the evolutionary profile method [ 11,7] which only makes use of the
evolutionary model underlying the amino acid similarity matrices. The values P (i
j) = prelp(i,j)
i
can, due to the construction of the similarity matrix, be interpreted as
the transition probabilities for a probabilistic transition (mutation) of the amino acid i
to j. From this point of view the value M (i, j) from (1) can be written as M (i, j) =
log P (ij)
pj which can be read as the likelihood ratio of amino acid j having occurred by
transition from amino acid i against j occurring just by chance. This can be extended
to the profile-sequence case where i is replaced with the profile vector and letting
the same probabilistic transition take place on a random amino acid with distribution
instead of on the fixed amino acid i. The resulting probability of j occurring by
transition from an amino acid distributed like is given by
20
20
prel (i, j)
i P (i j) = i (6)
i=1 i=1
pi
This score summed up over all aligned positions in an alignment of a profile against
a sequence is therefore an optimal means by which to decide whether the sequence is
more likely the result of sampling from the profile which has undergone the probabilis-
tic evolutionary transition or whether the sequence occurs just by chance (optimality in
a statistical sense).
It is apparent that the formula 7 is not a special case of the earlier introduced average
scoring (4). This is a drawback for the average scoring approach since it fails to yield an
intuitively correct result in a simple example: If the profile position is distributed like
the amino acid background distribution, i. e. i = pi for all i, we would expect that we
have no information available on which to decide whether an amino acid j is related
with the profile position or not. Thus it is a desirable property of a scoring system that
16
Niklas von Ohsen and Ralf Zimmer
any amino acid j should yield zero when scored against the background distribution.
This is the case for the evolutionary profile-sequence score but is not the case for the
average score where we receive (with p being the background distribution and e j being
the j-th unit vector)
score(, j) = scoreaverage (p, ej ) = pi M (i, j)
i=1,...,20
which is never positive due to Jensens inequality (see e.g. [ 4]) and will always be neg-
ative for the amino acid background distribution commonly observed. Thus the average
score would propose that we have evidence against the hypothesis that the profile po-
sition and the amino acid are related, which seems questionable. This is the motivation
to look for a generalisation of the evolutionary sequence-profile scoring scheme to the
profile-profile case. The results are explained in the following section which introduces
the new scoring function proposed in this paper.
Let again (X, Y ) be a pair of random variables with values in {1, . . . , 20} which rep-
resent positions in profiles for which the question whether they are related is to be
answered. Since the goal here is to score profile positions against profile positions we
have to incorporate into our model the fact that the special X and Y we are observing
have the amino acid distribution ( i )i=1,...,20 and (j )j=1,...,20 , respectively. This is
done by introducing an event E which has the following property:
P (X = i, Y = j|E) = i j (8)
Since a substitution model that directly addresses the case E with its special distribu-
tions of and is not available for the calculation of the last factor, we use the standard
model (see equation (3)) as an approximation instead and exploit the knowledge on the
amino acid distributions (see (8)) at the current profile positions for the first factor:
20
20
P (X = i, Y = j|E)P (related|X = i, Y = j) (11)
i=1 j=1
20
20
prel (i, j)
= P (related) i j (12)
i=1 j=1
pi pj
Improving Profile-Profile Alignments via Log Average Scoring 17
If the prior probability is set to 1 and the log is taken like in the usual sequence-
sequence score we receive the following formula for the log average score
20
20
prel (i, j)
scorelogaverage (, ) = log i j (13)
i=1 j=1
pi pj
It is interesting to note that the only difference between this formula and the average
score is the exchanged order of the log and the sums. As can be seen this formula is
an extension of the evolutionary profile score for the profile-sequence case with the
advantages discussed above. If these scoring terms are summed up over all aligned
positions in a profile-profile alignment the resulting alignment score is thus the log of
the probability that the profiles are related under the substitution model given the data
they provide (except for the prior).
3 Evaluation
In order to evaluate whether the different scores are a good measure of the relatedness
of two profiles, we performed fold recognition and related pair recognition benchmarks.
Additionally, we investigated how a confidence measure for the protein fold prediction
depending on the underlying scoring system performed on the benchmark set of pro-
teins.
The experiments were carried out using a protein sequence set which consists of 1511
chains from a subset of the PDB with a maximum of 40% pairwise sequence identity
(see [5]). The composition of the test set in terms of relationships on different SCOP
levels is shown in figure 1. Throughout the experiments the SCOP version 1.50 is used
[17].
Note that there are 34 proteins in the set which are the only representatives of their
SCOP fold in the test set. They were deliberately left in the test set even though it is not
possible to recognise their correct fold class because this way the results resemble the
numbers in the application case of a query with unknown structure.
For all sequences a structure of the same SCOP class can be found in the benchmark
set, there are 34 chains in the set without a corresponding fold representative (i.e. single
members of their fold class in the benchmark), SCOP superfamily and SCOP family
representatives can be found for 1360 and 1113 sequences of the test benchmark set,
respectively.
Only chains contributing to a single domain according to the SCOP database were
used in order to allow for a one-to-one mapping of the chains to their SCOP classifi-
cation. For each chain a frequency profile representing a set of possibly homologous
sequences was constructed based on PSI-Blast searches on a non redundant sequence
database following a procedure described in the appendix.
18
Niklas von Ohsen and Ralf Zimmer
1200
SCOP superfamily
SCOP fold
800
SCOP family
200
0
No. of proteins for which the testset contains a member of the same ... Closest relative in the test set belongs to the same ...
Fig. 1. Composition of the Test Set. Left: Number of proteins for which the test set con-
tains a member of the indicated SCOP level. Right: Number of proteins whose closest
relative (in terms of SCOP level) in the test set belongs to the indicated SCOP level.
This is a partition of the test set in terms of fold recognition difficulty; ranging from
SCOP family being the easiest to SCOP class being impossible.
For each examined scoring approach we then used a JAVA implementation of the Gotoh
global alignment algorithm [9] to align a query profile against each of the remaining
1510 profiles in the test set. For a query sequence of length 150 about 6 alignments per
second can be computed on a 400M Hz Ultra Sparc 10 workstation.
It should be noted that for the case of fold recognition where one profile is subse-
quently being aligned against a whole database of profiles a significant speedup can be
achieved by preprocessing the query profile and calculating
20
prel (i, j)
:=
i
i=1
pi pj
j=1,...,20
to one scalar product and one logarithm. This can be done in a similar manner with the
average scoring approach where the complexity reduces to only the scalar product. The
running time of the algorithm could be reduced by a factor of more than 6 using this
technique.
The appropriate gap penalties were determined separately for every scoring method
using a machine learning approach (see appendix, [ 25]) and are shown in table 1.
Improving Profile-Profile Alignments via Log Average Scoring 19
Throughout the experiments shown here we used the BLOSUM 62 substitution model
[12]. The average scoring alignments were calculated using the values from the BLO-
SUM 62 scoring matrix and, thus, contain the above mentioned scaling factor of f =
2
log 2 . To keep the results comparable we also applied the factor to the log average score.
Therefore, the gap penalties for the log average score in table 1 must be divided by f if
the score is calculated exactly as in formula (13).
3.4 Results
For each of the three profile scoring system discussed in section 2 the following test
were performed using the constructed frequency profiles. In order to assess the superi-
ority of the profile methods over simple sequence methods we also performed the tests
for plain sequence-sequence alignment on the original chains using the BLOSUM 62
substitution matrix and the same gap penalties as for the log average scoring.
2000
Fold Recognition. The goal here is to identify the SCOP fold to which the query pro-
tein belongs by examining the alignment scores of all 1510 alignments of the query
profile against the other profiles. The scores are sorted in a list together with the name
20
Niklas von Ohsen and Ralf Zimmer
100
Sequence alignment using BLOSUM 62
Profile alignment using dot product scoring
Profile alignment using average scoring w BLOSUM 62
Profile alignment using log average scoring w BLOSUM 62
80
60
40
20
0
of the protein which produced the score and the fold prediction is the SCOP fold of the
highest scoring protein in the list. Since all the proteins in the list are aligned against
the same query and the scores are compared, a possible systematic bias of the score by
special features of the query sequence is not relevant for this test (e. g. length depen-
dence). The test was performed following a leave-one-out procedure, e. g. for each of
the 1511 proteins the fold was predicted using the list of alignments against the 1510
other profiles. The fold recognition rate is then defined as the percentage of all proteins
for which the fold prediction yielded a correct result.
Out of the 1511 test sequences log average scoring is able to assign correct folds for
1181 cases or 78.1%, whereas the usual average scoring correctly predicts 1097 (72.6%)
and dot product scoring 1024 (67.7%) sequences, both improving on simple sequence-
sequence alignment with 969 (64.1%) correct assignments. This improvement becomes
more distinctive for more difficult cases towards the twilight of detectable sequence
similarity. Figure 3 shows the fold recognition rates for family, superfamily, fold pre-
dictions separately. Here, all four methods perform well for the easiest case, family
recognition, with 81.2% for sequence alignment performing worst and log average pro-
file scoring with 91.5% performing best. For the hardest case of fold detection, log
average scoring (24.8%) significantly outperforms (at least 50% improvement) both
other profile methods (11.1% and 16.2%), whereas sequence alignment hardly is able
to make correct predictions (6.8%). However, the effect of performance improvement is
most marked for the superfamily level, where some remote evolutionary relationships
should, by definition, be detectable via sensitive sequence methods. Here, the new scor-
ing scheme again achieves a 50% improvement over the second best (average profile
scoring) methods, thereby increasing the recognition rate from 36.8% to 54.3%. This
almost doubles the recognition rate of simple sequence alignment (23.0%).
Improving Profile-Profile Alignments via Log Average Scoring 21
A more detailed look on the fold recognition results can be achieved by using con-
fidence measures which measure the quality of the fold prediction a priori. Here we use
the z-score gap which is defined as follows. First the mean and standard deviation for
the scores in the list are calculated and the raw scores are transformed into z-scores with
respect to the determined normal distribution, i. e. the following formula is applied:
score mean
z score =
standard deviation
Then the difference of the z-score between the top scoring protein and the next best
belonging to a SCOP fold different from the predicted one is calculated yielding the z-
score gap. A list L which contains all 1511 fold predictions together with their z-score
gap is set up and sorted with respect to the z-score gap. Entries l L which represent
correct fold predictions are termed positives, others negatives. If i is an index in this
list, figure 4 shows the percentage of correct fold predictions if only the top i entries of
the list are predicted. It also demonstrates a clear improvement of fold prediction sensi-
1.0
0.4
0.2
0.0
Fig. 4. Fold Recognition Ranked with Respect to the z-score Gap (See Text).
tivity and specificity for the log average scoring as compared to the competing scoring
schemes. Again, all profile methods perform better than pure sequence alignment, but
dot product only shows a slight improvement.
22
Niklas von Ohsen and Ralf Zimmer
Related Pair Recognition. This protocol aims at a slightly different question. The goal
is to decide whether two proteins have the same SCOP fold by only looking at the score
of their profile alignment. Therefore, a good performance in this test means that the
scoring system is a good absolute measure of similarity between the sequences. Length
dependency and other systematic biases will decrease the performance of a scoring
system here.
The calculations done here also rely on the 1511 lists calculated in the fold recog-
nition setting. These are merged into one large list following two different procedures:
z-scores: Before merging, the mean and standard deviation for each of the lists are
calculated and the raw scores are transformed into z-scores as in ( 3.4). This setting
is related with the fold recognition setting since biases introduced by the query
profile should be removed by the rescaling.
raw scores: No transformation is applied.
The resulting list L contains in each entry l L a score score(l) and the two proteins
whose alignment produced the score. An entry l L will be called positive if the two
proteins have the same SCOP fold and negative if not. The list of 1 511 1 510 =
2 281 610 entries is then sorted with respect to the alignment score and for all scores s
in the list specificity and sensitivity are calculated from the following formulas:
The plots of these quantities for the whole range of score values are shown in figure 5,
which clearly exhibits the recognition performance of the new scoring scheme over the
whole range of specificities. The ranking of the respective methods is again sequence
alignment, dot product, average scoring, and log average scoring best, almost doubling
the performance of average scoring. Using z-scores, sequence alignment and dot prod-
uct scoring improve somewhat, but still, log average scoring consistently shows doubled
performance over the second best method.
4 Discussion
1.0
Profile/Profile alignment using dot product scoring, raw score
Profile/Profile alignment using log av. scoring with BLOSUM 62, raw score
Profile/Profile alignment using average Scoring with BLOSUM 62, raw score
Sequence/Sequence alignment with BLOSUM 62, raw score
0.8
0.6
Sensitivity
0.4
0.2
0.0
0.4
0.2
0.0
Fig. 5. Related Pair Recognition. Top: Specificity-sensitivity plots for the raw scores.
Bottom: Specificity-sensitivity plots for the z-scores (see text).
24
Niklas von Ohsen and Ralf Zimmer
The pair recognition test for the raw score provides a good measure of how well the
alignment score represents a quantitative measure for the relationship between two pro-
teins. The log average score outperforms all other methods here and the plain sequence-
sequence alignment score even outperforms the dot product method which indicates
that the latter approach is heavily dependent on some re-weighting procedure like the
z-score rescaling. When performing this z-score rescaling the average scoring becomes
significantly worse which is an unexpected effect since the objective is to make the
scores comparable independent of the scoring method used. It is interesting that the log
average score shows only a slight improvement here over the raw score performance
suggesting that the raw score alone is already a good measure of similarity for the two
profiles.
In conclusion, we see that the proposed log average score leads to a superior per-
formance of profile-profile alignment methods in the disciplines fold recognition and
related pair recognition suggesting that it is a better measure for the similarity of two
profiles than the previously described other methods tested here. This is the effect of
simply exchanging the log and the weighted average in the definition of the average
score. A more general fact might also be learned from this: When a scoring function
that maps a state to a score is to be extended to a more general setting where a score is
assigned to a distribution of states, it is not always the best way to simply take the ex-
pected value (i. e. average scoring). Following this, future developments might include
an incorporation of the log average scoring into a new scoring approach for protein
threading as well as an application of the technique in the context of progressive multi-
ple alignment tools.
Acknowledgements
This work was supported by the DFG priority programme Informatikmethoden zur
Analyse und Interpretation groer genomischer Datenmengen, grant ZI616/1-1. We
thank Alexander Zien and Ingolf Sommer for the construction of the profiles and many
helpful discussions.
A Appendix
Two distinct sets of proteins from the PDB [3] are used in the described experiments.
The first one is a set introduced by [1,21] of 251 single domain proteins with known
structure. It is derived from a non-redundant subset of the PDB introduced by [ 14]
where the sequences have no more than 25 % pairwise sequence identity. From this set
all single-domain proteins with all atom coordinates available are selected yielding the
training set Strain of 251 proteins (see also [25]).
training set TR of 81 proteins from the data set mentioned above belonging to 11 fold
classes each of which contain at least five of the sequences from TR. In every iteration
each of the members of TR is used as a query and aligned against all 251 protein pro-
files. If we call the alignments of the best scoring fold class member for each of the 81
proteins the 81 good alignments and all the alignments of each of the 81 proteins against
a member of a different fold class a bad alignment then the iteration tries to maximise
the difference of the alignment scores between the good and the bad alignments. The
iterations were stopped when a convergence could be observed which always happened
before 16 iterations were completed.
References
1. Nick Alexandrov, Ruth Nussinov, and Ralf Zimmer. Fast protein fold recognition via se-
quence to structure alignment and contact capacity potentials. In Lawrence Hunter and
Teri E. Klein, editors, Pacific Symposium on Biocomputing96, pages 5372. World Sci-
entific Publishing Co., 1996.
2. Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng
Zhang, Webb Miller, and David J. Lipman. Gapped BLAST and PSI-BLAST: a new gen-
eration of protein database search programs. Nucleic Acids Research, 25(17):33893402,
September 1997.
3. F.C. Bernstein, T.F. Koetzle, G.J.B. Williams, E.F. Jr. Meyer, M.D. Brice, J.R. Rodgers,
O. Kennard, T. Shimanouchi, and M. Tasumi. The protein data bank: a computer based
archival file for macromolecular structures. J.Mol.Biol., 112:535542, 1977.
4. Patrick Billingsley. Probability and Measure. Wiley, 1995.
5. S. E. Brenner, P. Koehl, and M. Levitt. The ASTRAL compendium for protein structure and
sequence analysis. Nucleic Acids Res, 28(1):2546., 2000.
6. Michael Brown, Richard Hughey, Anders Krogh, I. Saira Mian, Kimmen Sjolander, and
David Haussler. Using dirichlet mixture priors to derive hidden markov models for protein
families. In Proceedings of the Second Conference on Intelligent Systems for Molecular
Biology, volume 2, Washington, DC, July 1993. AAAI Press. preprint.
7. Jean-Michel Claverie. Some useful statistical properties of position-weight matrices. Com-
puters Chem., 18(3):287294, 1994.
26
Niklas von Ohsen and Ralf Zimmer
8. Margaret O. Dayhoff, R.M. Schwartz, and B.C. Orcutt. A model of evolutionary change in
proteins. In Atlas of Protein Sequence and Structure, volume 5, Supplement 3, chapter 22,
pages 345352. National Biochemical Research Foundation, Washington DC, 1978.
9. Osamu Gotoh. An improved algorithm for matching biological sequences. Journal of Molec-
ular Biology, 162:705708, 1982.
10. Michael Gribskov, A. D. McLachlan, and David Eisenberg. Profile analysis: Detection of
distantly related proteins. Proceedings of the National Academy of Sciences of the United
States of America, 84(13):43554358, 1987.
11. Michael Gribskov and Stella Veretnik. Identification of sequence patterns with profile anal-
ysis. In Methods in Enzymology, volume 266, chapter 13, pages 198212. Academic Press,
Inc., 1996.
12. Steven Henikoff and Jorja G. Henikoff. Amino acid substitution matrices from protein
blocks. Proceedings of the National Academy of Sciences of the United States of America,
89(22):1091510919, 1992.
13. Steven Henikoff and Jorja G. Henikoff. Positionbased sequence weights. Journal of Molec-
ular Biology, 243(4):574578, 1994. 4. November.
14. Uwe Hobohm and Chris Sander. Enlarged representative set of protein structures. Protein
Science, 3:522524, 1994.
15. Yvonne Kallberg and Bengt Persson. KIND A non-redundant protein database. Bioinfor-
matics, 15(3):260261, March 1999.
16. Anders Krogh and Graeme Mitchison. Maximum entropy weighting of aligned sequences
of protein or DNA. In C. Rawlings, D. Clark, R. Altman, L. Hunter, T. Lengauer, and
S. Wodak, editors, Proceedings of ISMB 95, pages 215221, Menlo Park, California 94025,
1995. AAAI Press.
17. L. Lo Conte, B. Ailey, T. J. Hubbard, S. E. Brenner, A. G. Murzin, and C. Chothia. SCOP: a
structural classification of proteins database. Nucleic Acids Res, 28(1):2579., 2000.
18. Leszek Rychlewski, Lukasz Jaroszewski, Weizhong Li, and Adam Godzik. Comparison of
sequence profiles. Strategies for structural predictions using sequence information. Protein
Science, 9:232241, 2000.
19. Shamil R. Sunyaev, Frank Eisenhaber, Igor V. Rodchenkov, Birgit Eisenhaber, Vladimir G.
Tumanyan, and Eugene N. Kuznetsov. PSIC: profile extraction from sequence alignments
with position-specific counts of independent observations. Protein Engineering, 12(5):387
394, 1999.
20. Roman L Tatusov, Stephen F. Altschul, and Eugene V. Koonin. Detection of conserved seg-
ments in proteins: Iterative scanning of sequence databases with alignment blocks. Proceed-
ings of the National Academy of Sciences of the United States of America, 91:1209112095,
December 1994.
21. Ralf Thiele, Ralf Zimmer, and Thomas Lengauer. Protein threading by recursive dynamic
programming. Journal of Molecular Biology, 290(3):757779, 1999.
22. Julie D. Thompson, Desmond G. Higgins, and Toby J.Gibson. CLUSTAL W: Improv-
ing the sensitivity of progressive multiple sequence alignment through sequence weight-
ing, position-specific gap penalties and weight matrix choice. Nucleic Acids Research,
22(22):46734680, Nov 1994.
23. Hermann Witting. Mathematische Statistik. Teubner, 1966.
24. Thomas D. Wu, Craig G. Nevill-Manning, and Douglas L. Brutlag. Minimal-risk scoring
matrices for sequence analysis. Journal of Computational Biology, 6(2):219235, 1999.
25. Alexander Zien, Ralf Zimmer, and Thomas Lengauer. A simple iterative approach to param-
eter optimization. Journal of Computational Biology, 7(3):483501, 2000.
False Positives in Genomic Map Assembly
and Sequence Validation
Abstract. This paper outlines an algorithm for whole genome order restriction
optical map assembly. The algorithm can run very reliably in polynomial time by
exploiting a strict limit on the probability that two maps that appear to overlap are
in fact unrelated (false positives). The main result of this paper is a tight bound
on the false positive probability based on a careful model of the experimental
errors in the maps found in practice. Using this false positive probability bound,
we show that the probability of failure to compute the correct map can be limited
to acceptable levels if the input map error rates satisfy certain sharply delineated
conditions. Thus careful experimental design must be used to ensure that whole
genome map assembly can be done quickly and reliably.
1 Introduction
O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 2740, 2001.
c Springer-Verlag Berlin Heidelberg 2001
28 Thomas Anantharaman and Bud Mishra
eliminating false positive matches and are included in many sequence alignment tools
such as BLAST (see for example chapter 2 in [5]).
A simple bound using Bruns sieve can be easily derived [ 2], but such a bound often
fails to exploit the full power of optical mapping. Here, we derive a much tighter but
more complex bound that characterizes the sharp transition from infeasible experiments
(requiring exponential computation time) to feasible experiments (polynomial compu-
tation time) much more accurately. Based on these bounds, a newer implementation of
the Gentig algorithm for assembling genome-wide shot-gun maps [ 2] has improved its
performance in practice.
A close examination shows that the false positive probability bound exhibits a com-
putational phase-transition: that is, for poor choice of experimental parameters the prob-
ability of obtaining a solution map is close to zero, but improves suddenly to probability
one as the experimental parameters are improved continuously. Thus careful optimized
choice of the experimental parameters analytically has strong implication to experiment
design in solving the problem accurately without incurring unnecessary laboratory or
computational cost. In this paper, we explicitly delineate the interdependencies among
these parameters and explore the trade-offs in parameter space: e.g., sizing error vs. di-
gestion rate vs. total coverage. There are many direct applications of these bounds apart
from the alignment and assembly of maps in Gentig: Comparing two related maps (e.g.
chromosomal aberrations), Validating a sequence (e.g. shot-gun assembly-sequence) or
a map (e.g., a clone map) against a map, etc. Specific usage of our bounds in these
applications will appear elsewhere [3].
derived from that placement while allowing for various modeled data errors: sizing er-
rors, missing restriction cut sites, and false optical cuts sites.
The posterior conditional probability density for a hypothesized placement H, given
the maps, consists of the product of a prior probability density for the hypothesized
placement and a conditional density of the errors in the component maps relative to
the hypothesized placement. Let the M input maps to be contiged be denoted by data
vectors Dj (1 j M ) specifying the restriction site locations and enzymes. Then
the Bayesian probability density for H, given the data can be written using Bayes rule
as in [1]:
M
M
M
f (H|D1 . . . DM ) = f (H) f (Dj |H)/ f (Dj ) f (H) f (Dj |H).
j=1 j=1 j=1
The conditional probability density function f (D j |H) depends on the error model used.
We model the following errors in the input data:
1. Each orientation is equally likely to be correct.
2. Each fragment size in data D j is assumed to have an independent error distributed
as a Gaussian with standard deviation . (It is also possible to model the standard
deviation as some polynomial of the true fragment size.)
3. Missing restriction sites in input maps D j are modeled by a probability p c of an
actual restriction site being present in the data.
4. False restriction sites in the input maps D j are modeled by a rate parameter p f ,
which specifies the expected false cut density in the input maps, and is assumed to
be uniformly and randomly distributed over the input maps.
The Bayesian probability density components f (H) and f (D j |H) are computed sep-
arately for each contig (island) of the proposed placement and the overall probability
density is equal to their products. For computational convenience, we actually compute
a penalty function, , proportional to the logarithm of the probability density as follows:
M
1
f (H) = exp(/(2 2 )).
j=1
( 2) mj
Now consider the presence of missing cuts (restriction sites) with p c < 1. To model
the multiplicative error of p c for each cut present in the contig we add a penalty c =
2 2 log[1/pc ] and to model the multiplicative error of (1 p c ) for each missing cut in
30 Thomas Anantharaman and Bud Mishra
Before proceeding further with the technical details of our probabilistic analysis, we
summarize the two main formulae that can be used directly in estimating the false pos-
itive probability for a particular map alignment, or in designing a biochemical exper-
iment with the goal of bounding the false positive probability below some acceptable
small value (typically 10 3 ).
(R e )n
where Pn = 8 .
n
To achieve acceptable false positive rate, one needs to choose an acceptable value for
the experimental parameters: P d , , Ld and coverage. F P T exhibits a sharp phase tran-
sition in the space of experimental parameters. Thus the success of a mapping project
depends extremely critically on a prudent combination of experimental errors (digestion
rate, sizing), sample size (molecule length and number of molecules) and problem size
(genome length). Relative sizing error can be lowered simply by increasing L with a
choice of rarer-cutting enzyme and digestion rate can be improved by better chemistry
[6].
As an example, for a human genome of size G = 3, 300M b and a desired coverage
of 6, consider the following experiment. Assume a typical value of molecule length
Ld = 2M b. If the enzyme of choice is PAC I, the average true fragment length is about
25Kb. Assume a minimum overlap 1 of = 30%. Assume that the sizing error for a
fragment of 30kb is about 3.0kb, and hence 2 = 0.3kb. With a digest rate of Pd = 82%
we get an unacceptable F P T 0.0362. However just increasing P d to 86% results in
an acceptable F P T 0.0009. Alternately, reducing average sizing error from 3.0kb to
2.4kb while keeping Pd = 82% also produces an acceptable F P T 0.0007.
Obviously one should allow some margin in choosing experimental parameters so
that the actual experimental parameters will be a reasonable distance from the phase
transition boundary. This is needed both to allow some slippage in experimental errors
as well as the possibility that there may be additional small errors not modeled by the
error model.
The key to understanding the false positive bound is the following technical lemma that
forms the basis of further computation. Let X = x 1 , . . ., xn and Y = y1 , . . ., yn
be a pair of sequences of positive real numbers, each sequence representing sizes of
an ordered sequence of restriction fragments. We rely on a matching rule to decide
whether X and Y represent the same restriction fragments in a genome, by comparing
1
This value should be selected to minimize F P T .
32 Thomas Anantharaman and Bud Mishra
where wi s are chosen to match the error model. For example, if the sizing error variance
2p
for a fragment with true size X is 2 X p , where p [0, 2], we can use w i xi +y
2
i
.
Lemma 1. Let X = X1 , . . ., Xn and Y = Y1 , . . ., Yn be a pair of sequences of
IID random variables X i s and Yi s with exponential distributions and pdfs f (x) =
1 x/L
Le . Then
1. Pr(|Xi Yi |/(Xi + Yi ) ) , for all 0 and with equality holding, if
1.
i Yi 2 ( )n/2
2. Pr( ni=1 wi ( X
Xi +Yi ) ) ( n
4
n , for all 0 and with equality
2 )! i=1 wi
holding, if min1in wi .
Proof. The first identity can be shown by integrating the relevant portion of the joint
distribution of X i and Yi :
Pr(|Xi Yi |/(Xi + Yi ) )
Xi 11+
1 Xi +Yi
= e L dYi dXi = .
1 L2
Xi =0 Yi =Xi 1+
Note that this means that for each pair of random fragment sizes X i , Yi the statistic
Ui |Xi Yi |/(Xi + Yi ) is uniformly distributed between 0 and 1.
We can now compute the overall probability P n for all n fragment pairs:
n
Xi Yi 2
Pn Pr( wi ( ) )
i=1
Xi + Yi
n
= Pr( wi Ui 2 )
i=1
Note that U1 , . . ., Un are IID uniform distributions over [0,1], hence this probability is
just
n that part of the volume of the n-dimensional unit cube that satisfies the condition
2
w U
1 i i . For small sizing errors such that min(w 1 , . .
. , wn ), this region
is one orthant of an n dimensional ellipsoid with radius values of /wi in the ith
dimension. In general this volume is an upper bound and hence:
( 4 )n/2
Pn n
n
( 2 )! i=1 wi
1i s
min(Xi , Yi ) = Zri +k
2
k=1
ri
1
s
i
max(Xi , Yi ) = Zk + Zri +k .
2
k=1 k=1
Then
r
1. Pr(|Xi Yi |/(Xi + Yi ) ) si +rr
i 1
i
i , for all 0.
n
i Yi 2 i=1 (ri /2)!(
si +ri 1
)(/wi )ri /2
2. Pr( ni=1 wi ( X
Xi +Yi ) ) (
n ri
ri /2)! , for all 0.
i=1
1
n
An wi . (2)
n i=1
34 Thomas Anantharaman and Bud Mishra
nAn Rn 2 + K/2
An+1 Rn+1 2 ,
(n + 1)
where K is a prior bias parameter, typically in the range 1 K 1.4. Hence for
n + k < N1 we can write F Pn+k as:
2
k/2
n An eK/2An Rn
F Pn+k 4Pn Rn k
n+k 2Gn
This result applies to the case of two maps. The generalization to a population of many
maps is considered for the more general case of missing cuts in the next section.
and simplify it to
(n+S+1)/2
n
s +r 1
2 n ( r2i )! i rii
P An,m,s,r Er0 ,s0 Pn (2eRn An ) S/2
n+S ( 12 )! wi i2
r 1
i=1
n
where N max ri , and S (ri 1).
1in
i=1
The resulting expression diverges for large values of r but the bound is quite tight for
realistic values of m, n, Rn .
As a final step in computing the False Positive probability we need to combine the
Pn,m just computed over random alignments involving fewer misaligned cuts (smaller
values of r) or more aligned fragments (larger n), as well as consider the possible ways
the ends of two random maps could be aligned with each other. Using the same approach
as for the case without misaligned cuts to model the permissible change in sizing error
we can show that the result is:
j1 2n+r+1 n
r+1
2n + r + 2 ( 2j )! 2An
F Pr 4Pn 1+ 1 Rn
r+1j
2n+r+1
r j=2
( 2 )! G n r
1 1 1+Z N1 nr/2
+ N2 N1 Z
1Z 2 1Z
where,
2 2
An eK/2An Rn 2n + r + 3
Z = Rn
2Gn 2n + 3
M1
M
F P Tr F Pr (Ni , Nj )
i=1 j=i+1
j1 2n+r+1 n
r+1
2n + r + 2 ( j2 )! 2An
2n+r+1
r+1j
Pn 1+ 1 Rn
r j=2
( 2 )! G n r
38 Thomas Anantharaman and Bud Mishra
M1 M
2M (M 1) 1 + Z
+4 Nj + Ni Z Ni nr/2 (4)
1Z i=1 j=i+1
1 Z
where,
2 2
An eK/2An Rn 2n + r + 3
Z = Rn (5)
2Gn 2n + 3
Which is to be used with the previous equations for P n , Rn ,An ,Gn and the error model
parameter K (and implicitly ).
6 Experiment Design
In designing a shot-gun genome wide mapping experiment, one needs to ensure that the
data allows correct map overlaps to be clearly distinguished from random map overlaps.
If this is done using a False Positive threshold such as the F P T we have derived in
this paper, the goal is to ensure that the expected F P T for correct map overlaps does
not exceed some acceptable threshold (e.g. 10 3 ). In this section we will estimate the
expected value of F P T for a valid overlap based on the experimental error parameters.
In principle we just need to estimate the values of n, r, R n , M for a correct overlap
based on the experimental errors. However given the extreme sensitivity of F P T on
n, the number of aligned fragments, we will compute F P T for correct map overlaps
of a certain minimum size. By selecting a suitable value of , the minimum overlap
value, we can control the expected minimum value of n, at the cost of some reduction
in effective coverage by the factor 1 [9].
In addition to assume the following experimental parameters: G = Expected
Genome size. Ld = Length of each map. C = Desired coverage (before adjustment
for ). L = average distance between restriction site in Genome. X = Sizing error
(standard deviation) for fragment of size X. P d = The digestions rate of the restriction
enzyme used.
Assuming R 1 and An Gn we can then write F P T in terms of the experi-
mental parameters as:
n n
2 2nd + 2 (R e 8 ) 2
F P T 2M 1 + (d 1)R (6)
2n(d 1) n
where we have d = 1
Pd ,n = Ld
LPd 2
, R= , and M = CG
Ld .
L/Pd
The contigs generated were checked and verified to be at the correct offset (though
minor local alignment errors may be present).
8 Conclusion
In this paper we derived a tight False Positive Probability bound for overlapping two
maps. This can be used in the assembly of genome wide maps to reduce the search space
from exponential time to sub-quadratic time with only a small increase in false nega-
tives. The False Positive Probability bound also can be used to determine if a sequence
derived map has a statistically significant match with a map.
We also showed how the False Positive Probability bound can be used to select ex-
perimental parameters for whole-genome shot-gun mapping that will allow the genome
wide map to be assembled rapidly and reliably and showed that the boundary between
feasible and infeasible experimental parameters is quite narrow, exhibiting a form of
computational phase transition.
Our approach has certain limitations due to the assumptions underlying our model
that unrelated parts of the genome will not align with each other except in a random
manner. This assumption is not true for haplotypes, for example, and hence our algo-
rithm is not sufficient to produce a haplotyped map. Similarly if some other biological
process results in strong homologies over large distances our algorithm may merge ho-
mologous regions of the genome. If this turns out to be problem, explicit postprocessing
40 Thomas Anantharaman and Bud Mishra
of the resulting map contigs to look for merged homologous regions or haplotypes can
be performed.
Acknowledgments
The research presented here was partly supported by NSF Career Grant IRI-9702071,
DOE Grant 25-74100-F1799, NYU Research Challenge Grant, and NYU Curriculum
Challenge Grant.
References
1. T. S. Anantharaman, B. Mishra, and D. C. Schwartz. Genomics via optical mapping ii: Or-
dered restriction maps. Journal of Computational Biology, 4(2):91118, 1997.
2. T. S. Anantharaman, B. Mishra, and D. C. Schwartz. Genomics via optical mapping iii:
Contiging genomic dna. In Proceedings 7th Intl. Conf. on Intelligent Systems for Molecular
Biology: ISMB 99, pages 1827. AAAI Press, 1999.
3. M. Antoniotti, B. Mishra, T. Anantharaman, and T. Paxia. Genomics via optical mapping iv:
Sequence validation via optical map matching. Preprint.
4. C. Aston, B. Mishra, and D. C. Schwartz. Optical mapping and its potential for large-scale
sequencing projects. Trends in Biotechnology, 17:297302, 1999.
5. R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis. Cambridge
University Press, 1998.
6. J. Reed et al. A quantitative study of optical mapping surfaces by atomic force microscopy
and restriction endonuclease digestion assays. Analytical Biochemistry, 259:8088, 1998.
7. L. Lin et al. Whole-genome shotgun optical mapping of Deinococcus radiodurans . Science,
285:15581562, 1999.
8. Z. Lai et al. A shotgun sequence-ready optical map of the whole Plasmodium falciparum
genome. Nature Genetics, 23(3):309313, 1999.
9. M. S. Waterman. Introduction to Computational Biology. Chapman and Hall, 1995.
Boosting EM for Radiation Hybrid
and Genetic Mapping
Abstract. Radiation hybrid (RH) mapping is a somatic cell technique that is used
for ordering markers along a chromosome and estimating physical distances be-
tween them. It nicely complements the genetic mapping technique, allowing for
finer resolution. Like genetic mapping, RH mapping consists in finding a marker
ordering that maximizes a given criteria. Several software packages have been
recently proposed to solve RH mapping problems. Each package offers specific
criteria and specific ordering techniques. The most general packages look for
maximum likelihood maps and may cope with errors, unknowns and polyploid
hybrids at the cost of limited computational efficiency. More efficient packages
look for minimum breaks or two-points approximated maximum likelihood maps
but ignore errors, unknowns and polyploid hybrids.
In this paper, we present a simple improvement of the EM algorithm [5] that
makes maximum likelihood estimation much more efficient (in practice and to
some extent in theory too). The boosted EM algorithm can deal with unknowns
in both error-free haploid data and error-free backcross data. Unknowns are usu-
ally quite limited in RH mapping but cannot be ignored when one deals with
genetic data or multiple populations/panels consensus mapping (markers being
not necessarily typed in all panels/populations). These improved EM algorithms
have been implemented in the C ARHT AG E` NE software. We conclude with a com-
parison with similar packages (RHMAP and MapMaker) using simulated data
sets and present preliminary results on mixed simultaneous RH/genetic mapping
on pig data.
1 Introduction
Radiation hybrid mapping [4] is a somatic cell technique that is used for ordering mark-
ers along a chromosome and estimating the physical distances between them. It nicely
complements alternative mapping techniques especially by providing intermediate reso-
lutions. This technique has been mainly applied to human cells but also used on animals,
e.g. [6].
The biological experiment in RH mapping can be rapidly sketched as follows: cells
from the organism under study are irradiated. The radiation breaks the chromosomes
at random locations into separate fragments. A random subset of the fragments is then
rescued by fusing the irradiated cells with normal rodent cells, a process that produces
a collection of hybrid cells. The resulting clone may contain none, one or many chro-
mosome fragments. This clone is then tested for the presence or absence of each of the
O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 4151, 2001.
c Springer-Verlag Berlin Heidelberg 2001
42 Thomas Schiex et al.
markers. This process is performed a large number of times producing a radiated hybrid
panel.
The algorithmic analysis which follows the biological experiment, based on the re-
tention patterns of the markers observed, aims at finding the most likely linear ordering
of the markers on the chromosome along with distances between markers. The under-
lying intuition is that the further apart two markers are, the most likely it is that the
radiation will create one or more breaks between them, placing the two markers on
separate chromosomal fragments. Therefore, close markers will tend to be more often
co-retained than distant ones. Given an order of the markers, the retention pattern can
therefore be used to estimate the pairwise physical distances between them.
Two fundamental types of approaches have been used to evaluate the quality of
a possible markers permutation. The first, crudest approach, is a non parametric ap-
proach, called obligate chromosomal breaks (OCB), that aims at finding a permu-
tation that minimizes the number of breaks needed to explain the retention pattern.
This approach is not considered in this paper. The second one is a statistical paramet-
ric method of maximum likelihood estimation (MLE) using a probabilistic model of
the RH biological mechanism. Several probabilistic models have been proposed for RH
mapping [9] dealing with polyploidy, errors and unknowns. In this paper, we are only
interested in a subset of the models that are compatible with the use of the EM algo-
rithm [5] for estimating distances between markers. According to our experience, the
simplest equal retention model is the most frequently used in practice and also the
most widely available because it is a good compromise between efficiency and realism.
Such models are used in the RHMAP and RHMAPPER packages. More recently, more
efficient approximated MLE versions based on two points estimation have also been
used [2] but they dont deal with unknowns and wont be considered in the sequel.
The older but still widely used genetic mapping technique [ 10] exploits the occur-
rence of cross-overs during meiosis. As for RH mapping, the underlying intuition is
that the further apart two markers are, the most likely it is that a cross-over will occur in
between. Because cross-overs cannot be directly observed, the indirect observation of
allelic patterns in parents and children is used to estimate the genetic distance between
them. There is a long tradition of using EM in genetic mapping [ 7]. This paper will
focus on RH mapping but we must mention that the improvements presented in this
paper have also been applied to genetic mapping with backcross pedigree. Actually,
genetic and RH data nicely complement each other for ordering markers. Genetic data
leads to myopic ordering: set of close markers cannot be reliably ordered because usu-
ally no recombination can be observed between them. On the contrary RH data leads to
hypermetropic ordering: set of closely related markers can be reliably ordered but dis-
tant groups are sometimes difficult to order because too many breaks occurred between
them. Dealing with unknown is unavoidable in genetic mapping since markers may be
uninformative.
In either RH or genetic mapping, the most obvious computational barrier is the
shear number of possible orders. For n markers, there are n! 2 possible orders (as an
order and its reverse are equivalent), which is too large to search exhaustively, even for
moderate values of n. In the simplest case of error-free unknown-free data, it has been
observed by several authors that the MLE ordering problem is equivalent to the famous
Boosting EM for Radiation Hybrid and Genetic Mapping 43
traveling salesman problem [13,1], an NP-hard problem. The ordering techniques used
in existing packages go from branch and bound [ 2], to local search techniques and more
or less greedy heuristics. In all cases, finding a good order requires a very large number
of MLE calls. In practice, the cost of EM execution is still too heavy to make branch
and bound or local search methods computationally usable on large data sets and the
greedy heuristic approach remains among the most widely used in practice.
In this paper, we show how the EM algorithm for RH/genetic mapping can be sped
up when it is applied to data sets where each information is either completely informa-
tive or completely uninformative. This is the case, e.g. for error-free haploid RH data
with unknown or error-free backcross data with unknowns. In practice, for RH map-
ping, it has been observed in [9] that analyzing polyploid radiation hybrids as if they
were haploid does not compromise the ability to order markers which makes this re-
striction to haploid data quite reasonable. For genetic mapping, most phase-known data
can be reduced (although with some possible loss of information) to backcross data.
In practice, we have applied it to diploid RH data and complex animal (pig) pedigree
with very limited discrepancies (see section 5) and a speed-up factor of two orders of
magnitude.
Interestingly, this boosted EM algorithm is especially adapted to the unknown/
known patterns that appear in multiple population/panels mapping when a single con-
sensus map is built: when two or more data sets are available for the same organism
(and with a similar resolution for RH), a possible way to build a consensus map is to
merge the two data sets in one. markers that are not typed in one of the 2 data sets are
marked as unknown in this case. In this case, we show that each iteration of the EM
algorithm may be in O(n) instead of O(n.k), where n is the number of markers and k
the number of individuals typed.
These improved EM algorithms along with several TSP-inspired ordering strategies
for both framework and comprehensive mapping have been implemented in the free
software package C ARHT AG E` NE [13] which allows for multiple population/panels map-
ping using either shared or separate distance estimation for each pair of data sets. This
allows, among other, for mixed genetic/RH mapping (with separate estimations of ge-
netic and RH distances) which nicely exploits the complementarity of genetic and RH
data.
In this section, we will explain how EM can be optimized to deal with haploid RH data
sets with unknowns. This optimization also applies to backcross genetic data sets with
missing data. It has been implemented in the genetic/RH mapping software
C ARHT AG E` NE [13] but has never been described in the literature before.
Suppose that genetic markers M 1 ,. . . Mn are typed on k radiation hybrids. The ob-
servations for a given hybrid, given the marker order (M 1 , . . . , Mn ) can be written as a
vector x = (x1 , . . . , xn ) where xi = 1 if the marker M i is typed and present, X i = 0
if the marker is typed and absent and x i = if the marker could not be reliably typed.
Such unknowns are relatively rare in RH mapping but are much more frequent in ge-
netic mapping or in multiple population/panel consensus mapping.
44 Thomas Schiex et al.
The probabilistic HMM model for generating each sequence x in the case of error-
free haploid equal retention data is defined by one retention probability denoted r
(probability for a fragment to be retained) and n 1 breakage probability denoted
b1 , . . . , bn1 (bi is the probability of breakage between marker M i and Mi+1 ). Break-
age and retention are considered independent processes.
The structure of the HMM model for 4 markers ordered as M1,. . .,M4 is sketched as
a weighted digraph G = (V, E) in Figure 1. Vertices correspond to the possible state of
M1 M2 M3 M4
Retained
1b 1b 1b"
r b r b r r 1
b"
Broken
1r b 1r b 1r b" 1r 1
Missing
1b 1b 1b"
being respectively retained, missing or broken. An edge (a, b) E that connects two
vertices a and b is simply weighted by the conditional probability of reaching the state
b from the state a, noted p(a, b). For example, if we assume that M 1 is on a retained
fragment, there is a probability b 1 that a new fragment will start between M 1 and M2
and a probability (1 b 1 ) that the fragment remains unbroken. In the first case, the
new fragment may either be retained (r) or not (1 r). In the second case, we know
that M2 is on the same retained fragment. The EM algorithm [ 5] is the usual choice
for parameter estimation in hidden Markov models [ 11] where it is also known as the
Forward/Backward or Baum-Welsh algorithm. This algorithm can be used to estimate
the parameters r, b 1 , b2 . . . and to evaluate the likelihood of a map given the available
evidence (a vector x of observation for each hybrid in the panel).
If we consider one observation x = (0, , 0, 1) on a given hybrid, the graph can
be restricted to the partial graph of Figure 2 by removing vertices and edges which are
incompatible with the observation (dotted in the figure). Every path in this graph corre-
sponds to a possible reality. The path in bold corresponds to the fact that a fragment has
been a breakage between each pair of markers, each fragment being successively miss-
ing, retained, missing and retained. If we define the probability of such a source-sink
path as the product of all the edge probabilities, then the sum of the probabilities of all
the paths that are compatible with the observation is precisely its likelihood. Although
there is a worst-case exponential number of such paths, dynamic programming, embod-
ied in the so-called Forward algorithm [11] may be used to compute the likelihood of
a single hybrid in (n) time and space. For any vertex v, if we note P l (v) the sum of
the probabilities of all the paths that exist in the graph from the source to the vertex v,
we have the following recurrence equation:
Boosting EM for Radiation Hybrid and Genetic Mapping 45
M1 M2 M3 M4
Retained
1b 1b 1b"
r b r b r r 1
b"
Broken
1r b 1r b 1r b" 1r 1
Missing
1b 1b 1b"
Pl (v) = Pl (u).p(u, v)
u s.t. (u,v)E
This simply says that in order to reach v from the source, we must first reach a vertex
u that is directly connected to v (with probability P l (u)) then go to v (with probability
p(u, v)). We can sum up all these probabilities that correspond to an exhaustive list of
distinct cases. This recurrence can simply be used by initializing the probability P l of
the source vertex to 1.0 and applying the equation Forward (from left to right, using
a topological ordering of the vertices). Obviously, P l for the sink vertex is nothing but
the likelihood of the hybrid. One should note that the same idea can be exploited to
compute for each vertex P r (v), the sum of the probabilities of all paths that connect v
to the sink (simply reverse all edges and apply the forward version).
The EM algorithm is an iterative algorithm that starts from given initial values of
the parameters r0 , b0 , b0 . . . and that goes repeatedly through two phases:
1. Expectation: for each hybrid h H, given the current value of the parameters, the
probability Ph (u, v) that a path compatible with the observation x for hybrid h uses
edge (u, v) can simply be computed by:
If for a given parameter p, and given hybrid h, we note S h+ (p) the set of all edges
weighted by p and S h (p) the set of all edges weighted by 1p, an expected number
of occurrence of the corresponding event can be computed by:
(u,v)Sh+
(p) Ph (u, v)
E(p) =
(u,v)S + (p)S (p) Ph (u, v)
hH h h
E(p)
pi+1 =
k
46 Thomas Schiex et al.
It is known that EM will produce estimates of increasing likelihood till a local maxi-
mum is reached. The usual choice is to stop iteration when the increase of log-likelihood
is lower than a given tolerance. Several iterations are usually needed to reach, e.g. tol-
erance 104 , especially as the number of unknowns increases.
Each of the forward, backward and E(p) computation of the E phase successively
treat each pair of adjacent markers in constant time (there are at most 6 edges between
each pair of markers). This will be called steps in the sequel with the aim of getting
a better idea of complexity than Landaus notation can offer (and which can anyway be
derived from the number of steps needed since each step is constant time). From the
previous simplified presentation of EM, we can observe that each EM iteration needs
3(n + 1)k steps since n + 1 steps are required for the Forward phase, the Backward
phase and the E(p) computation phase. The M phase is in (n) only.
To make the E phase more efficient, the idea is to try to sum up the data for all hybrids
in a more concise way in order to try to avoid, as far as possible the k factor in the
complexity of the E phase. The crucial property that is exploited is that when a loci
status is known (0 or 1), then the probabilities P l and Pr for the corresponding Re-
tained vertex are both equal to 0.0 and 1.0 respectively (1.0 and 0.0 respectively for
the Missing vertex) and this independently of the markers around.
Any given hybrid can therefore be decomposed in segments of successive loci of 3
different types as illustrated in Figure 3:
dangling dangling
left bounded known right
1 0 0 1 0 00 1 0
dangling segments are either segments that start at the first loci and are all of un-
known status except the rightmost one (dangling left) or segments that end at the
last loci and are all of unknown status except the leftmost one (dangling right).
known pairs are segments composed from a pair of adjacent loci which are both
of known status.
bounded segments are segments that start and stop at loci of known status but
which are separated by loci of unknown status.
Given the hybrid data set H, it is possible to precompute for all pairs of markers
the number of known pairs of each type. This is done only once, when the data set is
loaded. For a given loci ordering, we can precompute in a O(n.k) phase the number
of dangling and bounded segments of each type that occurs in the data set. Then the
Boosting EM for Radiation Hybrid and Genetic Mapping 47
EM algorithm may iterate and perform the E phase by computing expectations for each
of the cases and multiplying the results by the number of occurrences. Maximizing
remains unchanged.
For known pairs, the expectation computation can be done in one step and there are
at most 4(n 1) different types of pairs which means 4(n 1) steps are needed. For all
other segments, dangling or bounded, the expectation computation needs a number of
steps equal to the length of the fragment. So, if we note u the total number of unknowns
in the data set, the total length of these fragments is less than 3u and the expectation
computation can be done in at most 9u steps. We get an overall number of steps of
4(n 1) + 9u which is usually very small compared to the 3(n + 1)k needed before.
From an asymptotic point of view, this is still in O(nk) because u is in O(nk) but it does
improve things a lot in practice. Also, decomposing hybrids into fragments guarantees
that repeated patterns occurring in two or more hybrids are only processed once.
There is a specific important case where asymptotic complexity may improve. When
multiple population/panels consensus mapping is performed, a marker which is not
typed in a data set will be unknown for all the hybrids/individuals of the data-set. This
may induce dangling and bounded segments that are shared by all the hybrids/indivi-
duals of the data-set but that will be processed only once at each EM iteration. In this
case, known pairs and all dangling/bounded segments induced by data-set merging will
be handle in O(n) time instead of O(nk).
For radiated hybrid data, the speed-up exceeds one order of magnitude. More mod-
est improvements are reached on genetic data. These improvements may reduce day of
1
Note that for RH data, this situation corresponds to the merging of two panels that have been
irradiated using a similar level of radiation. If this is not the case, one should rather perform
separate distance estimations per panel. More complex models, using proportional distances,
are available in RHMAP but are not compatible with the use of the EM algorithm.
Boosting EM for Radiation Hybrid and Genetic Mapping 49
computation time to hours and enable the use of more sophisticated ordering techniques,
without any approximation being made. These numbers still leave room for improve-
ments since C ARHT AG E` NE does not exploit the strategy of precomputing known pairs
once and for all but recomputes them at each EM call.
The same comparison was done using CRIMAP outbred model and C AR HT AG E` NE back-
cross model on the derived double backcross data. The results obtained are again con-
sistent with our assumption. Note that the important change in log-likelihood is not
surprising: fixing phases brings in information, while double backcross projection re-
moves some information. The important thing is that differences in log-likelihood are
not affected.
We completed this test by a larger comparison, using more orders, and it appears that
the differences in log-likelihood are well conserved: a difference of difference greater
than 1.0 was observed only for orders whose log-likelihood was very far from the best
one (more than 10 LOD). C ARHT AG E` NE can thus be used to build framework and com-
prehensive maps, integrating genetic and RH maps in reasonable time. For better fi-
nal distances between markers, one can simply reestimate them with RHMAP diploid
model and CRIMAP.
6 Conclusion
Acknowledgements
We would like to thank Gary Rohrer (USDA) for letting us use the USDA porcine
reference genetic data and Martine Yerle (INRA) for the RH porcine data.
Boosting EM for Radiation Hybrid and Genetic Mapping 51
References
1. Amir Ben-Dor and Benny Chor. On constructing radiation hybrid maps. J. Comp. Biol.,
4:517533, 1997.
2. Amir Ben-DOr, Benny Chor, and Dan Pelleg. RHO radiation hybrid ordering. Genome
Research, 10:365378, 2000.
3. Michael Boehnke, Kathryn Lunetta, Elisabeth Hauser, Kenneth Lange, Justine Uro, and Jill
VanderStoep. RHMAP: Statistical Package for Multipoint Radiation Hybrid Mapping, 3.0
edition, September 1996.
4. D.R. Cox, M. Burmeister, E.R. Price, S. Kim, and R.M.Myers. Radiation hybrid mapping: A
somatic cell genetic method for constructing high-resolution maps of mammalian chromo-
somes. Science, 250:245250, 1990.
5. A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via
the EM algorithm. J. R. Statist. Soc. Ser., 39:138, 1977.
6. R.J. Hawken, J. Murtaugh, G.H. Flickinger, M. Yerle, A. Robic, D. Milan, J. Gellin, C.W.
Beattie, L.B. Schook, and L.J. Alexander. A first-generation porcine whole-genome radiation
hybrid map. Mamm. Genome, 10:824830, 1999.
7. E.S. Lander, P. Green, J. Abrahamson, A. Barlow, M. J. Daly, S. E. Lincoln, and L. Newburg.
MAPMAKER: An interactive computer package for constructing primary genetic linkage
maps of experimental and natural populations. Genomics, 1:174181, 1987.
8. E. Laurent, V.and Wajnberg, B. Mangin, T. Schiex, C. Gaspin, and F. Vanlerberghe-Masutti.
A composite genetic map of the parasitoid wasp Trichogramma brassicae based on RAPD
markers. Genetics, 150(1):275282, 1998.
9. Kathryn L. Lunetta, Michael Boehnke, Kenneth Lange, and David R. Cox. Experimental
design and error detection for polyploid radiation hybrid mapping. Genome Research, 5:151
163, 1995.
10. Jurg Ott. Analysis of human genetic linkage. John Hopkins University Press, Baltimore,
Maryland, 2nd edition, 1991.
11. Lawrence R. Rabiner. A tutorial on hidden markov models and selected applications in
speech recognition. Proc. of the IEEE, 77(2):257286, 1989.
12. G.A. Rohrer, L.J. Alexander, J.W. Keele, T.P. Smith, and C.W. Beattie. A microsatellite
linkage map of the porcine genome. Genetics, 136:231245, 1994.
13. T. Schiex and C. Gaspin. Cartagene: Constructing and joining maximum likelihood genetic
maps. In Proceedings of the fifth international conference on Intelligent Systems for Molec-
ular Biology, Porto Carras, Halkidiki, Greece, 1997. Software available at
http://www-bia.inra.fr/T/CartaGene.
14. M. Yerle, P. Pinton, A. Robic, A. Alfonso, Y. Palvadeau, C. Delcros, R. Hawken, L. Alexan-
der, C. Beattie, L. Schook, D. Milan, and J. Gellin. Construction of a whole-genome radiation
hybrid panel for high-resolution gene mapping in pigs. Cytogenet. Cell Genet., 82:182188,
1998.
Placing Probes along the Genome
Using Pairwise Distance Data
1 Introduction
Genetics depends upon genomic maps. The ultimate maps are complete nucleotide se-
quences of the organism together with a description of the transcription units. Such
maps in various degrees of completion exist for many of the microbial organisms,
O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 5268, 2001.
c Springer-Verlag Berlin Heidelberg 2001
Placing Probes along the Genome Using Pairwise Distance Data 53
yeasts, worms, flies, and now humans. Short of this, genetically or physically mapped
collections of objects derived from the genome under study are still of immense utility,
and are often precursors to the development of complete sequence maps. These objects
may be markers of any sort, DNA probes, and genomic inserts in cloning vectors.
We have been exploring the use of microarrays to assist in the development of ge-
nomic maps. We report here one such mapping algorithm, and explore its foundation
using computer simulations and mathematical treatment. The algorithm uses unordered
probes that are microarrayed and hybridized to an organized sampling of arrayed but
unordered members of libraries of large insert genomic clones.
In the foregoing we assume some knowledge of genome organization, DNA hy-
bridization, repetitive DNA, gene duplication, and the common forms of microarray
experiments. In the proposed experimental setting, one sample at a time is hybridized
to microarrayed probes, and hybridization is measured as an absolute quantity. We as-
sume probes are of zero dimension, that is, of negligible length compared to the length
of the large genomic insert clones. Most importantly, we assume that hybridization sig-
nal of a probe reflects its inclusion in one or more large genomic insert clones present in
the sample, and negligible background hybridization. Our analysis is general enough to
include the effects of other sources of error. The novelty of the results reported here is
in their ability to deal with ambiguities, an inevitable consequence of the use of massive
parallelism in microarrays involving many probes and many clones. Similar algorithms
are reported in the literature [7], but assume only the knowledge of clone-probe inclu-
sion information for every such combination and suggest different algorithms that do
not exploit the underlying statistical structure.
One important application of our method is in measuring gene copy number in
genomic DNA [10]. Such techniques will eventually have direct application to the anal-
ysis of somatic mutations in tumors and inherited spontaneous germline mutations in
organisms when those mutations result in gene amplification or deletion. In contrast,
low signal-to-noise ratios, due to the high complexity of genomic DNA, make other
approaches such as the direct application of standard DNA microarray methods highly
problematic.
2 Related Literature
The problem of physical mapping and sequencing using hybridization is relatively well
studied. As shown in [6], the general problem of physical mapping is NP-complete. An
approach based on traveling salesman problem (TSP) in the absence of statistics is given
in [1]. The problem formalism used in this paper will be similar to the foundational work
in [1,2,3,5,7,11,12,14]. Our method extends the previous results by devising efficient
algorithms as well as biochemical experiments capable of achieving higher resolution of
probe placement within contigs. In [8] the M ATRIX -T O -L INE problem is suggested as
the model problem for determining Radiation Hybrid maps. Probe mapping using BACs
is slightly different in that pairwise distances for probes far away cannot be resolved
directly using BACs. Our design is general in that the inputs are modeled as random
variables with known statistics determined a priori by our experimental designs chosen
appropriately for a range of applications. Also we provide estimated probabilities of
correctness for the map we produce. In this sense, this paper invokes an an experimental
optimization as recommended in [2].
54 Will Casey, Bud Mishra, and Mike Wigler
3 Mathematical Definitions
Consider a genome represented by the interval [0, G]. Take P random short sub-strings
(about 200bps) which appear on the genome uniquely. Represent these strings as points
{x1 , . . . , xP }. Assume that the probes are i.i.d. with uniform random distribution over
the interval [0, G]. Let S be a collection of intervals of the genome, each of length L
(usually ranging from few 100kbs to Mbs). Suppose the left-hand points of the intervals
of S are i.i.d. uniform random variables over the interval [0, G]. Take a small, even in
number sized subset of intervals S S, chosen randomly from S. Divide S randomly
into two equal-size disjoint subsets S = SR
SG
, where R indicates a red color class
and G indicates a green color class. Now specify any point x in [0, G] and consider the
possible associations between x, and the intervals in S :
How the Distances Are Measured. With the resulting color sequences s j we can
compute the pairwise Hamming distance. Let
{ 2 , . . . , x
x1 , x P } [0, G],
Placing Probes along the Genome Using Pairwise Distance Data 57
ln f (|xi xj ||dij ).
1i,jP
minimize Wij (|
xi xj | dij )2 ,
1i<jP
Hence
dpl ,q1
Cp Cq
where xi s (i {p1 , . . . , pl }) are fixed by the locations assigned in the contig C p . Thus
taking a derivative of the expression above with respect to x q1 and equating it to zero,
we see that the optimal location for x q1 in Cr is
i{p1 ,...,pl }:di,q1 <L (xi + di,q1 ) / 2 di,q1
d = max x pl , 2
.
i{p1 ,...,pl }:di,q <L 1/ di,q1
1
Cr = [ rl , x
xr1 , . . . , x rl+1 , . . . , x
rl+m ],
rl L ,
rl+1 x
0x
and the distances between all other consecutive pairs are exactly the same as what they
were in the original constituent contigs. Thus, in any contig, the distance between every
pair of consecutive probes takes a value between 0 and L . Note that one may further
simplify the distance computation by simply considering the k nearest neighbors of x q1
from the contig C p : namely, xplk+1 , . . ., x
pl .
xi + di,q1 ) / 2 di,q1
i{plk+1 ,...,pl }:di,q1 <L (
dk = max x pl , 2
.
i{plk+1 ,...,pl }:di,q <L 1/ di,q1
1
d1 = x
pl + dpl ,q1 ,
At any point we can also improve the distances in a contig, by running an adjust
pj , where
operation on a contig C p with respect to a probe x
Cp = [ pj1 , x
xp1 , . . . , x pj , xpj+1 , . . . , x
pl .]
xpj adjust
x
p1
... x
pj
... x
pl
At this point, if x
= x
pj , then the new position of the probe x pj in the contig Cp is
x . As before, one can use various approximate version of the update rule, where only
k probes from the left and k probes from the right are considered and in the greediest
version only the two nearest neighbors are considered. Note that the adjust operation
always improves the quadratic cost function of the contig locally and since it is positive
valued and bounded away from zero, the iterative improvement operations terminate.
The input domain is a probe set V , and a symmetric positive real-valued distance weight
P
matrix D RP+ , where P = |V |.
PRE-PROCESS
Construct a graph G = V, E , where E = {ek = (xi , xy )|di,j < L }. The edge
set of the graph G is sorted into an increasing order as follows: e 1 , e2 , . . ., eQ , with
Q = |E | such that for any two edges e k1 = [xi1 , xj1 ] and ek2 = [xi2 , xj2 ], if k1 < k2
then di1 ,j1 di2 ,j2 . G can be constructed in O(|V | 2 ) time, and its edges can be sorted
60 Will Casey, Bud Mishra, and Mike Wigler
in O(|E | log(|V |)) time. In a simpler version of the algorithm it will suffice to sort the
edges into an approximate increasing order by a parameter H i,j (related to di,j ) that
takes values between 0 and M . Such a simplification would result in an algorithm with
O(|E | log M ) runtime.
MAIN ALGORITHM
Data-structure: Contigs are maintained in a modified union-find structure designed to
encode a collection of disjoint unordered sets of probes which may be merged at any
time. Union-find supports two operations, union and find [ 13], union merges two sets
into one larger set, find identifies the set an element is in. At any instant, a is represented
by the following:
Doubly linked list of probes giving left and right neighbor with estimated consecu-
tive neighbor distances.
Boundary probes: each contig has a reference to left and right most probes.
In the kth step of the algorithm consider edge e k = [xi , xj ]: if find(xi ) and find(xj )
are in distinct contigs Cp and Cq , then join Cp and Cq , and update a single distance to
neighbor entry in one of the contigs.
At the termination of this phase of the algorithm, one may repeatedly choose a
random probe in a randomly chosen contig and apply an adjust operation.
OUTPUT
A collection of probe contigs with probe positions relative to the anchoring probe for
that contig.
In a slightly more robust version the contigs may be represented by a dynamic balanced
binary search tree which admit find and implant operations. Each operation has worst
Placing Probes along the Genome Using Pairwise Distance Data 61
case time complexity of O(log(|V |)). Thus after summing over all |E | operations the
worst case runtime for the main algorithm is:
O |E | log(|V |) + k|V |
mean d(x) of 100 samples when l(BAC)=160 var d(x) of 100 samples when l(BAC)=160
14 11
10
12
10
8
8
7
variance
d
6
6
5
4
2
3
0 2
0 100 200 300 0 100 200 300
x x
BACs. From the population we repeat a sampling experiment using a sample size of 32
BACS 16 are colored red, and 16 are colored green. Each sample is hybridized in-silico
to the probe set. Here we assume a perfect hybridization so there are no cross hybridiza-
tions or failures in hybridizations associated with the experiment. We repeat the sample
experiment 130 times. This produces the observed distance matrix, whose distribution
we modeled earlier. This is the input for the algorithm presented in this paper. In the
distance vs. observed data plot we see that using a large M = 130 (suggested by the
Chernoff bounds) has its benefits in cutting down the rate of the false positives. The
observed distance matrix is input into the (10neighbor, = 11 16 ) algorithm without
the use of the adjust operation, the result is 7 contigs. The order within contigs had five
mistakes. We look at the the 4th contig and plot the relative error in probe placement.
100
100
d
2000
2000
0 0 50
1000 200 200
200 200
100 100
100 100
0 0
0 100 200 index 0 0 index index 0 0 index 0 5000
index x
inferred inferred inferred order difference in relative
probe positions contig structure given contig order positions for largest contig
2000 7 150 100
6 50
inferred probe index
1500
5 100 0
contig index
position
kbp
1000 4 50
3 50 100
500
2 150
0 1 0 200
0 100 200 0 100 200 0 100 200 0 50 100
index inferred index probe index index
6 Future Work
The more robust variation of the algorithm based on a dynamically balanced binary
search tree will be studied in more detail. A comparison with Traveling Salesman
heuristics and an investigation of an underlying relation to the heat equation will show
why this algorithm works well. We will work on a probabilistic analysis for the statis-
tics of contigs. A model incorporating failure in hybridization and cross hybridization
will be developed. We are able to prove that, if errors are not systematic, then a slight
modification of the Chernoff bounds presented here can be applied to ensure the same
Placing Probes along the Genome Using Pairwise Distance Data 63
results. We shall also consider the choice of probes to limit the cross-hybridization er-
ror and a choice of melting points to further add to the goal of decreasing experimental
noise. A set of experimental designs will be presented for the working biologists. More
extensive simulations, and results on real experiments shall report the progress of what
appears to be a promising algorithm.
Acknowledgments
The research presented here was supported in part by a DOE Grant, an NYU Research
Challenge Grant, and an NIH Grant.
Appendix
Proof. Since the M samples are done independently the proof reduces to showing that
when M = 1 the probabilities are Bernoulli with respective parameters. Let us define
events T = (iB jB ) , C = ((iR jR ) (iG jG ) (iY jY )), and H = (T C).
Given a set of K BACs on a genome [0, G] the probability that none start in an
interval of length l is (1 ) l el where = K G.
Shown below is a diagram that is helpful in computing the probabilities for events
C, H, T when x < L. The heavy dark bar labeled a represents a set of BACs which
covers probe p i but not pj ; the bar labeled b represents a set of BACs that covers probe
pi and pj ; finally, the bar labeled c represents a set of BACs that covers p j but not pi .
c
b
a
x L x pi x pj
Hence we derive:
Proof.
f (d|x)f (x)
f (x|d) = G
f (d|x)f (x) dx
0
2 /22 x 2 /22 L
1 e(dx) e(dL)
G I0x<L
2x
+ ILxG
2L
= G 2 /22 x (dL)2 /22 L
1 e(dx)
G 0 I0x<L 2x + ILxG e
2L
dx
For small values of 2 the denominator in the above expression can be approximated
as follows2 :
2 2 2 2
1 e(dx) /2 x
L
GL e(dL) /2 L
f (d) = dx +
G0 2x G 2L
1 L
Id<L + 1 d=L .
G G
Thus, we make further simplifying assumptions and choose the following likelihood
function:
2 2
e(xd) /2 d 1
f (x|d) Id<L + IdL ILxG ,
2d GL
We treat the problem of false positives, and false negatives with Chernoffs tail bounds.
We find upper bounds on the probability of getting a false positive or false negative in
terms of the parameters , M, c = KL G , 0 1, L = L L.
A false positive is a pair of probes that appear to be close by the Hamming distance
but are actually far apart on the genome. We denote the event as:
A false negative is a pair of probes that appear to be far by the Hamming distance but
are actually close on the genome. We denote the event as:
In the following picture the volume of data which are false positives and false neg-
atives are indicated by the squares noted F.P. and F.N. respectively.
2
The Dirac Delta Function is distribution defined by the equations
x=0 = 0 if x = 0
.
x x=0
dx = 1
66 Will Casey, Bud Mishra, and Mike Wigler
11111
00000
d
00000
11111
00000
11111
00000
11111
00000
11111
00000
11111
x <L
F.N.
00000
11111 d > L
00000
11111
00000
11111
00000
11111
00000
11111
L
00000
11111
L 000000
111111
000000
111111
000000
111111
000000
111111
F.P. d < L
000000
111111
000000
111111
x > L
000000
111111
111111
000000 x
000000
111111
L L G
We develop a Chernoff bound to bound the probability that the volume of false
positive data is greater than a specified size.
The Chernoff bounds for a binomial distribution with parameters (M, q) are given
by:
Mq
ev
P (H > (1 + v)M q) < with v > 0
(1 + v)(1+v)
M q(1)2
P (H < M q) < e 2 with 0 < 1
Let H(M ) be the Hamming distance when M phases are complete. Let q(L) = P (H|x
L) 2L
c
2
= 2c2c We start by noting equivalent events:
e e
2c
1 1 M c
P ( F.N. ) < (e( 1) ) e2
0.5
50 50 0.1 0.1
0 0
0 0.5 1 0 0.5 1 0 500 0 500
M M
References
1. F. Alizadeh,R.M. Karp,D.K. Weisser, and G. Zweig. Physical Mapping of Chromosomes
Using Unique Probes, Journal of Computational Biology, 2(2):159185, 1995.
2. E. Barillot, J. Dausset, and D. Cohen. Theoretical Analysis of a Physical Mapping Strategy
Using Random Single-Copy Landmarks, Journal of Computational Biology, 2(2):159
185, 1995.
68 Will Casey, Bud Mishra, and Mike Wigler
3. A. Ben-Dor and B. Chor. On constructing radiation hybrid maps, Proceedings of the First
International Conference on Computational Molecular Biology, 1726, 1997
4. H. Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum
of observations, Annals of Mathematical Statistics, 23:483509,1952.
5. R. Drmanac et al., DNA sequence determination by hybridization: a strategy for efficient
large-scale sequencing, Science, 163(5147):596, Feb 4, 1994.
6. P.W. Goldberg, M.C. Golumbic, H. Kaplan, and R. Shamir. Four Strikes Against Physical
Mapping of DNA, Journal of Computational Biology, 2(1):139152, 1995.
7. D. Greenberg and S. Istrail. Physical mapping by STS hybridization: Algorithmic strategies
and the challenge of software evaluation, Journal of Computational Biology, 2(2):219
273, 1995.
8. J. Hastad, L. Ivansson, J. Lagergren, Fitting Points on the Real Line and its Application to
RH Mapping, Lecture Notes in Computer Science, 1461:465467, 1998.
9. N. Lisitsyn and M. Wigler, Cloning the differences between two complex genomes, Sci-
ence, 258:946951,1993.
10. R. Lucito, J. West, A. Reiner, J. Alexander, D. Esposito, B. Mishra, S. Powers, L. Norton,
and M. Wigler, Detecting Gene Copy Number Fluctuations in Tumor Cells by Microarray
Analysis of Genomic Representations, Genome Research, 10(11): 17261736, 2000.
11. M.J. Palazzolo, S.A. Sawyer, C.H. Martin, D.A. Smoller, and D.L. Hartl Optimized Strate-
gies for Sequence-Tagged-Site Selection in Genome Mapping, Proc. Natl. Acad. Sci. USA,
88(18):80348038, 1991.
12. D. Slonim, L. Kruglyak, L. Stein, and E. Lander, Building human genome maps with radi-
ation hybrids, Journal of Computational Biology, 4(4):487504, 1997.
13. R. E. Tarjan. Data Structures and Network Algorithms, CBMS 44
SIAM, Philadelphia, 1983.
14. D.C. Torney, Mapping Using Unique Sequences, J Mol Biol, 217(2):259264, 1991.
Comparing a Hidden Markov Model and a Stochastic
Context-Free Grammar
Arun Jagota1 , Rune B. Lyngs 1, , and Christian N.S. Pedersen 2,
1
Baskin Center for Computer Science and Engineering
University of California at Santa Cruz
Santa Cruz, CA 95064, U.S.A.
{jagota,rlyngsoe}@cse.ucsc.edu
2
Basic Research in Computer Science (BRICS)
Department of Computer Science
University of Aarhus, Ny Munkegade, DK-8000 Arhus C, DK
[email protected]
1 Introduction
The basic chain-like structure of the key biomolecules, DNA, RNA, and proteins, al-
lows an abstract view of these as strings, or sequences, over finite alphabets, obviously
of finite length. Furthermore, these sequences are not completely random, but exhibit
various kinds of structures in different contexts. E.g. a family of homologous proteins is
likely to have similar amino acid residues in equivalent positions; an RNA sequence
will have pairs of complementary subsequences to form base pairing helices. Hence, it
is natural to consider applying models from formal language theory to model different
classes of biological sequences.
Though not completely random, biological sequences can still possess inherent
stochastic traits, e.g., due to mutations in a family of homologous sequences or a lack
Supported by grants from Carlsbergfondet and the Program in Mathematics and Molecular
Biology
Partially supported by the IST Programme of the EU under contract number IST-1999-14186
(ALCOM-FT)
O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 6984, 2001.
c Springer-Verlag Berlin Heidelberg 2001
70 Arun Jagota, Rune B. Lyngs, and Christian N.S. Pedersen
of knowledge (and computing power) to correctly model all aspects of RNA secondary
structure formation. Thus, it is often better to use stochastic models giving a probability
distribution over all sequences, where a high probability reflects a sequence likely to be-
long to the class of sequences being modeled, instead of formalisms only distinguishing
sequences as either belonging to the class being modeled or not. The two most widely
used grammatical models in bioinformatics are hidden Markov models [ 3, 1, 8, 16, 9]
and stochastic context-free grammars [15, 14, 7], though other models have also been
proposed [18, 13]. These two types of stochastic models were originally developed as
tools for speech recognition (see [12, 2]). One can identify hidden Markov models as
a stochastic version of regular languages and stochastic context-free grammars as a
stochastic version of context-free languages (see [ 17] for an introduction to formal lan-
guages). A more in-depth treatment of biological uses of hidden Markov models and
stochastic context-free grammars can be found in [ 5, Chap. 36 and 910].
As stochastic models are commonly used to model families of biological sequences,
and as a common task in bioinformatics is that of comparing data, it is natural to ask
how to compare two stochastic models. In [10] we described how to compare two hid-
den Markov models, by computing the co-emissionor collisionprobability of the
probability distributions of the two models, i.e., the probability that the two models
independently generate the same sequence. Having the co-emission probability for a
pair of probability distributions as well as for each of the distributions with itself, it is
easy to compute the L 2 - and the Hellinger-distance between the two distributions. In
this paper we study the problem of comparing a hidden Markov model and a stochastic
context-free grammar. We develop recursions for the co-emission probability of the dis-
tributions of the model and the grammar, recursions that lead to a set of quadratic equa-
tions. Though quadratic equations are generally hard to solve, we show how to find an
approximate solution by a simple iteration scheme. Furthermore, we show how to solve
the equivalent maximization problem, the problem of finding a run through the hidden
Markov model and derivation in the grammar that generate the same sequence and have
maximal joint probability. This is in essence parsing the hidden Markov model by the
grammar, so the algorithm can be viewed as a generalization of the CYK algorithm
for parsing a sequence by a stochastic context-free grammar. Indeed, in most cases the
complexity of our algorithm will be identical to the complexity of the CYK algorithm.
Finally we discuss the undecidability of some natural extensions of our results.
The structure of this paper is as follows. In Sect. 2 and 3 we briefly introduce hid-
den Markov models and stochastic context-free grammars and the terminology we use.
In Sect. 4 we consider the problem of computing the co-emission probability of the
probability distributions of a hidden Markov model and a stochastic context-free gram-
mar, and in Sect. 5 we develop the algorithm for parsing a hidden Markov model by a
stochastic context-free grammar. In Sect. 6 we present an illustrative experiment and in
Sect. 7 we discuss the problems occurring when trying to extend the methods presented
in Sect. 4 and 5.
Comparing a Hidden Markov Model and a Stochastic Context-Free Grammar 71
k
PM (, s) = PM () PM (s | ) = i1 ,i
aM ij ,sj ; PM (s) =
eM PM (, s).
i=1 j=1
A partial run in M is a run that starts in a state p and ends in a state q, which are not
necessarily the special start- and end-states. To ease the presentation of the methods in
Sect. 4, we introduce the concept of a partial run from p to q semi-including q meaning
that if q is a non-silent state then no symbol has yet been emitted from q. The probability
of a partial run from p to q semi-including q and generating s is thus the probability
of generating s on the path from p to the immediate predecessor of q and taking the
transition to q. Given a string s and a model M , efficient algorithms for determining
PM (s) and the run in M of maximal probability generating s are known, see [ 12, 5].
A context-free grammar G describes a set of finite strings over a finite alphabet , also
called a language over . It consists of a set V of non-terminals, a set T = {} of
terminals, where is the empty string, and a set P of production rules , where
V and (V T )+ . A production rule means that can be rewritten
to by applying the rule. A string s can be derived from a non-terminal U ,
U s, if U can by rewritten to s by a sequence of production rules; s is in the language
described by G if it can be derived from a special start non-terminal S. A derivation D
of s in G is a sequence of production rules which rewrites S to s. A derivation of s in G
is also called a parsing of s in G.
72 Arun Jagota, Rune B. Lyngs, and Christian N.S. Pedersen
which is similar to the definition of the co-emission probability of two hidden Markov
models in [10]. This quantity is also often referred to as the collision probability of
the probability distributions of G and M , as it is the probability that strings picked
at random according to the two probability distributions collide, i.e., are identical. We
initially assume that M is an acyclic hidden Markov model, i.e., a left-right hidden
Markov model where no states have self-loop transitions. By itself, this is not a very
interesting class of hidden Markov models, e.g., a model of this type cannot generate
strings of arbitrary length, but the ideas of our approach for computing the co-emission
probability for this class of models are also applicable to left-right and general hidden
Markov models.
For acyclic hidden Markov models we can use an approach closely mimicking the
inside algorithm for computing the probability that a stochastic context-free grammar
generates a given string. In the inside algorithm, when computing the probability that
a string s is derived in a stochastic context-free grammar, track is being kept of the
probability that a substring of s is derived from a non-terminal of the grammar. In our
algorithm for computing the co-emission probability of G and M we keep track of the
Comparing a Hidden Markov Model and a Stochastic Context-Free Grammar 73
p=q
:
p q
p r q
(a) The A array holds the probabilities of getting from one state p to another state q in M
without emitting any symbols.
,U
p r q
,U :
p q
,U
p r q
(b) The B array holds the probabilities of getting from one state p to another state q in
M while emitting only one symbol and at the same time generating the same symbol by a
terminal production rule for the non-terminal U in G.
,U
p q
U
x ,U
,U
x: p q
p q
X Y
y z
,
p r q y z
(c) The C array holds the probabilities of getting from one state p to another state q in
M while emitting any string and at the same time generating the same string from a non-
terminal U in G.
Fig. 1. Illustration of the individual purposes and recursions of the three arrays used.
Hollow circles denote silent states, solid circles denote non-silent states, and hatched
circles denote states of any type. Squiggle arrows indicate partial runs of arbitrary length
and straight arrows indicate single transitions between states.
probability of deriving the same string from a non-terminal that is generated on a partial
run from a state p to a state q semi-including q in M . In our dynamic programming
based algorithm we maintain three arrays, A, B, and C, for the following purposes.
74 Arun Jagota, Rune B. Lyngs, and Christian N.S. Pedersen
A(p, q) will be the sum of the probabilities of all partial runs from p to q semi-
including q that does not emit any symbols, illustrated in Fig. 1(a). I.e. all states on
the partial runs, except possibly for q, are silent states.
B(U, p, q) is the probability of independently deriving a single symbol string from
the non-terminal U and generating the same single symbol string on a partial run
from p to q semi-including q, illustrated in Fig. 1(b).
C(U, p, q) is the probability of independently deriving a string from the non-ter-
minal U and generating the same string on a partial run from p to q semi-including
q, illustrated in Fig. 1(c).
The purpose of the A and B arrays is to deal efficiently with partial runs consisting
of silent states. As all symbols of a string are in a sense non-silent, this is the main new
problem encountered when modifying the inside algorithm to parse acyclic hidden
Markov models.
The C array is similar to the array maintained in the inside algorithm. The array
of the inside algorithm tells us the probability of deriving any substring of the string
being parsed from any of the non-terminals of G. Similarly, the C array will tell us
the probability of independently generating a sequence on a partial run between any
pairs of states and at the same time deriving it from any of the non-terminals of G. It is
evident that C(G, M ) = C(S, start , end), where S is the start symbol of G and start
and end are the start- and end-states of M , as the co-emission probability of G and M
is the probability of deriving the same string from S that is generated on a (genuine)
run from start to end . This is assuming that the end-state of M is silent, so that any
partial run from start to end semi-including end in M is also a genuine run in M .
Having described the required arrays, we must specify recursions for computing
themand argue that these recursions split the computations into ever smaller parts,
i.e., that the dependencies of the recursions are acyclicto obtain a dynamic program-
ming algorithm computing C(G, M ). The A array are the probabilities for getting from
the state p to the state q in M along a path that only consists of silent states. In general
such a path can be broken down into the last transition of the path and a preceding path
only going through silent states. Hence, we obtain the following recursion where the
first case takes care of the initialization.
1 if p = q
A(p, q) = A(p, r) ar,q otherwise
M
(1)
pr<q,
r silent
The ordering of states referred to in the summation is any ordering consistent with the
(partial) ordering of states by the acyclic transition structure. One immediately observes
that each entry of the A array requires time at most time O(n) to compute. Thus the
entire A array can be computed in time O(n 3 ). However, one can observe that we
actually only need to sum over states r with a transition to q in the summation of ( 1).
This observation reduce the time requirements to O(nm) as each transition is part of
O(n) of the above recursions.
The B array holds the probabilities for getting from the state p to the state q in M by
a partial run generating exactly one symbol, and at the same time generating the same
Comparing a Hidden Markov Model and a Stochastic Context-Free Grammar 75
symbol from U by a terminal production rule. In general we can partition the partial run
into an initial path consisting only of silent states and the remaining partial run starting
with the non-silent state emitting the symbol. Hence, we obtain the following recursion
where initialization is handled by paying special attention to the special case where
the initial path of silent states is empty.
B(U, p, q) = PG (U )
(U)G
M
ep, p,r A(r, q) +
aM A(p, r) B(U, r, q) (2)
p<rq p<r<q,
r non-silent
To find the time requirements for computing the B array one observes that each non-
terminal of G occurs in O(n 2 ) entries. Hence, each terminal production rule of G is
part of O(n2 ) of the above recursions. Each of these recursions requires time O(n) to
compute. Thus, computing the B array requires time O(|P t | n3 ).
The C array should hold the probabilities for getting from the state p to the state q
in M by a partial run generating any string, and at the same time generating the same
string from a non-terminal U in G. The purpose of the A and B arrays is to handle the
special cases where at most a single symbol is emitted on a partial run from p to q. In
these cases we do not really recurse on a non-terminal U from G but either ignore G
completely or only consider terminal production rules. In the general case where a string
with more than one symbol is generated, we need to apply a non-terminal production
rule U XY from G. Part of the string is then derived from X and part of the string
is derived from Y . This leads to the following recursion.
The reason for requiring r to be a non-silent state in the last sum is to ensure a unique
decomposition of the partial runs from p to q. If we allowed r to be silent we would
erroneously include some partial runs from p to q several times in the summation. As
we did with the computation of the B array, we can observe that each non-terminal pro-
duction rule is part of O(n 2 ) of the above recursions, and that each recursion requires
time O(n) to compute. Hence, we obtain total time requirements of O(|P n | n3 ) for
computing the C array. Adding the time requirements for computing the three arrays
leads to overall time requirements of O(|P | n 3 ) for computing the co-emission proba-
bility of G and M . This is in correspondence with the time requirements of O(|P ||s| 3 )
for parsing a string s by the grammar G cf. Sect. 3.
As previously stated, in order for these recursions to be used for a dynamic pro-
gramming algorithm, we need to argue that the recursions only refer to elements that in
some sense are smaller. I.e. that no recursion for any of the entries of the three arrays
depends cyclicly on itself. But this is an easy observation as all pairs of states indexing
an array on the right-hand side of the recursions are closer in the ordering of states than
the pair of states indexing the array on the left-hand side of the recursions.
76 Arun Jagota, Rune B. Lyngs, and Christian N.S. Pedersen
The point of always recursing by smaller elements is exactly where we run into
problems when trying to extend the above method to allow cycles in M . Even when
only allowing the self-loops of left-right models this problem crops up. We can easily
modify (1), (2), and (3) to remain valid as equations for the entries of the A, B and C
arrays. All that is needed is to change some of the strict inequalities in the summation
boundaries to include equality. More specifically, if we assume that only non-silent
states can have self-loop transitions (self-loop transitions for silent states are rather
meaningless and can easily be eliminated), we need to change ( 2) and (3) to
B(U, p, q) = PG (U )
(U)G
M
ep, p,r A(r, q) +
aM A(p, r) B(U, r, q) , (4)
prq p<rq,
r non-silent
For the B array, (4) still refers only to smaller elements. Either the A array or an entry
of the B array where the two states are closer in the ordering as r has to be strictly larger
than p in the ordering in the last sum. But for the C array we might choose r equal to
either p or q in the last sum. Hence, to compute C(U, p, q) we need to know C(X, p, q)
and C(Y, p, q) which might in turn depend on (or even be) C(U, p, q).
But, as stated above, (5) still holds as equations for the entries of the C array. For
each pair of states, p and q, in M we thus have a system of equations with one vari-
able and one equation (and the restriction that all variables have to attain non-negative
values) for each non-terminal of G. Assume that we solve these systems in an order
where the distance between the pair of states in the ordering is increasing, i.e., we first
consider systems with p = q, then systems with q being the successor to p, etc. Then
most of the systems will be systems of linear equations. In the equation for a given
entry C(U, p, q) the only unknown quantities are the occurrences of C(X, p, q) and
C(Y, p, q) corresponding to production rules U XY of G. These occurrences have
coefficients PG (U XY ) C(Y, q, q) and PG (U XY ) C(X, p, p), respectively,
coefficients that have known values if p < q. But if p = q the last sum of ( 5) will lead
to a number of terms of the form C(X, p, p) C(Y, p, p). I.e. the system of equations
is quadratic. Hence, for each state with a self-loop transition we need to solve a system
of quadratic equations with one variable and one equation for each non-terminal of G.
General systems of quadratic equations are hard to solve, see [ 6], but the construction
proving this requires equations with all terms having coefficients with the same sign.
One can immediately observe that in a system of equations based on ( 5) the left-hand
side terms will have coefficients that have opposite sign of the right-hand side terms.
Hence, the hardness proof does not relate to the system of quadratic equations we obtain
from (5). We have not been able to find any literature on algorithms solving systems of
the type derivable from (5). But as all the dependencies are positive we can approximate
Comparing a Hidden Markov Model and a Stochastic Context-Free Grammar 77
a solution simply by initializing all entries to the terms depending only on the A and
B arrays and then iteratively update each entry in turn. This process will converge to
the true solution. We conjecture that this convergence will be very rapid for all realistic
grammars and models.
For general hidden Markov models we do not have an ordering of the states of
M . Hence, to modify (1), (2), and (3) to hold for general hidden Markov models, the
ordering constraints on states should be removed from all the summations. When we
do no longer have any ordering of the states, all entries might depend on each other.
Thus, we cannot separate the systems of equations for the entries of the C array into
independent blocks based on the pair of states indexing the entry of the array. This
means that we obtain just one aggregate system of quadratic equations with |V | n 2
variables and equations for the entries of the C array, where |V | is the number of non-
terminals in G and n is the number of states in M . However, for the entries of the A and
B arrays we still only get systems of linear equations. Actually, the B array can even
be computed by simple dynamic programming.
The description of our method will employ three arrays A max , Bmax , and Cmax ,
similar to the A, B, and C arrays used in previous section. The A max (p, q) entries hold
the maximum probability of any partial run from state p to state q semi-including q not
emitting any symbols. But this is just the path of maximum weight in the graph defined
by the transitions not leaving a non-silent state in M . We can thus compute the A max
array by standard all-pairs shortest paths graph algorithms in time O(n 2 log n + nm),
cf. [4, p. 550].
The Bmax (U, p, q) entries hold the maximum probability of a combination of a
derivation consisting of only a terminal production rule U and a partial run from
p to q generating a string consisting only of the symbol . This imposes a restriction on
the path from p to q preventing the use of standard graph algorithms. Still, we only need
to combine transitions from non-silent states with preceding and succeeding partial runs
not emitting any symbols, i.e., with entries of A max . However, if we compute the B max
entries directly from the equation
Bmax (U, p, q) = max PG (U ) eM r, ar,s Amax (p, r) Amax (s, q)
M
(7)
(U)G
pqM
we use time O(|Pt | m n2 ). This could possibly be the dominating term in the overall
time requirements for computing the value of ( 6), as we will see later.
Hence, we will specify a more efficient way to compute the B max (U, p, q) entries.
If p is a non-silent state the optimal choice of a preceding partial run not emitting any
symbols must be the empty run. Thus
Bmax (U, p, q) = max PG (U ) eM p, ap,r Amax (r, q)
M
(8)
(U)G
prM
for all entries of B max with p non-silent. Having computed these entries, we can now
proceed to compute the entries for p silent by
Computing the B max array this way reduces the time requirements to O(|P t | n3 ).
We are now ready to specify how to determine the value of ( 6), i.e., how to compute
the entries of the Cmax array. An entry C max (U, p, q) holds the maximum probability of
a combination of a derivation from U in G and a partial run from p to q in M yielding
the same string. The following equation for computing an entry of C max closely follows
Fig. 1(c).
PG (U ) Amax (p, q)
B
max (U, p, q)
Cmax (U, p, q) = max
(10)
max PG (U XY ) Cmax (X, p, r) Cmax (Y, r, q)
| (U XY ) G, r M
The only exception is that there is no harm in considering the same combination of a
derivation and a partial run several times when working with maxima instead of sums.
Comparing a Hidden Markov Model and a Stochastic Context-Free Grammar 79
Hence, there is no restriction on the type of the state (denoted by r) that we maximize
over in the general case. In an actual implementation one might want to retain the type
restriction to speed up the program, though.
Again, this is slightly beyond a shortest path problem. However, ( 10) gives us a
number of inequalities for each of the entries of C max , similar to [4, Lemma 25.3];
Cmax (U, p, q) must be at least as large as any of the terms on the right-hand side of
(10). Thus, we can use a technique very similar to the relaxation technique discussed
in [4, pp. 520521]; if at any time C max (U, p, q) < PG (U XY ) Cmax (X, p, r)
Cmax (Y, r, q) for any (U XY ) G and r M we can increase C max (U, p, q)
to this value. This means that we could start by initializing each C max (U, p, q) to
max{PG (U ) Amax (p, q), Bmax (U, p, q)} and then keep updating C max by it-
erating over all possible relaxations until no more changes occur. This process will
eventually terminate as no entry can depend cyclicly on itself, as discussed above.
The worst-case time requirements of this scheme are quite excessive, though. Thus,
the question of how to order the relaxations so as to compute C max most efficiently still
remains. We propose an approach very similar to Dijkstras algorithm for computing
single-source shortest paths in a graph, cf. [4, Sect. 25.2]. Assume that S is the set
of entries for which we have already determined the correct value, and that we have
performed all relaxations combining the correct values of these entries (and initialized
all entries using the Amax and Bmax arrays as mentioned in the preceding paragraph).
Let Cmax (U, p, q) be an entry of maximum value not in S. We claim that the current
value of Cmax (U, p, q) must be correct, i.e., that it cannot be increased. In fact, no entry
not in S can have a correct value larger than C max (U, p, q). The reason for this is that
any relaxation not combining two entries both in Swhich we assumed have already
been performedwill involve an entry not in S, and thus with current value at most
Cmax (U, p, q). As the other entry used in the relaxation can have a value of at most
1, no future relaxations can lead to values above C max (U, p, q). Hence, we can insert
Cmax (U, p, q) in S and perform all relaxations combining C max (U, p, q) with an entry
from S. This idea is formalized in algorithm 1.
So what are the time requirements of algorithm 1? To some extent this of course
depends on the choice of data structures implementing the priority queue P Q and the
set S, but a key observation is that each possible relaxation, i.e., combination of two
particular entries, is only performed once, namely when the entry with the smaller value
is inserted in S. Hence, algorithm 1 performs O(|Pn | n3 ) relaxations. One can observe
that for each relaxation we need to perform the operation increasekey on P Q, while all
other operations on P Q are performed at most O(|V | n 2 ) times. Thus, increasekey is
the most critical operation, why we will assume that P Q is implemented by a Fibonacci
heap. This limits the time requirements for all operations on P Q to O(|P n | n3 + |V |
n2 log(|V | n)).
For the set S we need to be able to insert an element efficiently, and to efficiently
iterate through allelements with a particular non-terminal and state. But having already
set aside time |P |n n3 for the priority queue operations, the operations on S do
not need to be that efficient. As it turns out, it is actually sufficient to maintain S sim-
ply as a three-dimensional boolean array indexed by (U, p, q). This makes insertion a
constant time operation. However, it does not allow for an efficient way to iterate over
80 Arun Jagota, Rune B. Lyngs, and Christian N.S. Pedersen
/* Initialization */
P Q = , S =
for all U G, p, q M do
P Q. insert((U, p, q); min{PG (U ) Amax (p, q), Bmax (U, p, q)})
/* Main loop where one entry is fixed at a time */
while P Q is not empty do
/* Fix the entry with highest probability not yet fixed */
(U, p, q); x = P Q. deletemax
S. insert((U, p, q); x)
/* Combine this entry with all feasible, fixed entries */
for all X, Y G with (X U Y ) G and all r M with (Y, q, r); y S do
P Q. increasekey((X, p, r); x y)
for all X, Y G with (X Y U ) G and all r M with (Y, r, p); y S do
P Q. increasekey((X, r, q); y x)
Algorithm 1: The algorithm for computing an optimal parse of a hidden Markov model
M by a stochastic context-free grammar G.
all elements in S with a particular non-terminal and state, short of iterating over all
elements with that non-terminal and state and test the membership of each individual
element. However, this turns out to be sufficiently efficient. In some situations, espe-
cially in the beginning when there are only a few elements in S, we might test the
membership of numerous elements not in S. But each membership test can be asso-
ciated with a relaxation involving a particular pair of elements, namely the relaxation
that is performed if the test succeeds. Furthermore, for each relaxation we will only test
membership twice, once for each element that the relaxation combines. Hence, the total
time we spend iterating over elements in S is O(|P n | n3 ). Thus, the time requirements
for algorithm 1 is O(|Pn |n3 + |V |n2 log(|V |n)). Combined with the time complex-
ity of computing the A max and Bmax arrays, this leads to an overall time complexity of
O(|P | n3 + |V | n2 log(|V | n)) for determining the optimal parse of a general hidden
Markov model by a stochastic context-free grammar (having computed A max , Bmax ,
and Cmax it is easy to find the optimal parse by standard backtracking techniques). This
should be compared with the time requirements of O(|P | |s| 3 ) for finding the optimal
parse of a string s by the CYK algorithm.
In the above description we did not use any assumptions about the structure of the
hidden Markov model M . A natural question to ask is thus how much can be gained
with respect to time requirements, by restricting our attention to left-right models. That
it is not a lot should not be surprising, considering that we are already close to the
complexity of the CYK algorithm. However, we can observe that C max (U, p, q) can
only depend on entries C max (U , p , q ) with p p q q (where the ordering is
with respect to the implicit partial ordering of states in a left-right model). Thus, we can
separate (10) into O(n2 ) systems, one for each choice of p and q, that can be solved
one at a time in a predefined order. Hence, the priority queue only needs to hold at
most |V | elements at any time, reducing the time complexity for finding the optimal
parse to O(|P | n3 + |V | n2 log |V |). More importantly, though, as we only need a
priority queue with at most |V | elements and |V | will usually be very small, it might be
Comparing a Hidden Markov Model and a Stochastic Context-Free Grammar 81
Fig. 2. Alignment and predicted secondary structure of the two sequences, seq1 and
seq2, used to construct the trna2 model, and the sequence and secondary structure
of the maximal parse of trna2 aligned according to the states of trna2 emitting the
symbols. In the two positions indicated with an the sequence of the maximal parse
does not match any of the two other sequences.
feasible to replace the Fibonacci heap implementation with implementations that have
worse asymptotic complexities but smaller overhead. Thus, if we just implement P Q
with an array, scanning the entire array each time a deletemax operation is performed,
the complexity of parsing a left-right model only increases to O(|P | n 3 + |V |2 n2 )
with the involved constants being very small.
6 Results
parse to predict the secondary structure. In each position we can choose a symbol so as
to construct a sequence with a structure of very high probability, i.e., the maximal parse
seems to be a question of highly probable structures finding matching sequences instead
of what we are really looking for, highly probable sequences finding a matching struc-
ture. This is further supported by the two positions where the maximal parse disagrees
with seq1 and seq2, especially as seq1 and seq2 agree in these positionsthe sequence
obtained by changing the symbols in these two positions to the symbols shared by seq1
and seq2 would be roughly 80 times as probable in the hidden Markov model.
Another problem is that the maximum is not very good for discriminating states
that exhibit complementarity from states that do not exhibit complementarity. E.g. a
state that has probabilities 0.5 for emitting either a C or a G gets a lower probability
if paired with a state identical to itself, than if paired with a state that has probability
0.51 for emitting a C and 0.49 for emitting a A. However, having the framework of
algorithm 1 it is easy to modify the details to accommodate a scoring of combinations
of pairs of states and derivations introducing base pairs that captures complementarity
better. Furthermore, the co-emission probability will indeed capture that the two C/G-
emitting states exhibit better complementarity than the C/G-C/A pair. Thus, a better idea
of a common structure might be obtained by looking at the probability that two states
emit symbols that are base paired for all pairs of non-silent states, similar to [ 19, 11].
Indeed, as the dependencies of the energy rules commonly used for RNA secondary
structure prediction can be captured by a context-free grammar, one can also combine
the computation of the co-emission probability as discussed in this paper with the com-
putation of the equilibrium partition function presented in [ 11] to obtain probabilities
for base pairing of positions including both the randomness of base pairing captured by
the partition functions as well as the variability of a family of sequences captured by a
(profile) hidden Markov model.
7 Discussion
In this paper we have considered the problem of comparing a hidden Markov model
with a stochastic context-free grammar. The methods presented can be viewed as natu-
ral generalizations of methods for analyzing strings by means of stochastic context-free
grammars, or of the idea of comparing two hidden Markov models in [ 10]. A natural
question is thus whether we can further extend the results to comparing two stochastic
context-free grammars. If we could determine the co-emission probability ofor just
the maximal joint probability of a pair of parses intwo stochastic context-free gram-
mars, we could also determine whether the languages of two context-free grammars
are disjoint, simply by assigning a uniform probability distribution to the derivations of
each of the variables and asking whether the computed probability is zero. However,
a well-known result in formal language theory states that it is undecidable whether
the languages of two context-free grammars are disjoint [ 17, Theorem 11.8.1]. Hence,
we cannot generalize the methods presented in this paper to methods comparing two
stochastic context-free grammars with a precision that allows us to determine whether
the true probability is zero or not.
Comparing a Hidden Markov Model and a Stochastic Context-Free Grammar 83
In [10] we use the co-emission probability to compute the L 2 -distance between the
probability distributions of two hidden Markov models. Having demonstrated how to
compute the co-emission probability between a hidden Markov model and a stochas-
tic context-free grammar, the only thing further required to compute the L 2 -distance is
the co-emission probability of the grammar with itself. As stated above, the problem
of computing the co-emission probability between two stochastic context-free gram-
mars is undecidable, but one could hope that computing the co-emission probability
of a stochastic context-free grammar with itself would be easier. However, given two
stochastic context-free grammars G 1 and G2 we can construct an aggregate grammar G
where the start symbols of G 1 and G2 are chosen with equal probability one half.
It is easy to see that the co-emission probability of G with itself is the sum of the
co-emission probabilities of G 1 and G2 with themselves plus twice the co-emission
probability between G 1 and G2 . Hence, computing the co-emission probability of a
stochastic context-free grammar with itself, or the L 2 - or Hellinger-distances between
the probability distributions of a context-free grammar and a hidden Markov model, is
as hard as computing the co-emission probability between two stochastic context-free
grammars.
In this paper we have presented the co-emission probability as a measure for com-
paring two stochastic models. However, the co-emission probability has at least two
other interesting uses. First, it allows us to use one model as a prior for training the
other model, e.g., using the distribution over sequences of a hidden Markov model
as our prior belief about the distribution over sequences for a stochastic context-free
grammar we want to construct. Secondly, it allows us to compute the probability that
two stochastic models have independently generated a sequence s given the two models
generate the same sequence. I.e. we can combine two models under the assumption of
independence.
References
1. K. Asai, S. Hayamizu, and K. Handa. Prediction of protein secondary strucuture by the
hidden markov model. Computer Applications in the Biosciences (CABIOS), 9:141146,
1993.
2. J. K. Baker. Trainable grammars for speech recognition. In Speech Communications Papers
for the 97th Meeting of the Acoustical Society of America, pages 547550, 1979.
3. G. A. Churchill. Stochastic models for heterogeneous DNA sequences. Bulletin of Mathe-
matical Biology, 51:7994, 1989.
4. T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. The MIT Press,
1990.
5. R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probal-
istic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.
6. J. Hastad, S. Phillips, and S. Safra. A well characterized approximation problem. Information
Processing Letters, 47(6):301305, 1993.
7. B. Knudsen and J. Hein. RNA secondary structure prediction using stochastic context-free
grammars and evolutionary history. Bioinformatics, 15:446454, 1999.
8. A. Krogh. Two methods for improving performance of an HMM and their application for
gene finding. In Proceedings of the 5th International Conference on Intelligent Systems for
Molecular Biology (ISMB), pages 179186, 1997.
84 Arun Jagota, Rune B. Lyngs, and Christian N.S. Pedersen
1 Introduction
Putative DNA recognition sites can be defined in terms of an idealized sequence that
represents the bases most often present at each position. Conservation of only very
short consensus sequences is a typical feature of regulatory sites (such as promoters)
in both prokaryotic and eukaryotic genomes. Structural genes are often organized into
clusters that include genes coding for proteins whose functions are related. Data from
the Arabidopsis genome project suggest that more than 5% of the genes of this plant
encode transcription factors. The necessity for the use of genomic analytical approaches
becomes clear when it is considered that less than 10% of these factors have been ge-
netically characterized. Transcription-factor genes comprise a substantial fraction of
all eukaryotic genomes, and the majority can be grouped into a handful of different,
often large, gene families according to the type of DNA-binding domain that they en-
code. Functional redundancy is not unusual within these families; therefore the proper
characterization of particular transcription-factor genes often requires their study in the
context of a whole family. The scope of genomic studies in this area is to find cis-
acting regulatory elements from a set of co-regulated DNA sequences (e.g. promoters).
The basic assumption is that a cluster of co-regulated genes is regulated by the same
transcription factors and the genes of a given cluster share common regulatory motifs.
O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 8597, 2001.
c Springer-Verlag Berlin Heidelberg 2001
86 Alain Denise, Mireille Regnier, and Mathias Vandenbogaert
In Section 2, we present and discuss the statistical criteria commonly used for eval-
uating over-representation of patterns in sequences. Section 3 is devoted to our mathe-
matical results. In Sections 4, 6 and 5, we validate our approach by a comparison with
published results derived by other methods that are computationally more expensive.
Finally, in Section 7, we discuss possible improvements and present further work.
In the present section, we present basic useful definitions for statistical criteria and we
briefly discuss their limits, e.g the validity domains and the computational efficiency.
Below, we denote by O(H) the number of observations of a given pattern H in a given
sequence. Depending of the application, it may be either the number of occurrences
[14,7] or the number of sequences where it appears [ 1,23].
Z-Scores. Many definitions of this parameter can be found in the literature. Other
names can be used: see for instance the so-called contrast used in [ 14]. A common
feature is the comparison of the observation with the expectation, using the variance as
a normalization factor. A rather general definition is
E(H) O(H)
Z(H) = (1)
V (H)
p-Values. For each word that occurs r times in a sequence or in a set of N (related)
sequences, one computes the probability that this event occurs just by chance:
When the expectation of a given word is much smaller than 1, a single occurrence is a
rare event. In this case, the p-value is defined as: P (O(H) r knowing that O(H) 1),
e.g.:
P (O(H) r)
pval(H) = . (3)
P (O(H) 1)
The computation is performed in two steps. First, the probability that H occurs in a
given sequence, e.g. P (O(H) 1), is known. An exact formula is provided in [ 17]
and used in [7]. An approximated formula is often used, for instance in software RSA-
tools (http://www.ucmb.ulb.ac.be/bioinformatics/rsa-tools/) or
in [1]. Then two different cases occur.
88 Alain Denise, Mireille Regnier, and Mathias Vandenbogaert
Set of small sequences. The p-value in (2) is the probability that r sequences out of
N contain H; when they are independent, it is given by a binomial formula:
N
pval(H) = (P (O(H) 1))r (1 P (O(H) 1))N r (4)
r
Large sequences. One needs to compute P (O(H) r, through the exact formulae
in [17] or an approximation.
This p-value is evaluated in [23] through a significance coefficient. Given a motif H, the
significance coefficient is defined as
which takes into account the number of different oligonucleotides D. The number of
distinct oligonucleotides depends whether one counts on a single strand or on both
strands.
3 Main Results
The model of random text that we handle with is the Bernoulli model: one assumes
the text to be randomly generated by a memoryless source. Each letter s of the alphabet
has a given probability p s to be generated at any step. Generally, the p s are not equal.
Definition 1. Given a pattern H of length m on the alphabet S and a Bernoulli distri-
bution on the letters of S, the probability of H is defined as
m
P (H) = p hi
i=1
where hi denotes the i-th character of H. By convention, empty string has probabil-
ity 1.
Finding a pattern in a random text is, in some sense, correlated to the previous occur-
rences of the same or other patterns [13]. Hence for example, the probability of finding
H1 = ATT knowing that one has just found H 2 = TAT is - intuitively - rather good
since a T right after H2 is enough to give H 1 . Correlation polynomials and correlation
functions give a way to formalize this intuition.
Definition 2. The correlation set of two patterns H i and Hj is the set of words w which
satisfy: there exists a non-empty suffix v of H i such that vw = Hj . It is denoted Ai,j . If
Hi = Hj , then the correlation set is called the autocorrelation set of H i .
Thus for example, the correlation set of H 1 = ATT and H2 = TAT is A1,2 = {AT }; the
autocorrelation set of H 1 is {}, while the autocorrelation set of H 2 is {, AT}. Empty
string always belong to the autocorrelation set of any pattern.
Assessing the Statistical Significance of Overrepresented Oligonucleotides 89
where |w| denotes the length of word w. If H i = Hj , then this polynomial is called the
autocorrelation polynomial of H i . The correlation function is:
Di,j (z) = (1 z)Ai,j (z) + P (Hj )z mj .
. When Hi = Hj , the correlation function can be written D i .
The most common counting model is the overlapping model: overlapping occurrences
of patterns are taken into account. It is as follows. For example, consider two oligonu-
cleotides H1 = ATT, H2 = TAT and a sequence TTATTATATATT. This sequence contains
2 occurrences of H 1 and 4 occurrences of H 2 , as shown below:
H1 H2 H1
T TT A T TT
A TT A TT
A T TT T T
H2 H2 H2
It turns out [3] that our main results rely on the computation of the (real) roots of a
polynomial equation:
Definition 4. Let a be a real number such that a > P (H 1 ). Let (Ea ) be the fundamental
equation:
D1 (z)2 (1 + (a 1)z)D1 (z) az(1 z)D1 (z) = 0 . (5)
Let za be the largest real positive solution of Equation (E a ) that satisfies 0 < za < 1.
The number za is called the fundamental root of (E a ).
and za is the fundamental root of E a . I(a) is called the rate function. Additionally:
1
P rob(O(H1 ) = k) enI(a)+a . (10)
a 2n
in A. thaliana where 5 approximate tandem repeats of a 40-uple were found. For all
92 Alain Denise, Mireille Regnier, and Mathias Vandenbogaert
patterns, the occurrence probability nP (H) ranges between 10 6 and 107 . For each
oligonucleotide, the first value is the number of occurrences in the window and the sec-
ond one is the p-value computed by our large deviation formulae, where the correcting
term a has been neglected. The third one is the p-value computed in [ 7] with a generat-
ing function method and the last one is the Z-score. We notice that, for any pattern, the
p-values computed with two different methods are of the same magnitude order (this is
illustrated in Figure 1, where the logs of p-values are plotted.) However, they can differ
up to a factor 1.72. This is due to the approximation done in our calculations. When a
increases, the difference z a 1 also increases as well as the contribution of e a . Nev-
ertheless, it is worth noticing that the p-value order is almost the same. One inversion
occurs between patterns ACGGTTCAC and AAGACGGTT, that have similar p-values.
On the other hand, the last column of the table confirms that Z-score is not adequate
for very rare events. Patterns AAGACGGTT and AATTGGCGG have the same Z-score 48,
while p-values have a ratio 100. For patterns ACGACGCTT and ACGCTTGG, the two
parameters define a different order. The same inversion appears between AATTGGCGG
and TTTGTACCA.
35
30
25
20
15
10
5
1 2 3 4 5 6 7
In [23], van Helden et al. study the frequencies of oligonucleotides extracted from reg-
ulatory sites from the upstream regions of yeast genes. Statistical significance of the
oligonucleotides occurring in the 800 bp upstream sequences of regulatory regions
is assessed by evaluating the probability of observing r or more occurrences of the
Assessing the Statistical Significance of Overrepresented Oligonucleotides 93
oligonucleotide in the regulatory sequence, using the binomial formula. In [ 23], the
probabilities are not computed in the Bernoulli model. For a given oligonucleotide,
the authors count the number f of its occurrences in the non-coding sequences of yeast.
Then, an approximate formula for P (O(H) 1) is given and p-value follows through a
computation of binomial formula (4). It is observed in [23] that these binomial statistics
prove to be appropriate, except for self-overlapping patterns such as AAAAAA, ATATAT,
ATGATG. As a matter of fact, auto-correlation does not affect the expected occurrence
number, but increases the variance [8].In other words, the probability to observe either
very high or very low occurrence values is increased for auto-correlated patterns.
Table 2 compares the results of several methods to compute the significance coef-
ficient defined above. Figure 2 presents a graphical view of this comparison. The se-
30
25
20
15
10
5
0
1 2 3 4 5 6 7 8 9 10 11 12 13
In [1], Beaudoing et al. study polyadenylation signals in mRNAs of human genes. One
of their aims is to find several variants of the well known AAUAAA signal. For this pur-
pose, they select 5646 putative mRNA 3 ends of length 50 nucleotides and seek for
overrepresented hexamers. Pattern AAUAAA is clearly the most represented: it occurs
in 3286 sequences, for a total number of 3456 occurrences. Seeking for other (weaker)
signals involves searching for other overrepresented hexanucleotides. Nevertheless, it is
necessary to avoid artifacts, e.g. patterns that appear overrepresented because they are
similar to the first pattern. The algorithm designed by Beaudoing et al. consists in can-
celing all sequences where the overrepresented hexamer has been found. Hence, they
search for the most represented hexamer in the 2780 sequences which do not contain
the strong signal AAUAAA.
Here we show how Theorem 2 gives a procedure for dropping the artifacts of a
given pattern without canceling the sequences where it appears. Table 3 presents the 15
most represented hexamers in the sequences considered in [ 1]. Columns 2 and 3 respec-
tively give the observed number of occurrences and the rank according to this criteria.
Columns 4, 5 and 6 present the (non-conditioned) expected number of occurrences, the
corresponding Z-score and the rank of the hexamer according to this Z-score. Here, the
variance has been approximated by the expectation; this is possible as stated in [ 16].
Remark that rankings of columns 3 and 6 are quite similar: only patterns UAAAAA and
UAAAUA do not belong to both rankings. A number of motifs look like the canonical
one: they may be artifacts. This is confirmed by the three last columns which present,
respectively, the expected number of occurrences conditioned by the observed number
of occurrences of AAUAAA, the corresponding conditioned Z-score and the rank ac-
Assessing the Statistical Significance of Overrepresented Oligonucleotides 95
Table 3. Table of the most frequent hexanucleotides. Obs: number of observed oc-
currences. Rk: Rank. Exp.: (non-conditional) expectation. Cd.Exp.: Expectation condi-
tioned by number of occurrences of AAUAAA.
cording to this criteria. It is clear that artifacts are dropped out, generally very far away
in the ranking. It is worth noticing that some patterns which seemed overrepresented
are actually avoided: this is the case for AUAAAU which goes down from 5th to last
place (among the 4096 possible hexamers, only 4078 are present in the sequences). As
AUAAAU is an artifact of the strong signal, this means that U is rather avoided right after
this signal.
The case of UUUUUU in rank 2 is particular: this pattern is effectively overrepre-
sented, but was not considered by Beaudoing et al. as a putative polyadenylation signal
because its position does not match with observed requirements (around -15/-16 nu-
cleotides upstream of the putative polyadenylation site.) It should also be stated that the
approximation of the variance by the expectation that we do here for all patterns is not
as good for periodic patterns like UUUUUU as for others [ 16]. By this way, variance of
UUUUUU is under-evaluated; so its actual Z-score is significantly lower than the one
given in the table.
Now over-representation of AUUAAA (rank 3) is obvious; this is the known first
variant of the canonical pattern. We remark that the following hexamer, UUAAAA, is an
artifact of AUUAAA. It suggests to define a conditional expectation, or, even better, a
p-value that takes into account the over-representation of two or more signals instead
of one: in this example, AAUAAA and AUUAAA. This extension of Theorem 2 is the
subject of a future work.
As it is mentioned above, the Z-score is not precise enough, and this remark also
holds for conditioned Z-scores. In a second step, the authors of [ 1] computed a p-value
defined by formula (4). This formula is approximated by the incomplete -function.
Nevertheless, any computation is rather delicate, and machine dependent due to numer-
ous call to exp and log functions. The numerical stability necessitates a very careful
96 Alain Denise, Mireille Regnier, and Mathias Vandenbogaert
use of real precision. It is worth noticing that large deviation principle applies for a
Bernoulli process, with explicit values for the rate function and a [3].
Acknowledgments
We thank Eivind Coward for making the EXCEP software available for us and Jacques
Van Helden for fruitful electronic discussions. This research was partially supported
by the IST Program of the EU under contract number 99-14186 (ALCOM-FT), the
REMAG Action from INRIA and IMPG French program.
References
10. L. Marsan and M.F. Sagot. Extracting structured motifs using a suffix tree-algorithms and
application to promoter consensus identification. In RECOMB00, pages 210219. ACM-,
2000. Proceedings RECOMB00, Tokyo.
11. P. Nicod`eme. The symbolic package Regexpcount. In GCB00, 2000. presented at GCB00,
Heidelberg, October 2000; available at
http://algo.inria.fr/libraries/software.html.
12. G. Nuel. Grandes deviations et chaines de Markov pour letude des mots exceptionnels
dans les sequences biologiques. Phd thesis, Universite Rene Descartes, Paris V, 2001. to be
defended in July,2001.
13. P.A. Pevzner, M. Borodovski, and A. Mironov. Linguistic of Nucleotide sequences:The
Significance of Deviations from the Mean: Statistical Characteristics and Prediction of the
Frequency of Occurrences of Words. J. Biomol. Struct. Dynam., 6:10131026, 1991.
14. E.M. Panina, A.A. Mironov, and M.S. Gelfand. Statistical analysis of Complete Bacterial
Genomes:Avoidance of Palindromes and Restriction-Modification Systems. Genomics. Pro-
teomics. Bioinformatics, 34(2):215221, 2000.
15. F.R. Roth, J.D. Hughes, P.E. Estep, and G.M. Church. Finding DNA regulatory motifs within
unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nature
Biotechnol., 16:939945, 1998.
16. M. Regnier, A. Lifanov, and V. Makeev. Three variations on word counting. In GCB00,
pages 7582. Logos-Verlag, 2000. Proc. German Conference on Bioinformatics, Heidelberg;
submitted to BioInformatics.
17. M. Regnier and W. Szpankowski. On Pattern Frequency Occurrences in a Markovian Se-
quence. Algorithmica, 22(4):631649, 1998. preliminary draft at ISIT97.
18. G. Reinert and S. Schbath. Compound Poisson Approximation for Occurrences of Multiple
Words in Markov Chains. Journal of Computational Biology, 5(2):223253,
19. G. Reinert, S. Schbath, and M. Waterman. Probabilistic and Statistical Properties of Words:
An Overview. Journal of Computational Biology, 7(1):146, 2000.
20. S. Robin and J. J. Daudin. Exact distribution of word occurrences in a random sequence of
letters. J. Appl. Prob., 36(1):179193, 1999.
21. E. Rocha, A. Viari, and A. Danchin. Oligonucleotides bias in bacillus subtilis: general trands
and taxonomic comparisons. Nucl. Acids Research, 26:29712980, 1998.
22. A.T. Vasconcelos, M.A. Grivet-Mattoso-Maia, and D.F. de Almeida. Short interrupted palin-
dromes on the extragenic DNA of Escherichia coli K-12, Haemophilus influenzae and Neis-
seria meningitidis. BioInformatics, 16(11):968977, 2000.
23. J. van Helden, B. Andre, and J. Collado-Vides. Extracting regulatory sites from the upstream
region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol.,
281:827842, 1998.
24. Martin Tompa. An exact method for finding short motifs in sequences, with application to the
ribosome binding site problem. In ISMB99, pages 262271. AAAI Press, 1999. Seventh In-
ternational Conference on Intelligent Systems for Molecular Biology, Heidelberg,Germany.
25. A. Vanet, L. Marsan, and M.-F. Sagot. Promoter sequences and algorithmical methods for
identifying them. Res. Microbiol., 150:779799, 1999.
26. M.S. Waterman, R. Arratia, and D.J. Galas. Pattern recognition in several sequences: con-
sensus and alignment. Bull. Math. Biol., 45, 515-527, 1984.
27. M. Waterman. Introduction to Computational Biology. Chapman and Hall, London, 1995.
Pattern Matching and Pattern Discovery Algorithms for
Protein Topologies
1 Biological Motivation
Once the structure of a protein has been determined, the next task for the biologist is to
find hypotheses about its function. One possible approach is a pairwise comparison of
its structure with the structures of proteins whose functions are already known. There
are several tools that allow such comparisons, for example DALI [7] or CATH [11].
However there are two weaknesses with these approaches. Firstly, as the number of
proteins with a given structure is growing, the time needed to do such comparisons is
also growing. Currently there are about 15,000 protein structure descriptions deposited
in the Protein Data Bank [1], but in the future this number may grow significantly.
Secondly, even if a similarity with one or more proteins has been found, it may not be
apparent whether this may also imply functional similarity, especially if the similarity
is not very strong.
Another possibility is to try a similar approach at a structural level similar to that
used for sequences in the PROSITE database [6]. That is, precompute a database of
motifs for proteins with known structuresi.e., structural patterns which are associated
with some particular protein function. This effectively requires computing the maximal
O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 98111, 2001.
c Springer-Verlag Berlin Heidelberg 2001
Pattern Matching and Pattern Discovery Algorithms for Protein Topologies 99
common substructure for a set of structures. One such approach is that of CORA [10],
based on multiple structural alignments of protein sequences for given CATH families.
Both approaches have been successfully used for protein comparison at the se-
quence level. The main difficulty in adapting them to the structural level is the complex-
ity of the necessary algorithmswhile exact sequence comparison algorithms work in
linear time, exact structure comparison algorithms may require exponential time and
the situation only gets worse with algorithms for finding maximal common substruc-
tures. Another aspect of the problem is that it is far from clear which is the best way to
define structure similarity. There are many possible approaches, which require different
algorithmic methods and are likely to produce different results.
Our work is aimed at the development of efficient comparison and maximal com-
mon substructure algorithms using TOPS diagrams for structural topology descriptions,
at the definition of structure similarity in a natural way that arises from such formali-
sation, and at the evaluation of usefulness of such an approach. The drawback of our
approach is that TOPS diagrams are not very rich in information; however it has the ad-
vantage that it is still possible to design practical algorithms for this level of abstraction.
2 TOPS Diagrams
At a comparatively simple level, protein structures can be described using TOPS car-
toons (see [4,13,14]). A sample cartoon for 2bopA0 is shown in Figure 1(a); for compar-
ison a Rasmol-style picture is given in Figure 1(b). The cartoon shows the secondary
all topological information implied by such descriptions and there are no strict rules
governing the appearance of a TOPS cartoon for a given protein.
TOPS diagrams, developed by Gilbert et al. [5], are a more formal description of
protein structural topology and are based on TOPS cartoons. Instead of representing
spatial positions by element positions in a plane, a TOPS diagram contains information
about the grouping of -strands in -sheets (two adjacent elements in a -sheet are
connected by an H-bond, which can be either parallel or anti-parallel) and some infor-
mation about relative orientation of elements (any two SSEs can be connected by either
left or right chirality). Note that, in the topological sense, we reduce the set of atomic
hydrogen bonds between a pair of strands to a single H-bond relationship between the
strands. In principle chiralities can be defined between any two SSEs; however only
a subset of the most important chiralities is included in TOPS diagramsthis subset
roughly corresponds to the implicit position information in TOPS cartoons. A TOPS di-
agram can be regarded as a graph with four different types of vertices (corresponding to
up- or down-oriented strands and up- or down-oriented helices) and four different types
of edges (corresponding to parallel or antiparallel H-bonds and left or right oriented
chiralities). Moreover, the corresponding graph is orderedeach vertex is assigned a
unique number from 1 to n, where n is the total number of vertices. In Figure 2 the
ordering is also indicated by placing the vertices in the order of increasing numbers
(looking from left to right).
if there is an edge between v and w, then there is an edge (of the same type) between
F (v) and F (w).
Figure 3 shows one of the possible patterns that matches the diagram for 2bopA0 by
mapping vertices with numbers 1, 2, 3, 4, 5, and 6, corresponding to vertices with
numbers 1, 2, 4, 6, 7, and 8. In practice, however, it might be useful to make the pattern
definition more complicated. There might be reasons to require that close vertices in
pattern (i.e., vertices with close numbers) are to be mapped to close vertices in the
target diagram (for some natural notion of close). Alternatively it might be useful to
require that the target graph does not contain extra edges between vertices to which
pattern graph vertices are mapped (in this case the pattern graph must be an induced
subgraph of the target graph).
If we want to compare a target TOPS diagram to a set of diagrams, we can do
this by pairwise comparisons between the target and each of the comparison sets; each
such comparison can be made by finding a largest common pattern for two diagrams
and assigning a similarity measure based on the size of the pattern and the sizes of the
two diagrams. Alternatively, if we want to use a motif-based approach, we can find the
largest common patterns for a given set of proteins, consider these patterns as motifs,
and check whether a pattern for some motif matches the diagram of a target protein.
In practice the definition of a motif may be more complicatedfor example, it may
include several patterns or some additional information.
Several algorithms for protein comparison based on the notion of patterns have al-
ready been developed and implemented by David Gilbert. The system is available at
http://www3.ebi.ac.uk/tops/; it permits searching for proteins that match a
given pattern or to perform pattern-based comparisons of TOPS descriptions of pro-
teins. Our current task is to implement the more efficient algorithms that we describe
here. These algorithms will permit the fast generation of motif databases, which we
plan to make available on the web.
4 Experimental Results
4.1 Methodology and Databases
In experiments that we have performed to date we have tried to estimate the useful-
ness of the pattern-based protein motifs, i.e., what is the probability that the fact that
102 Juris Vksna and David Gilbert
a protein matches a given motif implies that protein has also some real similarity with
other proteins characterised by the same motif. To do this, we have tried to compare
our approach against the existing CATH protein classification database. CATH [11] is a
hierarchical classification of protein domain structures, which clusters proteins at four
major levelsClass (C), Architecture (A), Topology (T) and Homologous superfam-
ily (H). There are four different C classesmainly alpha (class 1), mainly beta (class
2), alpha-beta (class 3) and low secondary structure content (class 4). In most cases C
classes are assigned automatically. The architecture level describes the overall shape of
the domain structure according to orientations of the secondary structures; classes in
this level are assigned manually. Classes in the topology level depend on both the over-
all shape and connectivity of the secondary structures and are assigned automatically
by the SSAP algorithm. Classes in the homologous superfamily level group together
protein domains which are thought to share a common ancestor and can therefore be
described as homologous. They are assigned automatically from the results of sequence
comparisons and structure comparisons (using SSAP).
Our comparisons are based on the assumption that identical CATH numbers will
also imply some similarity of the TOPS diagrams for the corresponding proteins. The
TOPS Atlas database [13], containing 2853 domains and based on clustering structures
from the protein data bank [1] using the standard single linkage clustering algorithm at
95% sequence similarity, was selected as the data set for this investigation. Structures
with identical CATH numbers (to a given level) have been placed in one group and
a maximal common pattern for this group has been computed. Then the pattern was
matched against all structures in the selected subset and the quality q of the pattern,
corresponding to positive predictive value, computed as follows:
q = number of proteins in a given group / number of successful matches
Thus, q = 1 corresponds to a good pattern (no false positives) and the value of q is
lower for less good patterns.
4.2 Results
The experiments were performed using the CATH number identity at levels A, T, and
H. The CATH number identity at the A level was clearly insufficient to guarantee any
similarity at the TOPS diagram level; somewhat more surprising was the fact that iden-
tity at the T (topological) level still produced noticeably weaker results than identity at
the H level. Results for the latter are shown in Figure 4. Here the values of q for all
domains from the data set (in lexicographical order by CATH numbers) are shown. The
first 527 structures correspond to CATH class 1 (mainly ), the next 1048 to class 2
(mainly ), the following 1151 to class 3 ( ) and the last 124 to class 4 (weak sec-
ondary structure contents). As can be expected q values are small for class 4, since there
is very little secondary structure information and also for class 1, since in mainly alpha
domains there are few H-bonds and the corresponding TOPS diagrams contain little
information about topology. Better q values can be observed for classes 2 and 3. Fig-
ure 5 shows q values (in light-grey) for class 3. Here the proteins have been reordered
according to increasing q values. As can be seen, in about 36% of cases the q value is 1,
i.e., the CATH number is uniquely defined by a TOPS pattern. Also, there are not many
proteins with q values close to, but less than 1. Therefore, if a pattern has been shown
Pattern Matching and Pattern Discovery Algorithms for Protein Topologies 103
to be good for known proteins, it is likely that it will remain good for new, as yet un-
classified, proteins. For comparison the figure also contains values (in dark-grey) where
q values have been computed using only secondary-structure sequence patterns instead
of complete TOPS diagrams. This demonstrates that good sequence patterns only exist
for approximately 8% of structures. The superiority of sequence patterns for one group
is caused by different definitions of the largest pattern.
Figure 6 contains the same data as Figure 5, but initially ordered by pattern size as
computed by the number of SSEs in the pattern, and then by q values. It can be seen that
we start to get good q values when the number of SSEs reaches 7 or 8 (proteins with
104 Juris Vksna and David Gilbert
Fig. 6. Quality of TOPS Patterns for CATH Class 3 Ordered by the Size of Patterns.
numbers from 459 or 531 on horizontal axis), and that q values are good in most cases
when the number of SSEs reaches 11 (proteins with numbers from 800 on horizontal
axis). Therefore, if a protein contains 7 or more SSEs, there is a good chance that it will
have a good pattern and, if it contains 11 or more SSEs, then in most cases it will have
a good pattern.
Thus, the results obtained so far suggest that a database of pattern motifs could
be quite useful for comparison of those proteins that have sufficiently rich secondary-
structure content and especially for proteins with a large number of strands. This is not
the largest subgroup of all proteins; however for this subgroup there are good chances
that comparison with TOPS motifs will give biologically useful information. Of course,
TOPS diagrams contain limited information about secondary structure; thus we can
expect that motifs based on richer secondary structure models may give better results.
At the same time the TOPS formalism has the advantage that all computations can be
performed comparatively quickly. The exact computation times are very dependent on
the given data, but in general we have observed that the comparison of a given protein
against a database of about 1000 motifs requires less than 0.1 second on an ordinary
600 MHz PC workstation. The discovery of motifs and associated evaluation via pattern
matching over the TOPS Atlas has been done in about 2 hours on the same equipment.
5.1 Definitions
p((v1 , w1 )) < p((v2 , w2 )) p(v1 ) < p(v2 ) or p(v1 ) = p(v2 ) p(w1 ) < p(w2 )
and assign to edges numbers {1, 2, . . . , |E|} according to this order. We call these num-
bers edge positions and denote them by p(e).
A graph G = (V, E) is vertex- (edge-) labelled with set of labels S, if there is given
a function l v : V S (and, for edge labels, l e : E S). We denote the label of a vertex
v V by l(v) and the label of an edge e E by l(e). For given vertex-ordered and
vertex- and edge-labelled graphs G 1 = (V1 , E1 ) and G2 = (V2 , E2 ), we say that G1 is
isomorphic to a subgraph of G 2 if there is an injective mapping I from V 1 to V2 such
that:
Since each edge is uniquely determined by two vertices we can extend the isomorphism
I to edges by defining I((v, w)) = (I(v), I(w)). Then I preserves edge order just like
it preserves vertex order, i.e., e 1 , e2 E1 , p(e1 ) < p(e2 ) = p(I(e1 )) < p(I(e2 )).
We can consider a TOPS diagram as a vertex ordered and vertex and edge labelled graph
with the set of vertex labels S V = {e+,e-,h+,h-} (up- or down-oriented strand or up- or
down-oriented helix) and the set of edge labels S E = {P, A, L, R, P L, P R, AL, AR}
(parallel or antiparallel H-bonds or left- or right-oriented chiralities or a combination
of H-bonds and chiralities). In practice, P edges are only permitted between e+ and e+
or e- and e- vertices, and A edges are allowed only between e+ and e- or e- and e+
vertices, but here for us this is not essential.
For practical purposes it is also worth noting the complexity of graphs that have to
be dealt with in TOPS formalismthe maximal number of vertices is around 50 and
the number of edges is comparatively small and similar to the number of vertices.
Let P be a TOPS pattern, D 1 and D2 be TOPS diagrams, and G(P ), G(D 1 ) and
G(D2 ) be the graphs corresponding to these patterns or diagrams. Then the problem
of checking whether TOPS pattern P will match diagram D 1 is equivalent to checking
whether G(P ) is isomorphic to a subgraph of G(D 1 ). Similarly, the problem of finding
a largest common pattern P of D 1 and D2 is equivalent with finding a largest common
subgraph G(P ) of G(D 1 ) and G(D2 ).
106 Juris Vksna and David Gilbert
First, it is easy to see that the subgraph isomorphism problem for vertex-ordered graphs
remains NP-complete, since the maximal clique problem is NP complete, and this is
not altered by vertex ordering. Also, the relatively small number of edges cannot be
exploited to obtain polynomial algorithms, since in [3] and [15] similar graph structures
are considered that are even simpler (the vertex degree is 0 or 1) and for such graphs
the subgraph isomorphism problem is proved to be NP -complete. In [3] an algorithm
is given that is polynomial with respect to the number of overlapping edgeshowever
in TOPS this number tends to be quite large.
There are several good non-polynomial algorithms for subgraph isomorphism, the
two most popular being by Ullmann [12] and McGregor [9]. Although these are not eas-
ily adaptable to vertex-ordered graphs, the vertex ordering seems to be the property that
could considerably improve the algorithm efficiency. Our algorithm can be regarded as
a variant of a method based on constraint satisfaction [9]; however there is an additional
mechanism for periodically recomputing constraints. A very similar class of graphs has
also been considered by I. Koch, T. Lengauer and E. Wanke in [8] where the authors de-
scribe a maximal common subgraph algorithm based on searching for maximal cliques
in a vertex product graph. This method seems to be applicable also for TOPS; however
it is only practical for finding maximal common subgraphs for two graphs and is not
directly useful for finding motifs for larger sets of proteins.
We have developed a subgraph isomorphism algorithm that exploits the fact that the
graphs are vertex oriented. Initially, let us assume that we are dealing with graphs that
are connected and do not contain isolated vertices (this set is also the most important
in practice). Then an isomorphism mapping I is uniquely determined by defining the
mapping of edges.
The algorithm tries to match edges in the increasing order of edge positions and
backtracks if for some edge match can not be found. Since the graphs are ordered, the
positions in the target graph to which a given edge may be mapped and which have
to be checked can only increase. Two additional ideas are used to make this process
more efficient. Firstly, we assign a number of additional labels to vertices and edges.
Secondly, if an edge e can not be mapped according to the existing mapping for previous
edges, then the next place where this edge can be mapped according to the labels is
found, and the minimal match positions of previous edges are advanced in order to be
compatible with the minimal position of e.
6.1 Labelling
By definition vertices and edges are already assigned labels l v and le correspondingly
that must be preserved by isomorphism mapping. Additionally we use an another kind
of label for both vertices and edges, which we call Index. For a vertex v, Index(v) is a
16-tuple of integers (containing twice as many elements as there are edge labels). The
Pattern Matching and Pattern Discovery Algorithms for Protein Topologies 107
ith element of Index(v) is the number of edges (x, v) with l e ((x, v)) equal to the ith
possible value of l e (according to some initially fixed order of labels). Similarly, the
(k + i)th element of Index(v) is the number of edges (v, x) with l e ((v, x)) equal to
the ith possible value of l e . Thus, the value Index(v) encodes the numbers of incoming
and outgoing edges of all possible types for a given vertex v. For an edge e = (v, w),
Index(e) is a 4-tuple of integers S 1 , S2 , E1 , E2 , where S1 is the number of edges
(v, x) with p(x) < p(w), S2 is the number of edges (v, x) with p(x) > p(w), E 1 is
the number of edges (y, w) with p(y) < p(w), and E 2 is the number of edges (y, w)
with p(y) > p(w). The edge index describes how many shorter or longer other edges
are connected to the endpoints of a given edge. For both vertices and edges we define
Index(x) Index(y) if the inequality holds between the all corresponding pairs of
16-tuples (or 4-tuples). It is easy to see that for any vertex or edge x we must have
Index(x) Index(I(x)).
6.2 Algorithm
We assume that graphs are given as arrays PV, PE, TV and TE, where PV is an array
of vertices in the pattern graph with PV[i] being the vertex v with p(v) = i, PE is an
array of edges in the pattern graph with PE[i] being the edge e with p(e) = i, and
TV and TE are similar arrays for the target graph. For an edge e of the pattern graph,
list Matches(e) contains all possible positions (in increasing order) in target graph to
which e can be matched according to vertex and edge labels and Index values. By
Matches(e)[i] we denote the ith element from this list. The number Next(e) is the first
position in Matches(e) list to which it still may be worth to try to match the edge.
Initially for all edges we have Next(e) = 1. For vertex v the number Pos(v) is the
position in target graph to which vertex v is matched, otherwise we have Pos(v) = 0.
Algorithm 1 shows the main loop. Starting from the first edge the algorithm tries
to find matches for all edges in increasing order and returns an array Pos of vertex
mappings, if it succeeds. If for some edge a match consistent with matches for previous
edges can not be found a procedure AdvanceEdgeMatchPositions is invoked, which
tries to increase the values Next(e) for some of already matched edges and the matching
process is continued starting from the first edge for which the value Next(e) has been
changed.
Procedure AdvanceEdgeMatchPositions uses a variant of depth-first search to find
edges for which Next(e) can be increased. Alternative strategies are of course possible,
6.3 Correctness
The informal motivation why the algorithm correctly finds an isomorphic subgraph (or
gives the answer that no isomorphic subgraph exists) is the following. First, as already
noted above, for connected oriented graphs the isomorphism mapping is completely
defined by defining the mapping for edges. For an isomorphism mapping it is sufficient
to satisfy the labelling constraints on edge endpoints, preserve edge order and connec-
tivity. If the AdvanceEdgeMatchPositions procedure is not used, the algorithm simply
performs an exhaustive search of all mappings satisfying these constraints and either
108 Juris Vksna and David Gilbert
procedure AdvanceEdgeMatchPositions(v,vt);
begin
PatternVertexStack ; push (PatternVertexStack,v);
TargetVertexStack ; push (TargetVertexStack,vt);
while PatternVertexStack = do
pvert pop (PatternVertexStack); tvert pop (TargetVertexStack);
foreach edge e with p(e) < k, Moved(e) = false, and with endpoint pvert do
Moved(e) true ;
Find the smallest i Next(e) such that, for (vt2, wt2) = Matches(e)[i], we
have wt2 tvert (or vt2 tvert, if pvert is the rightmost endpoint of e);
if such an i is found then
Next(e) i;
Let newpvert be the other endpoint of e;
newtvert vt2 (or newtvert wt2, if pvert is the rightmost endpoint of
e);
if Pos(newpvert) = 0 then
Pos(newpvert) 0;
push (PatternVertexStack,newpvert);
push (TargetVertexStack,newtvert);
end
end
else
return Not Isomorphic
end
end
end
end
Algorithm 2: The Depth-First Search for Edges.
finds one, or returns an answer that no such mapping exists. If the AdvanceEdgeMatch-
Positions procedure is included, then when invoked it receives a vertex v in pattern
graph and the first vertex vt in target to which v may be mapped according to search
performed so far. The constraints on edge mappings are then narrowed down to be con-
sistent with the mapping requirement for vertex v.
To deal with graphs that may be disconnected (but do not have isolated vertices) we
additionally have to check that the vertex positions are preserved by isomorphism map-
ping, i.e., for vertices v and w in pattern graph with p(v) < p(w) we must have
p(I(v)) < p(I(w)). If we have isolated vertices, we additionally have to check that
the sequence of vertices between v and w is a substring of the sequence of vertices
between I(v) and I(w). This additional checking can be easily incorporated into the
algorithm.
110 Juris Vksna and David Gilbert
Acknowledgments
Juris Vksna was supported by a Wellcome Trust International Research Award.
References
1. Berman, H.M., Westbrook, J., Feng., Z., Gilliland, G., Bhat, T.N., Weissig, H.,
Shindyalov, I.N., Bourne, P.E.: The Protein Data Bank. Nucleic Acids Research 28 (2000)
235-242.
2. Bron, C., Kerbosch, J.: Algorithm 457: Finding all cliques of an undirected graph. Communi-
cations of ACM 16 (1973) 575-577.
3. Evans, P.A.: Finding common subsequences with arcs and pseudoknots. Proceedings of Com-
binatorial Pattern Matching 1999, LNCS 1645 (1999) 270-280.
Pattern Matching and Pattern Discovery Algorithms for Protein Topologies 111
4. Flores, T.P.J., Moss, D.M., Thornton, J.M.: An algorithm for automatically generating protein
topology cartoons. Protein Engineering 7 (1994) 31-37.
5. Gilbert, D., Westhead, D.R., Nagano, N., Thornton, J.M.: Motif-based searching in tops pro-
tein topology databases. Bioinformatics 15 (1999) 317-326.
6. Hofmann, K., Bucher, P., Falquet, L., Bairoch, A.: The PROSITE database, its status in 1999.
Nucleic Acids Research 27 (1999) 215-219.
7. Holm, L., Park, J.: DaliLite workbench for protein structure comparison. Bioinformatics 16
(2000) 566-567.
8. Koch, I., Lengauer, T., Wanke, E.: An algorithm for finding maximal common subtopologies
in a set of protein structures. Journal of Computational Biology 3 (1996) 289-306.
9. McGregor, J.J.: Relational consistency algorithms and their application in finding subgraph
and graph isomorphisms. Information Science 19 (1979) 229-250.
10. Orengo, C.A.: CORAtopological fingerprints for protein structural families. Protein Sci-
ence 8 (1999) 699-715.
11. Orengo, C.A., Michie, A.D., Jones, S., Swindelis, M.B.: CATHa hierarchic classification
of protein domain structures. Structure 5 (1997) 1093-1108.
12. Ullmann, J.R.: An algorithm for subgraph isomorphism. Journal of the ACM 23 (1976) 31-
42.
13. Westhead, D.R., Hatton, D.C., Thornton, J.M.: An atlas of protein topology cartoons avail-
able on the World Wide Web. Trends in Biochemical Sciences 23 (1998) 35-36.
14. Westhead, D.R., Slidel, T.W.F., Flores, T.P.J., Thornton, J.M.: Protein structural topology:
automated analysis and diagrammatic representation. Protein Science 8 (1999) 897-904.
15. Zhang, K., Wang, L., Ma, B.: Computing similarity between RNA structures. Proceedings of
Combinatorial Pattern Matching 1999, LNCS 1645 (1999) 281-293.
Computing Linking Numbers of a Filtration
1 Introduction
In this paper, we develop fast algorithms for computing the linking numbers of simpli-
cial complexes. Our work is within a framework of applying computational topology
methods to the fields of biology and chemistry. Our goal is to develop useful tools by
researchers in computational structural biology.
Motivation and Approach. In the 1980s, it was shown that the DNA, the molecular
structure of the genetic code of all living organisms, can become knotted during repli-
cation [1]. This finding initiated interest in knot theory among biologists and chemists
for the detection, synthesis, and analysis of knotted molecules [ 8]. The impetus for
this research is that molecules with non-trivial topological attributes often display ex-
otic chemistry. Taylor recently discovered a figure-of-eight knot in the structure of a
plant protein by examining 3,440 proteins using a computer program [ 19]. Moreover,
chemical self-assembly units have been used to create catenanes, chains of interlocking
molecular rings, and rotaxanes, cyclic molecules threaded by linear molecules. Re-
searchers are building nanoscale chemical switches and logic gates with these struc-
tures [2,3]. Eventually, chemical computer memory systems could be built from these
building blocks.
Catenanes and rotaxanes are examples of non-trivial structural tanglings. Our work
is on detecting such interlocking structures in molecules through a combinatorial
method, based on algebraic topology. We model biomolecules as a sequence of alpha
complexes [7]. The basic assumption of this representation is that an alpha-complex
sequence captures the topological features of a molecule. This sequence is also a fil-
tration of the Delaunay triangulation, a well-studied combinatorial object, enabling the
development of fast algorithms.
O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 112127, 2001.
c Springer-Verlag Berlin Heidelberg 2001
Computing Linking Numbers of a Filtration 113
The focus of this paper is the linking number. Intuitively, this invariant detects if
components of a complex are linked and cannot be separated. We hope to eventually in-
corporate our algorithm into publicly available software as a tool for detecting existence
of interlocked molecular rings.
Given a filtration, the main contributions of this paper are:
(i) the extension of the definition of the linking number to graphs, using a canonical
basis,
(ii) an algorithm for enumerating and generating all cycles and their spanning surfaces
within a filtration,
(iii) data structures for efficient enumeration of co-existing pairs of cycles in different
components,
(iv) an algorithm for computing the linking number of a pair of cycles,
(v) and the implementation of the algorithms and experimentation on real data sets.
Algorithm (iv) is based on spanning surfaces of cycles, giving us an approximation to
the linking number in the case of non-orientable or self-intersecting surfaces. Such cases
do not arise often in practice, as shown in Section 6. However, we note in Section 2 that
the linking number of a pair may be also computed by alternate algorithms. Regardless
of the approach taken, pairs of potentially linked cycles must be first detected and enu-
merated. We provide the algorithms and data structures of such enumeration in (i-iii).
Prior Work. Important knot problems were shown to be decidable by Haken in his
seminal work on normal surfaces [10]. This approach, as reformulated by Jaco and
others [13], forms the basis of many current knot detection algorithms. Haas et al.
recently showed that these algorithms take exponential time in the number of crossings
in a knot diagram [12]. They also placed both the UNKNOTTING PROBLEM and the
SPLITTING PROBLEM in NP, the latter being the focus of our paper. Generally, other
approaches to knot problems have unknown complexity bounds, and are assumed to
take at least exponential time. As such, the state of the art in knot detection only allows
for very small data sets. We refer to Adams [1] background in knot theory.
Three-dimensional alpha shapes and complexes may be found in Edelsbrunner and
Mucke [7]. We modify the persistent homology algorithm to compute cycles and sur-
faces [6]. We refer to Munkres [15] for background in homology theory that is accessi-
ble to non-specialists.
Outline. The remainder of this paper is organized as follows. We review linking num-
bers for collections of closed curves, and extend this notion to graphs in R 3 in Section 2.
We describe our model for molecules in Section 3. Extending the persistence algorithm,
we design basic algorithms in Section 4 and use them to develop an algorithm for com-
puting linking numbers in Section 5. We show results of some initial experiments in
Section 6, concluding the paper in Section 7.
2 Linking Number
In this section, we define links and discuss two equivalent definitions of the linking
number. While the first definition provides intuition, the second definition is the basis
of our computational approach.
114 Herbert Edelsbrunner and Afra Zomorodian
1 +1
1 +1 +1 1
(a) A Link Diagram for the Whitehead (b) Crossing Label Convention.
Link.
Fig. 1. The Whitehead link (a) is labeled according to the convention (b) that the cross-
ing label is +1 if the rotation of the overpass by 90 degrees counter-clockwise aligns its
direction with the underpass, and 1 otherwise.
Linking Number. A knot (link) invariant is a function that assigns equivalent objects
to equivalent knots (links.) Seifert first defined an integer link invariant, the linking
number, in 1935 to detect link separability [ 18]. Given a link diagram for a link l, we
choose orientations for each knot in l. We then assign integer labels to each crossing
between any pair of knots k, k , following the convention in Figure 1(b). Let (k, k )
of the pair of knots to be one half the sum of these labels. A standard argument using
Reidermeister moves shows that is an invariant for equivalent pairs of knots up to
sign [1]. The linking number (l) of a link l is
(l) = |(k, k )|.
k=k l
We note that (l) is independent of knot orientations. Also, the linking number does
not completely recognize linking. The Whitehead link in Figure 1(a), for example, has
linking number zero, but is not separable.
Surfaces. The linking number may be equivalently defined by other methods, including
one based on surfaces [17]. A spanning surface for a knot k is an embedded surface with
boundary k. An orientable spanning surface is a Seifert surface. Because it is orientable,
Computing Linking Numbers of a Filtration 115
Fig. 2. The Hopf link and Seifert surfaces of its two unknots are shown on the left.
Clearly, = 1. This link is the 200th complex for data set H in Section 6. The span-
ning surface produced for the cycle on the right is a Mobius strip and therefore non-
orientable.
we may label its two sides as positive and negative. We show examples of spanning
surfaces for the Hopf link and Mobius strip in Figure 2. Given a pair of oriented knots
k, k , and a Seifert surface s for k, we label s by using the orientation of k. We then
adjust k via a homotopy h until it meets s in a finite number of points. Following along
k according to its orientation, we add +1 whenever k passes from the negative to the
positive side, and 1 whenever k passes from the positive to the negative side. The
following lemma asserts that this sum is independent of our the choice of h and s, and
it is, in fact, the linking number.
S EIFERT S URFACE L EMMA . (k, k ) is the sum of the signed intersections between
k and any Seifert surface for k.
The proof is by a standard Seifert surface construction [ 17]. If the spanning surface is
non-orientable, we can still count how many times we pass through the surface, giving
us the following weaker result.
S PANNING S URFACE L EMMA . (k, k ) (mod 2) is the parity of the number of times
k passes through any spanning surface for k.
Graphs. We need to extend the linking number to graphs, in order to use the above
lemma
for computing linking numbers3for simplicial complexes. Let G = (V, E), E
1
2 be a simple undirected graph in R with c components G , . . . , G . Let z1 , . . . , zm
V c
be a fixed basis for the cycles in G, where m = |E||V |+c. We then define the linking
number between two components of G to be (G i , Gj ) = |(zp , zq )| for all cycles
zp , zq in Gi , Gj , respectively. The linking number of G is then defined by combining
the total interaction between pairs of components:
(G) = (Gi , Gj ).
i=j
The linking number is computed only between pairs of components following Seiferts
original definition. Linked cycles within the same component may be easily unlinked
116 Herbert Edelsbrunner and Afra Zomorodian
by a homotopy. Figure 3 shows that the linking number for graphs is dependent on the
chosen basis. While it may seem that we want (G) = 1 in the figure, there is no clear
1
G
2
G
1 1
G G
2 2
G G
=1 =2
Fig. 3. We get different (G) for graph G (top) depending on our choice of basis for
G2 : two small cycles (left) or one large and one small cycle (right.)
answer in general. We will define a canonical basis in Section 4 using the persistent
homology algorithm to compute (G) for simplicial complexes.
3 Alpha Complexes
Our approach to analyzing a topological space is to assume a filtration for such a space.
A filtration may be viewed as a history of a growing space that is undergoing geometric
and topological changes. While filtrations may be obtained by various methods, only
meaningful filtrations give meaningful linking numbers. As such, we use alpha com-
plex filtrations to model molecules. The alpha complex captures the connectivity of a
molecule that is represented by a union of spheres. This model may be viewed as the
dual of the space filling model for molecules [14].
Vu = {x R3 | u (x) v (x),
v S}.
The Voronoi regions decompose the union of balls into convex cells of the form u V u ,
as illustrated in Figure 4. Any two regions are either disjoint or they overlap along
a shared portion of their boundary. We assume general position, where at most four
Voronoi regions can have a non-empty common intersection. Let T S have the prop-
erty that its Voronoi regions have a non-empty common intersection, and consider the
convex hull of the corresponding centers, T = conv {u | u T }. General position
Computing Linking Numbers of a Filtration 117
Fig. 4. Union of nine disks, convex decomposition using Voronoi regions, and dual
complex.
Any two simplices in K are either disjoint or they intersect in a common face which is a
simplex of smaller dimension. Furthermore, if K, then all faces of are simplices
in K. A set of simplices with these two properties is a simplicial complex [15]. A
subcomplex is a subset L K that is itself a simplicial complex.
Fig. 5. Symmetric difference in dimensions one and two. We add two 1-cycles to get
a new 1-cycle. We add the surfaces the cycles bound to get a spanning surface for the
new 1-cycle.
define a basis for the first homology group of the complex which contains all 1-cycles,
and choose representatives for each homology class. We use these representatives to
compute linking numbers for the complex.
A simplex of dimension d in a filtration either creates a d-cycle or destroys a (d1)-
cycle by turning it into a boundary. We mark simplices as positive or negative, accord-
ing to this action [5]. In particular, edges in a filtration which connect components
are marked as negative. The set of all negative edges gives us a spanning tree of the
complex, as shown in Figure 6. We use this spanning tree to define our canonical basis.
Fig. 6. Solid negative edges combine to form a spanning tree. The dashed positive edge
i creates a canonical cycle.
Every time a positive edge i is added to the complex, it creates a new cycle. We choose
Computing Linking Numbers of a Filtration 119
the unique cycle that contains i and no other positive edge as a new basis cycle. We
call this cycle a canonical cycle, and the collection of canonical cycles, the canonical
basis. We use this basis for computation.
Persistence. The persistence algorithm matches positive and negative simplices to find
life-times of homological cycles in a filtration. The algorithm does so by following a
representative cycle z for each class. Initially, z is the boundary of a negative simplex
j , as z must lie in the homology class j destroys. The algorithm then successively
adds class-preserving boundary cycles to z until it finds the matching positive simplex
i , as shown in Figure 7. We call the half-open interval [i, j) the persistence interval of
Fig. 7. Starting from the boundary of the negative triangle j , the persistence algorithm
finds a matching positive edge i by finding the dashed 1-cycle. We modify this 1-cycle
further to find the solid canonical 1-cycle and a spanning surface.
both the homology class and its canonical representative. During this interval, the ho-
mology class exists as a class of homologous non-boundings cycles in the filtration. As
such, the class may only affect the linking numbers of complexes K i , . . . , Kj1 in the
filtration. We use this insight in the next section to design an algorithm for computing
linking numbers.
Computing Canonical Cycles. The persistence algorithm halts when it finds the match-
ing positive simplex i for a negative simplex j , often generating a cycle z with mul-
tiple positive edges and multiple components. We need to convert z into a canonical
cycle by eliminating all positive edges in z except for i . We call this process canon-
ization. To canonize a cycle, we add cycles associated with unnecessary positive edges
to z successively, until z is composed of i and negative edges, as shown in Figure 7.
Canonization amounts to replacing one homology basis element with a linear combina-
tion of other elements in order to reach the unique canonical basis we defined earlier.
A cycle undergoing canonization changes homology classes, but the rank of the basis
never changes.
Computing Spanning Surfaces. For each canonical cycle, we need a spanning surface
in order to compute linking numbers. We may compute these by maintaining surfaces
120 Herbert Edelsbrunner and Afra Zomorodian
while computing the cycles. Recall that initially, a cycle representative is the boundary
of a negative simplex j . We use j as the initial spanning surface for z. Every time
we add a cycle y to z in the persistence algorithm, we also add the surface y bounds to
the zs surface. We continue this process through canonization to produce both canon-
ical cycles and their spanning surfaces. Here, we are using a crucial property of our
filtrations: the final complex is always the Delaunay complex of the set of weighted
points and does not contain any 1-cycles. Therefore, all 1-cycles are eventually turned
to boundaries and have spanning surfaces.
If the generated spanning surface is Seifert, we may apply the S EIFERT S URFACE
L EMMA to compute the linking numbers. In some cases, however, the spanning surface
is not Seifert, as in Figure 2(b). In these cases, we may either compute the linking num-
ber modulo 2 by applying the S PANNING S URFACE L EMMA, or compute the linking
number by alternative methods.
5 Algorithm
In this section, we use the basis and spanning surfaces computed for 1-cycles to find
linking numbers for all complexes in a filtration. Since we focus on 1-cycles only, we
will refer to them simply as cycles.
during some sub-interval [u, v) [r, s). Let t p,q be the minimum index in the filtration
when zp and zq are in the same component. Then, [u, v) = [r, s)[0, t p,q ). If [u, v)
= ,
zp , zq are p-linked during that interval. In the remainder of this section, we will first
develop a data structure for computing t p,q for any pair of cycles z p , zq . Then, we use
this data structure to efficiently enumerate all pairs of p-linked cycles. Finally, we give
an algorithm for computing (z p , zq ) for a p-linked pair of cycles z p , zq .
Component History. To compute t p,q , we need to have a history of the changes to the set
of components in a filtration. There are two types of simplices that can change this set.
Vertices create components and are therefore all positive. Negative edges connect com-
ponents. We construct a binary tree called component tree recording these changes using
a union-find data structure [4]. The leaves of the component tree are the vertices of the
filtration. When a negative edge connects two components, we create an internal node
and connect it to the nodes representing these components, as shown in Figure 9. The
1 7
3 7
2 4 3 6
6
5 1 2 4 5
Fig. 9. The union-find data structure (left) has vertices as nodes and negative edges as
edges. The component tree (right) has vertices as leaves and negative edges as internal
nodes.
component tree has size O(n) for n vertices, and we construct it in time O(nA 1 (n)),
where A1 (n) is the inverse of the Ackermanns function which exhibits insanely slow
growth. Having constructed the component tree, we find the time two vertices w, x are
in the same component by finding their lowest common ancestor (lca) in this tree. We
utilize Harel and Tarjans optimal method to find lcas with O(n) preprocessing time
and O(1) query time [11]. Their method uses bit operations. If such operations are not
allowed, we may use van Leeuwens method with the same preprocessing time and
O(log log n) query time [20].
Fig. 10. The augmented union-find data structure places root nodes in the shaded circu-
lar doubly-linked list. Each root node stores all active canonical cycles in that compo-
nent in a doubly-linked list, as shown for the darker component.
store zp at the root of the component and keep a pointer to z p with simplex j , which
destroys zp . This implies that we may delete z p from the data structure at time j with
constant cost.
Our algorithm to enumerate p-linked cycles is incremental. We add and delete cycles
using the above operations from the union-find forest, as the cycles are created and
deleted in the filtration. When a cycle z p is created at time i, we output all p-linked
pairs in which z p participates. We start at the root which now stores z p and walk around
the circular list of roots. At each root x, we query the component tree we constructed
in the last subsection to find the time t when the component of x merges with that of
zp . Note that t = tp,q for all cycles zq stored at x. Consequently, we can compute the
p-linking interval for each pair z p , zq to determine if the pair is p-linked. If the filtration
contains P p-linked pairs, our algorithm takes time O(mA 1 (n) + P ), as there are at
most m cycles in the filtration.
Orientation. In the previous section, we showed how one may compute spanning sur-
faces sp , sq for cycles zp , zq , respectively. To compute the linking number using our
lemma, we need to orient either the pair s p , zq or zp , sq . Orienting a cycle is trivial:
we orient one edge and walk around to orient the cycle. If either surface has no self-
intersections, we may easily attempt to orient it by choosing an orientation for an arbi-
trary triangle on the surface, and spreading that orientation throughout. The procedure
either orients the surface or classifies it as non-orientable. We currently do not have an
algorithm for orienting surfaces with self-intersections. The main difficulty is distin-
guishing between two cases for a self-intersection: a surface touching itself and passing
through itself.
v ++ ++ v ++
zq zq
sp+ sp+
sp
+ +
Fig. 11. On the left, starting at v, we walk on z q according to its orientation. Segments
of zq that intersect sp are shown, along with their contribution to S p,q = + + + + +
. We get (zp , zq ) = 1. On the right, the bold flip curve is the border of s + p
and sp , the portions of s p that are oriented differently. S p,q = + + + ,
so counting all +s, we get (z p , zq ) mod 2 = 3 mod 2 = 1.
If sp is orientable, there are no flip curves on it. The contribution of cycle segments
to the string is the same as before: + or + for segments that pass through s p , and
++ and for segments that do not. By counting +s, only segments that pass through
sp change the parity of the sum for . Therefore, the algorithm computes mod 2
correctly for orientable surfaces. For the orientable surface on the right in Figure 11, for
instance, we get (zp , zq ) mod 2 = 5 mod 2 = 1, which is equivalent to the parity of
the answer computed by the previous algorithm.
Remark. We are currently examining the question of orienting surfaces with self-
intersections. Using our current methods, we may obtain a lower bound signature for
by computing a mixed sum: we compute and mod 2 whenever we can to obtain
the approximation. We may also develop other methods, including those based on the
projection definition of the linking number in Section 2.
6 Experiments
In this section, we present some experimental timing results and statistics which we
used to guide our algorithm development. We also provide visualizations of basis cycles
in a filtration. All timings were done on a Micron PC with a 266 MHz Pentium II
processor and 128 MB RAM running Solaris 8.
Implementation. We have implemented all the algorithms in the paper, except for the
algorithm for computing mod 2. Our implementation differs from our exposition in
three ways. The implemented component tree is a standard union-find data structure
with the union by rank heuristic, but no path compression [ 4]. Although this structure
has a O(n log n) construction time and a O(log n) query time, it is simple to implement
and extremely fast in practice. We also use a heuristic to reduce the number of p-linked
cycles. We store bounding boxes at the roots of the augmented union-find data structure.
Before enumerating p-linked cycles, we check to see if the bounding box of the new
cycle intersects with that of the stored cycles. If not, the cycles cannot be linked, so we
obviate their enumeration. Finally, we only simulate the barycentric subdivision.
Data. We have experimented with a variety of data sets and show the results for six rep-
resentative sets in this section. The first data set contains points regularly sampled along
two linked circles. The resulting filtration contains a complex which is a Hopf link, as
shown in Figure 2. The other data sets represent molecular structures with weighted
points. In each case, we first compute the weighted Delaunay triangulation and the age
ordering of that triangulation. The data points become vertices or 0-simplices. Table 1
gives the sizes of the data sets, their Delaunay triangulations, and age orderings. We
show renderings of specific complexes in the filtration for data set K in Figure 12.
Basis. Table 2 summarizes the basis generation process. We distinguish the two steps
of our algorithm: initial basis generation and canonization. We give the number of basis
cycles for the entire filtration, which is equal to the number of positive edges. We also
show the effect of canonization on the size of the cycles and their spanning surfaces
in Table 2. Note that canonization increases the size of cycles by one or two orders of
magnitude. This is partially the reason we try to avoid performing the link detection if
possible.
Computing Linking Numbers of a Filtration 125
Fig. 12. Complex K8168 of K has two components and seventeen cycles. The spanning
surfaces are rendered transparently.
Links. In Table 3, we show that our component tree and augmented trees are very fast
in practice to generate p-linked pairs. We also show that our bounding box heuristic
for reducing the number of p-linked pairs increases the computation time negligibly.
The heuristic is quite successful, moreover, in reducing the number of pairs we have to
check for linkage, eliminating 99.8% of the candidates for dataset Z. The differences
in total time of computation reflect the basic structure of the datasets. Dataset D has
a large computation time, for instance, as the average size of the p-linked surfaces is
approximately 264.16 triangles, compared to about 1.88 triangles for dataset K, and
about 1.73 triangles for dataset M.
Discussion. Our initial experiments demonstrate the feasibility of the algorithms for
fast computation of linking. The experiments fail to detect any links in the protein data,
however. This is to be expected, as a protein consists of a single component, the pri-
mary structure of a protein being a single polypeptide chain of amino acids. Links, on
the other hand, exist in different components by definition. We may relax this defini-
126 Herbert Edelsbrunner and Afra Zomorodian
Table 2. On the left, we give the time to generate and canonize basis cycles, as well as
their number. On the right, we give the average length of cycles and size of surfaces,
before and after canonization.
time in seconds avg cycle length avg surface size
# cycles
generate canonize total before after before after
H 0.08 0.04 0.12 1,653 H 3.06 51.03 1.06 63.04
G 0.08 0.03 0.11 2,005 G 3.26 13.02 1.38 52.28
M 0.28 0.20 0.48 6,537 M 3.29 34.18 1.33 71.18
Z 0.46 0.46 0.92 10,106 Z 4.71 25.33 3.26 117.81
K 0.72 1.01 1.73 15,607 K 3.48 67.87 1.62 166.70
D 2.63 2.94 5.57 52,902 D 3.46 39.94 1.81 158.99
Table 3. Time to construct the component tree, and the computation time and number of
p-linked pairs (alg), p-linked pairs with intersecting bounding boxes (heur), and links.
tion easily, however, to allow for links occurring in the same component. We have im-
plementations of algorithms corresponding to this relaxed definition. Our future plans
include looking for links in proteins from the Protein Data Bank [ 16]. Such links could
occur naturally as a result of disulphide bonds between different residues in a protein.
7 Conclusion
In this paper, we develop algorithms for finding the linking numbers of a filtration.
We give algorithms for computing bases of 1-cycles and their spanning surfaces in
simplicial complexes, and enumerating co-existing cycles in different components. In
addition, we present an algorithm for computing the linking number of a pair of cycles
using the surface formulation. Our implementations show that the algorithms are fast
and feasible in practice. By modeling molecules as filtrations of alpha complexes, we
can detect potential non-trivial tangling within molecules. Our work is within a frame-
work for applying topological methods for understanding molecular structures.
Acknowledgments
We thank David Letscher for discussions during the early stages of this work. We also
thank Daniel Huson for the zeolite dataset Z, and Thomas La Bean for the DNA tile data
Computing Linking Numbers of a Filtration 127
set D. This work was supported in part by the ARO under grant DAAG55-98-1-0177;
The first author is also supported by NSF under grants CCR-97-12088, EIA-99-72879,
and CCR-00-86013.
References
1. Colin C. Adams. The Knot Book: An Elementary Introduction to the Mathematical Theory
of Knots. W. H. Freeman and Company, New York, NY, 1994.
2. Richard A. Bissell, Emilio Cordova, Angel E. Kaifer, and J. Fraser Stoddart. A checmically
and electrochemically switchable molecular shuttle. Nature, 369:133137, 1994.
3. C. P. Collier, E. W. Wong, Belohradsky, F. M. Raymo, J. F. Stoddart, P. J. Kuekes, R. S.
Williams, and J. R. Heath. Electronically configurable moleculear-based logic gates. Science,
285:391394, 1999.
4. Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to Algorithms.
The MIT Press, Cambridge, MA, 1994.
Delfinado and H. Edelsbrunner. An incremental algorithm for Betti numbers of
5. C. J.A.
simplicial complexes on the 3-sphere. Comput. Aided Geom. Design, 12:771784, 1995.
6. H. Edelsbrunner, D. Letscher, and A. Zomorodian. Topological persistence and simplifica-
tion. In Proc. 41st Ann. IEEE Sympos. Found. Comput. Sci., pages 454463, 2000.
7. H. Edelsbrunner and E.P. Mucke. Three-dimensional alpha shapes. ACM Trans. Graphics,
13:4372, 1994.
8. Erica Flapan. When Topology Meets Chemistry : A Topological Look at Molecular Chirality.
Cambridge University Press, New York, NY, 2000.
9. P. J. Giblin. Graphs, Surfaces, and Homology. Chapman and Hall, New York, NY, second
edition, 1981.
10. Wolfgang Haken. Theorie der Normalflachen. Acta Math., 105:245375, 1961.
11. Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors.
SIAM J. Comput., 13:338355, 1984.
12. Joel Hass, Jeffrey C. Lagarias, and Nicholas Pippenger. The computational complexity of
knot and link problems. J. ACM, 46:185211, 1999.
13. William Jaco and Jeffrey L. Tollefson. Algorithms for the complete decomposition of a
closed 3-manifold. Illinois J. Math., 39:358406, 1995.
14. Andrew R. Leach. Molecular Modeling, Principles and Applications. Pearson Education
Limited, Harlow, England, 1996.
Munkres. Elements of Algebraic Topology. Addison-Wesley, Redwood City, California,
15. J.R.
1984.
16. RCSB. Protein data bank. http://www.rcsb.org/pdb/.
17. Dale Rolfsen. Knots and Links. Publish or Perish, Inc., Houston, Texas, 1990.
18. H. Seifert. Uber das Geschlecht von Knoten. Math. Annalen, 110:571592, 1935.
19. William R. Taylor. A deeply knotted protein structure and how it might fold. Nature,
406:916919, 2000.
20. Jan van Leeuwen. Finding lowest common ancestors in less than logarithmic time. Unpub-
lished report.
Side Chain-Positioning as an
Integer Programming Problem
1 Introduction
Within the near future the approximate fold of most proteins will be known, thanks to
structural genomics projects. The approximate fold of a protein is not enough though,
to obtain a full understanding of a molecular mechanism or to be able to utilize the
structure in drug design. For this a complete model of the protein is often needed. The
main procedure today to obtain a complete model is by homology modeling. The
process of homology modeling often includes the positioning of amino acid side chains
on a fixed backbone of a protein. Another area that has recently become important is
the area of automatic protein design. Here the goal is to obtain a sequence that folds
to a given structure. Mayo and coworkers have shown that it is possible to perform
automatic designs [1]. One crucial step in their procedure is to find the optimal side
chain conformation.
O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 128141, 2001.
c Springer-Verlag Berlin Heidelberg 2001
Side Chain-Positioning as an Integer Programming Problem 129
There are two common features to most algorithms that try to solve this problem.
The first one is to discretize the allowed conformational space into rotamers rep-
resenting the statistically dominant side chain orientations in naturally occurring pro-
teins [2,3]. The rotamer approximation reduces the conformation space and makes it
possible to use a discrete formulation of the problem. The second feature is the use of
an energy function that can be divided into terms depending only on pairwise interac-
tions between different parts of the protein. The total energy of a protein in a specific
conformation, E C , can therefore be described as
EC = Ebackbone + E(ir ) + E(ir js ) (1)
i i j>i
Ebackbone is the self-energy of the backbone, i.e. the interaction between all atoms in
the backbone. E(i r ) is the self-energy of side chain i in its rotamer conformation i r ,
including its interaction with the backbone. E(i r js ) is the interaction energy between
side chain i in the rotamer conformation i r and the side chain j in the rotamer confor-
mation js . In this study we will keep the backbone of the protein fixed and only change
the side chain rotamers. The term E backbone will therefore not contribute to any differ-
ence in energy between two protein conformations and can be ignored. The problem
we want to solve can thus be defined as; given the coordinates of the backbone and a
specific rotamer library and energy function, find the set of rotamers that minimizes the
energy function. The solution space of this problem obviously increases exponentially
with the number of residues included.
i k m
j l n
r t
s
Pruning Algorithms. The most frequently used algorithms in this category are based
on the dead end elimination theorem (DEE) [ 6,7,8,9]. DEE-based methods iteratively
use rejection criteria. If a criterion is fulfilled it guarantees that a certain rotamer or
combination of rotamers cannot be a part of the GMEC. This reduces the conformation
space significantly and hopefully at the end a single solution remains. If DEE-based
methods converge to a single solution they are guaranteed to have found the global min-
imum energy. During the last few years methods based on this theorem have developed
considerably and can now solve most side-chain positioning problems [ 10]. However
in protein design there is a limit beyond which this method fails to converge and other
inexact methods have to be used [10]. The reason for this is that protein design requires
a large rotamer library [9].
Mean-Field Algorithms. A third approach is to use mean field algorithms [ 11]. Here
all rotamers of all side chains can be thought of as existing at the same time, but with
different probabilities. Each residue is considered in turn and the probabilities for all ro-
tamers of that side chain are updated, based on the mean field generated by the multiple
side chains at neigbouring residues. The procedure is repeated until it converges. The
predicted conformation of a side chain is chosen to be the rotamer with the highest prob-
ability. An advantage of self consistent mean field algorithms is that the computational
time scales linearly with the number of residues. Unfortunately, there is no guarantee
that the minimum of the mean-field landscape corresponds to the true GMEC.
Minimize f (x)
subject to gi (x) 0, i = 1, . . . , m
In that case the decision variables only can take integer values we have an integer pro-
gramming problem. The easiest programming problems have continues variables and
linear constraints and objective functions. These are called linear programming (LP)
problems. Please refer to [12,13] for a thorough introduction to linear programming
and integer programming.
It is possible [8], to rewrite the energy function (1) so that it only contains terms
depending on pairs of side chains, i.e., the energy of a conformation can be written as
EC = E (ir , js ). (2)
i j>i
where i = 1 . . . nsc , j = 2 . . . nsc , i < j and nsc is the number of side chains. The
total energy of the system (not just one conformation as before) can then be calcu-
lated as a sum, consisting of all possible variables, one for each rotamer combination,
ir , js , times their respective energy contribution to the total energy e ir ,js = E (ir , js ),
see equation (2). The energy e ir ,js is calculated assuming both of these rotamers are
included in the conformation. The total energy, E tot is then:
Etot = eir ,js xir ,js (4)
i r j>i s
where
xir ,js = 1 for all i and j, i < j (5)
r s
xhq ,ir = xgp ,ir = xir ,js = xir ,kt (6)
q p s t
two side chains could possibly take, then only one of these combinations can exist. This
could for example be i r and js . The second condition (6) states that one side chain can
only exist in one rotamer state, independent of the rotamer states of other side chains.
This assures that it can not exist in one state in relation to one side chain and another
state in relation to another side chain.
This allows us to formulate a minimization problem as:
Minimize: Etot (8)
subject to: (5) (6) (7)
This is a linear integer programming problem. We can rewrite it in matrix form:
zIP = min{cx|Ax = b, x {0, 1}n} (9)
where A is an m n matrix and b and c vectors of appropriate dimensions. In our case
all components of A are 0, 1 or 1, the components of b 0 or 1 and c consists of real
values. The condition (6) can of course be implemented in different ways to obtain this
form. We refer to the Appendix for a description of our approach. A small example in
Table 1 shows what the integer program can look like for a small side chain positioning
problem and our implementation.
In general, integer problems are hard to solve. By relaxing the integral constraints,
that is, allowing 0 xir ,js 1 instead of xir ,js {0, 1}, we can turn this problem
into a LP-problem to which there exist several efficient algorithms. If the solution to
the LP relaxed problem is integral, x ir ,js {0, 1} then it is an optimal solution to
the original problem [13]. This means that if we solve the relaxation of the side chain
positioning problem (8) and the solution turns out to be integral, it corresponds to the
GMEC. The linear programming relaxation is as follows
zLP = min{cx|Ax = b, 0 x 1} (10)
The feasible solution set of the problem, F (LP ), defined by Ax = b and 0 x 1,
forms a convex polyhedron, as it is an intersection of a finite number of hyper-planes.
The LP-problem is well defined in that either it is infeasible, unbounded, or has an
optimal solution. An optimal solution to an LP problem can always be found in an ex-
treme point of the polyhedron (if there exists an optimal solution). For a more thorough
introduction to the ideas of the linear programming methods we refer to [ 12].
It can be shown that there is a polynomial time method to solve LP, the proof is nor-
mally based on the ellipsoid method [14]. This method is, however, not appreciated in
practice due to its slowness. Today, two popular methods are used. The idea of the sim-
plex method is to move from one extreme point to another in such a way that the value
of the objective function will always decrease or at least not increase. This is done until
a minimum is reached or until the problem is found to be unbounded. Although there is
an example showing that the simplex algorithm is of exponential time, it is very efficient
if implemented well and in reality it is often of polynomial time [ 15]. The interior point
methods is a family of algorithms that stay in the strict interior of the feasible region,
such as using a barrier function. The term grew from Karmarkars algorithm to solve
a LP-problem [16]. Except for certain extreme worst cases, the interior point method
runs in polynomial time.
Side Chain-Positioning as an Integer Programming Problem 133
Table 1. The A, b and c matrix, see (10) of a trivial example protein with four residues
a, b, c and d, where a has three rotamers a 1 , a2 , a3 , b two, c three and d two rotamers.
These matrices are used as the input to the simplex algorithm. The order of the de-
variables x are also shown. In these matrices
cision the row:
(ab=1) corresponds to
r x =
s ar ,bs 1, (a1b=a1c)
corresponds to x
s a1 ,bs = t xa1 ,ct and (a1c=a1d)
corresponds to t xa1 ,ct = u xa1 ,du . So the rows (a1b=a1c) and (a1c=a1d) together
say that rotamer a 1 either exists or not and can not do both on the same time. The rest
of the rows correspond to similar constraints.
x = (xa1 b1 xa1 b2 xa1 c1 xa1 c2 xa1 c3 xa1 d1 xa1 d2 xa2 b1 . . . xa2 d2 xa3 b1 . . . xa3 d2 xb1 c1 xb1 c2
xb1 c3 xb1 d1 xb1 d2 xb2 c1 . . . xb2 d2 xc1 d1 xc1 d2 xc2 d1 xc2 d2 xc3 d1 xc3 d2 )
c = (1.43 . . . 0.54)
A b
1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 (ab=1)
1 1-1-1-1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (a1b=a1c)
0 0 1 1 1-1-1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (a1c=a1d)
0 0 0 0 0 0 0 1 1-1-1-1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (a2b=a2c)
0 0 0 0 0 0 0 0 0 1 1 1-1-1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (a2c=a2d)
1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0-1-1-1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (b1a=b1c)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1-1-1 0 0 0 0 0 0 0 0 0 0 0 0 (b1c=b1d)
0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0-1-1-1 0 0 0 0 0 0 0 0 0 (b2a=b2c)
0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0-1 0 0 0 0-1 0 0 0 0 0 0 0 0 0 0 0 (c1a=c1b)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0-1-1 0 0 0 0 0 (c1b=c1c)
0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0-1 0 0 0 0-1 0 0 0 0 0 0 0 0 0 0 (c2a=c2b)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0-1-1 0 0 0 (c2b=c2c)
0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0-1 0 0 0 0-1 0 0 0 0 0 0 0 0 0 (c3a=c3b)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0-1-1 0 (c3b=c3d)
0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0-1 0 0 0 0-1 0 0 0 0 0 0 0 0 (d1a=d1b)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0-1 0-1 0-1 0 0 (d1b=d1c)
0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0-1 0 0 0 0-1 0 0 0 0 0 0 0 (d2a=d2b)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0-1 0-1 0-1 0 (d2b=d2c)
To use LP-methods alone when solving an integer problem we have to prove that
the solution we find is always integral. We have not been able to do so but our results
indicate that this could be the case. If the the solution is fractional, branch and bound
methods can be used in combination with LP to get an integer solution.
Time Complexity. Let nsc be the number of side chains and n rot be the average num-
ber of rotamer states for each side chain. Then the integer programming formulation of
the problem (8) will give approximately n 2sc nrot number of constraints. For the sim-
plex algorithm it is usually counted on that the number of iterations grows linearly with
the number of constraints of the problem and the work for each iteration grows as the
square of the number of constraints [15]. That is the computational time of solving the
relaxed problem by the simplex method should be in the order of O(n 6sc ). If we look
at the average number of rotamers for each residue the time complexity for the simplex
134 Olivia Eriksson, Yishao Zhou, and Arne Elofsson
algorithm ought to be O(n 3rot ). This is under the assumption that the solution to the
relaxed problem is integral and therefore the final solution.
and the rotamer i r cannot be a member of the GMEC. The inequality ( 11) is not easy
to check computationally since {C } can be huge. Instead one can get a more practical,
but weaker, condition by just comparing that case where the left side of ( 11) is highest
in energy with the case where the right side is lowest in energy. This can be done by the
following inequality known as the DEE-theorem introduced by Desmat et al. [ 6]
E(ir ) + min E(ir js ) > E(it ) + max E(it js ) i = j. (12)
s s
j,j=i j,j=i
The theorem states that the rotamer i r is a dead ending, thus precluding i r to be a
member of the GMEC, if the inequality (12) holds true for any rotamer i t = ir of
the side chain i . For a proof of this theorem we refer to [ 6]. In words inequality (12)
says that rotamer i r must be a dead ending if the energy of the system taken at i r :s
most favorable interactions with all the other residues are bigger than the energy of
another rotamer ( i t ) at its worst interactions. The best and worst interactions are given
by choosing that rotamer which gives the lowest, respectively the highest, value for
each interaction term. Desmat also described how the criteria could be extended to the
elimination of pairs of rotamers inconsistent with the GMEC.
A more powerful version of these criteria was introduced by Goldstein [ 7]. It sub-
tracts the right-hand side from the left in equation ( 12) before applying the min operator.
E(ir ) E(it ) + min(E(ir js ) E(it js )) > 0 (13)
s
j,j=i
This criterion says that i r is a dead ending if we can always lower the energy by taking
rotamer it instead of ir while keeping the other rotamers fixed. Goldsteins criterion
can also be extended to pairs of rotamers inconsistent with the GMEC. These criteria
are used iteratively and one after each other until no more rotamers can be eliminated.
Side Chain-Positioning as an Integer Programming Problem 135
4 Methods
The energy function used in this study consists of van der Waals forces between atoms
that are more than three bonds away from each other. Van der Waals parameters were
taken from CHARMM22 [18] parameter set (par all22 prot). In this study we have
focused on the algorithmic part of the side chain positioning problem, i. e. finding
the global minimum energy conformation of the model. The next step would be to
use a more appropriate energy function. We used a backbone-independent rotamer li-
brary of R. Dunbrack [3] (bbind99.Aug.lib), that contains a maximum of 81 rotamers
per residue. Further, all hydrogen atoms were ignored. All calculations have been per-
formed using the backbone coordinates of the lambda repressor (PDB-code: 1r69). The
energy contributions of each rotamer pair to the total energy were calculated in advance
and stored in a vector c, see (10). The two matrices A and b were constructed. An ex-
ample of these matrices for a small problem can be seen in Table 1. These matrices
were then used as the input to the linear programming algorithms. First a simplex al-
gorithm, lp solve 3.1 [19], was used. Secondly the problem was solved using a mixed
integer programming (mip) algorithm from the CPLEX package [ 20]. This algorithm
is designed to solve problems with both integral and real variables. It makes use of the
fact that we have an integer problem, and finds structures in the input that could be used
to reduce the problem. After a relaxation of the integer constraints it solves the obtained
LP-program by the simplex method. If the solution is not integral a branch and bound
procedure takes place. We have also tried other algorithms from the CPLEX package
such as primal and dual; simplex, hybrid network and hybrid barrier solvers. The mip
solver gave the best results.
For comparison we have also performed exhaustive search for up to 10 side chains
and implemented a single-residue based DEE-algorithm, using Goldsteins criterion
(13). We also implemented a combination of the DEE-theorem and the linear program-
ming (CPLEX-mip). Here we used condition (13) iteratively until no more rotamers
could be eliminated and then applied linear programming on the rest of the solution
space to find the GMEC.
All algorithms were tested on identical systems, and we assured that the correct
solution was found. In Table 2 the size of the conformational space of some of the
problems can be seen. As several different programs were used for the calculations, it is
136 Olivia Eriksson, Yishao Zhou, and Arne Elofsson
Table 2. Some of the problems used in the study; n sc is the number of free side chains,
rot is the average number of rotamers per side chain. The fraction of rotamers left after
n
the DEE-search and the total number of conformations in the problems are also shown.
The protein used is the lambda repressor (PDB-code:1r69).
nsc n
rot Number of conformations % Remaining rot. after DEE
4 15 103.8 0
5 27 106.2 0
6 24 107.1 0
8 15.7 108.1 4.2
10 16.2 1010.5 3.3
12 21 1013.4 5.0
13 19.6 1013.8 5.0
15 27.8 1017.6 3.0
20 28.8 1022.9 9.9
27 25.9 1030.5 14.3
33 27.5 1038.2 17.0
35 28.5 1041.0 16.0
43 26.5 1048.1 15.4
not trivial to compare the absolute computational time needed. It is, however, possible
to compare their complexity.
Today there are mainly three types of methods used for the side chain positioning
problem, stochastic methods, mean-field algorithms and DEE-theorem based methods.
Stochastic methods and mean-field algorithms always find a solution, however this is
not guaranteed to be optimal. If a DEE method converges to one solution, it is guar-
anteed to be the global optimal solution. However, for larger problems DEE methods
do not always converge. Therefore, the remaining solutions have to be tested by e.g.
exhaustive search. Here we introduce a novel method using linear programming. If the
solution from the LP-method is integral it corresponds to the GMEC.
Integral Solutions. In our experiments, see below, the solutions were controlled for
integrality and so far we have never found a fractional solution. The mip method from
the CPLEX package is made for integer programs. It uses a simplex solver, followed by
branch-and-bound with simplex if the solution is not integral, see Section 4. In our ex-
periments the branch-and-bound part of the algorithm was never used. We have also
used primal and dual; simplex, hybrid network and hybrid barrier solvers from the
CPLEX package. We received integral results from all these solvers. To examine if the
energy function has any effect on the integrality of the solutions we have tried different
nonsense energy functions on a test case with the simplex algorithm. All the solutions
found were integral.
Side Chain-Positioning as an Integer Programming Problem 137
5
10
4
10
3
10
2
10
time (s)
1
10
0 o = lp solve 3.1
10
x = cplexmip
+ = exhaustive search
1
10
2
10
3
10
4
10
0 1 2
10 10 10
number of sidechains
Fig. 2. Experimental studies of the complexity of the two linear programming algo-
rithms versus the number of side chains. The circles represent the simplex method and
the crosses the mip method. The complexity is approximately O(n 5sc ) and O(n3sc ) re-
spectively. For comparison a curve representing the exhaustive search is also included.
Experimental Time Complexity. First the number of free side chains was increased,
nsc , to examine the time behavior of the linear programming methods. The computa-
tional time for the mip method and the simplex method (from lp solve 3.1) are shown
in Figure 2. It can be seen that the time scales approximately as O(n 3sc ) for the mip
method and as O(n 5sc ) for the simplex method. This agrees quite well with the estima-
tion of the complexity (see Section 2), where the computational time was calculated
to scale as O(n6sc ) for the simplex method. Furthermore, it shows the superiority of
the mip-method. The reason that the mip-method has a better complexity is due to the
preprocessor, which finds structures in the input that can reduce the problem size and
increase the speed. This also means that there probably exists better ways to implement
the conditions of (8).
In protein design the average number of rotamers for each side chain, n rot , is often
very large. Therefore, it is interesting to study the complexity of the LP-algorithms
versus nrot . Our estimation of the time complexity for the simplex method on the side
chain positioning problem is O(n 3rot ), see Section 2. This agrees quite well with our
experimental study, see Figure 3, where the complexity is approximately O(n 2.2 rot ) both
for the simplex and mip algorithms.
Comparison with the DEE-Method. The DEE-method is a pruning method that con-
secutively eliminate parts of the conformation space. First, simple criteria with a low
complexity are used and then slower methods are applied until convergence. For large
design problems DEE-methods do not converge in a reasonably amount of time [ 10].
This means that the time complexity for large system could be exponential and not
polynomial. All this makes it difficult to calculate an overall time complexity of DEE-
138 Olivia Eriksson, Yishao Zhou, and Arne Elofsson
1
10
0
10
time (s)
1
10
o = lp solve 3.1
x = cplexmip
Fig. 3. Experimental studies of the complexity of the two linear programming algo-
rithms used in this study versus the number of rotamers. The number of side chains is
8. The circles represent the simplex method, the crosses the mip method and the stars
a DEE-algorithm with a final exhaustive search, see methods. For a problem of this
size the DEE-algorithm almost converges. Therefore, the computational time of the ex-
haustive search is negligible. The complexity is approximately O(n 2.2 2.3
rot ) (mip), O(nrot )
2.0
(simplex) and O(n rot ) (DEE).
methods. In a study by Pierce et al [17], the time complexity of DEE was estimated
in terms of the cost of the nested loops required to implement each approach. For the
different criteria their estimation was between O(n 2sc n2rot ) and O(n3sc n5rot ).
To perform a comparison between DEE and LP a part of the DEE method, Gold-
steins criterion for single residues was implemented. However, we did not implement
other criteria. The comparison of the two methods can therefore only be seen as an
indication of their relative performance.
In our implementation, where the DEE did not converge to one solution, the re-
maining conformational space was searched exhaustively. When the problem contained
more than approximately 8 residues the DEE algorithm did not converge. In Figure 4,
it can be seen that for larger systems our DEE implementation does not show a poly-
nomial complexity. With a more complete DEE-method it ought to be possible to limit
the remaining conformational space so that a single solution is obtained for much larger
systems. However, the complexity for the DEE algorithm would then be larger.
What might be more interesting is to compare the DEE implementation with the
mip-linear programming method. If one only consider the complexity of the DEE-
algorithm before the exhaustive search is started, the complexity is approximately equal
to the LP-method, see Figure 4. However, while the LP-method has found the GMEC,
the DEE-algorithm has not converge to a single solution for most problems, Table 2.
In Figure 4, the time complexity of a combination of this DEE-algorithm and the LP-
Side Chain-Positioning as an Integer Programming Problem 139
o = only mipcplex
8 x = DEE and mipcplex
10
* = DEE and exhaustive search
+ = only DEE
6
10
2
10
0
10
2
10
0 1 2
10 10 10
number of sidechains (logaritmic scale)
Fig. 4. Experimental studies of the complexity of the mip linear programming algo-
rithm and the Dead End Elimination algorithm, versus the number of side chains. The
circles represent the linear programming mip method alone, the crosses a combina-
tion of DEE and LP, the stars the DEE with a final exhaustive search and the +-signs
the DEE-algorithm alone. The complexity for mip and the combined methods is ap-
proximately O(n 3sc ) and O(n3.5sc ) respectively. The complexity of the DEE part of the
computation is O(n 3.2
sc ), the dashed line.
algorithm mip is also shown. Here, the time for the combined method is less than for
mip alone, but the complexity is a little bit better for mip.
We have also made a study of the DEE-algorithm (Goldsteins singles criterion)
with an increasing number of rotamers, see Figure 3. Here the problem was small
enough (8 residues) for the DEE-method to eliminate almost all rotamers. The com-
plexity of the DEE-algorithm was O(n 2rot ), i.e. almost identical with the LP-methods.
However, the DEE-algorithm was faster.
6 Conclusions
We have introduced a novel solution method to the problem of optimal side chain posi-
tioning on a fixed backbone. It has earlier been shown that finding the optimal solution
to this problem is essential for the success of automatic protein designs [ 1]. The state of
the art methods include several rounds using different versions of the Dead End Elim-
ination theorem, with an increasing time complexity. This method do not necessarily
converge to a single solution but the conformational space is reduced significantly and
the remaining solutions can be searched.
By using linear programming (LP) we are guaranteed to find an optimal solution
in polynomial time. If this solution is integral it corresponds to the global minimum
energy conformation. This far in our studies the solutions have always been integral.
140 Olivia Eriksson, Yishao Zhou, and Arne Elofsson
Linear programming is a well studied area of research and many algorithms for
fast solutions are available. We obtained the best results using the mip-method from
the CPLEX package. The time complexity for the mip-method to find the GMEC
was O(n3sc n2.2
rot ), while our DEE implementation (Goldsteins singles criterion), had
a similar complexity of O(n 3sc n2rot ). However, for the mip-method the GMEC was
found while for the DEE-method a fraction of the conformation space remained to
be searched. More advanced DEE implementations converge to a single solution for
larger problems, but they use more time consuming criteria, the worst with an esti-
mated complexity of O(n 3sc n5rot ) [17]. As the complexity for the mip-method has a
smaller dependency on the number of rotamers, the use of LP-algorithms might be best
for problems with many rotamers, as in protein design. One reason for the effectiveness
of the mip-algorithm is most likely due to the preprocessing of the problem. There-
fore, a reformulation of the input matrices ( 10) to the simplex algorithm, perhaps as a
network [21], might improve the complexity even further.
Acknowledgments
We thank Jens Lagergren for the idea of using linear programming and helpful discus-
sions. This project was supported in part by Swedish Natural Science Research Council
(NFR) and the Strategic Research Foundation (SSF).
References
1. Dahiyat, B.I. and Mayo, S.L: De novo protein design: fully automated sequence selection.
Science 278 (1997) 8287
2. Ponder, J.W. and Richards, F.M.: Tertiary Templates for Proteins: Use of Packing Criteria
in the Enumeration of Allowed Sequences for Different Structural Classes J. Mol. Biol. 193
(1987) 775791
3. Dunbrack, R. and Karplus, M.: Backbone-dependent rotamer library for proteins - applica-
tion to sidechain prediction. J. Mol. Biol. 230 ( 1993) 543574
4. Metropolis, N., Metropolis, A.W., Rosenbluth, M.N. and Teller, A.H.: Equation of state com-
puting on fast computing machines. J. Chem. Phys. 21 (1953) 10871092
5. LeGrand, S.M. and Mers Jr, K.M. The genetic algorithm and the conformational search of
polypeptides and proteins. J. Mol. Simulation 13 ( 1994) 299320
6. Desmat, J., De Maeyer, M., Hazes, B. and Lasters, I.: The dead end elimination theorem and
its use in side-chain positioning. Nature 356 (1992) 539542
7. Goldstein, R.F.: Efficient rotamer elimination applied to protein side-chains and related spin
glasses. Biophysical J. 66 (1994) 13351340
8. Lasters, I., De Maeyer, M. and Desmet J.: Enhanced dead-end elimination in the search for
the global minimum energy conformation of a collection of protein side chains. Protein Eng.
8 (1995) 815822
9. Gordon, D.B. and Mayo, S.L.: Radical performance enhancements for combinatorial opti-
mization algorithms based on the dead-end elimination theorem. J. Comp. Chem. 19 (1998)
15051514
10. Voigt, C.A., Gordon, D.B. and Mayo, S.L.: Trading accuracy for speed: A quantitative com-
parison of search algorithms in protein sequence design. J. Mol. Biol. 299 (2000) 789803
11. Koehl, P. and Levitt, M.: De Novo Protein Design. I. In search of stability and specificity. J.
Mol. Biol. 293 (1999) 11611181
Side Chain-Positioning as an Integer Programming Problem 141
12. Lueneberger, D.G.: Linear and nonlinear programming. Addison-Wesley publishing com-
pany (1973)
13. Nemhauser, G.L. and Wolsey, L.A.: Integer and Combinatorial Optimization. Wiley-
Interscience series in discrete mathematics and optimization (1988)
14. Korte, B. and Vygen, J.: Combinatorial optimization. Theory and algorithms Springer-Verlag
(1991)
15. Lindeberg, P.O.: Opimeringslara: en introduction. Department of Optimization and Systems
Theory,Royal institute of Technology, Stockholm (1990)
16. Karmarkar, N.: A New Polynomial Time Algorithm for Linear Programming. Combinatorica
4 (1984) 375395
17. Pierce, N.A., Spriet, J.A., Desmet, J. and Mayo, S.L.: Conformational Splitting:A More Pow-
erful criterion for Dead-End Elimination. J. Comp. Chem.21 (2000) 9991009
18. Brooks, B.R., Bruccoleri, R.E., Olafson, B.D., States, D.J., Swaminathan, S. and Karplus,
M.: CHARMM: A program for Macromolecular Energy, Minimization and Dynamics Cal-
culations. Journal of Computational Chemistry 4 (1983) 187217
19. Berkelaar, M: lp solve 3.1: Program with the Simplex algorithm.
ftp://ftp.es.ele.tue.nl/pub/lp_solve (1996)
20. http://www.cplex.com
21. Rockafellar, R.T.: Network Flows and Monotropic Optimization. Wiley Interscience (1984)
Tal Pupko1, Roded Sharan 2 , Masami Hasegawa1, Ron Shamir1 , and Dan Graur 3
1
The Institute of Statistical Mathematics
4-6-7 Minami Azabu, Minato ku, Tokyo, Japan
{tal,hasegawa}@ism.ac.jp
2
School of Computer Science
Tel-Aviv University, Tel-Aviv 69978, Israel
{roded,rshamir}@post.tau.ac.il
3
Department of Zoology, Faculty of Life Sciences
Tel-Aviv University, Tel-Aviv, Israel
[email protected]
Abstract. There are very few instances in which positive Darwinian selection
has been convincingly demonstrated at the molecular level. In this study, we
present a novel test for detecting positive selection at the amino-acid level. In this
test, amino-acid replacements are characterized in terms of chemical distances,
i.e., degrees of dissimilarity between the exchanged residues in a protein. The test
identifies statistically significant deviations of the mean observed chemical dis-
tance from its expectation, either along a phylogenetic lineage or across a subtree.
The mean observed distance is calculated as the average chemical distance over
all possible ancestral sequence reconstructions, weighted by their likelihood. Our
method substantially improves over previous approaches by taking into account
the stochastic process, tree phylogeny, among site rate variation, and alternative
ancestral reconstructions. We provide a linear time algorithm for applying this
test to all branches and all subtrees of a given phylogenetic tree. We validate
this approach by applying it to two well-studied datasets, the MHC class I gly-
coproteins serving as a positive control, and the house-keeping gene carbonic
anhydrase I serving as a negative control.
1 Introduction
The neutral theory of molecular evolution maintains that the great majority of evolu-
tionary changes at the molecular level are caused not by Darwinian selection acting on
advantageous mutants, but by random fixation of selectively neutral or nearly neutral
mutants [12]. There are very few cases in which positive Darwinian selection was con-
vincingly demonstrated at the molecular level [ 10,22,34,30,23]. These cases are vital
to understanding the link between sequence variability and adaptive evolution. Indeed,
it has been estimated that positive selection has occurred in only 0.5% of all protein-
coding genes [2].
The most widely used method for detecting positive Darwinian selection is based
on comparing synonymous and nonsynonymous substitution rates between nucleotide
O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 142155, 2001.
c Springer-Verlag Berlin Heidelberg 2001
A Chemical-Distance-Based Test for Positive Darwinian Selection 143
which might result from assuming a specific ancestral sequence reconstruction. The ra-
tionale underlying our proposed test is that the evolutionary acquisition of a new func-
tion requires a significant change of the biochemical properties of the amino-acid se-
quence [7]. To quantify this biochemical difference between two amino-acid sequences,
we define a chemical distance measure based on, e.g., Granthams matrix [ 4]. Our test
identifies large deviations of the mean observed chemical distance from the expected
distance along a branch or across a subtree in a phylogenetic tree. If the observed chem-
ical distance between two sequences significantly exceeds the chance expectation, then
it is unlikely that this is the result of random genetic drift, and positive Darwinian se-
lection should be invoked.
Based on the assumed stochastic process, the tree topology and its branch lengths,
we calculate both the mean observed chemical distance and its underlying distribution
for the branch or subtree in question. The mean observed chemical distance is calculated
as the average chemical distance over all ancestral sequence reconstructions, weighted
by their likelihood, thus, eliminating possible bias in a calculation based on a particular
ancestral sequence reconstruction. The underlying distribution of this random variable
is calculated using the JTT stochastic model [11], the tree topology and branch lengths,
taking into account among site rate variation. We provide a linear time algorithm to
perform this test for all branches and subtrees of a phylogenetic tree with n leaves.
In order to validate our approach, we applied it to two control datasets: Class I
major-histocompatibility-complex (MHC) glycoproteins, and carbonic anhydrase I.
These datasets were chosen since they were already used as standard positive control
(MHC) and negative control (carbonic anhydrase) for positive selection [ 24]. For the
MHC class I dataset, as reported in [9], we observe positive selection which favors
charge replacements only when applying the test to the subsequences of the binding
cleft (P < 0.01). In addition we observe positive selection which favors polarity re-
placements when using Granthams polarity indices [ 4] (P < 0.01). When applying
the test to the carbonic anhydrase dataset, no positive selection is observed.
The paper is organized as follows: Section 2 contains the notations and terminology
used in the paper. Section 3 presents the new test for positive Darwinian selection. Sec-
tion 4 describes the application of this test to the two control datasets. Finally, Section 5
contains a summary and a discussion of our approach.
2 Preliminaries
Let A be the set of 20 amino-acids. We assume that sequence evolution follows the JTT
probabilistic reversible model [11]. For amino-acid sequences this model is described
by a 20 20 matrix M , indicating the relative replacement rates of amino-acids, and a
vector (PA , . . . , PY ) of amino-acid frequencies. For each branch of length t and amino-
acids i and j, the i j replacement probability, denoted by P ij (t), can be calculated
from the eigenvalue decomposition of M [ 13]. (In practice, an approximation to P ij (t)
is used to speedup the computation [19].) We denote by f ij (t) = Pi Pij (t) = Pj Pji (t)
the probability of observing i and j in the same position in two aligned sequences of
evolutionary distance t.
A Chemical-Distance-Based Test for Positive Darwinian Selection 145
1
N
D(s1 , s2 ) = d(s1i , s2i )
N i=1
Let T be an unrooted phylogenetic tree. For a node v, we denote by N (v) the set
of nodes adjacent to v. For an edge (u, v) T we denote by t(u, v) the length of the
branch connecting u and v.
In this section we describe a new test for detecting positive Darwinian selection. The
input to the test is a set of gap-free aligned sequences and a phylogenetic tree for them.
We first present a version of our test for a pair of known sequences. We then extend
this method to test positive selection on specific branches of a phylogenetic tree under
study. Finally we generalize the test to subtrees (clades) and incorporate among site rate
variation.
Assuming that the distribution of the chemical distance in each position is identical, we
obtain
1
N
E(D(s1 , s2 )) = E(d(s1i , s2i )) = E(d(s11 , s21 ))
N i=1
V (d(s11 , s21 ))
V (D(s1 , s2 )) =
N
Here we first describe a general method to apply pairwise tests to a phylogenetic tree.
Suppose that we wish to test a statistical hypothesis on a specific branch of the phy-
logenetic tree. Also suppose that we have a procedure to test our hypothesis on a pair
of known sequences, like the procedure described above. In order to test our hypoth-
esis on a specific branch, we could first infer the corresponding ancestral sequences
(using, e.g., maximum likelihood estimation [ 20]) and then check our hypothesis. In-
ferring ancestral sequences and then using these sequences as observations was done in
e.g., [31]. This approach, which treats estimated reconstructions as observations may
lead to erroneous conclusions due to bias in the reconstruction. A more robust approach
is to average over all possible reconstructions, weighted by their likelihood. By aver-
aging over all possible ancestral assignments, we extend our test to hypothesis testing
on a phylogenetic tree, without possible bias that results from reconstructing particular
sequences at internal tree nodes.
We describe in the following how to apply our test to a specific branch connecting
nodes x and y in a tree T . Since we assume that different positions evolve independently
we restrict the subsequent description to a single site.
A Chemical-Distance-Based Test for Positive Darwinian Selection 147
Each branch (u, v) T partitions the tree into two subtrees. Let L(u, v, a) denote
the likelihood of the subtree which includes v, given that v is assigned the amino-acid
a. L(u, v, a) can be computed by the following recursion equation:
L(u, v, a) = { Pab (t(v, w)) L(v, w, b)}
w(N (v)\{u}) bA
For a leaf v at the base of the recursion we have L(u, v, a) = 1, assuming amino-acid
a in v, and L(u, v, a) = 0 otherwise.
The likelihood of T is thus:
PT = fab (t(u, v)) L(u, v, b) L(v, u, a)
a,bA
E(D(x, y)) = P r(z) P r(x = a, y = b|z) d(a, b)
zAn a,bA
= d(a, b) P r(z) P r(x = a, y = b|z)
a,bA zAn
= d(a, b) fab (t(x, y))
a,bA
We conclude that E(D(x, y)) is the same as in the known-sequences case. For the
variance of D(x, y) we have no explicit formula. Instead, we evaluate V (D(x, y)) using
parametric bootstrap [25]. Specifically, we draw at random many assignments of amino-
acids to the leaves of T and compute D(x, y) for each of them, thereby evaluating its
variance. An assignment to the leaves of T is obtained as follows: We first root T
at an arbitrary node r. We then draw at random an amino-acid for r according to the
amino-acid frequencies. We next draw amino-acids for each child of r according to the
appropriate replacement probabilities of our model, and continue in this manner till we
reach the leaves.
Finally, since D(x, y) is approximately normally distributed, we can compute a p-
value for the test, which is simply P r(Z D(x,y)E(D(x,y)) ) where Z
V (D(x,y))
N ormal(0, 1). Note, that if the test is applied to several (or all) branches of the tree,
148 Tal Pupko et al.
then the significance level of the test should be corrected in accordance with the number
of tests performed, e.g., using Bonferronis correction which multiplies the p-value by
the number of branches tested.
The algorithm for testing the branches of a phylogenetic tree T is summarized in
Figure 1. For each branch (x, y) T the algorithm outputs the p-value of the test for
that branch. In the actual implementation we used M = 100.
PositiveSelectionTest(T ):
Root T at an arbitrary node r.
Draw M assignments to the leaves of T using parametric bootstrap.
Traverse T bottom-up, computing along the way for every (u, v) T , a A
the value of L(u, v, a), where u is the parent of v.
Traverse T top-down, computing along the way for every (u, v) T , a A
the value of L(v, u, a), where u is the parent of v.
For every (x, y) T do:
Calculate D(x, y) and E(D(x, y)).
Evaluate V (D(x, y)).
Output the p-value for the branch (x, y).
Theorem 1. For a given phylogenetic tree T with n leaves, the algorithm tests all
branches of T in O(n) time.
Proof. Given L(u, v, a) for every (u, v) T and every a A, it is straightforward to
compute D(u, v) for all (u, v) T in linear time. The computation of E(D(u, v)) and
V (D(u, v)) is clearly linear. The complexity follows.
The rate of evolution is not constant among amino-acid sites [ 28]. Consider two se-
quences of length N . Suppose that there are on average l replacements per site between
these sequences. This means that we expect lN replacements altogether. How many
replacements should we expect at each particular site? Naive models assume that the
variation of mutation rate among sites is zero, i.e., that all sites have the same replace-
ment probability. Models that take this Among Site Rate Variation (ASRV) into account
assume that at the j-th position the average number of replacement is lr[j], where each
r = r[j] is a rate parameter drawn from some probability distribution. Since the mean
rate over all sites is l, the mean of r is equal to 1. Yang suggested the gamma distribu-
tion with parameters and as the distribution for r, and since the mean of the gamma
distribution /, must be equal to 1, = [28], that is:
r 1
f (r; , ) = e r
()
4 Biological Results
In order to validate our approach, we applied it to two control datasets: Class I major-
histocompatibility-complex (MHC) glycoproteins, and carbonic anhydrase I. We have
chosen to analyze these datasets since they were already used as standard positive con-
trol (MHC) and negative control (carbonic anhydrase) for positive selection tests [ 24].
The datasets contain aligned sequences (all sequences are of the same length, and
the best alignment is gapless). Phylogenetic trees were constructed using the MOLPHY
software [1], with the neighbor-joining method [ 21] for MHC class I, and with the
maximum likelihood method for carbonic anhydrase I. The reason for the use of two
tree construction methods is that in the MHC case we are dealing with 42 sequences and,
therefore, an exhaustive maximum likelihood approach is impractical. Branch lengths
for each topology were estimated using the maximum likelihood method [ 3] with the
JTT stochastic model [11], assuming that the rate is discrete gamma distributed among
sites with 4 rate categories.
150 Tal Pupko et al.
aw24
a32
a26
aw69
a2.4a
a2.3
a2.1
a2.4b
a2.2y
a2.2f
aw68.2
aw68.1
a3.1
a3.2
a11
a1
bw58
b13
bw47
b44.1
b44.2
b27.1
b27.2
b27f
b27.3
b27.4
b18
bw65
b14
b8
bw42
b7.1
b40
bw41
bw60
cw3
cw6
cw1
cw1 1
cw2.1
cw2.2
0.1 calpha2
Fig. 2. A Phylogenetic Tree for MHC Class I Sequences. Species labels are as in [9].
The tree topology was estimated by using whole sequences. Branch lengths were esti-
mated for the cleft subsequences only. Each branch was subjected to the positive selec-
tion test on the cleft subsequences. Branches in bold-face indicate p-value< 0.01.
A Chemical-Distance-Based Test for Positive Darwinian Selection 151
Table 1. A list of z-scores for each of the tests performed on the MHC class I dataset.
The first row contains scores with respect to whole sequences. The second row contains
results with respect to the binding cleft subsequences, with branch lengths as for the
whole sequences. The third row contains results with respect to the binding cleft subse-
quences, with branch lengths reestimated on this part of the sequence only. Significant
z-scores (p-value< 0.01) appear in bold-face.
When we applied our test to the binding site only, positive selection was found with
very high confidence (P < 0.001). The respective z-scores are shown in Table 1. How-
ever, it might be argued, that when only the binding site part of the sequence is analyzed,
the branch lengths estimated for the whole sequences are irrelevant. Since it is known
that the rate of evolution in the binding site is faster relative to the rest of the sequence,
the branch lengths estimated from the whole sequences are underestimated. This under-
estimation can result in a false positive conclusion of positive selection, since we expect
in this case an excess of radical replacements. To overcome this problem, branch lengths
were reestimated on the binding site part of the sequence only. Significant excess of po-
lar and charge replacements were found also with these new estimates (P < 0.01). The
152 Tal Pupko et al.
corresponding z-scores are shown in Table 1. We note, that using the 0-1 polarity dis-
tance of [9], we found no evidence for positive selection. On the other hand, when we
used Granthams polarity indices [4], significant deviations from the random expecta-
tions were observed (see Table 1). The latter distance measure is clearly more accurate
since it is not restricted to 0-1 values. We conclude that there is a significant excess in
both charge and polar replacements, and not only in charge replacements, as reported
in [9].
Finally, we tested specific branches in the tree to find those branches which con-
tribute the most to the excess of charge replacements. Branches whose corresponding
p-value was found to be smaller than 0.01 appear in bold-face in Figure 2. We note,
that since we have no prior knowledge of which branches are expected to show excess
of charge replacements, these p-values should be scaled according to the number of
branches tested. Nevertheless, these high scoring branches lie all in the subtrees cor-
responding to the A and B alleles, matching the findings of Hughes et al. who report
positive selection for these alleles only [9].
5 Discussion
Natural selection may act to favor amino acid replacements that change certain prop-
erties of amino acids [7]. Here we propose a method to test for such selection. Our
method takes into account the stochastic model of amino-acid replacements, among
site rate variation and the phylogenetic relationship among the sequences under study.
The method is based on identifying large deviations of the mean observed chemical
distance between two proteins from the expected distance. Our test can be applied to
a specific branch of a phylogenetic tree, to a clade in the tree or, alternatively, over
all branches of the phylogenetic tree. The calculation of the mean observed chemical
distance is based on a novel procedure for averaging the chemical distance over all pos-
sible ancestral sequence reconstructions weighted by their likelihood. This results in an
unbiased estimate of the chemical distance along a branch of a phylogenetic tree. The
underlying distribution of this random variable is calculated using the JTT model, tak-
ing into account among site rate variation. We give a linear time algorithm to perform
this test for all branches and subtrees of a given phylogenetic tree.
A Chemical-Distance-Based Test for Positive Darwinian Selection 153
Two variants of the test are presented: The first is a statistical test of a single branch
in a phylogenetic tree. Positive selection along a tree lineage can be the result of a spe-
cific adaptation of one taxon to some special environment. In this case, the branch in
question is known a priori, and the branch-specific test should be used. Alternatively, if
the selection constraints are continuous, as for example, the selection that promotes di-
versity among alleles of the MHC class I, the test should be applied to all the sequences
under the assumed selection pressure - a clade-based test.
We validated our method on two datasets: Carbonic anhydrase I sequences served
as a negative control, and the cleft of MHC class I sequences as a positive control. MHC
class I sequences were previously shown to be under positive selection pressure, acting
to favor amino-acid replacements that are radical with respect to charge.
There are, however, some limitations to our method. The method relies heavily on
an assumed stochastic model of evolution. If this model underestimates branch lengths,
one might get false positive results. It is for this reason that it is important to estimate
branch lengths under realistic models, taking into account among site rate variation.
Furthermore, if the test is applied to specific parts of the protein, such as an alpha helix,
a replacement matrix that is specific for this part might be preferable over the more gen-
eral JTT model used in this study (see [26]). One might claim that if excess of, say, polar
replacements is found, it should not be interpreted as indicative of positive selection, but
rather, as an indication that a more sequence-specific amino-acid replacement model is
required. In MHC class I glycoproteins, however, other lines of evidence [ 9,24] suggest
positive Darwinian selection.
In the future, we plan to make the test more robust by accommodating uncertainties
in branch lengths and topology. This can be achieved by Markov-Chain Monte-Carlo
methods [6]. The sensitivity of our test to different assumptions regarding the stochastic
process and the phylogenetic tree will be better understood when more datasets are
analyzed.
Acknowledgments
We thank Hirohisa Kishino and Yoav Benjamini for their suggestions regarding the
statistical analysis. The first author was supported by a JSPS fellowship. The second
author was supported by an Eshkol fellowship from the Ministry of Science, Israel. The
fourth author was supported in part by the Israel Science Foundation (grant number
565/99). This study was also supported by the Magnet Daat Consortium of the Israel
Ministry of Industry and Trade and a grant from Tel Aviv Univeristy (689/96).
References
1. J. Adachi and M. Hasegawa. Molphy: programs for molecular phylogenetics based on max-
imum likelihood, version 2.3. Technical report, Institute of Statistical Mathematics, Tokyo,
Japan, 1996.
2. T. Endo, K. Ikeo, and T. Gojobori. Large-scale search for genes on which positive selection
may operate. Mol. Biol. Evol., 13:685690, 1996.
154 Tal Pupko et al.
25. D.L. Swofford, G.J. Olsen, P.J. Waddell, and D.M. Hillis. Phylogenetic inference. In D.M.
Hillis, C. Moritz, and B.K. Mable, editors, Molecular systematics, 2nd Ed., pages 407514.
Sinauer Associates, Sunderland, MA, 1995.
26. J.L. Thorne, N. Goldman, and D.T. Jones. Combining protein evolution and secondary struc-
ture. Mol. Biol. Evol., 13:666673, 1996.
27. H.C. Wang, J. Dopazo, and J.M. Carazo. Self-organizing tree growing network for classify-
ing amino acids. Bioinformatics, 14:376377, 1998.
28. Z. Yang. Maximum-likelihood estimation of phylogeny from DNA sequences when substi-
tution rates differ over sites. Mol. Biol. Evol., 10:13961401, 1993.
29. Z. Yang. Maximum likelihood phylogenetic estimation from DNA sequences with variable
rates over sites: approximate methods. J. Mol. Evol., 39:306314, 1994.
30. Z. Yang. Maximum likelihood estimation on large phylogenies and analysis of adaptive
evolution in human influenza virus A. J. Mol. Evol., 51:423432, 2000.
31. Z. Yang, S. Kumar, and M. Nei. A new method of inference of ancestral nucleotide and
amino acid sequences. Genetics, 141:16411650, 1995.
32. J. Zhang and S. Kumar. Detection of convergent and parallel evolution at the amino acid
sequence level. Mol. Biol. Evol., 14:527536, 1997.
33. J. Zhang, S. Kumar, and M. Nei. Small-sample tests of episodic adaptive evolution: a case
study of primate lysozymes. Mol. Biol. Evol., 14:13351338, 1997.
34. J. Zhang, H.F. Rosenberg, and M. Nei. Positive Darwinian selection after gene duplication
in primate ribonuclease genes. Proc. Natl. Acad. Sci. USA, 95:37083713, 1998.
Finding a Maximum Compatible Tree for a Bounded
Number of Trees with Bounded Degree Is Solvable in
Polynomial Time
1 Introduction
O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 156163, 2001.
c Springer-Verlag Berlin Heidelberg 2001
Maximum Compatible Tree 157
when returning a set of consensus trees, one for each phylogenetic island obtained dur-
ing a search for the maximum parsimony or maximum likelihood tree). In such cases,
it may return a larger subset of taxa than the maximum agreement subset approach, as
illustrated in Figure 1. This maximum compatible set (MCS) problem is NP-hard for
3 1 3
1
4 2 4
2
T1 T2
1 1 3
2 4
2 3
Fig. 1. The MCT of T1 and T2 has more leaves than the MAST.
6 or more trees [5]. We will denote by MCT the tree constructed on the MCS, and call
it the maximum compatible tree. Occasionally, when the context is clear, we will use
MAST to also denote the maximum agreement subtree.
Our result for the MCS problem is an algorithm for the two-tree MCS problem
which runs in time O(2 4d n3 ) time, where n = |S| and d is the maximum degree of
the two unrooted trees. (The algorithm we present is a dynamic programming algo-
rithm and is an extension of the earlier dynamic programming algorithm for the two-
tree MAST problem in [6].) Thus, we show that the two-tree MCS problem is fixed-
parameter tractable. We extend this algorithm to the k-tree MCS problem in which all
trees have bounded degree, obtaining a O(2 2kd nk+1 ) algorithm.
The organization of the paper is as follows. In Section 2 we give some basic defi-
nitions (we will introduce and explain other terminology as needed). We then present
the dynamic programming algorithm for two-tree MAST, and discuss the challenges in
extending the algorithm to the two-tree MCS problem. In Section 3 we present the dy-
namic programming algorithm for the two-tree MCS problem, the proof of correctness,
and running time analysis. We then show how we extend this algorithm to the k tree
case. In Section 4 we discuss further work in the area.
In this section, we first present some definitions and then we describe the dynamic pro-
gramming algorithm for computing the maximum agreement subtree of two trees, as
given in [6]. The algorithm actually computes the cardinality of the maximum agree-
158 Ganeshkumar Ganapathysaravanabavan and Tandy Warnow
ment subset but can be easily extended to give the maximum agreement subset (or
equivalently the maximum agreement subtree).
Definition 1. Given a leaf-labelled tree T on a set S, the restriction of T to a set X S
is obtained by removing all leaves in (S X) and then suppressing all internal nodes
of degree two. This restriction will be denoted by T |X.
We will first assume that the two trees T and T are both rooted (we will later discuss
how to extend the algorithm for unrooted trees). The trees can have any degree, but both
are leaf-labelled by the same set S of taxa.
Let v be a node in T , and denote by T v the subtree of T rooted at v. Similarly
denote by T w the subtree of T rooted at a node w in T . Denote by MAST(T v , Tw )
the number of leaves in a maximum agreement subset of T v and Tw . The dynamic
programming algorithm operates by computing MAST(T v , Tw ) for all pairs of nodes
(v, w) in V (T ) V (T ), bottom-up.
We now describe the basic idea of the dynamic programming algorithm. First, the
value MAST(Tv , Tw ) is easy to compute when either v or w are leaves. Now consider
the computation of MAST(T v , Tw ) where both v and w are not leaves, and let X be a
maximum agreement subset of T v and Tw . The least common ancestor of X in T v may
be v, or it may be a node below v. Similarly, the least common ancestor of X in T w
may be w or it may be a node below w. We have four cases to consider:
Proof. Since X is a compatible subset of taxa for the pair of trees T 1 and T2 , there is a
common refinement T of T1 |X and T2 |X. Hence we can refine T 1 , obtaining T 1 , and
refine T2 , obtaining T 2 , so that T1 restricted to X yields T , and similarly T 2 restricted
to X also yields T . Then X is an agreement subset of T 1 and T2 .
1 6
5 1 6 5
2 3 2 3 4
T1 T2
1 5
2 3 4
MCT(T1, T2)
algorithm for computing an MCS of two trees: for each way of refining the two trees into
a binary tree, compute a MAST. However, this algorithm is computationally expensive,
since the number of binary refinements of an n leaf tree with maximum degree d can
be O(4nd ). Hence this brute force algorithm will not be acceptably fast.
160 Ganeshkumar Ganapathysaravanabavan and Tandy Warnow
1 6
5 1 6 5
2 3 4 2 3 4
T3 T4
1 5
2 3 4
MAST(T3, T4)
We now describe the dynamic programming algorithm for the maximum compatible set
(MCS) of two rooted trees, both with degree bounded by d (by this we mean that every
node has at most d children). As for the MAST problem, here too the algorithm com-
putes the cardinality of an MCS, but can be easily extended to compute an MCS itself.
This algorithm can easily be extended to produce a dynamic programming algorithm
for computing an MCS of two unrooted trees, by computing an MCS of each of the n
pairs of rooted trees (obtained by rooting the unrooted trees at each of the n leaves).
Furthermore, we also show how to extend the algorithm to handle k rooted or unrooted
trees.
The basic set of problems we need to compute must include the computation of an
MCS of subtrees Tv and Tw , for every pair of nodes v and w. (T v denotes the subtree of
T rooted at v, and similarly T w denotes the subtree of T rooted at w.) We will also need
to include the computation of MCSs of other pairs of trees, but begin our discussion
with these MCS calculations.
Let T and T be two rooted trees, and let v and w denote nodes in T and T respec-
tively. Let the children of v be v 1 , v2 , . . . , vp and the children of w be w 1 , w2 , . . . , wq .
Let X be the set of leaves involved in an MCS T of Tv and Tw . Note that T |X and
T |X will only include those children of v and w which have some element(s) of X
below them. Let A be the children of v included in T |X and B be the children of w
included in T |X. (Note that X defines the sets A and B.)
Note also that any MCS of T and T actually defines an agreement subset of some
binary refinement of T and some binary refinement of T (Lemma 1). Hence, T defines
a binary refinement at the node v if |A| > 1, and a binary refinement at the node w if
|B| > 1. In these cases, T defines a partition of the nodes in A into two sets, and a
partition of the nodes in B into two sets.
There are four cases to consider:
Maximum Compatible Tree 161
1. |A| = |B| = 1, i.e A = {vi } and B = {wj } for some i and j. In this case, any
MCS of Tv and Tw is an MCS of Tvi and Tw j .
2. |A| = 1 and |B| > 1, i.e, any MCS of T v and Tw is an MCS of Tvi and Tw for
some i.
3. |A| > 1 and |B| = 1, i.e, any MCS of T v and Tw is an MCS of Tv and Tw j for
some j.
4. |A| > 1 and |B| > 1.
The analysis of the fourth case is somewhat complicated, and is the reason that we need
additional subproblems. Recall that T defines a bipartition of A into (A , A A ) and
B into (B , B B ). Further, recall that T is a binary tree with two subtrees off the
root; we call these subtrees T 1 and T2 . It can then be observed that T 1 is an MCS of
the subtree of T v obtained by restricting to the nodes below A A and the subtree
of Tw obtained by restricting to the nodes below B B . Similarly, T2 is an MCS
of the subtree of T v obtained by restricting to the nodes below A and the subtree of
Tw obtained by restricting to the nodes below B . Hence we need to define additional
subproblems as follows. For each A A define the tree T (v, A ) to be subtree of T v
obtained by deleting all the children of v (and their descendents) not in A . Similarly
define the tree T (w, B ) to be the subtree of T w obtained by deleting all the children
of w (and their descendents) not in B . The construction of tree T (v, A ) is illustrated
in Figure 4. Now define M CS(v, A , w, B ) to be the size of an MCS of T (v, A ) and
T T(V, A )
V
t2 t3
t1 t2 t3 t4
Fig. 4. The Tree T, the Set A and the Tree T(v, A).
Running time analysis There are O(2 d n) trees T (v, A), and hence O(2 2d n2 ) sub-
problems. The computation of M CS(v, A, w, B) involves computing the maximum
of 2d + 22d values, and hence takes O(2 2d ) time. Hence the running time is O(2 4d n2 ).
3.3 Algorithm for the MCS Problem of k Rooted Trees with Bounded Degree
We now show how to extend the analysis to k rooted trees. In this case, the subproblems
are 2k-tuples of the form M CS(v 1 , A1 , v2 , A2 , . . . , vk , Ak ) where vi is a node in Ti
and Ai Children(vi ). Hence there are O(2 kd nk ) subproblems. Computing each
subproblem involves taking the maximum of O(kd + 2 kd ) values. Hence the running
time for the algorithm is O(2 2kd nk ) time.
4 Future Work
To conclude we point out that many questions about the MCS problem remain unsolved.
We know that MCS is NP-hard for 6 trees with unbounded degree, but we do not know
the minimum number of trees for which MCS becomes hard. In particular, we do not
know if MCS is NP-hard or polynomial for two trees. It also remains to be seen if there
are any approximation algorithms for the problem, or exact algorithms when only some
of the trees have bounded degree.
Maximum Compatible Tree 163
Acknowledgments
Tandy Warnows work was supported in part by the David and Lucile Packard Founda-
tion.
References
1. A. Amir and D. Kesselman. Maximum agreement subtree in a set of evolutionary trees
metrics and efficient algorithms. Proceedings of the 35th IEEE FOCS, Santa Fe, 1994.
2. M. Farach Colton, T.M. Przytycka, and M.Thorup, On the Agreement of Many Trees. Infor-
mation Processing Letters 55 (1995), 297301.
3. C.R. Finden and A.D. Gordon. Obtaining common pruned trees, Journal of Classification 2,
255276 (1985).
4. H.N. Gabow and R.E. Tarjan. Faster scaling algorithms for network problems. SIAM J. Com-
put. 18 (5), 10131036 (1989).
5. A.M. Hamel and M.A. Steel. Finding a maximum compatible tree is NP-hard for sequences
and trees. Research Report No. 114, Department of Mathematics and Statistics, University
of Canterbury, Christchurch, New Zealand, 1994.
6. M. Steel and T. Warnow. Kaikoura tree theorems: computing the maximum agreement sub-
tree. Information Processing Letters 48, 7782 (1993).
7. D. Swofford, personal communication.
Experiments in Computing Sequences of Reversals
1 Introduction
For several good reasons, the problem of sorting signed permutations has received a lot
of attention in recent years. One of the attractive features of this problem is its simple
and precise formulation: Given a permutation of integers between 1 and n, some of
which may have a minus sign,
= (1 2 . . . n )
find d(), the minimum number of reversals that transform into the identity permu-
tation:
(+1 + 2 . . . + n).
The reversal operation reverses the order of a block of consecutive elements in , while
changing their signs.
Another good reason to study this problem is comparative genomics. The genome
of a species can be thought of as a set of ordered sequences of genesthe ordering
devices being the chromosomes, each gene having an orientation given by its loca-
tion on the DNA double strand. Different species often share similar genes that were
inherited from common ancestors. However, these genes have been shuffled by muta-
tions that modified the content of chromosomes, the order of genes within a particular
chromosome, and/or the orientation of a gene. Comparing two sets of similar genes ap-
pearing along a chromosome in two different species yields two (signed) permutations.
It is widely accepted that the reversal distance between these two permutations, that is,
the minimum number of reversals that transform one into the other, faithfully reflects
the evolutionary distance between the two species.
The last, and probably the best feature of the sorting problem from an algorithmic
point of view, is the dramatic increase in efficiency its solution exhibited in a few years.
From a problem of unknown complexity in the early nineties, a time when approximate
solutions were plenty [6], polynomial solutions of constantly decreasing degree were
O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 164174, 2001.
c Springer-Verlag Berlin Heidelberg 2001
Experiments in Computing Sequences of Reversals 165
successively found: O(n 4 ) [5], O(n2 (n)) [3], O(n2 ) [7], and finally O(n) [1], for the
distance computation alone. A computer scientists delight, knowing that the unsigned
version was proved to be NP-hard [4].
This high level of scientific activity inevitably generated undesirable side-effects.
Complex constructions, useful in the initial investigations, were later proven unnec-
essary. Terminology is not yet standard. For example, the overlap graphs of [ 7] are
different from the ones in [1]. Most importantly, the complexity measures tend to mix
two different problems, as pointed out in [1]: the computation of the number of neces-
sary reversals, and the reconstruction of one possible sequence of reversals that realizes
this number.
The first problem has an efficient and simple linear solution [ 1] that can hardly
be improved on. In this paper, we address the second problem with elementary tools,
further developing the ideas of [2]. We show that, with any problem of biologically
relevant size, it is possible to implement efficient algorithms using the simplest data
structures and operations: In this case, bit-vectors and standard logical and arithmetic
operations.
The next section presents a brief introduction to the current theory. It is followed by
a discussion of the implementation, and results on simulated biological data.
2 Sorting by Reversals
The basic construction used for computing d(), the reversal distance of a signed per-
mutation , is the cycle graph 1 associated with . Each positive element x in the
permutation is replaced by the sequence 2x 1 2x, and each negative element x by
the sequence 2x 2x 1. Integers 0 and 2n + 1 are added as first and last elements. For
example, the permutation
= ( 2 1 +4 +3 +5 +8 +7 +6 )
becomes
= ( 0 4 3 2 1 7 8 5 6 9 10 15 16 13 14 11 12 17 ).
The elements of are the vertices of the cycle graph. Edges join every other pair of
consecutive elements of , starting with 0, and every other pair of consecutive integers,
starting with (0, 1). The first group of edges, the horizontal ones, is often referred to as
black edges, and the second as arcs or gray edges.
Every connected component of the cycle graph is a cycle, which is a consequence
of the fact that each vertex has exactly two incident edges. The graph of Figure 1 has 4
cycles.
The support of an arc is the interval of elements of between, and including,
its endpoints. Two arcs overlap if their support intersect, without proper containment.
An arc is oriented if its support contains an odd number of elements, otherwise it is
unoriented. Note that an arc is oriented if and only if its endpoints belong to elements
with different signs in the original permutation.
1
A cycle graph in which all cycles of length two have been removed is often called a breakpoint
graph.
166 Anne Bergeron and Francois Strasbourg
11
00
00
11 1
0
0
1 11
00
00
11 1
0
0
1 11
00
00
11 1
0
0
1 11
00
00
11 1
0
0
1 11
00
00
11 1
0
0
1 11
00
00
11 1
0
0
1 11
00
00
11 1
0
0
1 11
00
00
11 1
0
0
1 11
00
00
11 1
0
0
1
0 4 3 2 1 7 8 5 6 9 10 15 16 13 14 11 12 17
2 1 +4 +3 +5 +8 +7 +6
The arc overlap graph is the graph whose vertices are arcs of the cycle graph, and
whose edges join overlapping arcs 2 . The overlap graph corresponding to the cycle graph
of Figure 1 is illustrated in Figure 2, in which each vertex is labeled by an arc (2i, 2i+1).
Oriented verticesthose for which the corresponding arc is orientedare marked by
black dots. Orientation extends to connected component in the sense that a connected
component with at least one oriented vertex is oriented. It is easy to show that a vertex
is oriented if and only if its degree is odd.
(0, 1)
(16, 17) (2, 3)
d() = n + 1 c + h + f
where c is the number of cycles in the cycle graph, h is the number of hurdles, and f
is a correction factor equals to 1 when there are at least 3 hurdles satisfying a particular
condition. With the above formula, it is easy to compute the reversal distance of the
permutation of Figure 1 and Figure 2 as 9 4 + 1 = 6.
2
The cycle overlap graph is obtained in a similar way, by defining two cycles to overlap if they
have at least two overlapping arcs.
Experiments in Computing Sequences of Reversals 167
Computing the distance d() can be done in linear time [ 1]. However, this number
alone gives no clue to the reconstruction of a possible sequence of reversals that realizes
d().
Reconstructing a possible sequence of reversals raises two different problems. The
first one is how to deal with unoriented components. This problem is solved by carefully
choosing a sequence of reversals that merges these components while creating at least
one oriented vertex in each component [ 5], [3], [7]. The selection of these reversals can
be done efficiently while computing d().
The second step, called the oriented sort, requires to choose, among several can-
didates, a safe reversal, that is a reversal that decreases the reversal distance. Such a
reversal always exists, but can be hard to find. For example, choosing the obvious re-
versal of the first two elements in the permutation:
( 2 1 +4 +3 )
( +1 +2 +4 +3 )
that still has d = 3, since one hurdle was created by the reversal. On the other hand, the
original permutation can be sorted by the sequence:
( 2 1 +4 +3 )
( 4 +1 +2 +3 )
( 4 3 2 1 )
( +1 +2 +3 +4 )
In general, to any oriented vertex of the overlap graph, there is an associated reversal
that creates consecutive elements in the permutation, and the search for safe reversals
can be restricted to reversals associated to oriented vertices. Several criteria have been
proposed to select safe reversals. In [5], the selection of a safe reversal is the most
expensive iteration of the algorithm; [3] reduces the complexity of the search by con-
sidering only O(log n) candidates; and [7] gives a characterization of a safe reversal in
terms of cliques. In [2], we showed that there is a much simpler way to identify a safe
reversal:
Theorem 1. The reversal that maximizes the number of oriented vertices is safe.
In the following sections we discuss a bit-vector implementation of the oriented sort
that uses this property.
subgraph of vertices adjacent to v, and reverses their orientation. Therefore, the net
change in the number of oriented vertices depends only on the orientation of vertices
adjacent to v.
The score of a reversal associated to vertex v is defined by the difference between
the number of its unoriented neighbors U v , and the number of its oriented neighbors
Ov . The score of a reversal is a local property of the graph, and this locality suggests
the possibility of a parallel algorithm to keep the scores and to compute the effects of a
reversal.
For a signed permutation of length n, we will denote by bold letters the character-
istic bit-vectors of subsets of the n + 1 arcs (0, 1) to (2n, 2n + 1). We will use only
three operations on these vectors: the exclusive-or operator ; the conjunction ; and
the negation .
(0, 1)
(12, 13) (2, 3)
(10, 11) 1
0
0 (4,
1 5)
(8, 9) (6, 7)
is the following:
v0 v1 v2 v3 v4 v5 v6
v0 0 0 1 1 0 0 0
v1 0 0 1 0 1 0 1
v2 1 1 0 1 1 0 1
v3 1 0 1 0 1 0 1
v4 0 1 1 1 0 1 0
v5 0 0 0 0 1 0 1
v6 0 1 1 1 0 1 0
p 0 1 1 0 0 0 0
s 0 1 3 2 0 2 0
The last two lines contain, respectively, the parity, or orientation, of the vertex,
and the score of the associated reversal. We will discuss efficient ways to initialize the
structure and to adjust scores in Sections 3.2 and 3.3.
Given the vectors p and s, selecting the oriented reversal with maximal score is
elementary. In the above example, vertex 2 would be the selected candidate.
Experiments in Computing Sequences of Reversals 169
The interesting part is how a reversal affects the structure. These effects are summa-
rized in the following algorithm, which recalculates the bit-matrix v, the parity vector
p, and the score vector s, following the reversal associated to vertex i, whose set of
adjacent vertices is denoted by v i .
s s + vi
vi i 1
For each vertex j adjacent to i
If j is oriented
s s + vj
vj j 1
vj vj vi
s s + vj
Else
s s vj
vj j 1
vj vj vi
s s vj
p p vi
The logic behind the algorithm is the following. Since vertex i will become unori-
ented and isolated, each vertex adjacent to i will automatically gain a point of score.
Next, if j is a vertex adjacent to i, vertices adjacent to j after the reversal are either
existing vertices that were not adjacent to i, or vertices that were adjacent to i but not
to j. The exceptions to this rule are i and j themselves, and this problem is solved by
setting the diagonal bits to 1 before computing the direct sum.
If j is oriented, each of its former adjacent vertices will gain one point of score,
since j will become unoriented, and each of its new adjacent vertices will gain one
point of score. Note that a vertex that stays connected to w will gain a total of two
points. For unoriented vertices, the gains are converted to losses.
The amount of work done to process a reversal corresponding to vertex i, in terms
of vector operations, is thus proportional to the number of adjacent vertices to vertex i.
matrix through the parity vector. The set c of candidates contains initially all the ori-
ented vertices. Going from the higher bit of scores to the lower, if at least one of the
candidates has bit i set to 1, we eliminate all candidates for which bit i is 0.
cp
i log(n)
While i 0 do
While (c si ) = 0
ii1
If i 0
c c si
ii1
At the end of the loop, c is the set of oriented vertices of maximal score.
If aj = 0
Then aj 1 (* First endpoint of arc (2j, 2j + 1) *)
Else (* Second endpoint *)
aj 0
vj a (* a is the set lj *)
We then repeat the process in the reverse order, reading the permutation from right
to left, initializing a to the set {n}, and changing the last instruction to v j vj a.
Experiments in Computing Sequences of Reversals 171
The formal analysis of the algorithm of Section 3 raises interesting questions. For ex-
ample, what is an elementary operation? Except for a few control statements, the only
operations used by the algorithm are very efficient bit-wise logical operators on words
of size wtypically 32 or 64, depending on implementation. The most expensive in-
structions in the main loop are additions and subtractions, such as
s s + vj ,
where s is a bit matrix of size n log(n), and v j is a bit vector of size n. Such an opera-
tion requires a total of (2n log(n))/w elementary operations with the loop described in
Section 3.2. Hopefully, log(n) is much smaller than w, and, in the range of biologically
meaningful values, n is often a small multiple of w. In the actual implementation, the
loop is controlled by the value of log(maximal score) which tends to be much less than
log(n). We thus have a, very generous, O(n) estimate for the instructions in the main
loop.
The overall work done by the algorithm depends on the total number v of vertices
adjacent to vertices of maximal score. We can easily bound it by n 2 , noting that the
number d of reversals needed to sort the permutation is bounded by n, and the degree
of a vertex is also bounded by n. We thus get an O(n 3 ) estimate for the algorithm, as-
suming that log n < w. However, experimental results suggest that v is better estimated
by O(n), at least for values of n up to 4096, which seems largely sufficient for most
biological applications.
The value of v is hard to control experimentally, but if we write the quantity gov-
erning the running time as dv m n, in which vm is the mean number of adjacent vertices,
then both d and n can be fixed in an independent way.
The first observation is that the implementation is very fast for typical biological prob-
lems. With n = 128, and k = 32, we can compute 10,000 sequences of reversals in 11s.
On the other end of the spectrum, we applied the algorithm to values up to n = 4096.
In this case, the computation of reversal sequences for k = 512 took a mean time of
3.86s for 100 random permutations.
172 Anne Bergeron and Franc ois Strasbourg
In order to study the effect of the variation of n on the running time, we choose four
different evolution rates k = 32, 128, 256, and 512. For each value of k, we generated
sets of 100 permutations of length ranging from 256 to 4,096. Figure 3 displays the
results for the mean sorting time for k = 256 and k = 512values for smaller ks were
too low for a significant analysis.
4 r
r
r
k = 512 r
T 3 r
r
i r
m r
e 2 r
r
r
r k = 256
1
r
r
r
r
0 512 1024 1536 2048 2560 3072 3584 4096
Length of Permutation
In this range of values, the behavior of the algorithm is clearly linear. Recall from
the analysis that the estimated running time is governed by the quantity dv m n. With k
constant and n sufficiently large, then d is also constant. It seems that for large n, the
shape of the overlap graph depends only on d, which would be certainly be true in the
limit case of a continuous interval.
In this series of experiments, we studied the effect of the variation of d, the number of
reversals, for a fixed n = 1,024. We generated sets of 500 permutations with evolution
rates varying from k = 64 to k = 1, 024, with equal increments.
For each set, we computed the mean number of reversals. Figure 4 presents the pairs
of mean values (d, t). The fact that the points are closer together in the right part of the
graph is called saturation: when k grows close to n, the value of d tends to stabilize.
Experiments in Computing Sequences of Reversals 173
r
r r
1.5
n = 1024 r
T r
r
i r
m 1.0 r
e r
r
0.5 r
r
r
r
r
r
0 128 256 384 512 640 768 896 1024
Number of Reversals
At least for the studied range of values, the performance of the algorithm on the
value of d, for a fixed n, seems to be much less than O(d 2 ), which is a bit surprising,
given that what is measured here is dv m . Factoring out d from the data yields the curve
of Figure 5, which appears to be O(log 2 (d)).
r
r r r r r r
r r
r r
r
r
r
r
5 Conclusions
References
1. David Bader, Bernard Moret, Mi Yan, A Linear-Time Algorithm for Computing Inversion Dis-
tance Between Signed Permutations with an Experiemntal Study. Proc. 7th Workshop on Algs.
and Data Structs. WADS01, Providence (2001), to appear in Lecture Notes in Computer Sci-
ence, Springer Verlag.
2. Anne Bergeron, A Very Elementary Presentation of the Hannenhalli-Pevzner Theory. CPM
2001, Jerusalem (2001), to appear in Lecture Notes in Computer Science, Springer Verlag.
3. Piotr Berman and Sridhar Hannenhalli, Fast Sorting by Reversal. CPM 1996, LNCS 1075:
168185 (1996).
4. Alberto Caprara, Sorting by reversals is difficult. RECOMB 1997, ACM Press: 7583 (1997).
5. Sridhar Hannenhalli and Pavel Pevzner, Transforming Cabbage into Turnip: Polynomial Al-
gorithm for Sorting Signed Permutations by Reversals. JACM 46(1): 127 (1999).
6. John Kececioglu and David Sankoff, Efficient bounds for oriented chromosome-inversion dis-
tance. CPM 1994, LNCS 807: 307325 (1994).
7. Haim Kaplan, Ron Shamir, Robert Tarjan, A Faster and Simpler Algorithm for Sorting Signed
Permutations by Reversals. SIAM J. Comput. 29(3): 880892 (1999).
8. Pavel Pevzner, Computational Molecular Biology, MIT Press, Cambridge, Mass., 314 pp.
(2000).
9. David Sankoff, Edit Distances for Genome Comparisons Based on Non-Local Operations.
CPM 1992, LNCS 644: 121135 (1992).
Exact-IEBP: A New Technique for Estimating
Evolutionary Distances between Whole Genomes
Li-San Wang
1 Introduction
Genome Rearrangement Evolution. The genomes of some organisms have a single
chromosome or contain single chromosome organelles (such as mitochondria [4,14] or
chloroplasts [13,15]) whose evolution is largely independent of the evolution of the nu-
clear genome for these organisms. Many single-chromosome organisms and organelles
have circular chromosomes. Gene maps and whole genome sequencing projects can
provide us with information about the ordering and strandedness of the genes, so the
chromosome is represented by an ordering (linear or circular) of signed genes (where
the sign of the gene indicates which strand it is located on). The evolutionary process
on the chromosome can thus be seen as a transformation of signed orderings of genes.
The process includes inversions, transpositions, and inverted transpositions, which we
will define later.
True Evolutionary Distances. Let T be the true tree on which a set of genomes has
evolved. Every edge e in T is associated with a number ke , the actual number of rear-
rangements along edgeP e. The true evolutionary distance (t.e.d.) between two leaves Gi
and Gj in T is kij = ePij ke , where Pij is the simple path on T between Gi and
Gj . If we can estimate all kij sufficiently accurately, we can reconstruct the tree T using
very simple methods, and in particular, using the neighbor joining method (NJ) [1,16].
Estimates of pairwise distances that are close to the true evolutionary distances will in
O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 175188, 2001.
c Springer-Verlag Berlin Heidelberg 2001
176 Li-San Wang
general be more useful for evolutionary tree reconstruction than edit distances, because
edit distances underestimate true evolutionary distances, and this underestimation can
be very significant as the number of rearrangements increases [7,20].
There are two criteria for evaluating a t.e.d. estimator: how close the estimated dis-
tances are to the true evolutionary distance between two genomes, and how accurate
the inferred trees are when a distance-based method (e.g. neighbor joining) is used in
conjunction with these distances. The importance of obtaining good t.e.d. estimates
when analyzing DNA sequences (under stochastic models of DNA sequence evolution)
is understood, and well-studied [20].
IEBP. The IEBP (Inverting the Expected BreakPoint distance) method [21] estimates
the true evolutionary distance by approximating the expected breakpoint distance (see
Section 2) under the GNT model with provable error bound. The method can be applied
to any dataset of genomes with equal gene content, and for any relative probabilities of
rearrangement event classes. Moreover, the method is robust when the assumptions
about the model parameters are wrong.
EDE. In the EDE (Empirically Derived Estimator) method [10] we estimate the true
evolutionary distance by inverting the expected inversion distance. We estimate the ex-
pected inversion distance by a nonlinear regression on simulation data. The evolution-
ary model in the simulation is inversion only, but NJ using EDE distance has very good
accuracy even when transpositions and inverted transpositions are present.
Our New t.e.d. Estimator. In this paper we improve the result in [21] by introducing the
Exact-IEBP method. The method replaces the approximation in the IEBP method by
computing the expected breakpoint distance exactly. In Section 3 we show the deriva-
tion for our new method. The technique is then checked by computer simulations in
Section 4. The simulation shows that the new method is the best t.e.d. estimator, and
the accuracy of the NJ tree using the new method is comparable to that of the NJ tree
using the EDE distances, and better than that of the NJ tree using other distances.
2 Definitions
We first define the breakpoint distance [3] between two genomes. Let genome G0 =
(g1 , . . . , gn ), and let G be a genome obtained by rearranging G0 . The two genes gi and
gj are adjacent in genome G if gi is immediately followed by gj in G0 , or, equivalently,
if gj is immediately followed by gi . A breakpoint in G with respect to G0 is defined
as an ordered pair of genes (gi , gj ) such that gi and gj are adjacent in G0 , but are not
adjacent in G (neither (gi , gj ) nor (gj , gi ) appear consecutively in that order in G).
The breakpoint distance between two genomes G and G0 is the number of breakpoints
in G with respect to G0 (or vice versa, since the breakpoint distance is symmetric). For
example, let G = (1, 2, 3, 4) and let G0 = (1, 3, 2, 4); there is a breakpoint between
genes 1 and 3 in G0 (w.r.t. G) but genes 2 and 3 are adjacent in G0 (w.r.t. G). The
breakpoint distance between two genomes is the number of breakpoints in one genome
with respect to the other.
A rearrangement is a permutation of the genes in the genome, followed by either
negating or retaining the sign of each gene. For any genome G, let G be the genome
obtained by applying on G. Let RI , RT , RV be the set of all inversions, transpositions,
and inverted transpositions, respectively. We assume the evolutionary model is the GNT
model with parameters and . Within each of the three types of rearrangement events,
all events have the same probability.
Let G0 = (g1 , g2 , . . . , gn) be the signed genome of n genes at the beginning of the
evolutionary process. For linear genomes we add the two sentinel genes 0 and n + 1 in
the front and the end of G0 that are never moved. For any k 1, let 1 , 2 , . . . k be k
random rearrangements and let Gk = k k1 . . . 1 G0 (i.e. Gk is the result of applying
178 Li-San Wang
these k rearrangements to G0 ). Given any linear genome G = (0, g10 , g20 , . . . , gn0 , n+1),
where 0 and n+1 are sentinel genes, we define the function Bi (G), 0 i n by setting
Bi (G) = 0 if genes gi and gi+1 are adjacent, and Bi (G) = 1 if not; in other words,
Bi (G) = 1 if and only if G has a breakpoint between gi and gi+1 . When G is circular
there are at most n breakpoints Bi (G), 1 i n. We denote the breakpoint distance
0 0
between two genomes PnG and G by BP (G, G ). Let Pi|k = Pr(Bi (Gk ) = P1);n
then
E[BP (G0, Gk )] = i=0 Pi|k for linear genomes and E[BP (G0 , Gk )] = i=1 Pi|k
for circular genomes.
Signed Linear Genomes. When the genomes are linear, we no longer have the luxury
of placing gene g1 at some fixed position with positive sign; different breakpoints may
have different distributions. We need to solve the distribution of each breakpoint indi-
vidually by considering the positions and the signs of both genes involved at the same
time. Let GnL be the set of all signed linear genomes, and let WnL = {(u, v) : u, v =
6 |v|}. We define the functions Ji : GnL WnL , i = 1, . . . , n 1,
1, . . . , n, |u| =
as follows: for any genome G GnL , Ji (G) = (x, y) if gi is at position |x| having
the same sign of x, and gi+1 is at position |y| having the same sign of y. Therefore
Estimating Evolutionary Distances between Whole Genomes 179
(xi,0 )(i,i+1) = 1
(xi,0 )(u,v) = 0, (u, v) WnL , (u, v) 6= (i, i + 1)
xi,k = M k xi,0
Pi|k = 1 eT xi,k = 1 eT M k xi,0
Since the two sentinel genes 0 and n + 1 never change their positions and signs, their
states are fixed. This means the distribution of the two breakpoints B0 and Bn depend
on the state of one gene each (g1 and gn , respectively); we can use the method for
circular genomes. Under the GNT model they have the same distribution. Then the
expected breakpoint distance after k events is
n
X n1
X n1
X
E[BP (G0, Gk )] = Pi|k = 2P0|k + Pi|k = 2P0|k + (1 eT M k xi,0 )
i=0 i=1 i=1
n1
X
= 2P0|k + (n 1) eT M k xi,0
i=1
1. Inversions. By symmetry of the circular genomes and the model, each inversion has
a corresponding inversion that inverts the complementary subsequence (the solid
vs. the dotted arc in Figure 1(a)); thus we only need to consider the n2 inversions
that do not invert gene g1 .
2. Transpositions. In Figure 1(b), given the three indices in a transposition, the genome
is divided into three subsequences, and the transposition swaps two subsequences
without changing the signs. Let the three subsequences be A, B, and C, where A
contains gene g1 . A takes the form (A1 , g1 , A2 ), where A1 and A2 may be empty.
In the canonical representation there are only two possible unsigned permutations:
(g1 , A2 , B, C, A1) and (g1 , A2 , C, B, A1). This means we only need to consider
transpositions that swap the two subsequences
not containing g1 .
3. Inverted Transpositions. There are 3 n3 inverted transpositions. In Figure 1(c),
given the three endpoints in an inverted transposition, exactly one of the three
subsequences changes signs. Using the canonical representation, we interchange
the two subsequences that do not contain g1 and invert one of them (the first two
genomes right of the arrow in Figure 1(c)), or we invert both subsequences without
swapping (the rightmost genome in Figure 1(c)).
For all u, v WnC , let n (u, v), n (u, v) and n (u, v) be the numbers of inversions,
transpositions, and inverted transpositions that bring a gene in state u to state v (n is the
number of genes in each genome). Then
The following lemma gives formulas for n (u, v), n (u, v), and n (u, v).
Lemma 1.
min{|u| 1, |v| 1, n + 1 |u|, n + 1 |v|}, if uv < 0
n (u, v) = 0, if u 6= v, uv > 0
|u|1 + n+1|u| , if u=v
2 2
0, if uv < 0
n (u, v) = (min{|u|, |v|} 1)(n + 1 max{|u|, |v|}), if u 6= v, uv > 0
n+1|u| + |u|1 , if u=v
3 3
(n 2)n (u, v), if uv < 0
n (u, v) = n (u, v), if u 6= v, uv > 0
3n (u, v), if u=v
Proof. The proof of (a) is omittedthis result is first shown in [17]. We now prove (b).
Consider the gene with state u. Let v be the new state of that gene after the transposition
with indices (a, b, c), 2 a < b < c n + 1. Since transpositions do not change the
Estimating Evolutionary Distances between Whole Genomes 181
g A g A
1 g
1 1
C
B
C B
C B B B
Fig. 1. The three types of rearrangement events in the GNT model on a signed circular
genome. (a) We only need to consider inversions that do not invert g1 . (b) A trans-
position corresponds to swapping two subsequences. (c) The three types of inverted
transpositions. Starting from the left genome, the three distinct results are shown here;
the broken arc represents the subsequence being transposed and inverted.
sign, n (u, v) = n (u, v), and n (u, v) = 0 if uv < 0. Therefore we only need to
analyze the case where u, v > 0.
We first analyze the case when u = v. Assume that either a u < b or b u < c.
In the first case, from the definition in Section 1 we immediately have v = u + (c b),
therefore v u = c b > 0. In the second case, we have v = u + (a b), therefore
v u = a b < 0. Both cases contradict the assumption that u = v, and the only
remaining possibilities that makes u = v are when 2 u = v < a or c u = v n.
This leads to the third line in the n (u, v) formula. Next, the total number of solutions
(a, b, c) for the following two problems is n (u, v) when u 6= v and u, v > 0:
(i) u < v : b = c (v u), 2 a u < b < c n + 1, u < v c.
(ii) u > v : b = a + (u v), 2 a < b u < c n + 1, a v < u.
In the first case n (u, v) = (u 1)(n + 1 v), and in the second case n (u, v) =
(v 1)(n + 1 u). The second line in the n (u, v) formula follows by combining the
two results.
For inverted transpositions there are three distinct subclasses of rearrangement
events. The result in (c) follows by applying the above method to the three cases.
formula that it is reasonable to set r = n for some constant larger than 1. We used
= 2.5 for 120 genes in our experiment.
Constructing the transition matrix M for circular genomes takes O(n2 ) time by
Lemma 1. We believe results similar to Lemma 1 can be obtained for linear genomes,
though it is still an open problem. Instead, we use the construction in Section 3.1 for
linear genomes. For each rearrangement , constructing the Y matrix takes O(n4 ) time.
Since there are O(n2 ) inversions and O(n3 ) transpositions and inverted transpositions,
constructing the transition matrix M takes O(n7 ) time. The running time for computing
xk in Exact-IEBP for k = 1, . . . , r is O(rn2 ) = O(n3 ) for circular genomes and
O(rn4 ) = O(n5 ) for linear genomes by r matrix-vector multiplications. Since the
breakpoint distance is always an integer between 0 and n, we can construct the array
that converts the breakpoint distance b to the corresponding Exact-IEBP distance in
k(b)
O(n2 ) time. Transforming the breakpoint distance matrix into the Exact-IEBP distance
matrix takes O(m2 ) additional array lookups.
We summarize the discussion as follows:
Theorem 1. Given a set of m genomes on n genes, we can estimate the pairwise true
evolutionary distance in
1. O(m2 n + n3 ) time using Exact-IEBP when the genomes are circular,
2. O(m2 n + n7 ) time using Exact-IEBP when the genomes are linear, and
3. O(m2 n + min{n, m2 } log n) time using IEBP (see [21]).
4 Experiments
We now show the experimental study of different distance estimators. We compare
the following five distance estimators on circular genomes: (1) BP, the breakpoint dis-
tance between two genomes, (2) INV [2], the minimum number of inversions needed to
transform one genome into another, (3) IEBP [21], an approximation to the Exact-IEBP
method with fast running time, (4) EDE [10], an estimation of the true evolutionary
distance based on the INV distance, and (5) Exact-IEBP, our new method.
Software. We use PAUP* 4.0 [19] to compute the neighbor joining method and the false
negative rates between two trees (which will be defined later). We have implemented
a simulator [10,21] for the GNT model. The input is a rooted leaf-labeled tree and the
associated parameters (i.e. edge lengths, and the relative probabilities of inversions,
transpositions, and inverted transpositions). On each edge, the simulator applies ran-
dom rearrangement events to the circular genome at the ancestral node according to the
model with given parameters and . We use tgen [6] to generate random trees. These
trees have topologies drawn from the uniform distribution, and edge lengths drawn from
the discrete uniform distribution on intervals [a, b], where we specify a and b.
50 50 50
0 0 0
0 100 200 300 0 100 200 300 0 100 200 300
Breakpoint Distance Inversion Distance EDE Distance
300 300
250 250
Actual number of events
150 150
100 100
50 50
0 0
0 100 200 300 0 100 200 300
IEBP Distance ExactIEBP Distance
(d) (e)
Fig. 2. Accuracy of the Estimators (See Section 4.1). The number of genes is 120.
Each plot is a comparison between some distance measures and the actual number of
rearrangements. We show the result for the inversion-only evolutionary model only.
The x-axis is divided into 30 bins; the length of the vertical bars indicate the standard
deviation. The distance estimators are (a) BP, (b) INV, (c) EDE, (d) IEBP, and (e) Exact-
IEBP. The figures (a), (b), (d) are from [21], and the figure (d) is from [10].
genome with 120 genes, the typical number of genes in the plant chloroplast genomes
[8]. Starting with the unrearranged genome G0 , we apply k events to it to obtain the
genome Gk , where k = 1, . . . , 300. For each value of k we simulate 500 runs. We then
compute the five distances.
The simulation results under the inversion-only model are shown in Figure 2. Under
the other two model settings, the simulation results show similar behavior (e.g. shape
of curves and standard deviations). Note that both BP and INV distances underesti-
mate the actual number of events, and EDE slightly overestimates the actual number of
events when the number of events is high. The IEBP and Exact-IEBP distances are both
unbiased the means of the computed distances are equal to the actual number of re-
arrangement events and have similar standard deviations. We then compare different
distance estimators by the absolute difference in the measured distances and the actual
number of events. Using the same data in the previous experiment, we generate the plots
as follows. The x-axis is the actual number of events. For each distance estimator D we
184 Li-San Wang
Absolute difference
Absolute difference
Absolute difference
200 200 200
50 50 50
0 0 0
0 100 200 300 0 100 200 300 0 100 200 300
Actual number of events Actual number of events Actual number of events
(a) Inversions only (b) Transpositions only (c) Three types of events
equally likely
Fig. 3. Accuracy of the estimators by absolute difference (See Section 4.1 for the de-
tails.). We simulate the evolution on 120 genes. The curves of BP, INV, IEBP, and EDE
are published previously in [10]; they are included for comparative purposes.
plot the curve fD , where fD (x) is the mean of the set {| 1c D(G0 , Gk )k| : 1 k x}
over all observations Gk .1
The result is in Figure 3. The relative performance is the same for most cases: BP
is the worst, followed by INV, IEBP, and EDE. Exact-IEBP has the best performance
except for inversion-only scenarios, where EDE is slightly better only when the num-
ber of events is small. In most cases, IEBP has similar behavior as Exact-IEBP when
the amount of evolution is small; the IEBP and Exact-IEBP curves are almost indistin-
guishable in (a). Yet, in (b) and (c) the IEBP curve is inferior than the Exact-IEBP curve
by a large margin when the number of events is above about 200.
In this section we explore the accuracy of the neighbor joining tree under different ways
of calculating genomic distances. See Table 1 for the settings for the experiment.
Given an inferred tree, we compare its topological accuracy by computing false
negatives with respect to the true tree [9,5]. We begin by defining the true tree.
During the evolutionary process, some edges of the model tree may have no changes
(i.e. evolutionary events) on them. Since reconstructing such edges is at best guesswork,
we are not interested in these edges. Hence, we define the true tree to be the result of
contracting those edges in the model tree on which there are no changes.
We now define how we score an inferred tree, by comparison to the true tree. For
every tree there is a natural association between every edge and the bipartition on the
leaf set induced by deleting the edge from the tree. Let T be the true tree and let T 0
1
The constant c is to reduce the bias effect in different distances. For the IEBP and the Exact-
IEBP distances c = 1 since they estimate the actual number of events. For the BP distance we
let c = 2(1 ) + 3( + ) = 2 + + since this is the expected number of breakpoints
created by each event in the model when the number of events is very low. Similarly for the
INV and EDE distances we let c = (1 ) + 3 + 2 = 1 + 2 + since each
transposition can be replaced by 3 inversions, and each inverted transposition can be replaced
by 2 inversions.
Estimating Evolutionary Distances between Whole Genomes 185
Parameter Value
1. Number of genes 120
2. Number of leaves 10, 20, 40, 80, and 160
3. Expected number of Discrete Uniform within the following intervals:
rearrangements in each edge [1,3], [1,5], [1,10], [3,5], [3,10], and [5,10]
4. Probability settings: (, ) (0,0) (Inversion only)
(1, 0) (Transposition only)
( 31 , 13 ) (The three rearrangement classes are equally likely)
5. Datasets for each setting 100
For each set of genomes, we compute the five distances. We then compute NJ trees
on each of the five distance matrices, and compare the resultant trees to the true tree.
The results of this experiment are in Figure 4. The x-axis is the maximum normalized
inversion distance (as computed by the linear time algorithm for minimum inversion
distances given in [2]) between any two genomes in the input. Distance matrices with
some normalized edit distances close to 1 are said to be saturated, and the recovery of
accurate trees from such datasets is considered to be very difficult [7]. The y-axis is the
false negative rate (i.e. the proportion of missing edges). False negative rates of less than
5% are excellent, but false negative rates of up to 10% can be tolerated. We use NJ(D)
to denote the tree returned by NJ using distance D. Note that except for NJ(EDE), the
relative orders of the NJ tree using different distances are very consistent with the orders
of the accuracy of the distances in the absolute difference plot. NJ(BP) has the worst
accuracy in all settings. NJ(INV) outperforms NJ(Exact-IEBP) and NJ(IEBP) by a small
margin only when the amount of evolution is low; in the transposition-only scenario
the accuracy of NJ(INV) degrades considerably. NJ(Exact-IEBP) has slightly better
accuracy than NJ(IEBP) until the amount of evolution is high; after that the accuracy
of NJ(IEBP) degrades, and NJ(Exact-IEBP) outperforms NJ(IEBP) by a larger margin.
Despite the inferior accuracy of EDE in the experiments in Section 4.1, NJ using EDE
186 Li-San Wang
70 70 70
NJ(BP) NJ(BP) NJ(BP)
NJ(INV) NJ(INV) NJ(INV)
60 NJ(IEBP) 60 NJ(IEBP) 60 NJ(IEBP)
NJ(EDE) NJ(EDE) NJ(EDE)
40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Normalized Maximum Pairwise Inversion Distance Normalized Maximum Pairwise Inversion Distance Normalized Maximum Pairwise Inversion Distance
(a) Inversions only (b) Transpositions only (c) Three types of events
equally likely
Fig. 4. Neighbor Joining Performance under Several Distances (See Section 4.2). See
Table 1 for the settings in the experiment. For comparative purposes, we include curves
of NJ(BP), NJ(INV), NJ(IEBP) from [21], and the curve of NJ(EDE) from [10].
returns the most accurate tree on average2 (especially in the inversion-only model), but
the accuracy of the NJ tree using Exact-IEBP is comparable in most cases.
In this section we demonstrate the robustness of the Exact-IEBP estimator when the
model parameters are unknown. The settings are the same in Table 1. The experiment is
similar to the previous experiment, except here we use both the correct and the incorrect
values of (, ) for the Exact-IEBP distance. The results are in Figure 5. These results
suggest that NJ(Exact-IEBP) is robust against errors in (, ).
5 Conclusions
We have introduced Exact-IEBP, a new technique for estimating true evolutionary dis-
tances between whole genomes. This technique can be applied to signed circular and
linear genomes with arbitrary relative probabilities between the three types of events
in the GNT model. Our simulation study shows that the Exact-IEBP method improves
upon the previous technique, IEBP [21], both in the distance estimation and the accu-
racy of the inferred tree when used in neighbor joining. The accuracy of the NJ trees
using the new method is comparable with the best estimator so far, the EDE estimator
[10]. These different methods are simple yet powerful and can be generalized easily to
different models.
2
We do not have a good explanation for the superior accuracy of NJ(EDE) due to the fact that
the behavior of NJ is still not well understood.
Estimating Evolutionary Distances between Whole Genomes 187
70 70 70
NJ(ExactIEBP(0,0)) NJ(ExactIEBP(0,0)) NJ(ExactIEBP(0,0))
NJ(ExactIEBP(1,0)) NJ(ExactIEBP(1,0)) NJ(ExactIEBP(1,0))
60 NJ(ExactIEBP(1/3,1/3)) 60 NJ(ExactIEBP(1/3,1/3)) 60 NJ(ExactIEBP(1/3,1/3))
40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Normalized Maximum Pairwise Inversion Distance Normalized Maximum Pairwise Inversion Distance Normalized Maximum Pairwise Inversion Distance
(a) Inversions only (b) Transpositions only (c) Three types of events
equally likely
Acknowledgments
I would like to thank Tandy Warnow for her advice and guidance on this paper and
Robert Jansen at the University of Texas at Austin for introducing the problem of
genome rearrangement phylogeny to us. I am grateful to the three anonymous refer-
ees for their helpful comments.
References
11. B.M.E Moret, S.K. Wyman, D.A. Bader, T. Warnow, and M. Yan. A new implementation
and detailed study of breakpoint analysis. In Proc. 6th Pacific Symp. Biocomputing (PSB
2001), pages 583594, 2001.
12. J.H. Nadeau and B.A. Taylor. Lengths of chromosome segments conserved since divergence
of man and mouse. Proc. Natl Acad. Sci. USA, 81:814818, 1984.
13. R.G. Olmstead and J.D. Palmer. Chloroplast DNA systematics: a review of methods and data
analysis. Amer. J. Bot., 81:12051224, 1994.
14. J.D. Palmer. Chloroplast and mitochondrial genome evolution in land plants. In R. Her-
rmann, editor, Cell Organelles, pages 99133. Wein, 1992.
15. L.A. Raubeson and R.K. Jansen. Chloroplast DNA evidence on the ancient evolutionary split
in vascular land plants. Science, 255:16971699, 1992.
16. N. Saitou and M. Nei. The neighbor-joining method: A new method for reconstructing
phylogenetic trees. Mol. Biol. & Evol., 4:406425, 1987.
17. D. Sankoff and M. Blanchette. Probability models for genome rearrangements and linear
invariants for phylogenetic inference. Proc. 3rd Intl Conf. on Comput. Mol. Bio. (RE-
COMB99), pages 302309, 1999.
18. D. Sankoff and J.H. Nadeau, editors. Comparative Genomics : Empirical and Analytical
Approaches to Gene Order Dynamics, Map Alignment and the Evolution of Gene Families.
Kluwer Academic Publishers, 2000.
19. D. Swofford. PAUP* 4.0. Sinauer Associates Inc, 2001.
20. D. Swofford, G. Olson, P. Waddell, and D. Hillis. Phylogenetic inference. In D. Hillis,
C. Moritz, and B. Mable, editors, Molecular Systematics, chapter 11. Sinauer Associates
Inc, 2 edition, 1996.
21. L.-S. Wang and T. Warnow. Estimating true evolutionary distances between genomes. In
Proc. 33th Annual ACM Symp. on Theory of Comp. (STOC 2001). ACM Press, 2001. To
appear.
Finding an Optimal Inversion Median:
Experimental Results
1 Introduction
Dobzhansky and Sturtevant [7] first proposed using the degree to which gene orders
differ between species as an indicator of evolutionary distance that could be useful for
phylogenetic inference, and Watterson et al. [ 23] first proposed the minimum number
of chromosomal inversions necessary to transform one ordering into another as an ap-
propriate distance metric. The 1992 study by Sankoff et al. [ 21] included a heuristic
algorithm for finding rearrangement distance (which considered transpositions, inser-
tions, and deletions, as well as inversions); it was the first large-scale application and
experimental validation of rearrangement-based techniques for phylogenetic purposes
and initiated what is now nearly a decade of intense interest in computational problems
relating to genome rearrangement (see summaries in [ 16,19,22]).
While much of the attention given to rearrangement problems may be due to their in-
triguing combinatorial properties, rearrangement-based approaches to phylogenetic in-
ference are of genuine biological interest in cases in which sequence-based approaches
perform poorly, such as when species diverged early or are rapidly evolving [ 16]. In ad-
dition, rearrangement-based phylogenetic methods can suggest probable gene orderings
of ancestral species [17,18], while other methods cannot. Furthermore, mathematical
models of genome rearrangement have applications beyond phylogeny (see [ 8,20]).
O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 189203, 2001.
c Springer-Verlag Berlin Heidelberg 2001
190 Adam C. Siepel and Bernard M.E. Moret
We consider the case where all genomes have identical sets of n genes and inver-
sion is the single mechanism of rearrangement. We represent each genome G i as a
Finding an Optimal Inversion Median: Experimental Results 191
permutation i of size n, and we let all pairs of genomes G i = (gi,1 . . . gi,n ) and
Gj = (gj,1 . . . gj,n ), in a set of genomes G, be represented by i = (i,1 . . . i,n ) and
j = (j,1 . . . j,n ) such that i,k = j,l iff Gi,k = Gj,l , and i,k = 1 j,l iff Gi,k
is the reverse complement of G j,l .
We define an inversion acting on permutation from i to j, for i j, as that opera-
tion which transforms into = ( 1 , 2 , . . . , i1 , j , j1 , . . . , i , j+1 , . . . ,
n ). The minimal number of inversions required to change one permutation i into
another permutation j is the inversion distance, which we denote by d( i , j ) (some-
times abbreviated as di,j ).
Let the inversion median M of a set of N permutations = { 1 , 2 , . . . , N } be
N
the signed permutation that minimizes the sum S(M, ) = i=1 d(M, i )). Let this
sum S(M, ) = S() be called the median score of M with respect to .
For a given number of genes n, we can construct an undirected graph G n = (V, E)
such that each vertex in V corresponds to a signed permutation of size n and two ver-
tices are connected by an edge if and only if one of the corresponding permutations
can be obtained from the other through a single inversion; formally, E = {{v i , vj } |
vi , vj V and d(i , j ) = 1}. We will call Gn the inversion graph of size n. In
this graph, the distance between any two vertices, v i and vj , is the same as the inver-
sion distance between the corresponding permutations, i and j . Furthermore, find-
ing the median of a set of permutations is equivalent to finding the minimum un-
weighted Steiner tree of the corresponding vertices in G n . Note that Gn is very large
(|V | = n! 2n ), so this representation does not immediately suggest a feasible graph-
search algorithm, even for small n.
Definition 1. A shortest path between two permutations of size n, 1 and 2 , is a con-
nected subgraph of the inversion graph G n containing only the vertices v 1 and v2 cor-
responding to 1 and 2 , and the vertices and edges on a single shortest path between
v1 and v2 .
genomes is of particular interest. In this section we develop a general bound for the
median-of-three problem, one that relies only on the metric property of the distance
measure used.
Proof. The upper bound follows directly from the possibility of a trivial median, and
the lower bound from properties of metric spaces (a median of lower score would neces-
sarily violate the triangle inequality with respect to two of 1 , 2 , and 3 ; see Figure 1).
v1
Proof. Assume to the contrary that 1 , 2 , and 3 have a trivial median path and have a
median M that is not trivial. By Definition 4, M must be on a shortest path between two
of 1 , 2 , and 3 . Without loss of generality, assume that the median path runs from 1
to M to 2 to 3 . Let d1,2 , d1,3 , and d2,3 be the pairwise distances between { 1 , 2 },
{1 , 3 }, and {2 , 3 }, respectively, and let d M,2 > 0 be the distance of M from 2 .
Then the median score of M is (d 1,2 dM,2 )+dM,2 +(dM,2 +d2,3 ) = d1,2 +dM,2 +d2,3 .
But this score is greater by d M,2 than the score of a trivial median at 2 , so M cannot
be a median.
v1
d1,
v
d2, d3,
M
v2 d2,3 v3
Fig. 2. A median path including v can be constructed using a shortest path from v 1 to
v and any median path of v , v2 , and v3 .
4 The Algorithm
end
196 Adam C. Siepel and Bernard M.E. Moret
Suppose instead that the algorithm exited before reaching a median. The algorithm
can exit for one of three reasons:
1. The priority stack s becomes empty (step 6);
2. The next item returned from s has a best possible score greater than or equal to the
current global upper bound (step 7);
3. A vertex w is found with a worst possible score equal to the global lower bound
(step 12);
Case 1 can occur only if all vertices have been visited, or if all remaining neighbors
have been pruned (because except when the algorithm stops for another reason, each
new neighbor is either pruned or pushed onto s). If all vertices have been visited, then
a median must have been visited. We have shown above that all neighbors on paths to
a median cannot have been pruned. Because s always returns a vertex v such that no
other vertex in s has a lower best-possible score than v, and because all neighbors that
are not pruned are added to s, case 2 can only occur if a median has been visited or if
all paths to medians have been pruned. We have shown that all paths to medians cannot
have been pruned. Therefore, if case 2 occurs, a median must have been visited. In case
3, w must be a median, since the global lower bound is set directly according to Lemma
1 (step 1), which we have shown to be correct.
Thus, none of these three cases can arise before a median has been found, and
the algorithm must return a median. The worst-case running time of the algorithm is
O(n3d ), with d = min{d1,2 , d2,3 , d1,3 }, but as would be expected with a branch-and-
bound algorithm, the average running time appears to be much better.
5 Experimental Method
We implemented find inversion median in C, reusing the linear-time distance
routine (as well as some auxiliary code) from GRAPPA [1], and we evaluated its perfor-
mance on simulated data. All test data was generated by a simple program that creates
multiple sets of three permutations by applying random inversions to the identity per-
mutation, such that each set of three permutations represents three taxa derived from a
common ancestor under an inversions-only model of evolution. In addition to the num-
ber of genes n to model and the number of sets s to create, this program accepts a
parameter i that determines how many random inversions to apply in obtaining the per-
mutation for each taxon. Thus, if n = 100, i = 10, and s = 10, the program generates
10 sets of 3 signed permutations, each of size 100, and obtains each permutation by ap-
plying 10 random inversions to the permutation +1, +2, . . . , +100. A random inversion
is defined as an inversion between two random positions i and j such that 1 i, j n
(if i = j, a single gene simply changes its sign). When i is small compared to n, each
permutation in a set tends to be a distance of 2i from each other.
We used several algorithmic engineering techniques to improve the efficiency of
find inversion median. For example, we avoided dynamic memory allocation
and reused records representing graph vertices. We were able to gain a significant
speedup by optimizing the hash table used for marking vertices: a custom hash table of-
fered a fourfold increase in the overall speed of the program, as compared with UNIXs
Finding an Optimal Inversion Median: Experimental Results 197
6 Experimental Results
Being especially concerned with the effectiveness of the pruning strategy, we have cho-
sen as a measure of performance the number of vertices V of the inversion graph that
the algorithm visited. In particular, we have taken V to be the number of times the
program executed the loop at step 8 of the algorithm. Note that the number of calls to
distance is approximately 3V . We recorded the distribution of V over many exper-
iments, in which we used various values for the number of genes n and the number of
inversions per tree edge i. Figure 3 is typical of our results. It summarizes 500 experi-
0.35
0.30
0.25
0.20
Relative
Frequency
0.15
0.10
0.05
0.00
0 10000 20000 30000 40000 50000 60000 70000 80000 90000
V
Fig. 3. Distribution of the number of vertices visited in the course of 500 experiments
with n = 50 and i = 7.
198 Adam C. Siepel and Bernard M.E. Moret
ments with n = 50 and i = 7 and shows a roughly exponential distribution, with high
relative frequencies in a few intervals having small V : in 87% of the experiments, fewer
than 10,000 vertices were visited, and in 95%, fewer than 20,000 were visited. This fig-
ure demonstrates that the algorithm generally finds a median rapidly, but occasionally
becomes mired in an unprofitable region of the search space. We have observed that
the tail of the exponential distribution becomes more substantial as i grows larger with
respect to n.
In order to characterize typical performance, we recorded the statistical medians
of V as n and i varied independently. The results are shown in Figures 4 and 5. For
25000
Statistical Mean s s
Statistical Median
20000
s
15000 s
V
s
10000
s
5000
s
s
s
s
0 s
10 20 30 40 50 60 70 80 90 100
n
comparison, we have also plotted the mean values of V . Note that, at least for i = 5,
the median and mean of V appear to grow quadratically over a significant range of
values for n; a simple fit yields f (n) = 2.1n 2 for the median values. Note also that,
for n = 50, the median of V grows approximately linearly with i, at least as long as i
remains small (mean V grows somewhat faster than median V ). To put the observed rate
of growth into perspective, note that in the theoretical worst-case of O(n 3d ), because
3d
d 2i and V = O( nn ) = O(n(6i1) ), one would see (given i = 5 and n = 50)
growth of V with n 29 and 506i1 .
18000
Statistical Mean s s
16000 Statistical Median
14000
12000 s
10000
V
8000
s
6000 s
4000 s
s
2000 s
s
0
1 2 3 4 5 6 7 8
i
12
Sun E10000 (p = 1)
Sun E10000 (p = 2)
10 Sun E10000 (p = 4)
Sun E10000 (p = 6)
Pentium III s
8
Average
Time
to Find 6
a Median s
(sec.)
4
s
2
s
0
s
50 75 100 125
n
Fig. 6. Sequential and parallel running times for i = 5 and n {50, 75, 100, 125}. Each
data point represents an average taken over 10 experiments. Parallel configurations used
parallelism only in the minor loop of the algorithm.
a median is about 12 seconds or less. Observe that for n = 100 (a realistic size for
chloroplast or mitochondrial genomes) medians can generally be found in an average
of about 2 seconds using a reasonably fast computer. We should note that the memory
requirements for the program are considerable, and that the level of performance shown
here is partly a consequence of the large amount of RAM available on the Sun.
It is evident from Figure 6 that we achieve a good parallel speedup for small p, but
that the benefits of parallelization begin to erode between p = 4 and p = 6 (this ten-
dency becomes more pronounced at p = 8, which we have not plotted here for clarity of
200 Adam C. Siepel and Bernard M.E. Moret
presentation). Anecdotal evidence suggests that the cause of this trend is a combination
of the overhead of synchronization and uneven load balancing among the computing
threads. We also observed that parallelism in the minor loop of the algorithm was far
more effective than parallelism in the major loop, presumably because the heuristic for
prioritization is sufficiently effective that the latter strategy results in a large amount of
unnecessary work.
20 s
Actual
18 Trivial s
16 Breakpoint
s
Inversion
14
Average 12 s
median score 10
(inversions) s
8
6
4s
2
0
1 2 3 4 5
i
Fig. 7. Comparison of inversion medians with breakpoint medians, trivial medians, and
actual medians, for n = 25. Averages were taken over 50 experiments.
inversion medians achieve comparable scores to actual medians 1 and that breakpoint
medians, when scored in terms of inversion distance, perform significantly worse. A
comparison in terms of inversion median scores is clearly biased in favor of inversion
medians; however, if it is true that inversion distances are (in at least some cases) more
meaningful than breakpoint distances, then these results suggest that inversion medians
are worth obtaining.
We used a slight modeification of program find inversion median to find
all optimal medians and thus to characterize the extent to which inversion medians are
unique. An example of our results is shown in Figure 8, which describes the number
1
Inversion medians are slightly better than actual medians when i becomes large with respect
to n, because saturation begins to cause convergence between taxa.
Finding an Optimal Inversion Median: Experimental Results 201
7 Future Work
The strength and weakness of the current algorithm both lie in its generality. On the one
hand, our approach depends only on elementary properties of metric spaces and thus
2
Recall that the distance between permutations is approximately 2i and that random permu-
tations tend to be separated by a distance of approximately n. The effects of saturation are
evident at i = 0.2n and are pronounced at i = 0.3n.
202 Adam C. Siepel and Bernard M.E. Moret
extends easily to the case of equally weighted inversions, translocations, fissions, and
fusions; furthermore, it could also be used with weighted rearrangement distances. (One
should note, however, that the running time is a direct function of the cost of evaluating
distances; we can compute exact breakpoint and inversion distances, but no efficient
algorithm is yet known for more complex distance computations.) On the other hand,
our approach does not exploit the unique structure of the inversion problem; as shown
elsewhere in this volume by A. Caprara, restricting the algorithm to inversion distances
only and using aspects of the Hannenhalli-Pevzner theory enables the derivation of
tighter bounds and thus also the solution of larger instances of the inversion median
problem.
Many simple changes to our current implementation will considerably reduce the
running time. For example, the current implementation does not condense genomes
before processing themi.e., it does not convert subsequences of genes shared among
all three genomes to single supergenes. Preliminary experiments indicate that con-
densing genomes yields very significant improvements in performance when i is small
relative to n. Distance computations themselves, while already fast, can be further im-
proved by reusing previous computations, since a move by the algorithm makes only
minimal changes to the candidate permutation. Finally, we can use the Kaplan-Shamir-
Tarjan algorithm, in combination with metric properties, to prepare better initial solu-
tions (by walking halfway through shortest paths between chosen permutations), thus
considerably decreasing the search space to be explored.
References
1. D.A. Bader, B.M.E. Moret, and M. Yan. A linear-time algorithm for computing inversion
distance between signed permutations with an experimental study. In Proceedings 7th Work-
shop on Algorithms and Data Structures WADS91. Springer Verlag, 2001. to appear in
LNCS.
2. P. Berman and S. Hannenhalli. Fast sorting by reversal. In D. Hirschberg and E. Myers, ed-
itors, Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching, pages
168185, 1996.
3. M. Blanchette, T. Kunisawa, and D. Sankoff. Parametric genome rearrangement. Gene,
172:GC11GC17, 1996.
4. A. Caprara. Formulations and complexity of multiple sorting by reversals. In S. Istrail, P.A.
Pevzner, and M.S. Waterman, editors, Proceedings of the Third Annual International Con-
ference on Computational Molecular Biology (RECOMB-99), pages 8493, Lyon, France,
April 1999.
5. M.E. Cosner, R.K. Jansen, B.M.E. Moret, L.A. Raubeson, L.-S. Wang, T. Warnow, and
S. Wyman. An empirical comparison of phylogenetic methods on chloroplast gene order
data in Campanulaceae. In D. Sankoff and J.H. Nadeau, editors, Comparative Genomics,
pages 99122. Kluwer Academic Press, 2000.
6. M.E. Cosner, R.K. Jansen, B.M.E. Moret, L.A. Raubeson, L.-S. Wang, T. Warnow, and
S. Wyman. A new fast heuristic for computing the breakpoint phylogeny and experimen-
tal phylogenetic analyses of real and synthetic data. In Proceedings of the 8th International
Conference on Intelligent Systems for Molecular Biology ISMB-2000, pages 104115, 2000.
7. T. Dobzhansky and A.H. Sturtevant. Inversions in the chromosomes of Drosophila pseu-
doobscura. Genetics, 23:2864, 1938.
Finding an Optimal Inversion Median: Experimental Results 203
1 Introduction
Maximum likelihood (Felsenstein, 1981) is increasingly used as an optimality criterion
for selecting evolutionary trees, but finding the global optimum is difficult computa-
tionally, even on a single tree. Because no general analytical solution is available, it is
necessary to use numeric techniques, such as hill climbing or expectation maximiza-
tion (EM), in order to find optimal values. Two recent developments are relevant when
considering analytical solutions for simple substitution models with a small number of
taxa. Yang (2000) has reported an analytical solution for three taxa with two state char-
acters under a molecular clock. Thus in this special case the tree and the edge lengths
that yield maximum likelihood values can now be expressed analytically, allowing the
most likely tree to be positively identified. Yang calls this case the simplest phylogeny
estimation problem.
A second development is in Chor et al. (2000), who used the Hadamard conjugation
for unrooted trees on four taxa, again with two state characters. As part of that study
O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 204213, 2001.
c Springer-Verlag Berlin Heidelberg 2001
Analytic Solutions for Three-Taxon MLM C Trees with Variable Rates Across Sites 205
analytic solutions were found for some families of observed data. It was reported that
multiple optima on a single tree occurred more frequently with maximum likelihood
than has been expected. In one case, the best tree had a local (non global) optimum that
was less likely than the optimum value on a different, inferior tree. In such a case, a hill
climbing heuristic could misidentify the optimal tree. Such examples reinforce the
desirability of analytical solutions that guarantee to find the global optima for any tree.
Even though three taxon, two state characters models under a molecular clock is
the simplest phylogeny estimation problem, it is still potentially an important case
to solve analytically. It can allow a rooted triplet method for inferring larger rooted
trees by building them up from the triplets. This would be analogous to the use of un-
rooted quartets for building up unrooted trees. Trees from quartets methods are already
used extensively in various studies (Bandelt and Dress 1986, Strimmer and von Hae-
seler 1996, Wilson 1998, Ben-Dor et al. 1998, Erdos et al. 1999). The fact that general
analytical solutions are not yet available for unrooted quartets only emphasizes the im-
portance of analytical solutions to the rooted triplets case.
In this work we provide analytic solutions for three taxon ML MC trees under any
distribution of variable rates across sites (provided the moment generating function
of the distribution is strictly increasing over the negative real numbers). This class of
distributions includes, as a special case, identical rates across sites. It also includes
the Gamma, the uniform, and the inverse Gaussian distributions. Therefore, our work
generalizes Yangs solution of identical rates across sites. In addition, our derivation
of the analytic solution is substantially simpler. We employ the Hadamard conjugation
(Hendy and Penny 1993, Hendy, Penny, and Steel 1994) and convexity of an entropy
like function.
The remainder of this paper is organized as follows: In subsection 2 we explain the
Hadamard conjugation and its relation to maximum likelihood. In Section 3 we state
and prove our main technical theorem. Section 4 applies the theorem to solve ML MC
analytically on three species trees. Finally, Section 5 presents some implications of this
work and directions for further research.
The Hadamard conjugation (Hendy and Penny 1993, Hendy, Penny, and Steel 1994) is
an invertible transformation linking the probabilities of site substitutions on edges of
an evolutionary tree T to the probabilities of obtaining each possible combination of
characters. It is applicable to a number of simple models of site substitution: Neyman 2
state model (Neyman 1971), JukesCantor model (Jukes and Cantor 1969), and Kimura
2ST and 3ST models (Kimura 1983). For these models, the transformation yields a
powerful tool which greatly simplifies and unifies the analysis of phylogenetic data. In
this section we explain the Hadamard conjugate and its relationships to ML.
We now introduce a notation that we will use for labeling the edges of unrooted
binary trees. (For simplicity we use four taxa, but the definitions extend to any n.)
Suppose the four species, 1, 2, 3 and 4, are represented by the leaves of the tree T .
A split of the species is any partition of {1, 2, 3, 4} into two disjoint subsets. We will
identify each split by the subset which does not contain 4 (in general n), so that for
206 Benny Chor, Michael Hendy, and David Penny
example the split {{1, 2}, {3, 4}} is identified by the subset {1, 2}. Each edge e of T
induces a split of the taxa, namely the two sets of leaves on the two components of T
resulting from the deletion of e. Hence the central edge of the tree T = (12)(34) in the
brackets notation induces the split identified by the subset {1, 2}. For brevity we will
label this edge by e 12 as a shorthand for e {1,2} . Thus E(T ) = {e1 , e2 , e12 , e3 , e123 }
(see Figure 1).
1 3
e1 e12 e3
e2 e123
2 4
We use a similar indexing scheme for splits at a site in the sequences: For a sub-
set {1, ..., n 1}, we say that a given site i is an -split pattern if is the set
of sequences whose character state at position i differs from the i-th position in the
n-th sequence. Given a tree T with n leaves and edge lengths q = [q e ]eE(T ) (0
qe < ) (where qe is the expected number of substitutions per site, across the edge
e), the expected probability (averaged over all sites) of generating an -split pattern
( {1, . . . , n 1}) is well defined (this probability may vary across sites, depending
on the distribution of rates). Denote this expected probability by s = P r(-split|T, q).
We define the expected sequence spectrum s = [s ]{1,...,n1} . Having this spectrum
at hand greatly facilitates the calculation and analysis of the likelihood, since the likeli-
hood of observing a sequence with splits described by the vector s given the sequence
spectrum s equals
L(s|s) = P r(-split | s)s = s
s .
{1,...,n1} >0
s
3 Technical Results
Under a molecular clock, a tree on n taxa has at least two sister taxa i and j whose
pendant edges q i and qj are of equal length (q i = qj ). Our first result states that if qi =
qj , then the corresponding split probabilities are equal as well (s i = sj ). Knowing that
a pair of these variables attains the same value simplifies the analysis of the maximum
likelihood tree in general, and in particular makes it possible for the case of n = 3 taxa.
Furthermore, if q i > qj and the moment generating function M is strictly increasing in
the range (, 0], then the corresponding split probabilities satisfy s i > sj as well.
s = H 1 M (Hq),
then:
qi = qj = si = sj ;
and if the function M is strictly monotonic ascending in the range (, 0] then:
qi > qj = si > sj .
Proof. Let X = {1, 2, . . . , n} be the taxa set with reference element n, and let X =
X {n}. Without loss of generality i, j = n. For X , let = {i, j} (where
= ()() is the symmetric difference of and ). The mapping
is a bijection between
Xi = { X |i , j }
and
Xj = { X |i , j }.
Note that the two sets Xi and Xj are disjoint. Writing h ,i for h,{i} we have
Xi = h,i = 1, h,j = 1, h ,i = 1, h ,j = 1.
Thus the only contributions to may come from the two edges pendant upon i
and j, namely
We remark that the moment generating function M in the four examples of Section 2
(equal rates across sites, uniform distribution with parameter b, 0 < b 1, Gamma
distribution with parameter k, 0 < k, and inverse Gaussian distribution with parameter
d, 0 < d) are strictly increasing in the range (, 0].
We first note that for three taxa, the problem of finding analytically the ML trees without
the constraint of a molecular clock is trivial. This is a special case of unconstrained
likelihood for the multinomial distribution. On the other hand, adding a molecular clock
makes the problem interesting even for n = 3 taxa, which is the case we treat in this
section.
For n = 3, let s0 be the probability of observing the constant site pattern (xxx or
yyy). Let s1 be the probability of observing the site pattern which splits 1 from 2 and
3 (xyy or yxx). Similarly, let s 2 be the probability of observing the site pattern which
splits 2 from 1 and 3 (yxy or xyx), and let s 3 be the probability of observing the site
pattern which splits 3 from 1 and 2 (xxy or yyx).
210 Benny Chor, Michael Hendy, and David Penny
Consider unrooted trees on the taxa set X = {1, 2, 3} that have two edges of the
same length. Let T1 denote the family of such trees with edges 2 and 3 of the same
length (q2 = q3 ), T2 denote the family of such trees with edges 1 and 3 of the same
length (q1 = q3 ), and T3 denote the family of such trees with edges 2 and 1 of the same
length (q2 = q1 ). Finally, let T0 denotes the family of trees with q 1 = q2 = q3 . We first
see how to determine the ML tree for each family.
2 1 1
1 2 3
3 3 2
Given an observed sequence of m sites, let m 0 be the number of sites where all three
nucleotides are equal, and let m i (i = 1, 2, 3) be the number of sites where the character
in sequence i differs from the state of the other sequences. Then m = m 0 + m1 + m2 +
m3 , and fi = mi /m is the frequency of sites with the corresponding character state
pattern.
Theorem 2. Let (m0 , m1 , m2 , m3 ) be the observed data. The ML tree in each family
is obtained at the following point:
For the family T0 , the likelihood is maximized at T 0 with s0 = f0 , s1 = s2 = s3 =
(1 f0 )/3.
For the family T1 , the likelihood is maximized at T 1 with s0 = f0 , s1 = f1 , s2 =
s3 = (f2 + f3 )/2.
For the family T2 , the likelihood is maximized at T 2 with s0 = f0 , s2 = f2 , s1 =
s3 = (f1 + f3 )/2.
For the family T3 , the likelihood is maximized at T 3 with s0 = f0 , s3 = f3 , s1 =
s2 = (f1 + f2 )/2.
Consider, without loss of generality, the case of the T 1 family. We are interested in
maximizing under the constraint q 2 q3 = 0. By Theorem 3.1, this implies s2
s3 = 0. Therefore, using Lagrange multipliers, a maximum point of the likelihood
must satisfy
(s2 s3 )
= (i = 1, 2, 3) ,
si si
implying
f1 f0
= ,
s1 s0
f2 f0
= + ,
s2 s0
f3 f0
= + .
s3 s0
Denote d = f0 /s0 , then by adding the last two equations and substituting s 3 = s2 we
have f2 + f3 = 2ds2 . Adding the right hand sides and left hand sides of this equality to
these of f1 = ds1 and f0 = ds0 , we get
f0 + f1 + f2 + f3 = d(s0 + s1 + 2s2 ) .
s0 = f0 , s1 = f1 , s2 = s3 = (f2 + f3 )/2 .
We denote by T 2 , T3 , T0 the three corresponding trees that maximize the function for
the families T2 , T3 , T0 . The weights of these three trees can be obtained in a similar
fashion to T 1 .
and somewhat abusing the notation, we get the following values for the function on
the three trees
(T1 ) = G(f1 ) ,
(T2 ) = G(f2 ) ,
(T3 ) = G(f3 ) .
212 Benny Chor, Michael Hendy, and David Penny
The function G(p) behaves similarly to minus the binary entropy function (Gallager,
1968)
H(p) = p log p + (1 p) log(1 p) .
The range where G(p) is defined is 0 p 1 f 0 . In this interval, G(p) is negative
and -convex, just like H(p). So G has a single minimum at the point p 0 where its
derivative is zero, dG(p)/dp = 0. Solving for p we get p 0 = (1 f0 )/3.
Now f3 f2 f1 and G(p) is -convex. Therefore, out of the three values
G(f1 ), G(f2 ), G(f3 ), the maximum is attained at either G(f 3 ) or at G(f1 ), but not
at G(f2 ) (unless f2 = f1 or f2 = f3 ).
Since f3 + f2 + f1 = 1 f0 and f3 f2 f1 , we have f3 (1 f0 )/3 f1 ,
namely the two candidates for ML points are on different sides of the minimum point.
The point f 3 is strictly to the left and the point f 1 is strictly to the right (except the case
where f3 = f1 and the two points coincide). If G(f 1 ) G(f3 ), then the tree T 1 is the
obvious candidate for ML MC tree. Indeed, T 1 satisfies s3 = s2 < s1 , so by Theorem
2, q3 = q2 < q1 . Thus, a root can be placed on the edge q 1 so that the molecular clock
assumption is satisfied.
We certainly could have a case where G(f 3 ) > G(f1 ). However, the tree T 3 has
s3 < s1 = s2 , implying (by Theorem 2)q3 < q1 = q2 . Therefore there is no way
to place a root on an edge of T 3 so as to satisfy a molecular clock. In fact, any tree
with edge lengths q 3 < q1 = q2 does not satisfy a molecular clock. So the remaining
possibilities could be either the tree T 0 (where s1 = s2 = s3 = (1 f0 )/3) or the tree
T1 . As T0 attains the minimum over the function G, we are always better off taking the
tree T1 (except in the redundant case f 1 = f3 , where all these trees collapse to T 0 ). This
completes the proof of Theorem 3.
The case m2 < m3 < m1 and its other permutations can clearly be handled similarly.
In the case where G(f 3 ) > G(f1 ), T1 is still the MLMC tree. However, if the difference
between the two values is significant, it may give a strong support for rejecting a molec-
ular clock assumption for the given data m 0 , m1 , m2 , m3 . This would be the case, for
example, when 0 m 3 m1 m2 .
Two natural directions for extending this work are to consider four state characters
and to extend the number of taxa to n = 4 and beyond. The question of constructing
rooted trees from rooted triplets is an interesting algorithmic problem, analogous to
that of constructing unrooted trees from unrooted quartets. The biological relevance of
triplets based reconstruction methods is also of interest.
Acknowledgments
Thanks to Sagi Snir for helpful comments on earlier versions of this manuscript. Benny
Chor was supported by ISF grant 418/00 and did part of this work while visiting Massey
University.
Analytic Solutions for Three-Taxon MLM C Trees with Variable Rates Across Sites 213
References
Bandelt, H.-J., and A. Dress, 1986. Reconstructing the shape of a tree from observed dissimilarity
data. Advances in Applied Mathematics, 7:309343.
Ben-Dor, A., B. Chor, D. Graur, R. Ophir, and D. Pelleg, 1998. Constructing phylogenies from
quartets: Elucidation of eutherian superordinal relationships. Jour. of Comput. Biology, 5(3):377
390.
Chor, B., M. D. Hendy, B. R. Holland , and D. Penny, 2000. Multiple Maxima of Likelihood in
Phylogenetic Trees: An Analytic Approach. Mol. Biol. Evol., Vol. 17, No.10, September 2000,
pp. 15291541.
Erdos, P., M. Steel, L. Szekely, and T. Warnow, 1999. A few logs suffice to build (almost) all
trees (i). Random Structures and Algorithms, 14:153184.
Felsenstein, J., 1981. Evolutionary trees from DNA sequences: A maximum likelihood approach.
J. Mol. Evol., 17:368376.
Gallager, R.G. Information Theory and Reliable Communication, Wiley, New York (1968).
Hendy, M. D., and D. Penny, 1993. Spectral analysis of phylogenetic data. J. Classif., 10:524.
Hendy, M. D., D. Penny, and M.A. Steel, 1994. Discrete fourier analysis for evolutionary trees.
Proc. Natl. Acad. Sci. USA., 91:33393343.
Neyman, J., 1971. Molecular studies of evolution: A source of novel statistical problems. In
S. Gupta and Y. Jackel, editors, Statistical Decision Theory and Related Topics, pages 127.
Academic Press, New York.
Strimmer, K., and A. von Haeseler, 1996. Quartet puzzling: A quartet maximum-likelihood
method for reconstructing tree topologies. Molecular Biology and Evolution, 13(7):964969.
Waddell, P., D. Penny, and T. Moore, 1997. Hadamard conjugations and Modeling Sequence
Evolution with Unequal Rates across Sites. Molecular Phylogenetics and Evolution, 8(1):3350.
Wilson, S.J., 1998. Measuring inconsistency in phylogenetic trees. Journal of Theoretical Biol-
ogy, 190:1536.
Yang, Z., 2000. Complexity of the simplest phylogenetic estimation problem. Proc. R. Soc. Lond.
B, 267:109119.
The Performance of Phylogenetic Methods on Trees of
Bounded Diameter
1 Introduction
Phylogenetic trees (that is, evolutionary trees) form an important part of biological re-
search. As such, there are many algorithms for inferring phylogenetic trees. The ma-
jority of these methods are designed to be used on biomolecular (i.e., DNA, RNA, or
amino-acid) sequences. Methods for inferring phylogenies from biomolecular sequence
data are studied (both theoretically and empirically) with respect to the topological ac-
curacy of the inferred trees. Such studies evaluate the effects of various model con-
ditions (such as the sequence length, the rates of evolution on the tree, and the tree
shape) on the performance of various methods.
The sequence length requirement of a method is the sequence length needed by
the method in order to obtain (with high probability) the true tree topology. Earlier
studies established analytical upper bounds on the sequence length requirements of
various methods (including the popular neighbor-joining [ 18] method). These studies
showed that standard methods, such as neighbor-joining, recover the true tree (with high
probability) from sequences of lengths that are exponential in the evolutionary diameter
of the true tree. Based upon these studies, in [5,6] we defined a parameterization of
model trees in which the longest and shortest edge lengths are fixed, so that the sequence
length requirement of a method can be expressed as a function of the number of taxa, n.
This parameterization leads to the definition of fast-converging methods, which are
methods that recover the true tree from sequences of lengths bounded by a polynomial
in n once f , the minimum edge length, and g, the maximum edge length, are bounded.
Several fast-converging methods were developed [ 3,4,8,21]. We and others analyzed
O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 214226, 2001.
c Springer-Verlag Berlin Heidelberg 2001
The Performance of Phylogenetic Methods on Trees of Bounded Diameter 215
2 Basics
In this section, we present the basic definitions, models of evolution, methods, and
terms, upon which the rest of the paper is based.
216 Luay Nakhleh et al.
In [1], the sequence length requirement for the neighbor-joining method under the
Cavender-Farris model was bounded from above, and extended to the General Markov
model in [5]. We state the result here:
Theorem 1. ([1,5]) Let (T, M ) be a model tree in the General Markov model. Let
(e) = log |det(Me )|, and set ij = (e).
ePij
Assume that f is fixed with 0 < f (e) for all edges e T . Let > 0 be given.
Then, there are constants C and C (that do not depend upon f ) such that, for
C
k= log neC (max ij )
f2
then with probability at least 1 !, neighbor-joining on S returns the true tree, where S
is a set of sequences of length k generated on T . The same sequence length requirement
applies to the Q method of [2].
From Theorem 1 we can see that as the edge length gets smaller, the sequence length
has to be larger in order for neighbor-joining to return the true tree with high probability.
Note that the diameter of the tree and the sequence length are exponentially related.
Analysis when both f and g Are Fixed: In [8,21], the convergence rate of neighbor-
joining was analyzed when both f and g are fixed (recall that f is the smallest edge
length, and g is the largest edge length). In this setting, by Theorem 1 and because
max ij = O(gn), we see that neighbor-joining recovers the true tree, with probability
1 !, from sequences that grow exponentially in n. An average case
analysis of tree
topologies under various distributions shows that max ij = (g n) for the uniform
distribution and (g log n) for the Yule-Harding distribution. Hence, neighbor-joining
has an average case convergence rate which is polynomial in n under the Yule-Harding
distribution, but not under the uniform distribution.
By definition, fast-converging methods are required to converge to the true tree
from polynomial length sequences, when f and g are fixed. The convergence rates of
fast-converging methods have a somewhat different form. We show the analysis for the
DCM -NJ method (see [21]):
The Performance of Phylogenetic Methods on Trees of Bounded Diameter 219
Theorem 2. ([21]) Let (T, M ) be a model tree in the General Markov model. Let
(e) = log |det(Me )|, and set ij = (e).
ePij
Assume that f is fixed with 0 < f (e) for all edges e T . Let > 0 be given.
Then, there are constants C and C (that do not depend upon f ) such that, for
C
k= 2
log neC (width(T ))
f
then with probability at least 1 !, DCM -NJ on S returns the true tree, where S is
a set of sequences of length k generated on T , and width(T ) is a topologically defined
function which is bounded from above by max ij and is also O(g log n).
Consequently, fast-converging methods recover the true tree from polynomial length
sequences when both f and g are fixed.
Analysis when max ij Is Fixed: Suppose now that we fix max ij but not f . In this
case, neither neighbor-joining nor the fast-converging methods will recover the true
tree from sequences whose lengths grow polynomially in n, because as f 0, the
sequence length requirement increases without bound. However, for random birth-
death trees, the expected minimum edge length is (1/n). Hence, suppose that in ad-
dition to fixing max ij we also require that f = (1/n). In this case, application
of Theorem 1 and Theorem 2 shows that neighbor-joining and the fast-converging
methods all recover the true tree with high probability from O(n 2 log n)-length se-
quences. The theoretically obtained convergence rates differ only in the leading con-
stant, which in neighbor-joinings case depends exponentially on max ij , while in the
case of DCM -NJs this rate depends exponentially on width(T ). Thus, the perfor-
mance advantage of a fast-converging method from a theoretical perspective depends
upon the difference between these two values. We know that width(T ) max ij
for all trees. Furthermore, the two values are essentially equal only when the strong
molecular clock assumption holds. Note also that when the tree has a low evolutionary
diameter (i.e., when max ij is small), then the predicted performance of these methods
suggests that they will be approximately identical. Only for large evolutionary diame-
ters should we obtain a performance advantage by using the fast-converging methods
instead of neighbor-joining.
In the next section we discuss the empirical performance of these methods.
tree topologies) with random branch lengths (also drawn from the uniform distribu-
tion within some specified range), the DCM-NJ+MP method was a clear improvement
upon the NJ method with respect to topological accuracy. The DCM-NJ+MP method
was also more accurate in many of our experiments than the other variants we tested,
leading us to conclude that the improved performance on random trees might extend to
other distributions on model trees.
Later in this paper we will present new experiments, testing this conclusion on ran-
dom birth-death trees with a moderate deviation from ultrametricity. Here we present a
small sample of our earlier experiments, which shows the improved performance and
indicates how DCM-NJ+MP obtains this improved performance.
Recall that the DCM-NJ+MP method has two phases. In the first phase, a collection
of trees is obtained, one for each setting of the parameter q. This inference is based upon
dividing the input set into overlapping subsets, each of diameter bounded from above by
q. The NJ method is then used on each subset to get a subtree for the subset, and these
subtrees are merged into a single supertree. These trees are constructed to be binary
trees, and hence do not need to be further resolved. This first phase is the DCM-NJ
portion of the method. In the second phase, we select a single tree from the collection
of trees {Tq : q dij }, by selecting the tree which has the optimal parsimony score
(i.e., the fewest changes on the tree).
The accuracy of this two-phase method depends upon two properties: first, the first
phase must produce a set of trees so that at least some of these trees are better than
the NJ tree, and second, the technique (in our case, maximum parsimony) used in the
second phase must be capable of selecting a better tree than the NJ tree. Thus, the
first property depends upon the DCM-NJ method providing an improvement, and the
second property depends upon the performance of the maximum parsimony criterion as
a technique for selecting from the set {T q }. In the following figures we show that both
properties hold for random trees under the uniform distribution on tree topologies and
branch lengths.
In Figure 1, we show the results of an experiment in which we scored each of the
different trees T q for topological accuracy. This experiment is based upon random trees
from the uniform distribution. Note that the best trees are significantly better than the NJ
tree. Thus, the DCM-NJ method itself is providing an advantage over the NJ method.
In Figure 2 we show the result of a similar experiment in which we compared several
different techniques for the second phase (i.e., for selecting a tree from the set {T q }).
This figure shows that the Maximum Parsimony (MP) technique obtains better trees
than the Short Quartet Support Method, which is the technique used in the second phase
of the DCM -NJ method. Furthermore, both DCM-NJ+MP and DCM -NJ improve
upon NJ, and this improvement increases with the number of taxa.
Thus, for random trees from the uniform distribution on tree topologies and branch
lengths, DCM-NJ+MP improves upon NJ, and this improvement is due to both the
decomposition strategy used in Phase 1, and the selection criterion used in Phase 2.
Note however that DCM-NJ+MP is not statistically consistent, even under the sim-
plest models, since the maximum parsimony criterion can select the wrong tree with
probability going to 1 as the sequence length increases.
The Performance of Phylogenetic Methods on Trees of Bounded Diameter 221
NJ
SQS
Avg RF
DESIRED TREE
THRESHOLD
Fig. 1. The accuracy of the T q s for different values of q on a randomly generated tree
with 100 taxa, sequence length 1000, and an average branch length of 0.05.
Fig. 2. DCM-NJ+MP vs. DCM -NJ vs. NJ on random trees (uniform distribution on
tree topologies and branch lengths) with sequence evolution under the K2P+Gamma
model. Sequence length is 1000. Average branch length is 0.05.
222 Luay Nakhleh et al.
Software: We used Sandersons r8s package for generating birth-death trees [ 17]
and the program Seq-Gen [15] to randomly generate a DNA sequence for the root and
evolve it through the tree under K2P+Gamma model of evolution. We calculated evo-
lutionary distances appropriately for the model (see [ 11]). In the presence of saturation
(that is, datasets in which some distances could not be calculated because the formula
did not apply), we used the fix-factor 1 technique, as defined in [ 9]. In this technique,
the distances that cannot be set using the standard technique are all assigned the largest
corrected distance in the matrix.
The software for DCM-NJ was written by Daniel Huson. To calculate the maximum
parsimony scores of the trees we used PAUP* 4.0 [19]. For job management across the
cluster and public laboratory machines, we used the Condor software package [ 20]. We
generated the rest of this software (a combination of C++ programs and Perl scripts)
explicitly for these experiments.
For each model tree we generated sequences of length 500 using seq-gen, computed
trees using NJ and DCM-NJ+MP. We then computed the Robinson-Foulds error rate
for each of the inferred trees, by comparing it to the model tree that generated the data.
In order to obtain statistically robust results, we followed the advice of McGeoch [ 12]
and Moret [13] and used a number of runs, each composed of a number of trials (a
trial is a single comparison), computed the mean and standard deviation over the runs
of these events. This approach is preferable to using the same total number of samples
in a single run, because each of the runs is an independent pseudorandom stream. With
this method, one can obtain estimates of the mean that are closely clustered around the
true value, even if the pseudorandom generator is not perfect.
The standard deviation of the mean outcomes in our studies varied depending on
the number of taxa. The standard deviation of the mean on 10-taxon trees is 0.2 (which
is 20 percent, since the possible values of the outcomes range from 0 to 1), on 25-taxon
trees is 0.1 (which is 10 percent), whereas on 200, 400 and 800-taxon trees the standard
deviation ranged from 0.02 to 0.04 (which is between 2 and 4 percent). We graph the
average of the mean outcomes for the runs, but omit the standard deviations from the
graphs.
In Figure 3, we show how neighbor-joining and DCM-NJ+MP are affected by in-
creasing the rate of evolution (i.e., the height). The x-axis is the maximum expected
number of changes of a random site across the tree, and the y-axis is the RF rate. We
provide a curve for each number of taxa we explored, from 10 up to 800. The sequence
length is fixed in this experiment to 500. Note that both neighbor-joining and DCM-
NJ+MP have high errors for the lowest rates of evolution, and that at these low rates
of evolution the error rates increase as n increases. This is because for these low rates
of evolution, increasing the number of taxa makes the smallest edge length (i.e., f )
decrease, and thus increases the sequence length needed to have enough changes on
the short edges for them to be recoverable. As the rate of evolution increases, the error
rates initially decrease for both methods, but eventually the error rates begin to increase
again. This increase in error occurs where the exponential portion of the convergence
rate (i.e., where the sequence length depends exponentially on max ij ) becomes sig-
nificant. Note that where this happens is essentially the same for both methods and
that they perform equally well until that point. However, after this point, neighbor-
joinings performance is worse, compared to DCM-NJ+MP; furthermore, the error rate
increases for neighbor-joining at each of the large diameters, as n increases, while
DCM-NJ+MPs error rate does not reflect the number of taxa nearly as much.
In Figure 4, we present a different way of looking at the data. In this figure, the
x-axis is the number of taxa, the y-axis is the RF rate, and there is a curve for each
of the methods. We show thus how increasing n (the number of taxa) while fixing the
diameter of the tree affects the accuracy of the trees reconstructed. Note that at low rates
of evolution (the left figure), the error rates for both methods increase with the number
of taxa. At moderate rates of evolution (the middle figure), error rates increase for both
224 Luay Nakhleh et al.
methods but more so for neighbor-joining than for DCM-NJ+MP. Finally, at the higher
rate of evolution (the right figure), this trend continues, but the gap is even larger in
fact, DCM-NJ+MPs error increase looks almost flat.
These experiments suggest strongly that except for low diameter situations, the
DCM-NJ+MP method (and probably the other fast-converging methods) will out-
perform the neighbor-joining method, especially for large numbers of taxa and high
evolutionary rates.
Avg RF
Avg RF
0.005 0.01 0.05 0.1 0.5 1.0 1.5 2.0 0.005 0.01 0.05 0.1 0.5 1.0 1.5 2.0
Diameter Diameter
Fig. 3. NJ (left graph) and DCM-NJ+MP (right graph) error rates on random birth-death
trees as the diameter (x-axis) grows. Sequence length fixed at 500, and deviation factor
fixed at 4.
6 Conclusion
In an earlier study we presented the DCM-NJ+MP method and showed that it outper-
formed the NJ method for random trees drawn from the uniform distribution on tree
topologies and branch lengths. In this study we show that this improvement extends to
the case where the trees are drawn from a more biologically realistic distribution, in
which the trees are birth-death trees with a moderate deviation from ultrametricity. This
The Performance of Phylogenetic Methods on Trees of Bounded Diameter 225
Avg RF
Avg RF
Avg RF
10 25 50 100 200 400 800 10 25 50 100 200 400 800 10 25 50 100 200 400 800
Fig. 4. NJ and DCM-NJ+MP: Error rates on random birth-death trees as the number
of taxa (x-axis) grows. Sequence length fixed at 500 and the deviation factor at 4. The
expected diameter of the resultant trees are 0.02 (for the left graph), 0.2 (for the middle
graph), and 1.0 (for the right graph).
Taxa NJ DCM-NJ+MP
10 0.01 1.94
25 0.02 9.12
50 0.06 24.99
100 0.35 132.46
200 2.5 653.27
400 20.08 4991.11
800 160.4 62279.3
study has consequences for large phylogenetic analyses, because it shows that the accu-
racy of the NJ method may suffer significantly on large datasets. Furthermore, since the
DCM-NJ+MP method has good accuracy, even on large datasets, our study suggests
that other polynomial time methods may be able to handle the large dataset problem
without significant error.
Acknowledgments
We would like to thank the David and Lucile Packard Foundation (for a fellowship to
Tandy Warnow), the National Science Foundation (for a POWRE grant to Katherine St.
John), the Texas Institute for Computational and Applied Mathematics and the Center
for Computational Biology at UT-Austin (for support of Katherine St. John), Doug
Burger and Steve Keckler for the use of the SCOUT cluster at UT-Austin, and Patti
Spencer and her staff for their help.
226 Luay Nakhleh et al.
References
1. K. Atteson. The performance of the neighbor-joining methods of phylogenetic reconstruc-
tion. Algorithmica, 25:251278, 1999.
2. V. Berry and O. Gascuel. Inferring evolutionary trees with strong combinatorial evidence. In
Proc. 3rd Ann. Intl Conf. Computing and Combinatorics (COCOON 97), pages 111123.
Springer Verlag, 1997. in LNCS 1276.
3. M. Csuros. Fast recovery of evolutionary trees with thousands of nodes. To appear in
RECOMB 01, 2001.
4. M. Csuros and M. Y. Kao. Recovering evolutionary trees through harmonic greedy triplets.
Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA 99), pages 261270,
1999.
5. P. L. Erdos, M. Steel, L. Szekely, and T. Warnow. A few logs suffice to build almost all trees
I. Random Structures and Algorithms, 14:153184, 1997.
6. P. L. Erdos, M. Steel, L. Szekely, and T. Warnow. A few logs suffice to build almost all trees
II. Theor. Comp. Sci., 221:77118, 1999.
7. J. Huelsenbeck and D. Hillis. Success of phylogenetic methods in the four-taxon case. Syst.
Biol., 42:247264, 1993.
8. D. Huson, S. Nettles, and T. Warnow. Disk-covering, a fast-converging method for phyloge-
netic tree reconstruction. Comput. Biol., 6:369386, 1999.
9. D. Huson, K. A. Smith, and T. Warnow. Correcting large distances for phylogenetic recon-
struction. In Proceedings of the 3rd Workshop on Algorithms Engineering (WAE), 1999.
London, England.
10. M. Kimura. A simple method for estimating evolutionary rates of base substitutions through
comparative studies of nucleotide sequences. J. Mol. Evol., 16:111120, 1980.
11. W. H. Li. Molecular Evolution. Sinauer, Massachuesetts, 1997.
12. C. McGeoch. Analyzing algorithms by simulation: variance reduction techniques and simu-
lation speedups. ACM Comp. Surveys, 24:195212, 1992.
13. B. Moret. Towards a discipline of experimental algorithmics, 2001. To appear
in Monograph in Discrete Mathematics and Theoretical Computer Science; Also see
http://www.cs.unm.edu/moret/dimacs.ps.
14. L. Nakhleh, U. Roshan, K. St. John, J. Sun, and T. Warnow. Designing fast converging
phylogenetic methods. Oxford U. Press, 2001. To appear in Bioinformatics: Proc. 9th Intl
Conf. on Intelligent Systems for Mol. Biol. (ISMB 01).
15. A. Rambaut and N. C. Grassly. Seq-gen: An application for the Monte Carlo simulation of
dna sequence evolution along phylogenetic trees. Comp. Appl. Biosci., 13:235238, 1997.
16. D. F. Robinson and L. R. Foulds. Comparison of phylogenetic trees. Mathematical Bio-
sciences, 53:131147, 1981.
17. M. Sanderson. r8s software package. Available from
http://loco.ucdavis.edu/r8s/r8s.html.
18. N. Sautou and M. Nei. The neighbor-joining method: A new method for reconstructing
phylogenetic trees. Mol. Biol. Evol., 4:406425, 1987.
19. D. L. Swofford. PAUP*: Phylogenetic analysis using parsimony (and other methods), 1996.
Sinauer Associates, Underland, Massachusetts, Version 4.0.
20. Condor Development Team. Condor high throughput computing program, Copyright 1990-
2001. Developed at the Computer Sciences Department of the University of Wisconsin;
http://www.cs.wisc.edu/condor.
21. T. Warnow, B. Moret, and K. St. John. Absolute convergence: true trees from short se-
quences. Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA 01), pages
186195, 2001.
(1+)-Approximation of Sorting by Reversals and
Transpositions
Niklas Eriksen
1 Introduction
This paper is concerned with the problem of sorting permutations using long range
operations like inversions (reversing a segment) and transpositions (moving a segment).
The problem comes from computational molecular biology, where the aim is to find a
parsimonious rearrangement scenario that explains the difference in gene order between
two genomes. In the late eighties, Palmer and Herbon [ 9] found that the number of such
operations needed to transform the gene order of one genome into the other could be
used as a measure of the evolutionary distance between two species.
The kinds of operations we consider are inversions, transpositions and inverted
transpositions. Hannenhalli and Pevzner [7] showed that the problem of finding the
minimal number of inversions needed to sort a signed permutation is solvable in poly-
nomial time, and an improved algorithm was subsequently given by Kaplan et al. [ 8].
Caprara, on the other hand, showed that the corresponding problem for unsigned per-
mutations is NP-hard [4]. For transpositions no such sharp results are known, but the
(3/2)-approximation algorithms of Bafna and Pevzner [ 2] and Christie [5] are worth
mentioning.
Moving on to the combined problem, Gu et al. [ 6] gave a 2-approximation algo-
rithm for the minimal number of operations needed to sort a signed permutation by
inversions, transpositions and inverted transpositions. However, an algorithm looking
for the minimal number of operations will produce a solution heavily biased towards
transpositions. Instead, we propose the following problem: find the -sorting scenario
s (i.e., transforming to the identity) that minimizes inv(s) + 2trp(s), where inv(s)
and trp(s) are the numbers of inversions and transpositions in s, respectively.
We give a closed formula for this minimal weighted distance. Our formula is sim-
ilar to the exact formula for the inversion case, given by Hannenhalli and Pevzner [ 7].
O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 227237, 2001.
c Springer-Verlag Berlin Heidelberg 2001
228 Niklas Eriksen
We also show how to obtain a polynomial time algorithm for computing this formula
with an accuracy of (1 + ), for any > 0. As an example, we explicitly state a 7/6-
approximation. We also argue that for most applications the algorithm performs much
better than guaranteed.
2 Preliminaries
Here we present some useful definitions from Bafna and Pevzner [ 2] and Hannenhalli
and Pevzner [7], as well as a couple of new ones.
In this paper, we work with signed, circular permutations. We adopt the con-
vention of reading the circular permutations counterclockwise. We will sometimes lin-
earize the permutation by inverting both signs and reading direction if it contains -1,
then making a cut in front of 1 and finally adding n + 1 last, where n is the length of the
permutation. An example is shown in Figure 1. A breakpoint in a permutation is a pair
3 2 4 -6
5 -1 -5 1 1 -6 4 -5 -3 -2 7
-4 6 -3 -2
of adjacent genes that are not adjacent in a given reference permutation. For instance,
if we compare a genome to the identity permutation and consider the linearized version
of the permutation, the pair ( i , i+1 ) is a breakpoint if and only if i+1 i = 1. For
unsigned permutations, this would be written | i+1 i | = 1
The three operations we consider are inversions, transpositions and inverted transpo-
sitions These are defined in Figure 2. Following [2,7,8], we transform a signed, circular
. . . i i+1 . . . j j+1 . . . . . . i j . . . i+1 j+1 . . .
. . . i i+1 . . . j j+1 . . . k k+1 . . . . . . i j+1 . . . k i+1 . . . j k+1 . . .
. . . i i+1 . . . j j+1 . . . k k+1 . . . . . . i k . . . j+1 i+1 . . . j k+1 . . .
13 14
10 2
-7 9 1
-5 1
12 4
-6 2
11 3
-4 -8 8 15
-3
7 16
6 5
i and i+1
if there is a breakpoint between them and a grey edge between 2i and 2i+1,
unless these are adjacent (Figure 4). These edges will then form alternating cycles. The
length of a cycle is the number of black edges in it. Sometimes we will also draw black
and grey edges between 2i and 2i + 1, even though these are adjacent. We will then
get a cycle of length one at the places where we do not have a breakpoint in (these
cycles will be referred to as short cycles). A cycle is oriented if, when we traverse it,
at least one black edge is traversed clockwise and at least one black edge is traversed
counterclockwise. Otherwise, the cycle is unoriented.
Consider two cycles c1 and c2 . If we can not draw a straight line through the circle
such that the elements of c 1 are on one side of the line and the elements of c 2 are
on the other side, then these two cycles are inseparable. This relation is extended to
an equivalence relation by saying that c 1 and cj are in the same component if there
is a sequence of cycles c 1 , c2 , . . . , cj such that, for all 1 i j 1, c i and ci+1 are
inseparable.
A component is oriented if at least one of its cycles is oriented and unoriented
otherwise. If there is an interval on the circle, which contains an unoriented component,
but no other unoriented components, then this component is a hurdle. If we cannot
remove a hurdle without creating a second hurdle upon its removal (this is the case if
there is an unoriented component, which is not a hurdle, that stretches over an interval
that contains the previously mentioned hurdle, but no other hurdles), then the hurdle is
called a super hurdle. If we have an odd number of super hurdles and no other hurdles,
the permutation is known as a fortress.
230 Niklas Eriksen
13 14
10 2
9 1
12 4
11 3
8 15
7 16
6 5
Fig. 4. The breakpoint graph of a transformed permutation. It contains three cycles, two
of length 2 and one of length 3. The latter constitutes one component, and the first
two constitute another component. The first component is unoriented and the second is
oriented. Both components have size 2.
We should observe that for components in the breakpoint graph, the operations
needed to remove them do not depend on the actual numbers on the vertices. We could
therefore treat the components as separate objects, disregarding the particular permuta-
tion they are part of. If we wish, we can also regard the components as permutations,
by identifying them with (one of) the shortest permutations whose breakpoint graphs
consist of this component only. For example, the 2-cycle component in Figure 4 can be
identified with the permutation 1 -3 -4 2 5.
We say that an unoriented component is of odd length if all cycles in the component
are of odd length. We let b(), c() and h() denote the number of breakpoints, cycles
(not counting cycles of length one) and hurdles in the breakpoint graph of a permutation
, respectively. For components t, b(t) and c(t) are defined similarly. The size s of a
component t is given by b(t) c(t). We also let c s () denote the number of cycles
in , including the short ones, and f () is a function that is 1 is is a fortress, and 0
otherwise.
Let SI denote the set of all scenarios transforming into id using inversions only and
let inv(s) denote the number of inversions in a scenario s. The inversion distance is
defined as dInv () = minsSI {inv(s)}. It has been shown in [7,8] that
where b(), c(), h() and f () have been defined in the previous paragraph. In this
paper, we define the distance between and id by
where S is the set of all scenarios transforming into id, allowing both inversions
and transposition, and inv(s) and trp(s) is the number of inversions and transposi-
tions in scenario s, respectively. Here, transpositions refer to both ordinary and inverted
transpositions.
In order to give a formula for this distance, we need a few definitions.
Definition 1. Regard all components t as permutations and let d(t) be the distance
between t and id as defined above for permutations. Consider the set S of components
t such that d(t) > b(t) c(t) (when using inversions only, this is the set of unoriented
components). We call this set the set of strongly unoriented components. If there is
an interval on the circle that contains the component t S, but no other member of S,
then t is a strong hurdle. Strong super hurdles and strong fortresses are defined in
the same way as super hurdles and fortresses (just replace hurdle with strong hurdle).
Lemma 1. Each strongly unoriented component is unoriented (in the inversion sense).
Proof. We know that for oriented components t, d inv (t) = b(t) c(t) and for any
permutation , we have d() d Inv (). Regarding the component t as a permutation
gives d(t) dInv (t). Thus, for strongly unoriented components we have d inv (t)
d(t) > b(t) c(t) and we can conclude that a strongly unoriented component can not
be oriented.
d() = n cs () + ht () + ft (),
Proof. It is easy to see that d() ncs ()+ht ()+ft (). If we treat the strong hur-
dles as in the inversion case, we need only h t ()+ ft () inversions to make all strongly
unoriented components oriented. All oriented components can be removed efficiently
232 Niklas Eriksen
using inversions, and the unoriented components which are not strongly unoriented can,
by definition, be removed efficiently.
We now need to show that we can not do better than the formula above. From
Hannenhalli and Pevzner we know that we can not decrease n c s () by more than
1 using an inversion. Similarly, a transposition will never decrease n c s () by more
than 2, which is obtained by splitting a cycle in three cycles. The question is whether
transpositions can help us to remove strong hurdles more efficiently than inversions.
Bafna and Pevzner have shown that applying a transposition can only change the
number of cycles by 0 or 2. There are thus three possible ways of applying a transpo-
sition. First, we can split a cycle into three parts (c s = 2). If we do this to a strong
hurdle, at least one of the components we get must by definition remain a strong hur-
dle, since otherwise the original component could be removed efficiently. This gives
ht = 0. Second, we can let the transposition cut two cycles (c s = 0). To decrease
the distance by three, we would have to decrease the number of strong hurdles by three
which is clearly out of reach (only two strong hurdles may be affected by a transposi-
tion on two cycles). Finally, if we merge three cycles (c s = 2), we would need to
remove five strong hurdles. This clearly is impossible.
It is conceivable that the fortress property could be removed by a transposition that
reduce n cs () + ht () by two and at the same time removes an odd number of
strong super hurdles or adds a strong hurdle that is not a strong super hurdle. However,
from the analysis above, we know the transpositions that decrease n c s () + ht ()
by two must decrease h t () by an even number. We also found that when this was
achieved, no other hurdles apart from those removed were affected. Hence, there are no
transpositions that reduce n c s () + ht () + ft () by three.
We find that d() n cs () + ht () + ft (), and in combination with the first
inequality, d() = n cs () + ht () + ft ().
1 4
2 3
5 6
Fig. 5. The breakpoint graph of a cycle of length three which can be removed by a single
transposition.
(1+)-Approximation of Sorting by Reversals and Transpositions 233
Proof. Since the component is unoriented, applying an inversion to it will not increase
the number of cycles. If we apply a transposition to it, it will remain unoriented. Thus,
the only way to remove it efficiently would be to apply a series of transpositions, all
increasing the number of cycles by two,
Consider what happens if we split a cycle of even length into three cycles. The sum
of the length of these three new cycles must equal the length of the original cycle, in
particular it must be even. Three odd numbers never add to an even number, so we must
still have at least one cycle of even length, which is shorter than the original cycle.
Eventually, the component must contain a cycle of length 2. There are no transposi-
tions reducing b(t)c(t) by 2 that can be applied to this cycle, and hence the component
is strongly unoriented.
Concentrating on the unoriented components with cycles of odd lengths only, we find
that some of these are strongly unoriented and some are not. For instance, there are two
unoriented cycles of length three. One of them is the cycle in which we may remove
three breakpoints (Figure 5) and the other one can be seen in Figure 6 (a). Note that this
cycle can not be a component. This is, however, not true for the components in Figure 6
(b) and (c), which are the two smallest strongly unoriented components of odd length.
Fig. 6. A cycle of length three which can not be removed by a transposition (a), the
smallest strongly unoriented component of odd length (b) and the second smallest
strongly unoriented component of odd length (c).
234 Niklas Eriksen
5 The Algorithm
6 Discussion
The algorithm presented here relies on the creation of a table of components that can be
removed efficiently. Could this technique be used to find an algorithm for any similar
sorting problem such as sorting by transpositions? In general, the answer is no. In this
case, as for sorting with inversions, we know that if a component can not be removed
efficiently, we need only one extra inversion. We also know that for components that
can be removed efficiently, we can never improve on such a sorting by combining com-
ponents. For sorting by transpositions, no such results are known and until they are, the
table will need to include not only some of the components up to a certain size, but
every permutation of every size.
The next step is obviously to examine if there is an easy way to distinguish all
strongly unoriented components. For odd unoriented components, this property seems
very elusive. It also seems hard to discover a useful sequence of transpositions that
removes odd oriented components that are not strongly unoriented. However, investi-
gations on small components have given very promising results. For cycles of length 7,
we have the following result: If the cycle is not a strongly unoriented component, then
no transposition that increase the number of cycles by two will give a strongly unori-
ented component. This appears to be the case for cycles of length 9 as well, but no fully
exhaustive search has been conducted, due to limited computational resources.
If this pattern would hold, we could apply any sequence of breakpoint removing
transpositions to a component, until we either have removed the component, or are
(1+)-Approximation of Sorting by Reversals and Transpositions 237
unable to find any useful transpositions. In the first case, the component is clearly not
strongly unoriented, and in the second case it would be strongly unoriented.
Acknowledgment
I wish to thank my advisor Kimmo Eriksson for valuable comments during the prepa-
ration of this paper. Niklas Eriksen was supported by the Swedish Natural Science Re-
search Council and the Swedish Foundation for Strategic Research.
References
1. Bader, D. A., Moret, B. M. E., Yan, M.: A Linear-Time Algorithm for Computing Inver-
sion Distance Between Signed Permutations with an Experimental Study. Preprint
2. Bafna, V., Pevzner, P.: Genome rearrangements and sorting by reversals. Proceedings of
the 34th IEEE symposium of the foundations of computer science (1994), 148157
3. Bafna, V., Pevzner, P.: Sorting by Transpositions. SIAM Journal of Discrete Mathematics
11 (1998), 224240
4. Caprara, A.: Sorting permutations by reversals and Eulerian cycle decompositions. SIAM
Journal of Discrete Mathematics 12 (1999), 91110
5. Christie, D: A.: Genome rearrangement problems. Ph. D. thesis (1998)
6. Gu, Q.-P., Peng, S., Sudborough, H.: A 2-approximation algorithm for genome rearrange-
ments by reversals and transpositions. Theoretical Computer Science 210 (1999), 327339
7. Hannenhalli, S., Pevzner, P.: Transforming cabbage into turnip (polynomial algorithm for
sorting signed permutations with reversals). Proceedings of the 27th Annual ACM Sym-
posium on the Theory of Computing (1995), 178-189
8. Kaplan, H., Shamir, R., Tarjan, R. E.: Faster and Simpler Algorithm for Sorting Signed
Permutations by Reversals. Proceedings of the Eighth Annual ACM-SIAM Symposium
on Discrete Mathematics (1997), 344-351
9. Palmer, J. D., Herbon, L. A.: Plant mitochondrial DNA evolves rapidly in structure, but
slowly in sequence. Journal of Molecular Evolution 28 (1988), 8797
On the Practical Solution of
the Reversal Median Problem
Alberto Caprara
Abstract. In this paper, we study the Reversal Median Problem (RMP), which
arises in computational biology and is a basic model for the reconstruction of
evolutionary trees. Given q genomes, RMP calls for another genome such that
the sum of the reversal distances between this genome and the given ones is min-
imized. So far, the problem was considered too complex to derive mathematical
models useful for its practical solution. We use the graph theoretic relaxation of
RMP that we developed in a previous paper [6], essentially calling for a perfect
matching in a graph that forms the maximum number of cycles jointly with q
given perfect matchings, to design effective algorithms for its exact and heuris-
tic solution. We report the solution of a few hundred instances associated with
real-world genomes.
1 Introduction
O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 238251, 2001.
c Springer-Verlag Berlin Heidelberg 2001
On the Practical Solution of the Reversal Median Problem 239
one would expect the reversal distance to be employed when evolutionary trees are re-
constructed (still under the simplifying assumptions of genomes with the same genes,
no gene duplication and elementary operations of a unique type). This is not the case,
probably because the mathematical modeling of the problem of finding the best tree
w.r.t. the reversal distance is considered to be too complex, giving as motivation the fact
that, though efficient, computation of the distance between two permutations is not at
all trivial (actually its complexity was open for a while until the result of [ 12]). The only
attempts so far to use the reversal distance within evolutionary tree reconstruction can
be found in [11,25,17].
For the reasons above, the use of a simpler (though less realistic) notion of distance,
called breakpoint distance, which is trivial to compute, has been proposed in [ 22] and
extensively studied afterwards [3,20,23,18,8,21,5]. In particular, much work has been
done on the so-called Breakpoint Median Problem (BMP), which is the problem of find-
ing a genome which is closest to a given set of genomes w.r.t. the breakpoint distance,
i.e., the sum of the breakpoint distances between the genome to be found and each given
genome is minimized. All the methods to reconstruct evolutionary trees [ 3,23,18] use
as a subroutine a procedure to solve BMP, either exactly or heuristically, in order to find
the best genome associated with a given tree node once the genomes associated with
the neighbors of the node are fixed. It is easy to show [ 22] that BMP is a special case of
the Traveling Salesman Problem (TSP).
This paper is aimed at considering the use of the reversal distance in the reconstruc-
tion of evolutionary trees, mainly focusing on the Reversal Median Problem (RMP),
the counterpart of BMP in which the reversal distance is used instead of the breakpoint
distance. In [6] we described a graph theoretic relaxation of RMP which allowed us
to prove that the problem is N P-hard (actually also APX -hard). Essentially all pa-
pers dealing with BMP mention RMP as a more realistic model, say that it is N P-hard
citing [6], and motivate the use of BMP by sentences like For RMP, there are no al-
gorithms available, aside from rough heuristics, for handling even three relatively short
genomes [3,22], or Even heuristic approaches for RMP work well only for small in-
stances [23,24]. Our ultimate goal is to show that, although RMP is N P-hard and
nontrivial to model, instances of the problem associated with real-world genomes can
be solved to (near-)optimality within short computing time.
In this paper we first recall the graph theoretic relaxation of RMP given in [ 6], in
Section 2. In [6] we also presented an Integer Linear Programming (ILP) formulation
for this relaxation, hoping that this ILP would have allowed us to solve to proven op-
timality real-world RMP instances. Actually, this was not the case, as we discuss in
the present paperthe exact algorithm that we will propose, presented in Section 3,
is not based on this ILP. However, a careful use of the LP relaxation of the ILP is the
key part of the best heuristic algorithm that we have developed, illustrated in Section
4. Experimental results are given in Section 5, where we show that our methods can
solve to proven optimality many instances associated with real-world genomes, even of
relatively large size, and find provably good solutions for the remaining instances.
2 Preliminaries
A genome without gene duplications for which the orientation of the genes is known can
be represented by a signed permutation on N := {1, . . . , n}, obtained by signing a
240 Alberto Caprara
7 0
H
6 1
L M ( 1 )
L
M ( 2 )
L
5 L 2 M ( 3 )
L
L
L
4 3
# #
i # #
j
# #
# #
1
find a permutation n such that q
n( ) is minimized. In terms
of G( , . . . k, ),
q
n kQ c(T, M ( )) is
the problem calls for a permutation matching T such that q
minimized. The correspondence between solutions and T is given by M ( ) = T . The
immediate generalizations of Theorem 2 for RMP is the following. Let and qn
denote the optimal solution values of RMP and CMP, respectively.
Proposition 3. ([6]) Given an RMP instance and the associated CMP instance,
n .
q
For the MB graph in Figure 1, an optimal solution of both RMP and CMP is given
by 1 , for which ( 1 ) = d( 1 , 1 ) + d( 1 , 2 ) + d( 1 , 3 ) = 0 + 3 + 2 = 5 and
( 1 ) = c( 1 , 1 ) + c( 1 , 2 ) + c( 1 , 3 ) = 4 + 1 + 2 = 7. Hence, = 5 and
= 7.
The strength of the cycle distance lower bound on the reversal distance suggests that
the solution of CMP should yield a strong lower bound on the optimal solution value
of RMP. CMP is indeed the key problem addressed in this paper to derive effective
algorithms for RMP.
We conclude this section with an important notion that will be used in the next
sections. Given a perfect matching M on node set V and an edge e = (i, j) {(i, j) :
i, j V, i = j}, we let M/e be defined as follows. If e M , M/e := M \ {e}.
Otherwise, letting (i, a), (j, b) be the two edges in M incident to i and j, M/e :=
M \ {(i, a), (j, b)} {(a, b)}. The following obvious lemma will be used later in the
paper.
Lemma 1. Given two perfect matchings M, L of V and an edge e = (i, j) M , M L
defines a Hamiltonian cycle of V if and only if (M/e) (L/e) defines a Hamiltonian
cycle of V \ {i, j}.
Given an MB graph G( 1 , . . . , q ), the contraction of an edge e = (i, j) {(i, j) :
i, j V, i = j} is the operation that modifies G( 1 , . . . , q ) as follows. Edge (i, j) is
removed along with nodes i and j. For k = 1, . . . , q, M ( k ) is replaced by M ( k )/e,
and the base matching H is replaced by H/e. Figure 2 illustrates the contraction of
edge (i, j).
Note that the graph obtained after an edge contraction is not necessarily an MB
graph, in particular there may be some k Q such that M ( k ) H defines more than
On the Practical Solution of the Reversal Median Problem 243
one cycle after contraction. This is apparently a drawback, since our methods deal with
instances obtained by contracting edges. To overcome this, our algorithms are suited for
the following generalization of CMP. Consider a graph G on node set V , with |V | = 2 n
for some integer n > 1, along with a perfect matching H, called base matching, and
let E := {(i, j) : i, j V, i = j} \ H. Given q perfect matchings M 1 , . . . , Mq E
(which do not necessarily define Hamiltonian cycles with H), find a perfect matching
T E such that T H is a Hamiltonian cycle and q n kQ c(T, Mk ) is minimized.
In the rest of the paper, we will use the term CMP to denote this more general version.
n c(Mk , Ml )
q1 q
q
+ . (2)
2 q1
k=1 l=k+1
The lower bound on the optimal CMP (and RMP) value given by q n minus the left-
hand side of (2), called lbC , can be computed in O(nq 2 ) time since computation of
c(Mk , Ml ) for all k, l = 1, . . . , q, k = l, takes O(n) time. In the next section we give
an LP interpretation of this bound.
Recall the definition of edge contraction in Section 2. The branching scheme of our
algorithm is inspired by the following
Lemma 3. Given a CMP instance and an edge e E, the best CMP solution contain-
ing e is given by T {e}, where T is the optimal solution of the CMP instance obtained
by contracting edge e.
Proof. Let T := T {e}. First of all, note that T is a feasible CMP solution by Lemma
1. Now suppose there is another solution T with e T and
c(T , Mk ) > c(T, Mk ). (3)
kQ kQ
For all k such that e Mk , a cycle of length 2 formed by two copies of e is defined by
both T Mk and T Mk . Furthermore, contraction of edge e ensures a one-to-one
correspondence between cycles defined by T M k (resp. T Mk ) which are not two
244 Alberto Caprara
According to Lemma 3, if we fix edge e in the CMP solution, an upper bound on the
number of cycles is given by |{k : M k e}| (i.e., the number of cycles of length 2
defined by two copies of e) plus the upper bound ( 2) computed after the contraction
of e. This allows us to design a branch-and-bound algorithm where, starting from node
0 V , we enumerate all permutation matchings by fixing, in turn, either edge (0, 1),
or (0, 2), . . . , or (0, 2n) in the solution. Recursively, if the last edge fixed in the current
partial solution is (i, j) and the edge in H incident to j is (j, k), we proceed with the
enumeration by fixing in the solution, in turn, edge (k, l), for all l with no incident
edge fixed so far. We proceed in a depth first way. With this scheme we can perform
the lower bound test after each fixing in O(nq 2 ) time (in fact, the recomputation of
the lower bound after each fixing is done parametrically and, in practice, turns out to
be faster than from scratch). In order to have a fast processing of the subproblems,
the only operations performed are edge contraction and the bound computation, and
the incumbent solution is updated only when the current partial solution is a complete
solution.
The main drawback with the above scheme is that good solutions are found after
considerable computing time. To overcome this, we start from the lower bound lb C
computed for the original problem and call the branch-and-bound first with a target
value t := lbC , searching for a CMP solution of value t and backtracking as soon as
the lower bound for the current partial solution is > t. If a solution of value t is found,
it is optimal and we stop, otherwise no solution of value better than lb C + 1 exists.
Accordingly, we call the branch-and-bound with target value t := lbC + 1, and so on,
stopping as soon as we find a solution with the target value. Even if this has the effect
of reconsidering some subproblem more than once, every call of branch-and-bound
with a new value of t takes typically much longer than the previous one, therefore
the increase in running time due to many calls is negligible. On the other hand, the
scheme allows for the fast derivation of good lower bounds, noting that t is a lower
bound on the optimal CMP value and is increased by 1 after each call, with the minimal
core memory requirements of a depth-first branch-and-bound (we often examine several
million subproblems so the explicit storage of the subproblems would be impossible).
The above algorithm can easily be modified to find optimal RMP solutions. In par-
ticular, each time the current partial solution is a complete solution, of value (say) q n
(w.r.t. the CMP objective function), the above algorithm tests if q n = = t. If this is
the case, the algorithm stops as the current solution is optimal. In the modified ver-
sion, we compute the value of the current solution w.r.t. the RMP objective function.
If = t, again we stop. Otherwise, we possibly update the incumbent RMP solution
value (initially set to ). In any case, the algorithm stops when the target value t
(increased at each iteration) satisfies t.
The branch-and-bound algorithm is effective for many instances, providing a prov-
ably optimal solution within short time, especially in the relevant special case of q = 3,
for which the lower bound is reasonably close to the optimal value. In particular, the
processing of each subproblem within the branch-and-bound enumeration is very fast
(for our instances, a few hundred thousand subproblems per second on a PC). For the
remaining instances, the method is good at finding lower bounds on the optimal CMP
On the Practical Solution of the Reversal Median Problem 245
(and RMP) value but does not provide good heuristic solutions (even if various heuris-
tics are applied within the enumeration scheme). The following section describes a
heuristic based on a natural LP relaxation that performs well in practice.
4 An LP-Based Heuristic
In [6], we proposed a natural ILP formulation of CMP with one binary variable x e for
each edge e E, equal to 1 if e is in the CMP solution and 0 otherwise, and one binary
variable yC for each cycle that the CMP solution may define with M k , k Q. For
details, we refer to [6] or to the full paper.
The ILP formulation contains an exponential (in n) number of variables and con-
straints. Nevertheless, due to a fundamental result of Grotschel, Lovasz and Schrijver
[9], the associated LP relaxation can be solved in polynomial time provided the sepa-
ration of the constraints and the generation of the y variables can be done efficiently.
It is shown in [6] that this is indeed the case. However, in practice, this LP relaxation
turns out to be very difficult to solve with the present state-of-the-art LP solvers. In the
full paper, we provide experimental results showing that, even for q = 3, the largest
LPs solvable in a few hours correspond to instances with n = 30, that can be solved
within seconds by the combinatorial branch-and-bound algorithm of the previous sec-
tion, whereas LPs for n 40 may take days or even weeks to be solved. Hence, an
exact algorithm based on the ILP formulation seems (at present) useless in practice. In
this section, we show how to use the LP relaxation within an effective heuristic.
The heuristic starts from the (in most cases, fractional) vectors x produced at each
iteration within the iterative solution of the LP relaxation by column generation and
separation. For a given x , we apply a nearest neighbor algorithm, which starts from
some node i 0 V , selects the edge (i0 , j) E such that x(i0 ,j) is maximum, con-
siders the node l such that (j, l) H, selects the edge (l, p) E such that p = i 0
and x(l,p) is maximum, considers the node r such that (p, r) H, and so on, until
the edges selected form a permutation matching. We then apply to the final solution a
2-exchange procedure, where we check if the removal of two edges in the current solu-
tion and their replacement with the two other edges that yield a permutation matching
yields an improvement in the CMP value. If this is the case, the replacement is per-
formed and the procedure is iterated. Otherwise, i.e., if no 2-exchange in the current
solution improves the CMP value, the procedure terminates. Each time a new solution
is considered, namely the initial solution produced by nearest neighbor and the solution
after each improvement, we compute the corresponding RMP value to possibly update
the best RMP solution so far.
It is easy to see that the complexity of nearest neighbor is O(n 2 ), plus O(qn) for the
evaluation of the value of the CMP (and RMP) solution found. Also easy is to verify
that checking if a 2-exchange yields an improvement in the CMP value can be done
in O(q) time if one has a data structure that, for each k Q and i V , tells which
is the cycle in T Mk that visits node i, where T is the current solution (cycles are
simply identified by numbers). This data structure can be reconstructed in O(qn) time
every time the current solution is improved. Hence, each iteration of the 2-exchange
procedure, that either yields an improvement or terminates the procedure, takes O(qn 2 )
time since O(n2 ) 2-exchanges are considered.
246 Alberto Caprara
For every vector x , we try each node i V as starting node in nearest neigh-
bor. This is because starting from different nodes may yield solutions of considerably
different quality. Now the point is how to produce each x within reasonable time,
as the solution of each LP within the iterative procedure is quite time consuming, as
mentioned above. A natural choice is to work with a sparsified graph, where only a
subset of the edges E E is considered in the LP (the variables associated with the
other edges other being fixed to0). To this aim, the algorithm is organized in rounds.
In the first round, we let E := kQ Mk . For the other rounds, we consider the other
edges by increasing value of lower bound lb C . In particular, for each edge e E, we
compute lbC (e), the value oflower bound lb C if edge e is fixed in the solution. In the
second round we let E := kQ Mk {e E : lbC (e) lbC }, in the third round
E := kQ Mk {e E : lbC (e) lbC + 1}, and so on. We stress that in the
nearest neighbor heuristic and in the 2-exchange procedure the whole set E of edges is
considered (the definition of E is only meant to speed-up the solution of the LPs).
To further drive the heuristic with the LP solutions, every time in a round the value
of the current LP is such that q n < , where is the best RMP solution
value so far, we fix to 1 the n/10 x variables whose value is highest, imposing the
associated edges in the solution. We make sure that the partial solution is contained
into a CMP solution, and perform the fixing only if at least 3 iterations (of column
generation or separation followed by an LP) were performed since the last fixing. The
round terminates when the optimal value of the LP (containing only the edges in E and
n .
with some edges fixed in the solution), say , satisfies q
The overall heuristic terminates either after a time limit or when the round corre-
sponding to E = E terminates.
5 Experimental Results
All tables report the values of q and n, the time limit imposed on the branch-and-
bound algorithm and on the LP heuristic for all instances (within the caption), and the
following information:
# inst: the number of instances considered for each value of q and n (average and
maximum values refer to this number of instances);
# opt: the number of instances solved to optimality by the branch-and-bound algo-
rithm within the time limit;
: the average value of the best RMP solution foundthe optimal one if the
branch-and-bound algorithm terminates before the time limit, otherwise the best
solution produced by the LP heuristic;
lbC : the average value of lower bound lb C ;
lbBB : the average value of the lower bound produced by the branch-and-bound
algorithm (equal to the optimum if the algorithm terminates before the time limit);
B&B subpr.: the average number of subproblems considered within the branch-and-
bound algorithm;
B&B time: the average (maximum) time required by the branch-and-bound algo-
rithm (possibly equal to the time limit);
H : the average value of the best RMP solution produced by the LP heuristic;
H time: the average (maximum) time required to find the best solution by the LP
heuristic;
gap: the average (maximum) difference between the best RMP solution found and
the lower bound produced by the branch-and-bound algorithm.
2 units larger are typically found in fractions of a second. Our feeling is that the solu-
tions found by the LP heuristic are near optimal even for the cases in which the lower
bound cannot certify this. Table 2 refers to the randomly generated instances mentioned
Table 2. Results on Random Instances from set1 in [17], Time Limit of 10 Minutes.
r q n # inst. # opt. lbC lbBB B&B subpr. B&B time H H time gap
2 3 20 10 10 14.0 13.5 14.0 364.7 0.0 ( 0.0) 14.2 0.0 ( 0.0) 0.0 (0)
2 3 40 10 10 14.7 14.0 14.7 873.4 0.0 ( 0.0) 14.9 0.0 ( 0.0) 0.0 (0)
2 3 80 10 10 15.1 14.2 15.1 5341.7 0.0 ( 0.0) 15.1 0.1 ( 0.2) 0.0 (0)
2 3 160 10 10 15.1 14.2 15.1 21075.8 0.0 ( 0.1) 15.1 0.7 ( 0.7) 0.0 (0)
2 3 320 10 10 15.0 14.2 15.0 56076.9 0.1 ( 0.2) 15.0 5.2 ( 5.2) 0.0 (0)
4 3 10 10 10 14.5 13.6 14.5 492.9 0.0 ( 0.0) 14.8 0.0 ( 0.0) 0.0 (0)
4 3 20 10 10 27.5 26.5 27.5 31017.6 0.0 ( 0.3) 27.9 0.0 ( 0.1) 0.0 (0)
4 3 40 10 9 47.1 45.0 46.9 39167574.4 63.0 ( 600.0) 47.4 0.3 ( 2.7) 0.2 (2)
4 3 80 10 10 56.5 55.1 56.5 4342064.8 7.0 ( 70.2) 56.8 0.4 ( 1.1) 0.0 (0)
4 3 160 10 10 57.5 56.5 57.5 16382.0 0.0 ( 0.1) 57.6 1.0 ( 1.6) 0.0 (0)
4 3 320 10 10 57.6 56.6 57.6 45550.9 0.1 ( 0.1) 57.6 6.0 ( 10.2) 0.0 (0)
8 3 10 10 10 14.3 13.8 14.3 267.7 0.0 ( 0.0) 14.6 0.0 ( 0.0) 0.0 (0)
8 3 20 10 10 29.5 27.9 29.5 42045.8 0.1 ( 0.2) 30.1 0.1 ( 0.2) 0.0 (0)
8 3 40 10 4 61.6 57.3 60.3 258996275.2 433.1 ( 600.0) 62.1 0.9 ( 6.1) 1.3 (3)
8 3 80 10 0 125.5 112.7 115.0 292400000.0 600.0 ( 600.0) 125.5 9.8 ( 18.3) 10.5 (13)
8 3 160 10 0 190.6 177.5 179.7 272000025.6 600.0 ( 600.0) 190.6 46.2 ( 276.5) 10.9 (24)
8 3 320 10 9 205.0 203.4 204.8 37536243.2 66.7 ( 600.0) 205.0 29.2 ( 73.1) 0.2 (2)
16 3 10 10 10 14.2 13.4 14.2 247.3 0.0 ( 0.0) 14.4 0.0 ( 0.0) 0.0 (0)
16 3 20 10 10 29.5 28.0 29.5 96419.2 0.1 ( 0.7) 29.9 0.0 ( 0.1) 0.0 (0)
16 3 40 10 3 62.0 57.8 60.6 299220787.2 501.5 ( 600.0) 62.5 0.4 ( 0.6) 1.4 (3)
16 3 80 10 0 130.8 116.0 118.3 288000000.0 600.0 ( 600.0) 130.8 10.8 ( 23.8) 12.5 (14)
16 3 160 10 0 276.6 236.2 237.3 204600000.0 600.0 ( 600.0) 276.6 248.8 ( 508.1) 39.3 (42)
16 3 320 10 0 537.1 451.4 451.8 146000000.0 600.0 ( 600.0) 537.1 244.5 ( 600.0) 85.3 (98)
q n # inst. # opt. lbC lbBB B&B subpr. B&B time H H time gap
3 105 286 286 19.4 18.7 19.4 48697.0 0.1 ( 1.1) 19.4 ( ) 0.0 (0)
4 105 715 651 28.5 25.0 28.3 6257725.6 12.1 ( 60.0) 28.5 0.4 ( 5.1) 0.1 (4)
5 105 1287 1073 36.5 31.2 36.3 9265956.4 24.1 ( 60.0) 36.5 0.8 ( 7.3) 0.2 (4)
bound and the heuristic. Moreover, we applied the LP heuristic only to the instances
that were not solved to optimality by the branch-and-bound.
The behavior for the mitochondrial instances is analogous to the case of random
permutations, namely almost all instances can be solved to proven optimality for q = 3
whereas for q = 4 and 5 the gap is 0 only for very few instances, being (on average)
equal to 9% and 13%, respectively.
On the other hand, chloroplast instances turn out to be relatively easy to solve, the
solution value being much smaller than for random permutations. With very few excep-
tions, all instances are solved within one unit of the optimum in very short time. For this
reason, we tried also to solve instances for higher values of q. For q {6, . . . , 12} we
250 Alberto Caprara
solved the 13 instances obtained by taking q consecutive (in a circular sense) genomes
starting from the first, the second, . . . , the thirteenth. For q = 13 we solved the instance
with the whole genome set. Table 5 gives the corresponding results, with a time limit
of one hour, motivated by the fact that the number of instances considered is small and
that, for n = 105, 1 minute or one hour often makes a big difference, which is not the
case for small n (say, n < 50), probably because of the exponential nature of the al-
gorithm. The table shows that, for these genomes, even an instance with q = 13 can be
solved optimally within reasonable time, and that the average gap over all the instances
is about 1 unit.
Acknowledgments
This work was partially supported by MURST and CNR, Italy. Moreover, I would like
to thank Dan Gusfield, David Sankoff and Nick Tran for helpful discussions on the
subject; Mathieu Blanchette, David Bryant, Nadia El-Mabrouk and Stacia Wyman for
having provided me with the real-world instances mentioned in the paper; and David
Bader, Bernard Moret, Tandy Warnow, Stacia Wyman and Mi Yan for having made their
code for SBR publicly available.
References
1. D.A. Bader, B.M.E. Moret, M. Yan, A Linear-Time Algorithm for Computing Inversion
Distance Between Signed Permutations with an Experimental Study, Proceedings of the
Seventh Workshop on Algorithms and Data Structgures (WADS01) (2001), to appear in Lec-
ture Notes in Computer Science; available at http://www.cs.unm.edu/.
2. V. Bafna and P.A. Pevzner, Genome Rearrangements and Sorting by Reversals, SIAM Jour-
nal on Computing 25 (1996) 272289.
3. M. Blanchette, G. Bourque and D. Sankoff, Breakpoint Phylogenies, in S. Miyano and T.
Takagi (eds.), Proceedings of Genome Informatics 1997, (1997) 2534, Universal Academy
Press.
4. M. Blanchette, personal communication.
5. D. Bryant, A Lower Bound for the Breakpoint Phylogeny Problem, to appear in Journal of
Discrete Algorithms (2001).
6. A. Caprara, Formulations and Hardness of Multiple Sorting by Reversals, Proceedings
of the Third Annual International Conference on Computational Molecular Biology (RE-
COMB99) (1999) 8493, ACM Press.
7. A. Caprara, G. Lancia and S.K. Ng, Sorting Permutations by Reversals through Branch-
and-Price, to appear in INFORMS Journal on Computing (2001).
8. M.E. Cosner, R.K. Jansen, B.M.E. Moret, L.A. Rauberson, L.-S. Wang, T. Warnow and S.
Wyman, An Empirical Comparison of Phylogenetic Methods on Chloroplast Gene Order
Data in Campanulaceae, in [19], 99-121.
9. M. Grotschel, L. Lovasz and A. Schrijver, The Ellipsoid Method and its Consequences in
Combinatorial Optimization, Combinatorica 1 (1981), 169197.
10. D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computa-
tional Biology, (1997), Cambridge University Press.
11. S. Hannenhalli, C. Chappey, E.V. Koonin and P.A. Pevzner, Genome Sequence Comparison
and Scenarios for Gene Rearrangements: A Test Case, Genomics 30 (1995) 299311.
12. S. Hannenhalli and P.A. Pevzner, Transforming Cabbage into Turnip (Polynomial Algo-
rithm for Sorting Signed Permutations by Reversals), Journal of the ACM 48 (1999) 127.
On the Practical Solution of the Reversal Median Problem 251
13. M. Junger, G. Reinelt and G. Rinaldi, The traveling salesman problem, in M. Ball, T. Mag-
nanti, C. Monma, G. Nemhauser (eds.), Network Models, Handbooks in Operations Research
and Management Science, 7 (1995) 225330, Elsevier.
14. H. Kaplan, R. Shamir and R.E. Tarjan, Faster and Simpler Algorithm for Sorting Signed
Permutations by Reversals, SIAM Journal on Computing 29 (2000) 880892.
15. J. Kececioglu and D. Sankoff, Efficient Bounds for Oriented Chromosome Inversion Dis-
tance, Proceedings of 5th Annual Symposium on Combinatorial Pattern Matching, Lecture
Notes in Computer Science 807 (1994) 307325, Springer Verlag.
16. J. Kececioglu and D. Sankoff, Exact and Approximation Algorithms for Sorting by Rever-
sals, with Application to Genome Rearrangement, Algorithmica 13 (1995) 180210.
17. B.M.E. Moret, L.-S. Wang, T. Warnow and S.K. Wyman. Highly accurate reconstruction of
phylogenies from gene order data, Tech. Report TR-CS-2000-51 Dept. of Computer Science,
University of New Mexico (2000), available at http://www.cs.unm.edu/.
18. B.M.E. Moret, S.K. Wyman, D.A. Bader, T. Warnow and M. Yan, A New Implementation
and Detailed Study of Breakpoint Analysis, Proceedings of the Sixth Pacific Symposium on
Biocomputing (PSB 2001) (2001) 583594, World Scientific Pub.
19. D. Sankoff and J.H. Nadeau (eds.) Comparative Genomics: Empirical and Analytical Ap-
proaches to Gene Order Dynamics, (2000) Kluwer Academic Publishers.
20. I. Peer and R. Shamir, The Median Problems for Breakpoints are N P-
Complete, ECCC Report No. 71 (1998), University of Trier, 1998, available at
http://www.eccc.uni-trier.de/.
21. I. Peer and R. Shamir, Approximation Algorithms for the Median Problem in the Break-
point Model, in [19], 225241.
22. D. Sankoff and M. Blanchette, Multiple Genome Rearrangement and Breakpoint Phyloge-
nies, Journal of Computational Biology 5 (2000) 555570.
23. D. Sankoff, D. Bryant, M. Denault, B.F. Lang, G. Burger, Early Eukaryote Evolution Based
on Mitochondrial Gene Order Breakpoints, to appear in Journal of Computational Biology
(2001).
24. D. Sankoff and N. El-Mabrouk, Duplication, Rearrangement and Reconciliation, in [19],
537550.
25. D. Sankoff, G. Sundaram and J. Kececioglu, Steiner Points in the Space of Genome Rear-
rangements, International Journal of Foundations of Computer Science 7 (1996) 19.
26. J. Setubal and J. Meidanis, Introduction to Computational Molecular Biology, (1997), PWS
Publising.
Algorithms for Finding Gene Clusters
1 Introduction
The conservation of gene order has been extensively studied so far [ 25,19,16,12]. There
is strong evidence that genes clustering together in phylogenetically distant
genomes frequently encode functionally associated proteins [ 23,4,24] or indicate recent
horizontal gene transfer [11,5]. Due to the increasing amount of completely sequenced
genomes, the comparison of gene orders to find conserved gene clusters is becoming a
standard approach for protein function prediction [ 20,17,22,6].
In this paper we describe efficient algorithms for finding gene clusters for various
types of genomic data. We represent gene orders by permutations (re-orderings) of inte-
gers. Hence gene clusters correspond to intervals (contiguous subsets) in permutations,
and the problem of finding conserved gene clusters in different genomes translates to
the problem of finding common intervals in multiple permutations.
In addition to this bioinformatic application, common intervals also relate to the
consecutive arrangement problem [2,7,8] and to cross-over operators for genetic al-
gorithms solving sequencing problems such as the traveling salesman problem or the
single machine scheduling problem [3,15,18].
Recently, Uno and Yagiura [26] presented
an optimal O(n + K) time and O(n)
space algorithm for finding all K n2 common intervals of two permutations 1
and 2 of n elements. We generalized this algorithm to a family = ( 1 , . . . , k ) of
k 2 permutations in optimal O(kn + K) time and O(n) space [10] by restricting the
O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 252263, 2001.
c Springer-Verlag Berlin Heidelberg 2001
Algorithms for Finding Gene Clusters 253
C = {[1, 2], [1, 3], [1, 9], [2, 3], [4, 5], [4, 6], [4, 8], [5, 6], [5, 8], [6, 8], [7, 8]}.
In order to keep this paper self-contained, in the remainder of this section we recall the
algorithms of Uno and Yagiura [26] and of Heber and Stoye [10] that find all common
intervals of 2 (respectively k 2) permutations. We will restrict our description to
basic ideas and only give details where they are necessary for an understanding of the
new algorithms described in Sections 36 of this paper.
Since f (x, y) counts the number of elements in [l(x, y), u(x, y)]\ 2 ([x, y]), an interval
2 ([x, y]) is a common interval of if and only if f (x, y) = 0. A simple algorithm to
find C is to test for each pair of indices (x, y) with 1 x < y n if f (x, y) = 0,
yielding a naive O(n 3 ) time or, using running minima and maxima, a slightly more
involved O(n 2 ) time algorithm.
In order to save the time to test f (x, y) = 0 for some pairs (x, y), Uno and Yagiura
[26] introduce the notion of wasteful candidates for y.
Definition 1. For a fixed x, a right interval end y > x is called wasteful if it satisfies
f (x , y) > 0 for all x x.
Based on this notion, Uno and Yagiura give an algorithm called RC (short for Reduce
Candidate) that has as its essential part a data structure Y consisting of a doubly-linked
list ylist for the indices of non-wasteful right interval end candidates and, storing inter-
vals of ylist, two further doubly-linked lists llist and ulist that implement the functions
l and u in order to compute f efficiently. An outline of Algorithm RC is shown in Algo-
rithm 1 where L.succ(e) denotes the successor of element e in a doubly linked list L.
After initializing the lists of Y , a counter x (corresponding to the currently investigated
left interval end) runs from n 1 down to 1. In each iteration step, during the update of
Y , ylist is trimmed such that afterwards the function f (x, y) is monotonically increas-
ing for the elements y remaining in ylist. In lines 57, this allows us to efficiently find
all common intervals with left end x by evaluating f (x, y) running left-to-right through
the elements y > x of ylist until an index y is encountered with f (x, y) > 0 when the
reporting procedure stops.
Algorithms for Finding Gene Clusters 255
For details of the data structure Y and the update procedure in line 3, see [ 26,10].
The analysis shows that the update of data structure Y in line 3 can be performed in
amortized O(1) time, such that the complete algorithm takes O(n + K) time to find the
K common intervals of 1 and 2 .
We cite the following two results from [10] (without proofs) which indicate the great
value of the concept of irreducible intervals.
Lemma 1. Given a family = (1 , . . . , k ) of permutations of N = {1, 2, . . . , n},
the set of irreducible intervals I allows us to reconstruct the set of all common inter-
vals C in optimal O(|C |) time.
Now we can describe the algorithm from [10] that finds all K common intervals of a
family of k 2 permutations of N in O(kn + K) time.
For 1 i k, set i := (1 , . . . , i ). Starting with I1 = {[j, j + 1] | 1
j n 1}, the algorithm successively computes I i from Ii1 for i = 2, . . . , k (see
Algorithm 2). The algorithm employs a mapping
i : Ii1 Ii
that maps each element c I i1 to the smallest common interval c Ci that con-
tains c. It is shown in [10] that this mapping exists and is surjective, i.e., i (Ii1 ) :=
{i (c) | c Ii1 } = Ii . Furthermore, it is shown that i (Ii1 ) can be effi-
In this section we consider the problem of finding all common intervals in a family of
signed permutations. It is common practice when considering genome rearrangement
problems, to denote the direction of a gene in the genome by a plus (+) or minus ()
sign depending on the DNA strand it is located on [ 21]. In the context of sorting signed
permutations by reversals [1,9,13,14], the sign of a gene tells the direction of the gene in
the final (sorted) permutation and changes with each reversal. In our context, it has been
observed that for prokaryotes, functionally coupled genes, e.g. in operons, virtually al-
ways lie on the same DNA strand [20,12]. Hence, when given signed permutations, we
Algorithms for Finding Gene Clusters 259
require that the sign does not change within an interval. Between the different permuta-
tions, the sign of the intervals might vary, though. This restricts the (original) set of all
common intervals to the biologically more meaningful candidates.
Example 3. Let N = {1, . . . , 6} and = ( 1 , 2 , 3 ) with 1 = (+1, +2, +3, +4,
+5, +6), 2 = (3, 1, 2, +5, +4, +6), and 3 = (4, +5, +6, 2, 3, 1). With
respect to 1 the interval [1, 3] is a common interval, but [4, 5] and [4, 6] are not.
Obviously, the number of common intervals in signed permutations can be considerably
smaller than the number of common intervals in unsigned permutations. Hence, apply-
ing Algorithm 2 followed by a filtering step will not yield our desired time-optimal
result.
However, the problem can be solved easily by applying the algorithm for multichro-
mosomal permutations from the previous section. Since a common interval in signed
permutations can never contain two genes with different sign, we break the signed per-
mutations into pieces (chromosomes) wherever the sign changes. This is clearly pos-
sible in linear time. Then we apply the algorithm from the previous section to the ob-
tained family of multichromosomal permutations, the result being exactly the common
intervals of the original signed permutations. Hence, we have the following
Theorem 2. Given k signed permutations of N = {1, . . . , n}, all K common intervals
can be found in optimal O(kn + K) time using O(n) additional space.
placed by a variant, denoted i , that works on circular permutations and only generates
irreducible intervals of size n2 . This function is implemented by multiple calls to
the original function i . The two circular permutations 1 and i are therefore lin-
earized in two different ways each, namely by once cutting them between positions n
and 1, and once cutting between positions n2 and n2 + 1. Then i is applied to each
of the four resulting pairs of linearized permutations. For convenience, the output of
common intervals of length > n2 is suppressed. Finally, the resulting intervals of the
four runs of i are merged, sorted according their start and end positions using counting
sort, and duplicates are removed. Clearly, i generates all irreducible intervals of size
n2 in O(n) time. Hence, we have the following
7 Conclusion
In this paper we have presented time and space optimal algorithms for variants of the
common intervals problem for k permutations. The variants we considered, multichro-
mosomal permutations, signed permutations, circular permutations, and their combina-
tions, were motivated by the requirements imposed by real data we were confronted
with in our experiments. While in preliminary testing we have applied our algorithms
to bacterial genomes, it is obvious that in a realistic setting, one should further relax
the problem definition. In particular, one should allow for missing or additional genes
in a common interval while imposing a penalty whenever this occurs. Such relaxations
seem to make the problem much harder, though.
Acknowledgments
We thank Richard Desper and Zufar Mulyukov for carefully reading this manuscript as
well as Pavel Pevzner for a helpful discussion on the ideas contained in the manuscript.
References
1. V. Bafna and P. Pevzner. Genome rearrangements and sorting by reversals. SIAM J. Com-
puting, 25(2):272289, 1996.
2. K. S. Booth and G. S. Lueker. Testing for the consecutive ones property, interval graphs and
graph planarity using P Q-tree algorithms. J. Comput. Syst. Sci., 13(3):335379, 1976.
3. R. M. Brady. Optimization strategies gleaned from biological evolution. Nature, 317:804
806, 1985.
4. T. Dandekar, B. Snel, M. Huynen, and P. Bork. Conservation of gene order: A fingerprint of
proteins that physically interact. Trends Biochem. Sci., 23(9):324328, 1998.
5. J. A. Eisen. Horizontal gene transfer among microbial genomes: new insights from complete
genome analysis. Curr. Opin. Genet. Dev., 10(6):606611, 2000.
6. W. Fujibuchi, H. Ogata, H. Matsuda, and M. Kanehisa. Automatic detection of conserved
gene clusters in multiple genomes by graph comparison and P-quasi grouping. Nucleic Acids
Res., 28(20):40294036, 2000.
7. D. Fulkerson and O. Gross. Incidence matrices with the consecutive 1s property. Bull. Am.
Math. Soc., 70:681684, 1964.
8. M. C. Golumbic. Algorithmic Graph Theory and Perfect Graphs. Academic Press, New
York, 1980.
9. S. Hannenhalli and P. A. Pevzner. Transforming cabbage into turnip: Polynomial algorithm
for sorting signed permutations by reversals. J. ACM, 46(1):127, 1999.
10. S. Heber and J. Stoye. Finding all common intervals of k permutations. In Proceedings of
the 12th Annual Symposium on Combinatorial Pattern Matching, CPM 2001, volume 2089
of Lecture Notes in Computer Science, pages 207219. Springer Verlag, 2001. To appear.
11. M. A. Huynen and P. Bork. Measuring genome evolution. Proc. Natl. Acad. Sci. USA,
95(11):58495856, 1998.
12. M. A. Huynen, B. Snel, and P. Bork. Inversions and the dynamics of eukaryotic gene order.
Trends Genet., 17(6):304306, 2001.
13. H. Kaplan, R. Shamir, and R. E. Tarjan. A faster and simpler algorithm for sorting signed
permutations by reversals. SIAM J. Computing, 29(3):880892, 1999.
Algorithms for Finding Gene Clusters 263
14. J. D. Kececioglu and D. Sankoff. Efficient bounds for oriented chromosome inversion dis-
tance. In M. Crochemore and D. Gusfield, editors, Proceedings of the 5th Annual Symposium
on Combinatorial Pattern Matching, CPM 94, volume 807 of Lecture Notes in Computer
Science, pages 307325. Springer Verlag, 1994.
15. S. Kobayashi, I. Ono, and M. Yamamura. An efficient genetic algorithm for job shop schedul-
ing problems. In Proc. of the 6th International Conference on Genetic Algorithms, pages
506511. Morgan Kaufmann, 1995.
16. W. C. III Lathe, B. Snel, and P. Bork. Gene context conservation of a higher order than
operons. Trends Biochem. Sci., 25(10):474479, 2000.
17. E. M. Marcotte, M. Pellegrini, H. L. Ng, D. W. Rice, T. O. Yeates, and D. Eisenberg. De-
tecting protein function and protein-protein interactions from genome sequences. Science,
285:751753, 1999.
18. H. Muhlenbein, M. Gorges-Schleuter, and O. Kramer. Evolution algorithms in combinatorial
optimization. Parallel Comput., 7:6585, 1988.
19. A. R. Mushegian and E. V. Koonin. Gene order is not conserved in bacterial evolution.
Trends Genet., 12(8):289290, 1996.
20. R. Overbeek, M. Fonstein, M. DSouza, G. D. Pusch, and N. Maltsev. The use of gene
clusters to infer functional coupling. Proc. Natl. Acad. Sci. USA, 96(6):28962901, 1999.
21. P. A. Pevzner. Computational Molecular Biology: An Algorithmic Approach. MIT Press,
Cambridge, MA, 2000.
22. B. Snel, G. Lehmann, P. Bork, and M. A. Huynen. STRING: A web-server to retrieve and
display the repeatedly occurring neigbourhood of a gene. Nucleic Acids Res., 28(18):3443
3444, 2000.
23. J. Tamames, G. Casari, C. Ouzounis, and A. Valencia. Conserved clusters of functionally
related genes in two bacterial genomes. J. Mol. Evol., 44(1):6673, 1997.
24. J. Tamames, M. Gonzalez-Moreno, J. Mingorance, A. Valencia, and M. Vicente. Bringing
gene order into bacterial shape. Trends Genet., 17(3):124126, 2001.
25. R. L. Tatusov, A. R. Mushegian, P. Bork, N.P. Brown, W. S. Hayes, M. Borodovsky, K. E.
Rudd, and E. V. Koonin. Metabolism and evolution of Haemophilus influenzae deduced from
a whole-genome comparison with Escherichia coli. Curr. Biol., 6:279291, 1996.
26. T. Uno and M. Yagiura. Fast algorithms to enumerate all common intervals of two permuta-
tions. Algorithmica, 26(2):290309, 2000.
Determination of Binding Amino Acids Based
on Random Peptide Array Screening Data
Peter J. van der Veen1 , L.F.A. Wessels1,2 , J.W. Slootstra3 , R.H. Meloen3 ,
M.J.T. Reinders2 , and J. Hellendoorn1
1
Faculty of Information Technology and Systems
Control Engineering Laboratory
Delft University of Technology
Mekelweg 4, P.O. Box 5031, 2600 GA Delft, The Netherlands
{P.J.vanderVeen,L.Wessels,J.Hellendoorn}@ITS.TUDelft.NL
2
Faculty of Information Technology and Systems
Information and Communication Theory Group
Delft University of Technology
Mekelweg 4, P.O. Box 5031, 2600 GA Delft, The Netherlands
[email protected]
3
Pepscan Systems BV, Lelystad, The Netherlands
1 Introduction
Synthetic peptides can be designed to bind against the same antibody as com-
plete proteins do [5]. For this, the peptide should mimic the area of the protein
which is recognized by the antibody. Although proteins normally consist of a
large number of amino acids, the contact area between an antibody and a pro-
tein generally consists of 15-22 amino acids on each side [6]. Several experiments
suggest that only three to ve of these amino acids are responsible for the ma-
jor binding contribution between a protein and an antibody [4,9]. These amino
O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 264277, 2001.
c Springer-Verlag Berlin Heidelberg 2001
Determination of Binding Amino Acids 265
acids dene the so called energetic epitope [6]. In this paper, we focus on the
determination of those amino acids that constitute this energetic epitope.
The most obvious way to construct a synthetic peptide is by determining the
amino acids comprising the epitope of the protein. However, these amino acids
can be hard to identify and if they are identied, the usage of these amino acids
will not always yield a well binding peptide. This can for example be due to a
conformational dierence between the amino acids at the protein surface and the
amino acids in the peptide. When no well binding peptide can be constructed
from a part of the original protein, a random search is generally performed to
nd a well binding peptide. Such a search can e.g. be done by a phage display
analysis [1,7,10] or a peptide screening analysis [3]. In a peptide screening analy-
sis, thousands of randomly generated peptides can be measured against the same
antibody. This results in a relative binding strength for every measured peptide.
In general, the best binding peptide found during such an analysis is, however,
far worse than required. Therefore, several lead variants are usually screened to
improve the performance of the best binding peptides.
Although thousands of dierent measurements are preformed during a single
run of a peptide screening analysis, the number of tested peptides is relatively
small compared to the number of possible dierent peptides. The peptides under
study consist of 12 successive amino acids, implying that one can generate 2012
unique peptides. In principle, one should test all these possibilities to nd the
best binding peptide. Naturally, this is not feasible and hence a feasible search
procedure is required.
One logical method to search for such a well binding peptide is by selecting
several of the best known binding peptides and test a large number of mutations
of these peptides. But, again due to the large search space, only a limited number
of these mutations can be tested during one experiment. Still the question re-
mains which (combination of) amino acids of these peptides should be replaced.
In this paper we propose an algorithm that is meant to help in this decision by
identifying those amino acids which have a higher chance of being important for
binding. Eectively this reduces the search space and makes it possible to spend
more eort in mutating the other amino acids of these peptides.
The method generates a rule containing a combination of amino acids which
is over-represented in the best binding and underrepresented in the worst binding
peptides. Because the amino acids in the rule are over-represented amongst the
best binding peptides, these amino acids will most probably improve the binding
strength and are most probably contained in the energetic epitope.
2 Data
The data used in this paper is obtained from four distinct peptide screening
experiments, where one experiment has been performed for every monoclonal
antibody described. In each experiment the binding strength of either 3640 or
4550 randomly generated peptides is measured against an antibody. Each of
these peptides consists of 12 amino acids, where each amino acid has an equal
266 Peter J. van der Veen et al.
3 Method
The actual binding process between a peptide and an antibody is very complex
and depends on the interplay between several amino acids. In this paper, we do
not consider the full complexity of the problem, rather, our approach is based on
the principle that some amino acids are typically needed in a peptide to ensure
a high anity binding against an antibody.
To nd these amino acids, we need a representation that does not include the
positional information of the peptides amino acids. To this end, each peptide
is represented by a 20 dimensional feature vector. Each feature represents the
occurrence of a specic amino acid in the peptide (only the 20 naturally occur-
ring L-amino acids are used to generate the peptides). Thus, if an amino acid
is included in the peptide (even if it occurs more than once) the corresponding
feature is set to 1 otherwise to 0. An example of this representation is given in
Table Table 1. Our aim is to design an algorithm that constructs, for a given
Peptide Feature
A C D E FGHIK L
1 QGWFFMQINTQY 0 0 0 0 11010 0
2 LMWNPNIKTCER 0 1 0 1 00011 1
3 CDSADVSGDHLI 1 1 1 0 01110 1
4 MHGVIAQGQQDV 1 0 1 0 01110 0
5 FHNQLYYSPDYV 0 0 1 0 10100 1
6 HEEWFQLFYYMQ 0 0 0 1 10100 1
antibody, based on the measured data, a logical rule which describes the amino
acids that are needed in a well binding peptide. This rule maps each of the
measured peptides to one of two classes: 1) well binding peptides that con-
tain the necessary amino acids and thus satisfy the rule and 2) bad binding
peptides which do not contain the required amino acids and do not satisfy the
rule. Since the resulting rule will be used as a decision support in the analysis
of the measurements, it is imperative that the generated rule is interpretable.
Therefore, we chose to construct rules using logical expressions of amino acids
(allowing only AND and OR constructions). An example is given in equation
Determination of Binding Amino Acids 267
(1).
Each X in equation (1) can be replaced by any of the 20 dierent amino acids.
A classication result of a possible rule is shown in Figure Figure 1, where
for every peptide a dot is drawn, indicating whether the peptide is classied as
well binding (a dot at 100) or as bad binding (a dot at zero). On the left side
of the gure, the number of well binding peptides is larger than the number
of bad binding ones. Clearly, we now have a situation where the rule associates
100
percentage of well binding peptides
80
60
40
20
0 10 20 30 40 50 60 70 80 90
peptide no.
Fig. 1. Classication of Several Peptides. A dot at 100 means that the corre-
sponding peptide is classied as well binding. The continuous line shows the
estimated well binding distribution.
every peptide with a binary value (well binding/bad binding) while the data
set associates every peptide with a continuous valued binding strength. Since
the binding strength gradually transitions from the largest to the smallest value
it is undesirable to dene a crisp boundary between the two categories.
Therefore, to evaluate the rule, we need an additional requirement. This re-
quirement is built into the algorithm as follows: It is assumed that the measured
binding strength of a given peptide is proportional to the similarity between
this peptide and the best n binding peptides. According to this assumption, we
may interpret the measured binding strength as a measure of the probability
density of the well binding peptides (over the sorted peptides). By requiring
that the distribution of the well binding peptides equals the binding energy
curve, the classication rule can be optimized. Clearly this will have the desired
result that peptides having a high binding energy will be more often classied
as well binding and vice versa.
That leaves us with the question how to nd the distribution of the well
binding peptides given the (binary) output of the classication rule, as in Fig-
268 Peter J. van der Veen et al.
ure Figure 1. Here we convolve the output of the classication rule with a sliding
window that calculates the percentage of well binding peptides in that win-
dow (resembling a Parzen estimation). Figure Figure 1shows such a resulting
distribution when using an averaging window of 25 peptides wide. At the bor-
der we clip the averaging window, such that the ltered line has the same size
as the number of measured peptides. The average value for the rst peptide
is calculated using only half of the window size (13 peptides). The next one is
calculated using 14 peptides, etc. Clearly, the number of peptides used in this
gure is smaller than the number of peptides used in the experiment. On the
real data sets a window size of 100 was employed.
The estimated distribution can now be compared with the measured binding
energy. Figure Figure 2 shows an example of such a comparison. In this gure, the
Fig. 2. Comparison between the ltered measured binding energy and the es-
timated well binding distribution. The line separating Area 1 and Area 2 is
automatically placed at the knee of the measured binding curve. The position
of the knee is derived from the intersection point of two straight lines tted to
the binding curve.
4 Results
The proposed algorithm is applied to four dierent data sets. The results are
discussed in this section. Except when stated otherwise, the distribution of well
binding peptides have been derived with an averaging window of width 100. The
percentage shown on the vertical axes of the gures represents the value of the
distribution and can be interpreted as the percentage of peptides in the vicinity
of a given peptide which are classied as well binding.
mAb A1 . Figure 3 shows the result for monoclonal antibody A. The dashed
100
90
percentage of well binding peptides
80
70
60
50
40
30
20
10
0
0 1000 2000 3000 4000 5000
peptide no.
Fig. 3. Results of Monoclonal Antibody A. The dashed line is the rescaled ver-
sion of the measured relative binding strength. The solid line gives the estimated
distribution of well binding peptides when applying the rule in Equation (3).
line in the gure is the rescaled measured binding curve (the values are rescaled
to t in the same window as the results). The solid line is the distribution of
well binding peptides when using the rule in Equation (3).
The gure shows that both curves resemble each other quite well, which gives
a good indication that it is possible to classify the data in this way. On the
left side of the gure, the estimated distribution starts at approximately 70%
implying that the algorithm could not identify a pattern in some of the best
binding peptides.
For this antibody it is known that the motif FT improves the relative binding
strength considerably. This kind of information is not known for the amino acids
G or V, but a large number of the well binding peptides containing F and T also
contain a G or a V. Including these elements in the rule makes it possible to reduce
the number of false positives. This can be understood as follows: increasing
1
mAb A and mAb B are ctitious names.
Determination of Binding Amino Acids 271
the size of the rule without increasing the number of misclassied well binding
peptides reduces the number of false positives. The larger the number of amino
acids in a peptide that need to fulll the rule, the smaller the number of peptides
that will be classied as well binding.
The genetic algorithm is also applied on this data. According to our imple-
mentation of this algorithm, the number of OR terms have to be predened.
The best result found using two, three and four terms are shown in Equations
(4)-(6).
The result of (6) has a lower cost than the one proposed by the greedy algorithm.
But it does not deviate from the greedy result in stating that F and T are the
two most important amino acids. So, the information gained from the greedy
algorithm is comparable to the result of the genetic network.
The greedy algorithm and the genetic algorithm also performed comparable
on the other data sets. We prefer the greedy algorithm, while this algorithm is
much faster than the genetic algorithm. Therefore, only results obtained with
the greedy algorithm will be presented in the rest of this paper.
Using the rule generated by the greedy algorithm (3), peptide no. 3 in Ta-
ble Table 2 is classied as a bad binding peptide. This is partly due to the lack
of the amino acid F in the peptide. In this case the quite similar amino acid
Y probably performs the same contribution to the binding as F normally does.
This substitution eect is obviously not included in the algorithm.
Equation (3) is created using all the available peptides. To check the stability
of the algorithm the results are also calculated after repeatedly removing 10%
of the measured peptides from the training set. The resulting 10 rules are shown
in equation (7):
mAb 32F81. The number of well binding peptides for monoclonal antibody
32F81 is much larger than for mAb A, which can be seen in Figure 4. Again the
100
90
percentage of well binding peptides
80
70
60
50
40
30
20
10
0
0 1000 2000 3000 4000 5000
peptide no.
Fig. 4. Results of Monoclonal Antibody 32F81. The dashed line is the rescaled
version of the measured relative binding strength. The solid line gives the per-
centage of elements which fulll Equation (8).
binding energy shows a rapid decay. Some of the peptides which occur in this
area show an exceptionally low binding strength in several dierent peptide
screening experiments. The reason for this exceptionally bad binding behavior
of these peptides is not found by the algorithm.
The best ve binding peptides of monoclonal antibody 32F81 are shown in
Table Table 3.
All peptides shown fulll the rule and are classied as well binding peptides.
For this peptide it should be noted that the result using only the amino acid K is
already quite discriminative: 96% of the 400 best binding peptides contain this
amino acid. The comparison between the distribution found by the algorithm
and the distribution using a rule containing only K is shown in Figure 5.
100
90
percentage of well binding peptides
80
70
60 K
50
40
30
20
10 (E OR T) AND K
AND (D OR F OF H)
0
0 1000 2000 3000 4000 5000
peptide no.
Fig. 5. Comparison of the resulting rule of the algorithm (8) and the rule: K, for
monoclonal antibody 32F81.
The result using the rule K has almost a 100% accuracy for the best binding
peptides. However the number of false positives is also quite high. The cost
function we employ prefers the answer with a lower number of false positives.
274 Peter J. van der Veen et al.
100
90
70
60
50
40
30
20
10
0
0 100 200 300 400 500
peptide no.
Fig. 6. Part of the Result for Monoclonal Antibody B. The dashed line is the
rescaled version of the measured relative binding strength. The solid line gives
the percentage of elements which fulll the rule in Equation (9) (calculated using
an averaging window of 16).
This makes it necessary to reduce the size of the averaging window when esti-
mating the distribution. Namely, when a window size of 100 would be employed
here, the maximal value of the estimated distribution would be 15%, eectively
removing the peak of the distribution. To avoid this, the averaging window has
been reduced to a length of 16. A disadvantage of such a short window is the
larger uctuations of the predicted result (as can be seen in the tail of the curve
in Figure Figure 6).
The rule used to generate Figure Figure 6 is given in equation (9). A known
well binding motif for this peptide is TEDSAVE. The amino acids A, V and E of
this motif are included in the rule found by the algorithm.
Well binding A AND (E OR N) AND (I OR V) (9)
The ve best binding peptides for this antibody are given in Table Table 4. The
mAb 6A.A6. The last result is for monoclonal antibody 6A.A6 (Figure Figu-
re 7). The distribution of well binding peptides is estimated quite well for the
100
90
70
60
50
40
30
20
10
0
0 1000 2000 3000 4000 5000
peptide no.
Fig. 7. Results for Monoclonal Antibody 6A.A6. The dashed line is the rescaled
version of the measured relative binding strength. The solid line gives the per-
centage of elements which fulll the rule in Equation (10).
best binding peptides. However, for the bad binding peptides, the distribution
does not t the actual measured binding energy well, resulting in a large number
of false positives. An interesting element of rule (10) is the OR part (D OR E).
These two amino acids are very similar (see e.g. [2]) and can often be exchanged.
Well binding S AND (D OR E) (10)
The best known binding peptide against this antibody has the following se-
quence: SRLPPNSDVVLG. The amino acids S and D are included in the peptide,
thus the best known peptide will be classied as a well binding peptide. Re-
placement studies of this peptide show that the motif PNSD is very important
for a well binding result [8]. This pattern is much longer then the result found
by the algorithm. This might be caused by the lack of very well binding pep-
tides that have this motif in the data set. It should be noted that the binding
strength of even the best measured peptides of the random data set are much
smaller then this best known result. The ve best binding peptides are shown
in Table Table 5. The known important motif PNSD is not available among these
peptides.
5 Discussion
We have introduced an algorithm that identies those amino acids in a peptide
which are most probably needed to bind against a monoclonal antibody. The
relevant amino acids are described by a rule that distinguishes between well and
bad binding peptides. The algorithm we have introduced automatically generates
such a rule description from peptide screening experimental data.
276 Peter J. van der Veen et al.
This is based on the fact that if a certain motif or combination of amino acids
improves the binding strength between a peptide and a monoclonal antibody,
most of the peptides containing this motif will bind better than average against
this antibody. This results in an over-representation of these amino acids in the
better binding peptides. Such an over-representation of amino acids is utilized
by the proposed algorithm to generate a rule of amino acids that distinguishes
between well binding and bad binding peptides.
A greedy optimization algorithm is employed, which means that it does not
necessarily nd the rule with the minimal cost. Other optimization procedures
like genetic algorithms are able to perform better in nding the absolute opti-
mum.
Better solutions than those obtained with the greedy approach were in our
experience, only marginally better in terms of the cost function, and were char-
acterized by more OR terms (like the term (A OR D OR K OR
R OR S) in the last rule in Equation (7)).
It is important, however, to keep in mind that a large number of peptides
will fulll these long OR terms, i.e., such rules are non-specic. Since these rules
are employed to generate new peptides for subsequent analysis, we are more
interested in rules with short OR terms which reduce the search space by their
specicity.
We could adapt the cost function to achieve this by adding a penalty term
for long rules. However, if we look more carefully at the greedy optimization,
we notice that it starts by including the most discriminating amino acids in the
rule. This results in a reasonable performance after only a few iteration steps.
Additional OR terms are only included when this improves the performance of
the rule (in fact the performance should improve when normally one, but at most
two additional amino are included in the OR term). This greedy property of the
algorithm reduces the chance that these unimportant long terms are included in
the nal rule.
The rules as such can not be directly employed to generate new well binding
motifs, since positional information and frequency of occurrence is lost in the
feature representation. However, within the group of well binding peptides that
satisfy the rule, the amino acids specied in the rule occur at specic positions
in these peptides. A logical approach is to construct new candidate peptides by
preserving for the N best binding peptides, those amino acids that appear in the
rule, while randomly mutating the rest.
Determination of Binding Amino Acids 277
Acknowledgments
This work was funded by the DIOC-5 program of the Delft Inter-Faculty Re-
search Center at the TU Delft and by Pepscan Systems BV located in Lelystad.
References
1. D.A. Daniels and D.P. Lane. Phage peptide libraries. Methods: A Companion to
Methods in Enzymology, 9:494507, 1996.
2. M.O. Dayho, R.M. Schwartz, and B.C. Orcutt. A model of evolutionary change
in proteins. Atlas of Protein Sequence and Structure, 5, supplement 3:345352,
1978.
3. H.M. Geysen, R.H. Meloen, and S.J. Barteling. Use of peptide synthesis to probe
viral antigens for epitopes to a resolution of a single amino acid. Proc Natl Acad
Sci USA, 81:39984002, 1984.
4. L. Jin and J.A. Wells. Dissecting the energetics of an antibody-antigen interface
by alanine shaving and molecular grafting. Protein Science, 3:23512357, 1994.
5. R.H. Meloen, J.I. Casal, K. Dalsgaard, and J.P.M. Langeveld. Synthetic peptide
vaccines: Succes at last. Vaccine, 13(10):885886, 1995.
6. M.H.V. Van Regenmortel and S. Muller. Synthetic peptides as antigens. In S. Pil-
lai and P.C. Van der Vliet, editors, Laboratory Techniques in Biochemistry and
Molecular Biology, Vol. 28. Elsevier, Amsterdam, The Netherlands, 1999.
7. J.K. Scott. Discovering peptide ligands using epitope libraries. Trends in biotech-
nology, 17:241245, 1992.
8. J.W. Slootstra, W.C. Puijk, G.J. Ligtvoet, J.P.M. Langeveld, and R.H. Meloen.
Structural aspects of antibody-antigen ineteraction revealed through small random
peptide libraries. Molecular Diversity, 1:8796, 1995.
9. E. Trilie, M.C. Dubs, and M.H.V. Van Regenmortel. Antigenic cross-reactivity
potential of synthetic peptides immobilized on polyethylene rods. Molecular Im-
munology, 28(8):889896, 1991.
10. Y.L. Yip and R.L. Ward. Epitope discovery using monoclonal antibodies and
phage peptide libraries. Combinatorial Chemistry & High Throughput Screening,
2:125138, 1999.
A Simple Hyper-Geometric Approach for
Discovering Putative Transcription Factor Binding Sites
1 Introduction
A central issue in molecular biology is understanding the regulatory mechanisms that
control gene expression. The recent flood of genomic and post-genomic data, such as
microarray expression measurements, opens the way for computational methods eluci-
dating the key components that play a role in these mechanisms.
Much of the specificity in transcription regulation is achieved by transcription fac-
tors, which are largely responsible for the so called combinatorial aspects of the regu-
latory process (the number of possible behaviors being much larger than the number of
factors). These are proteins that, when in the suitable state, can bind to specific DNA
sequences. By binding to the chromosome in a location near the gene, these factors can
either activate or repress the transcription of the gene. While there are many potential
sites where these factors can bind, it is clear that much of the regulation occurs by fac-
tors that bind in the promoter region which is located upstream of the transcription start
site.
Unlike DNA-DNA hybridization, the dynamics of protein-DNA recognition are not
completely understood. Nonetheless, experimental results show that transcription fac-
tors have specific preference to particular DNA sequences. Somewhat generalizing, the
affinity of most factors is determined to a large extent by one or more relatively short
regions of 610bp. (One must bear in mind that DNA strands span a complete turn
O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 278293, 2001.
c Springer-Verlag Berlin Heidelberg 2001
Simple Binding-Site Discovery Algorithm 279
every 10 bases, thus geometric considerations make it unlikely that a single protein
binds to a longer region, although counterexamples are known.) A common situation is
the formation of dimers in which two DNA binding proteins form a complex. Each of
the two proteins, binds to a short sequence, and together they bind to a sequence that
can be 1218bp long, with a short spacer separating the two regions. Common protein
motifs such as the DNA binding Helix-Turn-Helix (HTH) motif also induce the same
preference on the regulatory site.
The recent advances in microarray experiments allow to monitor the expression
levels of genes in a genome-wide manner [8,9,14,15,22,23]. An important aspect of
these experiments is that they allow to find groups of genes that have similar expres-
sion patterns across a wide range of conditions [12]. Arguably, the simplest biological
explanation of co-expression is co-regulation by the same transcription factors. 1
This observation sparked several works on in-silico identification of putative tran-
scription factor binding sites [4,17,19,20,21]. The general scheme that most of these pa-
pers take involves two phases. First, they perform, or assume, some clustering of genes
based on gene expression measurements. Second, they search for short DNA patterns
that appear in the promoter region of the genes in each particular cluster. These works
are based to a large extent on methods that were developed to find common motifs in
protein and DNA sequences. These include combinatorial methods [ 6,19,21,24,25], pa-
rameter optimization methods such as Expectation Maximization (EM) [ 1], and Markov
Chain Monte Carlo (MCMC) simulations [18,20]. See [19] for a review of these lines
of work.
The use of expression profiles helps to select relatively clean clusters of genes
(i.e., most of them are indeed co-regulated by the same factors). Our interest here lies
with the second phase, and is thus not limited to gene expression analysis. Given high
quality clusters of genes, suspected for any reason to be co-regulated, we address the
hardness of the computational problem of finding putative binding sites in these clusters.
In this paper we describe a fast, simple, yet powerful, approach for finding putative
binding sites with respect to a given cluster of genes. Like some of the other works we
divide this phase into two stages. In the first stage we scan, in an exhaustive manner,
for simple patterns from an enumerable class (such as all 7-mers). We use a straight-
forward, natural, and well understood statistical model for filtering significant patterns
out of this class. Using the hyper-geometric distribution, we compute the probability
that a subset of genes of the given size will have these many occurrences of the pat-
tern we examine, when chosen randomly from the group of all known genes. In the
second stage, we use the patterns that were chosen as seeds for training a more ex-
pressive position-specific scoring matrix (PSSM) to model the putative binding site.
These models are both more accurate representation of the binding site, and potentially
capture much longer conserved regions.
By assuming that most binding sites do contain highly conserved short subsequences
and by explicitly using our post-genomic knowledge of all known and putative genes
to contrast clusters of genes against the genome background, we acquire quality seeds
1
Clearly this is not always the case. Co-regulation can be achieved by other means, and similar
expression patterns can be a result of parallel pathways or a close serial relationship. Nonethe-
less, this is often the case, and a reasonable hypothesis to test.
280 Yoseph Barash, Gill Bejerano, and Nir Friedman
for the construction of PSSMs through a simplified hyper-geometric model. The seeds
allow us to track down potential binding site locations through a specific relatively con-
served region within them. We then use these short seeds to guide the construction of
potentially much longer PSSMs encompassing more, or possibly the complete bind-
ing site. In particular, they allow us to align multiple sequences without resorting to an
expensive search procedure (such as MCMC simulations).
Indeed, an important feature of our approach is the evaluation speed. Once we finish
a preprocessing stage, we can evaluate clusters very efficiently. The preprocessing is
genome-wide and not cluster specific. It can be done only once and stored for all future
reference. This is important both for facilitating interactive analysis, and for serving
as computationally-cheap quality starting points for other, more complex analysis tools
(such as [2]) on top of our method.
In the next three sections we outline our algorithmic approach, discussing signifi-
cance of events, seed finding, and seed expansion into PSSMs, respectively. In Section 5
we describe experimental and comparative results, and then conclude with a discussion.
k = #E (G) of them include the event E. This is simply the hyper-geometric probabil-
ity of finding k red-balls among n draws without replacement from an urn containing
K red balls and N K black ones:
K N K
k
Phyper (k | n, K, N ) = Nnk
n
The p-value of the observation is the probability of drawing k or more genes that satisfy
E in n draws. This requires summing the tail of the hyper-geometric distribution
n
p-value(E, G) = Phyper (k | n, K, N )
k =k
The main appeal of this approach lies in its simplicity, both computationally and sta-
tistically. This null hypothesis is particularly attractive in the post-genomic era, where
nearly all promoter sequences are known. Under this assumption, irrelevant clustering
selects genes in a manner that is independent of their promoter region.
We have just defined the significance of a single event E with respect to a group of
genes G. But when we try many different events E 1 , . . . , EM over the same group of
genes long enough, we will eventually stumble upon a surprising event even in a group
of randomly selected sequences, chosen under the null hypothesis.
Judging the significance of findings in such repeated experiments is known as mul-
tiple hypotheses testing. More formally, in this situation we have computed a set of p-
values p1 , . . . , pM , the smallest corresponding to the most surprising event. We now ask
how significant are our findings considering that we have performed M experiments.
One approach is to find a value q = q(M ), such that the probability that any of the
events (or the smallest one) has a p-value less than q is small. Using the union bound
under the null hypothesis we get that
P (min pm t) P (pm q) = M q
m
m
Thus, if we want to ensure that this probability of a false recognition is less than 0.01
(i.e., 99% confidence), we need to set the Bonferroni threshold q = 0.01 M (see, for ex-
ample, [11]).
The Bonfferoni threshold is strict, as it ensures that each and every validated scoring
event is not an artifact. Our aim, however, is a bit different. We want to retrieve a set of
events, such that most of them are not artifacts. We are often willing to tolerate a certain
fraction of artifacts among the events we return. A statistical method that addresses this
kind of requirement is the False Discovery Rate (FDR) method of [ 3]. Roughly put, the
intuition here is as follows. Under the null hypothesis, there is some probability that the
best scoring event will have a small p-value. However, if the group was chosen by the
null hypothesis, it can be shown that the p-values we compute are distributed uniformly.
Simple Binding-Site Discovery Algorithm 283
Thus, the p-value of the second best event is expected to be roughly twice as large as
the p-value of the best event. Given this intuition, we should be less strict in rejecting
the null hypothesis for the second best pattern and so on.
To carry out this idea, we sort the events by their observed p-values, so that p 1
p2 . . . pM . We then return the events E 1 , . . . , Ek where k M is the maximal
index such that p k kq
M and q is the significance level we want to achieve in selecting.
We have replaced a strict validation test of single events, with a more tolerable version
validating a group of events. We may now detect significant patterns, weaker than the
most prominent one, that were previously below the threshold computed for the later.
This genome-wide preprocessing needs to be done only once. Storing its results we
can rapidly compute p-values of all B (
,) events with respect to any proposed subset of
genes. We simply look up which events occurred in the genes in the cluster, and then
compute the hyper-geometric tail distribution. Furthermore, one may wish to increase,
shrink, or shift the regions under consideration (e.g., from 1000bp to 2000bp upstream),
or adjust the upstream regions of several genes (say, due to elucidation of exact tran-
scription start site). While in general the preprocessing phase must be repeated, in prac-
tice, since it is mainly made up of counting events, we may efficiently subtract, and
add, respectively the counts in the symmetrical difference between the old and new sets
of strings, avoiding repeating the complete process over again. With many completely
sequenced genomes and gene expression data of model organisms in various settings
just beginning to accumulate, our division of labour is especially useful.
is the IUPAC consensus sequences. This approach determines the consensus string of
the binding site using a 15 letter alphabet that describe which subset of {A, C, G, T} is
possible at each position.
A position specific scoring matrix (PSSM) (see, e.g., [10]) offers a more refined
representation. A PSSM of length is an object P = {p 1 , . . . , p
}, composed of
column distributions over the alphabet {A, C, G, T}. The distribution p i , specifies the
probability of seeing each nucleotide at the ith position in the pattern.
Once we have a PSSM P, we can score each -mer by computing its combined
probability given P. A more common practice is to compute the log-odds between the
PSSM probability and a background probability of nucleotides. Thus, if p 0 is assumed
to be the nucleotide probability in promoter regions, then the score of an -mer is:
pi ([i])
ScoreP () = log
i
p0 ([i])
That is, we adjust the threshold so that the event defined by (P, ) has the smallest
p-value with respect to G. This discriminative choice of a threshold ensures that we
adjust it to take into account the amount of spurious matches to the PSSM outside
of G. Thus, we strive for a threshold that maximizes the number of matches within G
and at the same time minimizes the number of matches outside G. The use of p-values
provides a principled way of balancing these two requirements.
We can find this threshold quite efficiently. We compute the best score of the PSSM
over each gene in G, and sort this list of scores. We then evaluate only thresholds which
286 Yoseph Barash, Gill Bejerano, and Nir Friedman
are, say, half way between any two adjacent values in our list of sorted scores (each
succeeding threshold admits another gene into the group of supposedly detected events).
Using, for example, radix sort, this procedure takes time O(N L).
Learning PSSMs is composed of two tasks. Estimating the parameters of the PSSM
given a set of training sequences that are examples of the pattern we want to match, and
finding these sequences. The latter is clearly a harder problem and requires some care.
We start with the first task. Suppose we are given a collection 1 , . . . , n of -mers
that correspond to aligned sites. We can easily estimate a PSSM P that corresponds
to these sequences. For each position i, we count the number of occurrences of each
nucleotide in that position. This results in a count N (i, c) = j 1{j [i] = c}.
Given the counts we estimate the probabilities. To avoid entries with zero probabil-
ity, we add pseudo-counts to each position. Thus, we assign
N (i, c) +
pi (c) = (1)
n + 4
The key question is how to select the training sequences and how to align them.
Our approach builds on our ability to find seeds of conserved sequences. Suppose that
we find a significant -ball using the methods of the previous section. We can then use
this as a seed for learning a PSSM. The simplest approach takes the -mers that match
the ball within the promoter regions of G as the training sequences for the PSSM. The
learned PSSM then quantifies which differences are common among these sequences
and which ones are rare. This gives a more refined view of the pattern that was captured
by the -ball.
This simple approach learns an -PSSM from the -ball events found in the data.
However, using PSSMs we can extend the pattern to a much longer one. We start by
aligning not only the sequences that match the -ball, but also their flanking regions.
These are aligned by virtue of the alignment of the core -mers. We can then learn a
PSSM over a much wider region (say 20bp). If there are conserved positions outside
the core positions, this approach will find them. 5
Consider, for example, a HTH DNA binding motif, or a binding factor dimer, where
each component matches 6-10bps with several unspecific gap positions between the two
specific sites. If we find one of the two sites using the methods of the previous sections,
then growing a PSSM on the flanking regions allows us to discover the other conserved
positions.
Once we construct such an initial PSSM, we can improve it using a standard EM-
like iterative procedure. This procedure consists of the following steps. Given a PSSM
P0 , we compute a threshold 0 as described above. We then consider each position in
the training sequences and compute the probability that the pattern appears at that po-
sition. Formally, we compute the likelihood ratio (P 0 , 0 ) assigns to the appearance of
5
This assume that there are no variable lengths gaps inside the patterns. The structural con-
straints on transcription factors suggest that these are not common.
Simple Binding-Site Discovery Algorithm 287
the pattern at s[i, . . . , i+ 1]. We then convert this ratio to a probability by computing
s,i = logit(ScoreP0 (s[i, . . . , i + 1]) 0 )
where logit(x) = 1/(1 + ex ) is the logistic function. We then re-scale these prob-
abilities by dividing by a normalization factor Z s so that the posterior probability of
observing the pattern in s and its reverse complement sums to 1. Once we have com-
puted these posterior probabilities, we can accumulate expected counts
sg ,j
N (i, c) = 1{sg [j + i] = c}.
g j
Zsg
These represent the expected number of times that the ith position in the PSSM takes
the value c, based on the posterior probabilities.
Once we collected these expected counts, we re-estimate the weights of the PSSM
using Eq. 1 to get a new a PSSM. We optimize the threshold of this PSSM, and repeat
the process. Although this process does not guarantee improvement in the p-value of the
learned PSSM, it is often the case that successive iterations do lead to significant such
improvements. Note that our iterations are analogous to EMs hill-climbing behaviour,
and differ from Gibbs samplers where one performs a stochastic random walk aimed at
a beneficial equilibrium distribution.
5 Experimental Results
We performed several experiments on data from the yeast genome to evaluate the util-
ity and limitations of the methods described above. Thus, we focused on several re-
cent examples from the literature that report binding sites found either using computa-
tional tools or by biological verification. To better calibrate the results, we also applied
MEME [1], one of the standard tools in this field, on the same examples.
In this first analysis we chose to use the simple hamming distance measure and treat
the 1000bp sequence upstream of the ORF starting position as the promoter region. We
note that the latter is a somewhat crude approximation, as this region also contains an
untranslated region of the transcript.
We ran our method in two stages. In the first stage, we searched for patterns of
length 68 with ranging between 02 mismatches, and an allowed ball overlap factor
of 01. Generally speaking, in these runs the patterns found with no mismatches or
ball overlaps had better p-values. This happens because we search for relatively short
patterns, allowing for a non-trivial probability of a random match. For this reason we
report below only results with exact matches and no overlap. We believe that higher
values of both parameters will be useful for longer patterns (say of length 12 or 13). In
the second stage we run the EM-like procedure described above on all the patterns that
received significant scores. We chose to learn PSSMs of width 20 using 15 iterations of
our procedure.
To compare the results of these two stages, we ran MEME (version 3.0.3) in two
configurations. The first restricted MEME to retrieve only short patterns of width 6
8, corresponding to our -mers stage. The second configuration used MEMEs own
defaults for pattern retrieval resembling our end product PSSMs.
288 Yoseph Barash, Gill Bejerano, and Nir Friedman
We applied our procedure to several data sets from the recent literature. Selected
results are summarized in Table 1. In this table we rank the top results from the different
Table 1. Selected results on binding site regions of several yeast data sets, comparing
our findings with those of MEME.
runs of each procedure by their p-values (or e-values) reported by the programs after
removing repeated patterns. We report the relative rank of the patterns singled out in
the literature and their significance scores. We discuss these results in order.
The first data set is by Spellman et al. [22]. They report several cell-cycle related
clusters of genes. In a recent paper, Sinha and Tompa [ 21] report results of a systematic
search for binding sites in these clusters of IUPAC consensus regions using a random
sequence null hypothesis utilizing a Markov chain of order 3. The main technical devel-
opments in [21] are methods for approximating the p-value computation with respect to
such a null-hypothesis.
We examined two clusters reported on by Sinha and Tompa. In the first one, CLN2,
our method identifies the pattern ACGCGT and various expansions of it. This pattern was
found using patterns of length 6, 7, and 8 with significant p-values. The PSSMs learned
from these patterns were quite similar, all containing the above motif. Figure 1(a) shows
an example. In the second cluster, SIC1, the signal appears with a marginal p-value
(close to the Bonfferoni cutoff) already at = 6. The trained PSSM recovers the longer
pattern with a significant p-value. In both cases, the top ranking patterns correspond to
the known binding site.
The second data set is by Tavazoie et al. [23]. That paper also examines cell-cycle
related expression levels that were grouped using k-means clustering. They examined
30 clusters, and applied an MCMC-based procedure for finding PSSM patterns in the
promoter regions of genes in each cluster. We examined the clusters they report as
Simple Binding-Site Discovery Algorithm 289
2 2
0
AG C
A
CA GA
1
2
3
AAA
G G
T
C
ACGCGT
A G
C
G
AC
AAGA TA
GC
1
0
T
C
CG
AAAAA
GC CG
ACGC AAAA
G
G
C
AAC
T
A
GGG GC
A
AG
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
(a) (b)
2 2
0
A TACC
C
TTATAA
A
GATAAGAG
A G
AGG
GTA
GCC
CTCCCGG
1
0
G
T
AT
A
TT
ATC
TTC GAA
C
G C
TA
CG G
G
CGA
T
AA TC
A
AT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
(c) (d)
Fig. 1. Examples of PSSMs Learned by Our Procedure. (a) CLN2 cluster. (b) SBF clus-
ter. (c) Gasch et al. Cluster M. (d) Gasch et al. Cluster I/J.
statistically significant, and were able to reproduce binding sites that are very close to
the PSSMs they report; see Table 1.
In a recent paper, Iyer at al. [16] identify, using experimental methods, two groups
of genes that are regulated by the MBF/SBF transcription factor. Here, again, we man-
aged to recover the binding sites they discuss with high confidence. For example, we
show one of our matching PSSMs in Figure 1(b).
Finally, we discuss the recent data set of yeast response to environmental stress by
Gasch et al. [14]. We report on two clusters of genes M, and I/J. In cluster M the
string CACGTGA is found in several of the highest scoring patterns. However, when we
turned to grow PSSMs out of our seeds, a matrix of a lower ranking seed GATAAGA
exceeded the rest, exemplifying that seed ordering is not necessarily maintained when
the patterns are extended. The latter, more prominent PSSM is shown in Figure 1(c). In
cluster I/J a significant short pattern rising above our threshold is not found. However
when we extended the top most seed we obtained the PSSM of Figure 1(d) which both
nearly crosses our significance threshold, and holds biological appeal, showing two
conserved short regions flanking a less conserved 2-mer.
In general, the scores of the learned PSSMs vary. In some cases, the best seeds
yield the best scoring PSSMs. More often, the best scoring PSSM corresponds to a seed
lower in the list (we took into account only seeds that have p-value matching the FDR
decision threshold). In most cases the PSSM learned to recognize regions flanking the
seed sequence. In some cases more conserved regions were discovered. In general our
approach manages to identify short patterns that are close to the pattern in the data.
Moreover, using our PSSM learning procedure we are able to expand these into more
expressive patterns.
We note that in most analysed cases MEME also identified the shorter patterns.
However, there are two marked differences. First and foremost is run time. Compared
on a 733 MHz Pentium III Linux machine our seed discovery programs ran between
290 Yoseph Barash, Gill Bejerano, and Nir Friedman
half a minute and an hour, exhaustively examining all possible patterns, while the EM-
like PSSM growing iterations added a couple of minutes. The shortest MEME run on
the same data sets took about an hour, while longer ones ran for days, when asked to
return only the top thirty patterns. Second, MEME often gave top scores to spurious pat-
terns that are clear artifacts of the sequence distributions in the promoter regions (such
as poly As). When using MEME one can try to avoid these problems by supplying a
more detailed background model. This has the effect of removing most low complex-
ity patterns from the top scoring ones. Our program avoids most of these pitfalls by
performing its significance tests with respect to the genome background to begin with.
6 Discussion
In this paper we examined the problem of finding putative transcription factor binding
sites with respect to a selected group of genes. We advocate significance calculations
with respect to the random selection null hypothesis. We claim that this hypothesis
is both simple and clear and is more suitable for gene expression experiments than
the random sequence null hypothesis. We then use a simple hyper-geometric test in
a framework for constructing models of binding sites. This framework starts by sys-
tematically scanning a family of simple seed patterns. These seeds are then used for
building PSSMs. We describe how to construct statistical tests to select the most surpris-
ing threshold value for a PSSM and combine this with an EM-like iterative procedure
to improve it. We thus combine a first phase of kernel identification based on a rigorous
statistical analysis of word over-representation, with a subsequent phase of optimiza-
tion, leading to a PSSM, which can be used to scan sequences for new matches of the
putative regulatory motif.
We showed that even before performing iterative optimization of the PSSMs, our
method recovers highly selective seed patterns very rapidly. We reconstructed results
from several recent papers that use more elaborate and computationally intensive tools
for finding binding sites, as well as present novel binding sites.
A potential weakness of our model is the fact that we disregard multiple copies of a
match in the same sequence (the restriction to binary events). Despite the fact that this
phenomenon is known to happen in eukaryotic genes, we recall that a mathematical
analysis of counting the number of occurrences in a single string is more elaborate,
and computationally intensive. This may indeed lead in such cases to under-estimation,
which is problematic mainly for small clusters of co-regulated genes. The recognition
of two conserved patterns separated by a relatively long spacer (say of 10bp or more),
resulting from a HTH motif or a dimer complex, can however be attacked by looking
for proximity relationships between pairs of occurrences of different significant seeds.
As this field is showing an influx of interest, our work resembles several others in
different aspects. We highlight only the most relevant ones.
The use of the hyper-geometric distribution in the context of finding binding sites is
used by Jensen and Knudsen [17] to find short conserved subsequences of length 46
bp. They demonstrate the ability to reconstruct sequences, but suffer statistical problems
when they consider longer -mers, due to the large number of competing hypotheses.
Simple Binding-Site Discovery Algorithm 291
Already in Galas et al. [13], word statistics are used to detect over-represented mo-
tifs, and a definition of a general concept of word neighborhood is given similar
to the ball definition we give here. However, the analysis there is restricted to over-
representations at specific positions with respect to a common point of reference across
all sequence, deeming it mostly appropriate for prokaryotic transcription or translation
promoter region elucidation.
The general outline of our approach is similar to that of Wolferstetter et al. [ 27] and
Vilo et al. [26]. Both search for over-represented words and try to extend them. Vilo
et al. examine -mers of varying sizes that are identified by building a suffix tree for
the promoter regions. Then, they use a binomial formula for evaluating significance.
For the clustering they constructed, this resulted in a very large pool of sequences (over
1500). They use multiple alignment-like procedure for combining these -mers into
longer consensus regions. Thus, to learn longer binding sites with variable position,
they require overlapping subsequences to be present in the data. This is in contrast to
our approach that uses PSSMs to extend the observed patterns, and so is more robust to
highly variable positions that flank the conserved region.
Van Helden et al. [24] also use binomial approach. They try to take into consider-
ation the presence of multiple copies of a motif in the same sequence, but suffer from
resulting inaccuracies with respect to auto-correlating patterns. Our work can be seen as
generalizing this approach in several respects, including the use of a hyper-geometric
null model, the discussion of general distance functions and event space coarsening,
and the iterative PSSM improvement phase.
There are several directions in which we can extend our approach, some of them
embedding ideas from previous works into our context.
First, in order to estimate the sensitivity of our model it will be interesting to exam-
ine it on smaller, and known, gene families, as well as on synthetic data sets, as those
advocated in [19]. Extending our empirical work beyond yeast should also provide new
insights and challenges.
Our method treats the complete promoter region as a uniform whole. However, bi-
ological evidence suggests that the occurrence of binding sites can depend on the posi-
tion within the promoter sequence [22]. We can easily augment our method by defining
events on sub-regions within the promoter sequence. This will facilitate the discovery
of subsequences specific to certain positions. Another biological insight already men-
tioned is the phenomena of two conserved patterns separated by a relatively long spacer.
In the case of homeodimers we can easily expand our scope to handle events that require
two appearances of the subsequence within the promoter region. Otherwise, we can try
to extend our PSSMs further to flank the seed while weighting each column such as to
allow for longer spacers between meaningful sub-patterns.
So far we have looked for contiguous conserved patterns within the binding site.
More complex extensions involve defining new distance measures that incorporate pref-
erences for more conserved positions in specific positions in the pattern, and random
projection techniques, akin to [5], which will allow us to easily handle longer -mers.
We can also further generalize our model by allowing ourselves to express our -mer
centroids over the IUPAC alphabet. This allows both for a reduction of the event space
and the natural incorporation of biological insight, as outlined above. Our current
292 Yoseph Barash, Gill Bejerano, and Nir Friedman
method for diluting the set of covering -balls is highly heuristic. Interesting theo-
retical issues include the formal criteria we should optimize in selecting this approxi-
mating set of -balls and how to efficiently optimize with respect to such a criterion.
Finally, we intend to combine the putative sites we discover with learning methods that
learn dependencies between different sites and between sites and other attributes such
as expression levels and functional annotations [2].
Acknowledgments
The authors thank Zohar Yakhini for arousing our interest in this problem and for many
useful discussions relating to this work, and the referees for pointing out further relevant
literature. This work was supported in part by Israeli Ministry of Science grant 2008-
1-99 and an Israel Science Foundation infrastructure grant. GB is also supported by a
grant from the Ministry of Science, Israel. NF is also supported by an Alon Fellowship.
References
1. T.L. Bailey and Elkan C. Fitting a mixture model by expectation maximization to discover
motifs in biopolymers. In Proc. Int. Conf. Intell. Syst. Mol. Biol., volume 2, pages 2836.
1994.
2. Y. Barash and N. Friedman. Context-specific Bayesian clustering for gene expression data.
In Proc. Ann. Int. Conf. Comput. Mol. Biol., volume 5, pages 1221. 2001.
3. Y. Benjamini and Y. Hochberg. Controlling the False Discovery Rate: a practical and pow-
erful approach to multiple testing. J. Royal Statistical Society B, 57:289300, 1995.
4. A. Brazma, I. Jonassen, J. Vilo, and E. Ukkonen. Predicting gene regulatory elements in
silico on a genomic scale. Genome Res., 8:120215, 1998.
5. J. Buhler and M. Tompa. finding motifs using random projections. In Proc. Ann. Int. Conf.
Comput. Mol. Biol., volume 5, pages 6976. 2001.
6. H. J. Bussemaker, H. Li, and E. D. Siggia. building a dictionary for genomes: identification
of presumptive regulatory sites by statistical analysis. PNAS, 97(18):10096100, 2000.
7. H. J. Bussemaker, H. Li, and E. D. Siggia. Regulatory element detection using a probabilistic
segmentation model. In Proc. Int. Conf. Intell. Syst. Mol. Biol., volume 8, pages 6774. 2000.
8. S. Chu, J. DeRisi, M. Eisen, J. Mullholland, D. Botstein, P. Brown, and I. Herskowitz. The
transcriptional program of sporulation in budding yeast. Science, 282:699705, 1998.
9. J. DeRisi., V. Iyer, and P. Brown. Exploring the metabolic and genetic control of gene
expression on a genomic scale. Science, 282:699705, 1997.
10. R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis : Probabilis-
tic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.
11. R. Durrett. Probablity Theory and Examples. Wadsworth and Brooks, Cole, California,
1991.
12. M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein. Cluster analysis and display of
genome-wide expression patterns. PNAS, 95:1486314868, 1998.
13. D. J. Galas, M. Eggert, and M. S. Waterman. Rigorous pattern-recognition methods for dna
sequences: analysis of promoter sequences from Escherichia coli. J. Mol. Biol., 186:11728,
1985.
14. A. P. Gasch, P. T. Spellman, C. M. Kao, O. Carmel-Harel, M. B. Eisen, G. Storz, D. Botstein,
and P. O. Brown. Genomic expression program in the response of yeast cells to environmen-
tal changes. Mol. Bio. Cell, 11:42414257, 2000.
Simple Binding-Site Discovery Algorithm 293
1 Introduction
Although current technology for DNA sequencing is highly automated and can deter-
mine large numbers of base pairs very quickly, only about (on average) 550 consecutive
base pairs (bp) can be reliably determined in a single read [ 6]. Thus, a large consecutive
stretch of source DNA can only be determined by assembling it from short fragments
obtained using a shotgun sequencing strategy [ 5]. In a modification of this approach
called double-barreled shotgun sequencing [ 1], larger clones of DNA are sequenced
from both ends, thus producing mate-pairs of sequenced fragments with known relative
orientation and approximate separation (typically, employing a mixture of 2kb, 5kb,
10kb, 50kb and 150kb clones). So, usually a sequencing project produces a collection
of fragments that are randomly sampled from the source sequence. The average num-
ber x of fragments that cover any given position in the source sequence is known as the
fragment x-coverage.
Given two different assemblies of the same chromosome-sized source sequence,
possibly obtained from two different sequencing projects, how can one evaluate and
compare them? The aim of this paper is to present some fast and simple methods
addressing this problem that are based on fragment and mate-pair data obtained in
a sequencing project for the source sequence. Additional applications are in tracking
O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 294306, 2001.
c Springer-Verlag Berlin Heidelberg 2001
Comparing Assemblies Using Fragments and Mate-Pairs 295
forward features from one version of an assembly to the next, comparison of differ-
ent chromosomes from the same genome and of similar chromosomes from different
species. Although each method on its own is just an implementation of a simple idea or
heuristic, our experience is that the integration of these methods gives rise to a powerful
tool. We originally developed this tool to compare different assemblies of the human
genome, see Figures 6 and 7 in [7].
In Section 2 we discuss assembly evaluation and comparison techniques based on
fragments. In particular, we introduce the concept of segment discrepancy that mea-
sures by how much the positioning of a segment of conserved sequence differs between
two assemblies. Then we present some mate-pair based methods in Section 3, includ-
ing a useful breakpoint detection heuristic. Finally, we demonstrate the utility of these
methods in Section 4.
20
Fragment
10
coverage
0
3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 4.3 Mb
Assembly
6 Mb
5
4
Assembly B
3
2
1
Assembly A
0 1 2 3 4 5 6Mb
Fig. 2. Fragment based dot-plot comparison of two different assemblies of a 6Mb region
of chromosome 2 in human. Each point represents a fragment that hits both assemblies.
0 1 2 3 4 5 6 Mb
Assembly B
Assembly A
0 1 2 3 4 5 6 Mb
Fig. 3. Fragment Based Line-Plot Comparison. Each line segment represents a fragment
that hits both assemblies. Medium grey lines represent fragments contained in the heav-
iest common subsequence (HCS) of consistently ordered and oriented segments, light
grey lines represent consistently oriented segments that are not contained in the HCS,
and dark grey lines represent fragments (or segments) that have opposite orientation in
the two assemblies.
consensus sequence of assembly B directly against that of assembly A and then project
fragments from A onto B wherever compatible with the segments of local alignment
between A and B.
For any two fragments F, G F(A, B), define F < A G, if s(F, A) < s(G, A),
and define F < B G, if s(F, B) < s(G, B). Because we assume that all s values are
distinct, these are both total orderings and we use pred A (F ) and succA (F ) to denote
the <A -predecessor and < A -successor of F , respectively.
A sequence S = (F1 , F2 , . . . , Fk ) of fragments is called a matched segment, in
either of the two following cases:
1. {F1 , F2 , . . . , Fk } F + (A, B) and succA (Fi ) = succB (Fi ) for all i = 1, 2, . . . ,
k 1, or
2. {F1 , F2 , . . . , Fk } F (A, B) and succA (Fi ) = predB (Fi ) for all i = 1, 2, . . . ,
k 1.
A matched segment is called maximal, if it cant be extended.
Let S := S(F (A, B)) = {S1 , S2 , . . . , Sn } denote the set of all maximal matched
segments of F (A, B), and let S + and S denote the subset of such segments in cases
1 and 2, respectively. Both S + and S can be computed in a simple loop that consid-
ers each fragment in < A order and decides whether it extends the current segment or
defines the start of a new one.
The A-support of a matched segment S = (F 1 , F2 , . . . , Fk ) is defined as the in-
terval [s(S, A), t(S, A)], with s(S, A) := minF S (s(F, A), t(F, A)) and t(S, A) :=
maxF S (s(F, A), t(F, A)). The B-support is defined similarly. Let len(S) denote the
minimum length of the A- and B-supports of S.
Consider two segments S = (F1 , F2 , . . .) and T = (G1 , G2 , . . .). We say that S and
T are parallel if either both s(F 1 , A) < s(G1 , A) and s(F1 , B) < s(G1 , B), or both
s(F1 , A) > s(G1 , A) and s(F1 , B) > s(G1 , B) hold.
It seems reasonable to trust those portions of the two assemblies that are covered
by segments from the heaviest common subsequence H. Thus, we propose to measure
the amount by which the positioning of a segment S not in S + H differs in the two
assemblies as follows: We define the displacement D(S) associated with S as the sum
of lengths of all segments in H that are not parallel to S. In Figure 4 we plot segment
length vs. segment displacement.
160 Kbp
140
120
Segment 100
length 80
60
40
20
0
0 5 10 15 20 25 30 35 40 45 Kbp
Displacement
Let A and B be two assemblies of a chromosome and let F be a set of associated frag-
ments. Assume now that the fragments in F were generated using a double-barreled
shotgun protocol in which mate-pairs of fragments are obtained by reading both ends
of longer clones. For purposes of this paper, a mate-pair library M = (L, , ) consists
of a list L of pairs of mated fragments, together with a mean estimate and standard
deviation for the length of the clones from which the mate-pairs were obtained, see
Figure 5.
Typical clone sizes used to produce mate-pair libraries used in Celeras human
genome sequencing were 2kb, 10kb, 50kb, and 150kb. The quality of shorter mate-pairs
can be very good with a standard deviation of about 10% of the mean length, whereas
the standard deviation can reach 20% for long clones. Also, because both ends of clones
are read in separate sequencing reactions, there is a potential for mis-associating mates.
300 Daniel H. Huson et al.
Source sequence
F G
,
Fig. 5. Two fragments F and G that form a mate-pair with known mean distance and
standard deviation . Note their relative orientation in the source sequence.
However, a high level of automation and electronic sample tracking can reduce the oc-
currences of this problem to below 1%. By construction, any fragment will occur in at
most one mate-pair.
Given an assembly A with fragments F (A) and a collection of mate-pair libraries
M = {M1 , M2 , . . .}, let m = {F, G} F(A) be a mate-pair occurring in some
library Mi = (L, , ). Then m is called happy if the positioning of F and G in A
is reasonable, i.e., if F and G are oriented towards each other (as in Figure 5) and
| |s(F, A) s(G, A)| | 3, say. An unhappy mate-pair m is called mis-oriented if
the former condition is not satisfied, and mis-separated if only the latter condition fails.
50K
10K
Assembly B
2K
0 1 2 3 4 5 6 Mb
2K
10K
Assembly A
50K
0 1 2 3 4 5 6 Mb
to the left and G is oriented to the right. (Happy and mis-separated mates are innie-
oriented).
We now describe a simple but effective heuristic for detecting breakpoints. Choose
a threshold T > 0, depending on details of the sequencing project. (All figures in this
paper were produced using T = 5.) An event is a three-tuple (x, t, a) consisting of a
coordinate x {1, . . . , len(A)}, a type t {normal, anti, outtie, mis-separated}, and
an action a {+1, 1}, where +1 or 1 indicates the beginning or end of a clone-
middle, respectively. We maintain the number of currently alive mates V (t) of type
t. For each event e = (x, t, a) in ascending order of coordinate x: If a = +1, then
increment V (t) by 1. In the other case (a = 1), if V (t) T , then report a breakpoint
at position x and set V (t) = 0, else decrease V (t) by 1. (For a better estimation of the
true position of the breakpoint, report the interval [x , x], where x is the coordinate of
the most recent alive +1-event of type t.) Breakpoints estimated in this way are shown
in Figure 7.
A useful variant of the breakpoint estimator is obtained by taking the current number
of alive happy mates into account: Scanning from left to right, a breakpoint is said to
be present at position x if there exists an event e = (x, t, 1) such that the number of
alive unhappy mates of type t exceeds the number of alive happy mates of type t.
302 Daniel H. Huson et al.
50K
10K
Assembly B
2K
0 1 2 3 4 5 6 Mb
2K
10K
Assembly A
50K
0 1 2 3 4 5 6 Mb
Fig. 7. A Localized Clone-Middle Diagram for Assemblies A and B. Here, each mis-
separated or mis-oriented mate-pair is represented by a line that indicates the expected
range of placement of the right mate with respect to the left one. Ticks along the axis
indicate putative breakpoints, as inferred from the mis-oriented mates.
Similar to the fragment-coverage plot discussed in Section 2, one can use the clone-
coverage events to compute a clone-coverage plot for each of the types of mate-pairs,
see Figure 8.
Note that the simultaneous occurrence of both high happy and high mis-separated
coverage may indicate the presence of a polymorphism in the fragment data.
3.4 Synthesis
Combining all the described methods into one view gives rise to a tool that is very
helpful deciding by how much two different assemblies differ and, more, which one is
more compatible with the given fragment and mate-pair data; see Figure 9. This latter
capability is an especially powerful aspect of analysis in terms of fragments and mate-
pairs.
4 Some Applications
The techniques described in this paper have a number of different applications in com-
parative genomics. Originally, our goal was to design a tool for comparing the simi-
larities and differences of assemblies of human chromosomes produced at Celera with
Comparing Assemblies Using Fragments and Mate-Pairs 303
50
25
Assembly B
0
0 1 2 3 4 5 6 Mb
50
25
Assembly A
0
0 1 2 3 4 5 6 Mb
Fig. 8. Clone-coverage plot for assemblies A and B, showing the number of of happy
mate-pairs (medium grey), mis-separated pairs (light grey) and mis-oriented ones (dark
grey).
0 1 2 3 4 5 6 Mb
50K
Assembly B 10K
2K
2K
Assembly A
10K
50K
0 1 2 3 4 5 6 Mb
those produced by the publicly funded Human Genome Project (PFP). A detailed com-
parison based on our methods is shown in Figures 6 and 7 of [ 7]. As an example, we
show the comparison for chromosome 2 in Figure 10. For clarity, only segments of
length 50kb or more are shown.
50
Assembly H 25
50
Assembly C
25
0
0 50 100 150 200 Mb
Fig. 10. Line-plot and breakpoint comparison of two different assemblies of chromo-
some 2 of human. Assembly C was produced at Celera [ 7] and assembly H was pro-
duced in the context of the publicly funded Human Genome Project and was released
on September 5, 2000 [2]. The number of detected breakpoints (indicated as ticks along
the chromosome axes) is 73 for C and 3592 for H.
4.1 Feature-Tracking
0 10 20 30 40 50 60 70
50K
Assembly H2 10K
2K
2K
Assembly H1 10K
50K
0 10 20 30 40 50 60
Fig. 11. Line-plot, clone-middle and breakpoint comparison of the PFP assembly H 1
of chromosome 19 as of September 5, 2000, and the a more recent PFP assembly H 2
dating January 9, 2001.
Additionally, our algorithms can be used to compare different chromosomes of the same
species e.g. in search of duplication events, but also to compare different chromosomes
from different species, in the latter case using a lower stringency alignment method to
define fragment hits.
We illustrate this by a comparison of chromosome X and Y of human, as described
in [7]. In this analysis we use only uniquely hitting fragments. In summary, we see
approximately 1.3Mb of sequence in conserved segments, of which 164kb are contained
in the heaviest common subsequence (relative to the standard orientation of X and Y ),
82kb are contained in other segments of the same orientation and 1.05Mb in oppositely
oriented segments, see Figure 12. We observe orientation preserving similarity at both
ends of the chromosomes and a large inverted conserved segment in the interior of X.
References
1. A. Edwards and C.T. Caskey. Closure strategies for random DNA sequencing. METHODS:
A Companion to Methods in Enzymology, 3(1):4147, 1991.
2. D. Haussler. Human Genome Project Working Draft. http://genome.cse.ucsc.edu.
306 Daniel H. Huson et al.
0 50 100 Mb