Algorithms On Strings Trees and Sequence PDF
Algorithms On Strings Trees and Sequence PDF
and Sequences
COMPUTER SCIENCE AND COMPUTATIONAL
BIOLOGY
Dan Gusfield
University of California, Davis
CAMBRIDGE
UNIVERSITY PRESS
Contents
xiii
Preface
Introduction
The Boyer-Moore Algorithm
The Knuth-Morris-Pratt algorithm
Real- time string matching
Exercises
CONTENTS
8.10
8.11
viii
CONTENTS
A short history
Basic definitions
A motivating example
A naive algorithm to build a suffix tree
CONTENTS
CONTENTS
Preface
xiii
PREFACE
xv
ence, although it was an active area for statisticians and mathematicians (notably Michael
Waterman and David Sankoff who have largely framed the field). Early on, seminal papers
on computational issues in biology (such as the one by Buneman [83]) did not appear in
mainstream computer science venues but in obscure places such as conferences on cornputational archeology [226]. But seventeen years later, computational biology is hot, and
many computer scientists are now entering the (now more hectic, more competitive) field
[280]. What should they learn?
The problem is that the emerging field of computational molecular biology is not well
defined and its definition is made more difficult by rapid changes in molecular biology
itself. Still, algorithms that operate on molecular sequence data (strings) are at the heart
of computational molecular biology. The big-picture question in computational molecular biology is how to "do" as much "real biology" as possible by exploiting molecular
sequence data (DNA, RNA, and protein). Getting sequence data is relatively cheap and
fast (and getting more so) compared to more traditional laboratory investigations. The use
of sequence data is already central in several subareas of molecular biology and the full
impact of having extensive sequence data is yet to be seen. Hence, algorithms that operate on strings will continue to be the area of closest intersection and interaction between
computer science and molecular biology. Certainly then, computer scientists need to learn
the string techniques that have been most successfully applied. But that is not enough.
Computer scientists need to learn fundamental ideas and techniques that will endure
long after today's central motivating applications are forgotten. They need to study methods that prepare them to frame and tackle future problems and applications. Significant
contributions to computational biology might be made by extending or adapting algorithms from computer science, even when the original algorithm has no clear utility in
biology. This is illustrated by several recent sublinear-time approximate matching methods for database searching that rely on an interplay between exact matching methods
from computer science and dynamic programming methods already utilized in molecular
biology.
Therefore, the computer scientist who wants to enter the general field of computational
molecular biology, and who learns string algorithms with that end in mind, should receive a
training in string algorithms that is much broader than a tour through techniques of known
present application, Molecular biology and computer science are changing much too
rapidly for that kind of narrow approach. Moreover, theoretical computer scientists try to
develop effective algorithms somewhat differently than other algorithmists. We rely more
heavily on correctness proofs, worst-case analysis, lower bound arguments, randomized
algorithm analysis, and bounded approximation results (among other techniques) to guide
the development of practical, effective algorithms, Our "relative advantage" partly lies in
the mastery and use of those skills. S o even if I were to write a book for computer scientists
who only want to do computational biology, I would still choose to include a broad range
of algorithmic techniques from pure computer science.
In this book, I cover a wide spectrum of string techniques - well beyond those of
established utility; however, I have selected from the many possible illustrations, those
techniques that seem to have the greatestpotential application in future molecular biology.
Potential application, particularly of ideas rather than of concrete methods, and to anticipated rather than to existing problems is a matter of judgment and speculation. No doubt,
some of the material contained in this book will never find direct application in biology,
while other material will find uses in surprising ways. Certain string algorithms that were
generally deemed to be irrelevant to biology just a few years ago have become adopted
xiv
PREFACE
Our problem
None of us was an expert on string algorithms. At that point 1 had a textbook knowledge of
Knuth-Morris-Pratt and a deep confusion about Boyer-Moore (under what circumstances
it was a linear time algorithm and how to do strong preprocessing in linear time). I
understood the use of dynamic programming to compute edit distance, but otherwise
had little exposure to specific string algorithms in biology. My general background was
in combinatorial optimization, although I had a prior interest in algorithms for building
evolutionary trees and had studied some genetics and molecular biology in order to pursue
that interest.
What we needed then, but didn't have, was a comprehensive cohesive text on string
algorithms to guide our education. There were at that time several computer science
texts containing a chapter or two on strings, usually devoted to a rigorous treatment of
Knuth-Morris-Pratt and a cursory treatment of Boyer-Moore, and possibly an elementary
discussion of matching with errors. There were also some good survey papers that had
a somewhat wider scope but didn't treat their topics in much depth. There were several
texts and edited volumes from the biological side on uses of computers and algorithms
for sequence analysis. Some of these were wonderful in exposing the potential benefits
and the pitfalls of using computers in biology, but they generally lacked algorithmic rigor
and covered a narrow range of techniques. Finally, there was the seminal text Time Warps,
String Edits, and Macromolecules: The Theory und Practice of Sequence Comnparison
edited by D. Sankoff and J. Kruskal, which served as a bridge between algorithms and
biology and contained many applications of dynamic programming. However, it too was
much narrower than our focus and was a bit dated.
Moreover, most of the available sources from either community focused on string
matching, the problem of searching for an exact or "nearly exact" copy of a pattern in
a given text. Matching problems are central, but as detailed in this book, they constitute
only a part of the many important computational problems defined on strings. Thus, we
recognized that summer a need for a rigorous and fundamental treatment of the general
topic of algorithms that operate on strings, along with a rigorous treatment of specific
string algorithms of greatest current and potential import in computational biology. This
book is an attempt to provide such a dual, and integrated, treatment.
PREFACE
xvii
rithm will make those important methods more available and widely understood. I connect
theoretical results from computer science on sublinear-time algorithms with widely used
methods for biological database search. In the discussion of multiple sequence alignment
I bring together the three major objective functions that have been proposed for multiple alignment and show a continuity between approximation algorithms for those three
multiple alignment problems. Similarly, the chapter on evolutionary tree construction exposes the commonality of several distinct problems and solutions in a way that is not well
known. Throughout the book, I discuss many computational problems concerning repeated
substrings (a very widespread phenomenon in DNA). I consider several different ways
to define repeated substrings and use each specific definition to explore computational
problems and algorithms on repeated substrings.
In the book I try to explain in complete detail, and at a reasonable pace, many complex
methods that have previously been written exclusively for the specialist in string algorithms. I avoid detailed code, as I find it rarely serves to explain interesting ideas,3 and
I provide over 400 exercises to both reinforce the material of the book and to develop
additional topics.
In summary
This book is a general, rigorous text on deterministic algorithms that operate on strings,
trees, and sequences. It covers the full spectrum of string algorithms from classical computer science to modern molecular biology and, when appropriate, connects those two
fields. It is the book I wished I had available when I began learning about string algorithms.
Acknowledgments
I would like to thank The Department of Energy Human Genome Program, The Lawrence
Berkeley Laboratory, The National Science Foundation, The Program in Math and MolecHowever, many of the algorithms in the book have been coded in C and are available at
xvi
PREFACE
PART I
xviii
PREFACE
ular Biology, and The DIMACS Center for Discrete Mathematics and Computer Science
special year on computational biology, for support of my work and the work of my students
and postdoctoral researchers.
Individually, I owe a great debt of appreciation to William Chang, John Kececioglu,
Jim Knight, Gene Lawler, Dalit Naor, Frank Olken, R. Ravi, Paul Stelling, and Lusheng
Wang.
I would also like to thank the following people for the help they have given me along
the way: Stephen Altschul, David Axelrod, Doug Brutlag, Archie Cobbs, Richard Cole,
Russ Doolittle, Martin Farach, Jane Gitschier, George Hartzell, Paul Horton, Robert Irving, Sorin Istrail, Tao Jiang, Dick Karp, Dina Kravets, Gad Landau, Udi Manber, Marci
McClure, Kevin Murphy, Gene Myers, John Nguyen, Mike Paterson, William Pearson,
Pavel Pevzner, Fred Roberts, Hershel Safer, Baruch Schieber, Ron Shamir, Jay Snoddy,
Elizabeth Sweedyk, Sylvia Spengler, Martin Tompa, Esko Ukkonen, Martin Vingron,
Tandy Warnow, and Mike Waterman.
for other applications. Users of Melvyl, the on-line catalog of the University of California
library system, often experience long, frustrating delays even for fairly simple matching
requests. Even grepping through a large directory can demonstrate that exact matching is
not yet trivial. Recently we used GCG (a very popular interface to search DNA and protein
databanks) to search Genbank (the major U.S. DNA database) for a thirty-character string,
which is a small string in typical uses of Genbank. The search took over four hours (on
a local machine using a local copy of the database) to find that the string was not there.2
And Genbank today is only a fraction of the size it will be when the various genome programs go into full production mode, cranking out massive quantities of sequenced DNA.
Certainly there are faster, common database searching programs (for example, BLAST),
and there are faster machines one can use (for example, an e-mail server is available for
exact and inexact database matching running on a 4,000 processor MasPar computer). But
the point is that the exact matching problem is not so effectively and universally solved
that it needs no further attention. It will remain a problem of interest as the size of the
databases grow and also because exact matching will continue to be a subtask needed for
more complex searches that will be devised. Many of these will be illustrated in this book.
But perhaps the most important reason to study exact matching in detail is to understand
the various ideas developed for it. Even assuming that the exact matching problem itself
is sufficiently solved, the entire field of string algorithms remains vital and open, and the
education one gets from studying exact matchingmay be crucial for solving less understood
problems. That education takes three forms: specific algorithms, general algorithmic styles,
and analysis and proof techniques. All three are covered in this book, but style and proof
technique get the major emphasis.
Overview of Part I
In Chapter 1 we present naive solutions to the exact matching problem and develop
the fundamental tools needed to obtain rnore efficient methods. Although the classical
solutions to the problem will not be presented until Chapter 2, we will show at the end of
Chapter 1 that the use of fundamental tools alone gives a simple linear-time algorithm for
exact matching. Chapter 2 develops several classical methods for exact matching, using the
fundamental tools developed in Chapter 1. Chapter 3 looks more deeply at those methods
and extensions of them. Chapter 4 moves in a very different direction, exploring methods
for exact matching based on arithmetic-like operations rather than character comparisons.
Although exact matching is the focus of Part I, some aspects of inexact matching and
the use of wild cards are also discussed. The exact matching problem will be discussed
again in Part II, where it (and extensions) will be solved using suffix trees.
i and ends at position j of S . In particular, S[1..i] is the prefix of string S that ends at
is the
of string S that begins at position i , where
denotes
position i, and
the number of characters in string S.
Definition S[i.. j] is the empty string if i > j,
For example, california is a string, lifo is a substring, cal is a prefix, and ornia is a
suffix.
Definition A proper prefix, suffix, or substring of S is, respectively, a prefix, suffix, or
substring that is not the entire string S, nor the empty string.
Definition For any string S, S(i) denotes the i th character of S.
We will usually use the symbol S to refer to an arbitrary fixed string that has no additional
assumed features or roles. However, when a string is known to play the role of a pattern
or the role of a text, we will refer to the string as P or T respectively. We will use lower
y, to refer to variable strings and use lower case roman
case Greek characters
characters to refer to single variable characters.
Definition When comparing two characters, we say that the characters match if they
are equal; otherwise we say they mismatch.
Terminology confusion
The words "string" and " w o r d are often used synonymously in the computer science
literature, but for clarity in this book we will never use " word when "string" is meant.
(However, we do use "word" when its colloquial English meaning is intended.)
More confusing, the words "string" and "sequence" are often used synonymously, particularly in the biological literature. This can be the source of much confusion because
"substrings" and "subsequences" are very different objects and because algorithms for substring problems are usually very different than algorithms for the analogous subsequence
problems. The characters in a substring of S must occur contiguously in S, whereas characters in a subsequence might be interspersed with characters not in the subsequence.
Worse, in the biological literature one often sees the word "sequence" used in place of
"subsequence". Therefore, for clarity, in this book we will always maintain a distinction
between "subsequence" and "substring" and never use "sequence" for "subsequence". We
will generally use "string" when pure computer science issues are discussed and use "sequence" or "string" interchangeably in the context of biological applications. Of course,
we will also use "sequence" when its standard mathematical meaning is intended.
The first two parts of this book primarily concern problems on strings and substrings.
Problems on subsequences are considered in Parts IIIand IV.
smarter method was assumed to know that character a did not occur again until position 5,1
and the even smarter method was assumed to know that the pattern abx was repeated again
starting at position 5. This assumed knowledge is obtained in the preprocessing stage.
For the exact matching problem, all of the algorithms mentioned in the previous section preprocess pattern P. (The opposite approach of preprocessing text T is used in
other algorithms, such as those based on suffix trees. Those methods will be explained
later in the book.) These preprocessing methods, as originally developed, are similar in
spirit but often quite different in detail and conceptual difficulty. In this book we take
a different approach and d o not initially explain the originally developed preprocessing
methods. Rather, we highlight the similarity of the preprocessing tasks needed for several
different matching algorithms, by first defining a fundamental preprocessing of P that
is independent of any particular matching algorithm. Then we show how each specific
matching algorithm uses the information computed by the fundamental preprocessing of
P. The result is a simpler more uniform exposition of the preprocessing needed by several
classical matching methods and a simple linear time algorithm for exact matching based
only on this preprocessing (discussed in Section 1.5). This approach to linear-time pattern
matching was developed in [202].
In other words,
is the length of the longest prefix of
of S . For example, when S = anbcaabxaaz then
= 3 (aabc...aabx ...),
= 1 (aa ...a b ...),
= 2 (aab...aaz).
Figure 1.2: Each solid box represents a substring of S that matches a prefix of Sand that starts between
positions 2 and i . Each box is called a Z-box. We use to denote the right-most end of any Z-box that
begins at or to the left of position i and a to denote the substring in the Z-box ending at Then denotes
the left end of a. The copy of that occurs as a prefix of S is also shown in the figure.
EXACT MATCHING
0
1
1234567890123
T: xabxyabxyabxz
P: abxyabxz
0
1
1234567890123
T: xabxyabxyabxz
P: abxyabxz
0
1
1234567890123
T: xabxyabxyabxz
P: abxyabxz
abxyabxz
abxyabxz
abxyabxz
abxyabxz
abxyabxz
abxyabxz
abxyabxz
abxyabxz
abxyabxz
Figure 1.1 : The first scenario illustrates pure naive matching, and the next two illustrate smarter shifts. A
caret beneath a character indicates a match and a star indicates a mismatch made by the algorithm.
comparisons of the naive algorithm will be mismatches. This smarter algorithm skips over
the next three shift/compares, immediately moving the left end of P to align with position
6 of T, thus saving three comparisons. How can a smarter algorithm do this? After the ninth
comparison, the algorithm knows that the first seven characters of P match characters 2
through 8 of T. If it also knows that the first character of P (namely a )does not occur again
in P until position 5 of P, it has enough information to conclude that character a does not
occur again in T until position 6 of T. Hence it has enough information to conclude that
there can be no matches between P and T until the left end of P is aligned with position 6
of T. Reasoning of this sort is the key to shifting by more than one character. In addition
to shifting by larger amounts, we will see that certain aligned characters d o not need to be
compared.
An even smarter algorithm knows the next occurrence in P of the first three characters
of P (namely abx) begin at position 5. Then since the first seven characters of P were
found to match characters 2 through 8 of T, this smarter algorithm has enough information to conclude that when the left end of P is aligned with position 6 of T, the next
three comparisons must be matches. This smarter algorithm avoids making those three
comparisons. Instead, after the left end of P is moved to align with position 6 of T, the
algorithm compares character 4 of P against character 9 of T. This smarter algorithm
therefore saves a total of six comparisons over the naive algorithm.
The above example illustrates the kinds of ideas that allow some comparisons to be
skipped, although it should still be unclear how an algorithm can efficiently implement
these ideas. Efficient implementations have been devised for a number of algorithms
such as the Knu th-Morris-Pratt algorithm, a real-time extension of it, the Boyer-Moore
algorithm, and the Apostolico-Giancarlo version of it. All of these algorithms have been
implemented to run in linear time (O(n m) time). The details will be discussed in the
next two chapters.
S
r
In this
The Z algorithm
k - I and the current values of r and l,
Given
for all 1 < i
l are computed as follows:
of S.
Begin
1. If k > r, then find
by explicitly comparing the characters starting at position k to the
characters starting at position 1 of S , until a mismatch is found. The length of the match
is
If
> 0, thensetr tok
- 1 and set l tok.
2. If k r , then position k is contained in a 2-box, and hence S(k) is contained in substring
S[l..r] (call it a ) such that l > 1 and a matches a prefix of S. Therefore, character S(k)
also appears in position k' = k - l + 1 of S. By the same reasoning, substring S[k..r] (call
It follows that the substring beginning at position k
-it must match substring
and
(which is r - k 1).
must match a prefix of S of length at least the minimum of
See Figure 1.3.
EXACT MATCHING
substring of S that matches a prefix of S and that does not start at position one. Each such
box is called a 2-box. More formally, we have:
is greater than zero, the Z-box at i is
Definition For any position i > 1 where
defined as the interval starting at i and ending at position i + - 1.
Definition For every i > 1, is the right-most endpoint of the 2-boxes that begin at
-1
or before position i. Another way to state this is: is the largest value of j
over all I < j i such that
> 0. (See Figure 1.2.)
We use the term for the value of j specified in the above definition. That is, is
In case there is more than one
the position of the left end of the 2-box that ends at
2-box ending at then can be chosen to be the left end of any of those 2-boxes. As an
= 7,
= 16, and
= 10.
example, suppose S = a a b a a b c a x a a b a a b c y ; then
The linear time computation of 2 values from S is the fundamental preprocessing task
that we will use in all the classical linear-time matching algorithms that preprocess P.
But before detailing those uses, we show how to do the fundamental preprocessing in
linear time.
1.6. EXERCISES
11
for the n characters in P and also maintain the current l and r. Those values are sufficient
to compute (but not store) the Z value of each character in T and hence to identify and
= n.
output any position i where
There is another characteristic of this method worth introducing here: The method is
considered an alphabet-independent linear-time method. That is, we never had to assume
that the alphabet size was finite or that we knew the alphabet ahead of time - a character
comparison only determines whether the two characters match or mismatch; it needs no
further information about the alphabet. We will see that this characteristic is also true of the
Knuth-Morris-Pratt and Boyer-Moore algorithms, but not of the Aho-Corasick algorithm
or methods based on suffix trees.
1.6. Exercises
The first four exercises use the
that fundamental processing can be done in linear
time and that all occurrences of P i n can be found in linear time.
1. Use the existence of a linear-time exact matching algorithm to solve the following problem
in linear time. Given two strings and determine if is a circular (or cyclic) rotation of
that is, if and have the same length and a consists of a suffix of followed by a prefix
of For example, defabcis a circular rotation of abcdef. This is a classic problem with a
very elegant solution.
10
EXACT MATCHING
between positions 2 and k - 1 and that ends at or after position k. Therefore, when
>0
in Case 1, the algorithm does find a new Z-box ending at or after k , and it is correct to
change r to k + - 1. Hence the algorithm works correctly in Case 1.
In Case 2a, the substring beginning at position k can match a prefix of S only for
<
If not, then the next character to the right, character k
must match
length
But character k
matches character k' +
(since
c
SO
character 1
character k'
must match character 1
However, that would be a contradiction
for it would establish a substring longer than
that starts at k'
to the definition of
and matches a prefix of S . Hence
=
in this case. Further, k + - 1 < r , SO r and
l remain correctly unchanged.
In Case 2b, must be a prefix of S (as argued in the body of the algorithm) and since
any extension of this match is explicitly verified by comparing characters beyond r to
characters beyond the prefix the full extent of the match is correctly computed. Hence
is correctly obtained in this case. Furthermore, since k +
- 1 r, the algorithm
correctly changes r and 1.
+
+
Corollary 1.4.1. Repeating Algorithm Z for each position i > 2 correctly yields all the
values.
values are computed by the algorithm in
time.
The time is proportional to the number of iterations, IS], plus the number of
character comparisons. Each comparison results in either a match or a mismatch, so we
next bound the number of matches and mismatches that can occur.
Each iteration that performs any character comparisons at all ends the first time it finds
a mismatch; hence there are at most
mismatches during the entire algorithm. To bound
for every iteration k. Now, let k be an
the number of matches, note first that
at least. Finally,
iteration where q > 0 matches occur. Then is set to
so the total number of matches that occur during any execution of the algorithm is at
most
PROOF
+ +
+
+
13
1.6. EXERCISES
and m, and finds the longest suffix of
should run in O(n + m) time.
The algorithm
Suppose S has length n. Give an example to show that two maximal tandem arrays of a
given base fi can overlap.
Now give an O(n)-time algorithm that takes Sand fi as input, finds every maximal tandem
array of p, and outputs the pair (s, k ) for each occurrence. Since maximal tandem arrays
of a given base can overlap, a naive algorithm would establish only an O(r?)-time bound.
5. If the Z algorithm finds that Z2 = q > 0,all the values Z3,, . . , Zq+l,Z9+2can then be
obtained immediately without additional character comparisons and without executing the
main body of Algorithm Z. Flesh out and justify the details of this claim.
6. In Case 2b of the Z algorithm, when Zkt >- !Dlrthe algorithm does explicit comparisons
until it finds a mismatch. This is a reasonable way to organize the algorithm, but in fact
Case 2b can be refined so as to eliminate an unneeded character comparison. Argue that
when Zkt > lp(then Zk = I#? I and hence no character comparisons are needed. Therefore,
explicit character comparisons are needed only in the case that
= ]PI.
Zkt
7. If Case 2b of the Z algorithm is spiit into two cases, one for Zk, > IpI and one for Z k r = IpI,
would this result in an overall speedup of the algorithm? You must consider all operations,
not just character comparisons.
8. Baker [43] introduced the following matching problem and applied it to a problem of software
maintenance: "The application is to track down duplication in a large software system. We
want to find not only exact matches between sections of code, but parameterized matches,
where a parameterized match between two sections of code means that one section can
be transformed into the other by replacing the parameter names (e.g., identifiers and constants) of one section by the parameter names of the other via a one-to-one function".
Now we present the formal definition. Let C and l7 be two alphabets containing no symbols
in common. Each symbol in C is called a tokenand each symbol in Il is called aparameter.
A string can consist of any combinations of tokens and parameters from C and Il. For
example, if C is the upper case English alphabet and ll is the lower case alphabet then
XYabCaCXZdd W is a legal string over C and m. Two strings St and &
! are said to
p-match if and only if
a. Each token in S1 (or Sz) is opposite a matching token in Sz (or SI).
b. Each parameter in S1 (or SZ)is opposite a parameter in SZ (or St).
EXACT MATCHING
Figure 1.6: A circular string p. The linear string derived from it is accatggc.
problem is the following. Let $ be the linearstring obtained from p starting at character 1
and ending at character n. Then a is a substring of circular string B if and only if a is a
substring of some circular rotation of 6.
A digression o n circular strings i n DNA
The above two problems are mostly exercises in using the existence of a linear-time exact
matching algorithm, and we don't know any critical biological problems that they address.
However, we want to point out that circular DNA is common and important. Bacterial and
mitochondria1 DNA is typically circular, both in its genomic DNA and in additional small
double-stranded circular DNA molecules called plasmids, and even some true eukaryotes
(higher organisms whose cells contain a nucleus) such as yeast contain plasmid DNA in
addition to their nuclear DNA. Consequently, tools for handling circular strings may someday
be of use in those organisms. Viral DNA is not always circular, but even when it is linear
some virus genomes exhibit circular properties. For example, in some viral populations the
linear order of the DNA in one individual will be a circular rotation of the order in another
individual [450]. Nucleotide mutations, in addition to rotations, occur rapidly in viruses, and
a plausible problem is to determine if the DNA of two individual viruses have mutated away
from each other only by a circular rotation, rather than additional mutations.
It is very interesting to note that the problems addressed in the exercises are actually
"solvedn in nature. Consider the special case of Exercise 2 when string a has length n.
Then the problem becomes: Is a a circular rotation of
This problem is solved in linear
n
time as in Exercise 1. Precisely this matching problem arises and is "solved in E. coli
replication under the certain experimental conditions described in [475].In that experiment,
an enzyme (RecA) and ATP molecules (for energy) are added to E. colicontaining a single
strand of one of its plasmids, called string p , and a double-stranded linear DNA molecule,
one strand of which is called string a. If a is a circular rotation of 8 then the strand opposite
to a (which has the DNA sequence complementary to or) hybridizes with p creating a proper
double-stranded plasmid, leaving or as a single strand. This transfer of DNA may be a step
in the replication of the plasmid. Thus the problem of determining whether a is a circular
rotation of is solved by this natural system.
B?
1.6. EXERCISES
15
nations of the DNA string and the fewest number of indexing steps (when using the codons
to look up amino acids in a table holding the genetic code). Clearly, the three translations
can be done with 3n examinations of characters in the DNA and 3n indexing steps in the
genetic code table. Find a method that does the three translations in at most n character
examinations and n indexing steps.
Hint: If you are acquainted with this terminology, the notion of a finite-state transducer may be
helpful, although it is not necessary.
11. Let T be a text string of length m and let S be a multiset of n characters. The problem is
to find all substrings in T of length n that are formed by the characters of S. For example,
let S = (a, a, b, c} and T = abahgcabah. Then caba is a substring of T formed from the
characters of S.
Give a solution to this problem that runs in O(m)time. The method should also be able to
state, for each position i , the length of the longest substring in T starting at i that can be
formed from S.
Fantasy protein sequencing. The above problem may become useful in sequencing
protein from a particular organism after a large amount of the genome of that organism
has been sequenced. This is most easily explained in prokaryotes, where the DNA is
not interrupted by introns. In prokaryotes, the amino acid sequence for a given protein
is encoded in a contiguous segment of DNA - one DNA codon for each amino acid in
the protein. So assume we have the protein molecule but do not know its sequence or the
location of the gene that codes for the protein. Presently, chemically determining the amino
acid sequence of a protein is very slow, expensive, and somewhat unreliable. However,
finding the muttiset of amino acids that make up the protein is relatively easy. Now suppose
that the whole DNA sequence for the genome of the organism is known. One can use that
long DNA sequence to determine the amino acid sequence of a protein of interest. First,
translate each codon in the DNA sequence into the amino acid alphabet (this may have to
be done three times to get the proper frame) to form the string T; then chemically determine
the multiset S of amino acids in the protein; then find all substrings in Tof length JSIthat are
formed from the amino acids in S. Any such substrings are candidates for the amino acid
sequence of the protein, although it is unlikely that there will be more than one candidate.
The match also locates the gene for the protein in the long DNA string.
12. Consider the two-dimensional variant of the preceding problem. The input consists of two-
dimensional text (say a filled-in crossword puzzle) and a rnultiset of characters. The problem
is to find a connected two-dimensional substructure in the text that matches all the characters in the rnultiset. How can this be done? A simpler problem is to restrict the structure
to be rectangular.
13. As mentioned in Exercise 10, there are organisms (some viruses for example) containing
intervals of DNA encoding not just a single protein, but three viable proteins, each read in
a different reading frame. So, if each protein contains n amino acids, then the DNA string
encoding those three proteins is only n + 2 nucieotides (characters) long. That is a very
compact encoding.
(Challenging problem?) Give an algorithm for the following problem: The input is a protein
string S1 (over the amino acid alphabet) of length n and another protein string of length
m > n. Determine if there is a string specifying a DNA encoding for & that contains a
substring specifying a DNA encoding of S,. Allow the encoding of S,to begin at any point
in the DNA string for & (i.e., in any reading-frame of that string). The problem is difficult
because of the degeneracy of the genetic code and the ability to use any reading frame.
14
EXACT MATCHING
whereas a parameter represents a program's variable, which can be renamed as long as
all occurrences of the variable are renamed consistently. Thus if S,and & pmatch, then
the variable names in St could be changed to the corresponding variable names in &,
making the two programs identical. If these two programs were part of a larger program,
then they could both be replaced by a call to a single subroutine.
The most basic pmatch problem is: Given a text T and a pattern P, each a string over C
and l7,find all substrings of T that prnatch P. Of course, one would like to find all those
occurrences in O()
PI 1 T I )time. Let function qPfor a string S be the length of the longest
string starting at position i in S that pmatches a prefix of Sfl..i].Show how to modify
algorithm Z to compute all the qpvalues in O(1Sj)time (the implementation details are
slightly more involved than for function Zi,but not too difficult). Then show how to use the
modified algorithm Z to find all substrings of T that pmatch P, in O(i Pi I T I )time.
In [43]and [239],more involved versions of the pmatch problem are solved by more
complex methods.
The following three problems can be solved without the Zalgorithm or other
fancy tools. They only require thought.
9. You are given two strings of n characters each and an additional parameter k. In each
string there are n - k + 1 substrings of length k,and so there are @($) pairs of substrings,
where one substring is from one string and one is from the other. For a pair of substrings,
we define the match-countas the number of opposing characters that match when the two
substrings of length k are aligned. The problem is to compute the match-count for each
of the @(n2) pairs of substrings from the two strings. Clearly, the problem can be solved
with 0 ( k n 2 )operations (character comparisons plus arithmetic operations). But by better
organizing the computations, the time can be reduced to O($)operations. (From Paul
Horton.)
10. A DNA molecule can be thought of as a string over an alphabet of four characters {a. t, c , g }
(nucleotides), while a protein can be thought of as a string over an alphabet of twenty characters (amino acids). A gene, which is physically embedded in a DNA molecule, typically
encodes the amino acid sequence for a particular protein. This is done as follows. Starting
at a particutar point in the DNA string, every three consecutive DNA characters encode a
single amino acid character in the protein string. That is, three DNA nucleotides specify
one amino acid. Such a coding triple is called a codon, and the full association of codons
to amino acids is called the genetic code. For example, the codon ttt codes for the amino
acid Phenylalanine (abbreviated in the single character amino acid alphabet as 0,and
the codon gtt codes for the amino acid Valine (abbreviated as V). Since there are 43= 64
possible triples but only twenty amino acids, there is a possibility that two or more triples
form codons for the same amino acid and that some triples do not form codons. In fact,
this is the case. For example, the amino acid Leucine is coded for by six different codons.
Problem: Suppose one is given a DNA string of n nucleotides, but you don't know the correct "reading frame". That is, you don't know if the correct decomposition of the string into
codons begins with the first, second, or third nucleotide of the string. Each such "frameshift"
potentially translates into a different amino acid string. (There are actually known genes
where each of the three reading frames not only specifies a string in the amino acid alphabet, but each specifies a functional, yet different, protein.) The task is to produce, for each
of the three reading frames, the associated amino acid string. For example, consider the
string atggacgga. The first reading frame has three complete codons, atg, gac, and gga,
which in the genetic code specify the amino acids Met, Asp, and Gly. The second reading
frame has two complete codons, tgg and acg, coding for amino acids Trp and Thr,The third
reading frame has two complete codons, gga and cgg, coding for amino acids Glyand Arg.
The goat is to produce the three translations, using the fewest number of character exami-
17
T:
P:
xpbctbxabpqxctbpq
tpabxab
Exact Matching:
Classical Comparison-Based Methods
2.1. Introduction
This chapter develops a number of classical comparison-based matching algorithms for
the exact matching problem. With suitable extensions, all of these algorithms can be implemented to run in linear worst-case time, and all achieve this performance by preprocessing
pattern P. (Methods that preprocess T will be considered in Part I1 of the book.) The original preprocessing methods for these various algorithms are related in spirit but are quite
different in conceptual difficulty. Some of the original preprocessing methods are quite
difficult.' This chapter does not follow the original preprocessing methods but instead
exploits fundamental preprocessing, developed in the previous chapter, to implement the
needed preprocessing for each specific matching algorithm.
Also, in contrast to previous expositions, we emphasize the Boyer-Moore method over
the Knuth-Morris-Pratt method, since Boyer-Moore is the practical method of choice
for exact matching. Knuth-Morris-Pratt is nonetheless completely developed, partly for
historical reasons, but mostly because it generalizes to problems such as real-time string
matching and matching against a set of patterns more easily than Boyer-Moore does.
These two topics will be described in this chapter and the next.
Sedgewick [401] writes "Both the Knuth-Morris-Pratt and the Boyer-Moore algorithms require some complicated
preprocessing on the pattern that is dificult to understand and has limited the extent to which they arc uscd". In
agreement with Sedgrwick, I still do not understand the original Boyer-Moore preprocessing mrrhod h r the rtrorlg
good suffix rule,
19
T
Z
P before shift
t'
I YI
Z
P after shift
t'
I
I
Figure 2.1: Good suffix shift rufe, where character x of T mismatches with character y of P. Characters
y and z of Pare guaranteed to be distinct by the good suffix rule, so r has a chance of matching x.
good s u f i rule. The original preprocessing method 12781 for the strong good suffix
rule is generally considered quite difficult and somewhat mysterious (although a weaker
version of it is easy to understand). In fact, the preprocessing for the strong rule was given
incorrectly in 12781 and corrected, without much explanation, in [384]. Code based on
[384] is given without real explanation in the text by Baase [32], but there are no published
.~
code for strong preprocessing, based
sources that try to fully explain the r n e t h ~ dPascal
on an outline by Richard Cole [107], is shown in Exercise 24 at the end of this chapter.
In contrast, the fundamental preprocessing of P discussed in Chapter 1 makes the
needed preprocessing very simple. That is the approach we take here. The strong good
su& rule is:
P:
qcabdabdab
1233567890
18
Theorem 2.2.2. L(i) is the largest index j less than n such thar N j ( P ) 2 IP [i..n]l (which
isn-i+l). L1(i)isthelargestindexj lessthanrisuchthatNj(P) = IP[i..n]l = (n-i+1).
Given Theorem 2.2.2, it follows immediately that all the L (i) values can be accumulated
in linear time from the N values using the foilowing algorithm:
f
Z-based Boyer-Moore
for i := 1 to n do L1(i):= 0;
for j := 1 t o n - 1 do
begin
i := n - N j ( P )+ 1;
Lf(i) := j;
end;
The L ( i ) values (if desired) can be obtained by adding the following lines to the above
pseudocode :
L(2) := Lt(2);
for i := 3 to n do L ( i ) := rnax[L(i - 11, Lf(i)];
PROOF
+ 1, such that
We leave the proof, as well as the problem of how to accumulate the ll(i) values in
linear time, as a simple exercise. (Exercise 9 of this chapter)
P:
qcabdabdab
Note that the extended bad character rule would have shifted P by only one place in
this example.
Theorem 2.2.1. The use of the good sufJix rule never shifts P past an occurrence in T .
Suppose the right end of P is aligned with character k of 7"before the shift, and
suppose that the good suffix rule shifts P so its right end aligns with character k t > k.
Any occurrence of P ending at a position 1 strictly between k and k' would immediately
violate the selection rule for k', since it would imply either that a closer copy of t occurs
in P or that a longer prefix of P matches a suffix o f t .
PROOF
The original published Boyer-Moore algorithm [75] uses a simpler, weaker, version of
the good suffix rule. That version just requires that the shifted P agree with the t and does
not specify that the next characters to the left of those occurrences of t be different. An
explicit statement of the weaker rule can be obtained by deleting the italics phrase in the
first paragraph of the statement of the strong good suffix rule. In the previous example, the
weaker shift rule shifts P by three places rather than six. When we need to distinguish the
two rules, we will call the simpler rule the weak good suffix rule and the rule stated above
the strong good suffix rule. For the purpose of proving that the search part of Boyer-Moore
runs in linear worst-case time, the weak rule is not sufficient, and in this book the strong
version is assumed unless stated otherwise.
L ( i ) gives the right end-position of the right-most copy of P[i..n] that is not a suffix of
P, whereas L'(i) gives the right end-position of the right-most copy of P [ i . . n ]that is not
a suffix of P, with the stronger, added condition that its preceding character is unequal
to P(i - I). So, in the strong-shift version of the Boyer-Moore algorithm, if character
i - 1 of P is involved in a mismatch and L 1 ( i )> 0, then P is shifted right by n - L ( i )
positions. The result is that if the right end of P was aligned with position k of T before
the shift, then position L ( i ) is now aligned with position k .
During the preprocessing stage of the Boyer-Moore algorithm L ' ( i )(and L ( i ) , if desired) will be computed for each position i in P. This is done in O ( n )time via the following
definition and theorem.
f
Definition For string P, N j ( P ) is the length of the longest sum of the substring
P [ I .. j ] that is also a s u m of the full string P.
23
Boyer-Moore method has a worst-case running time of O(m)provided that the pattern
does not appear in the text. This was first proved by Knuth, Moms, and Pratt [278], and an
alternate proof was given by Guibas and Odlyzko [196]. Both of these proofs were quite
difficult and established worst-case time bounds no better than 5m comparisons. Later,
Richard Cole gave a much simpler proof [I081 establishing a bound of 4m comparisons
and also gave a difficult proof establishing a tight bound of 3m comparisons. We will
present Cole's proof of 4m comparisons in Section 3.2.
When the pattern does appear in the text then the original Boyer-Moore method runs in
O(nm)worst-case time. However, several simple modifications to the method correct this
prcblem, yielding an O(m)time bound in all cases. The first of these modifications was
due to Galil[168]. After discussing Cole's proof, in Section 3.2, for the case that P doesn't
occur in T, we use a variant of Galil's idea to achieve the linear time bound in all cases.
At the other extreme, if we only use the bad character shift rule, then the worst-case
running time is O(nm),but assuming randomly generated strings, the expected running
time is sublinear. Moreover, in typical string matching applications involving natural
language text, a sublinear running time is almost always observed in practice. We won't
discuss random string analysis in this book but refer the reader to [I 841.
Although Cole's proof for the linear worst case is vastly simpler than earlier proofs,
and is important in order to complete the full story of Boyer-Moore, it is not trivial.
However, a fairly simple extension of the Boyer-Moore algorithm, due to Apostolico and
Giancarlo [26], gives a "Boyer-Moore-like" algorithm that allows a fairly direct proof of
a 2m worst-case bound on the number of comparisons. The Apostolico-Giancarlo variant
of Boyer-Moore is discussed in Section 3.1.
We will present several solutions to that set problem including the Aho-Cotasick method in Section 3.4. For those
reasons, and for its historical role in the field, we fully develop the Knuth-Morris-Pratt method here.
22
h := k;
while i > 0 and P ( i ) = T ( h ) do
begin
i:=i-I;
h := h - 1;
end;
if i = 0 then
begin
report an occurrence of P in T ending at position k.
k := k n - l'(2);
end
else
shift P (increase k ) by the maximum amount determined by the
(extended) bad character rule and the good suffix rule.
e od;
Note that although we have always talked about "shifting P", and given rules to determine by how much P should be "shifted, there is no shifting in the actual implementation.
Rather, the index k is increased to the point where the right end of P would be "shifted".
Hence, each act of shifting P takes constant time.
We will later show, in Section 3.2, that by using the strong good suffix rule alone, the
25
T
I
k-1
I
I
1 I
I
1
I
1
t
I
I
I
I
P before shift
P after shift
Figure 2.2: Assumed missed occurrence used in correctness proof for Knuth-Morris-Pratt.
Theorem 23.2. For any aligrzrnenf of P with T , ifcharacrers I through i of P march the
opposing characters of T btrr character i + 1 mismatches T ( k ) , then P can be shifted by
i - spi places to the right without passing any occurrence of P in T.
Suppose not, so that there is an occurrence of P starting strictly to the left of
the shifted P (see Figure 2.2), and let a and j3 be the substrings shown in the figure. In
particular, B is the prefix of P of length sp:, shown relative to the shifted position of P.
The unshifted P matches T up through position i of P and position k 1 of T, and aII
characters in the (assumed) missed occurrence of P match their counterparts in T.Both of
these matched regions contain the substrings u and B, so the unshifted P and the assumed
occurrence of P match on the entire substring up. Hence rrp is a suffix of P [ l . . i ] that
matches a proper prefix of P. Now let 1 = (cr/l[ 1 so that position I in the "missed
occurrence" of P is opposite position k in T . Character P(1) cannot be equal to P(i I )
since P(1) is assumed to match T ( k )and P(i -t 1) does not match T ( k ) .Thus ap is a
proper suffix of P [ l..i] that matches a prefix of P , and the next character is unequal to
P(i 1). But la/> 0 due to the assumption that an occurrence of P starts strictly before
the shifted P,so lap I > IB I = sp:, contradicting the definition of spf Hence the theorem
is proved. CI
PROOF
Theorem 2.3.2says that the Knuth-Morris-Pratt shift rule does not miss any occurrence
of P in T, and so the Knuth-Moms-Pratt algorithm will conectly find all occurrences of
P in T. The time analysis is equally simple.
24
For any alignment of P and T, if the first mismatch (comparing from left to sight)
occurs in position i 1 of P and position k of T,then shift P to the right (relative
to T ) so that PII ..spS] aligns with T [k - spi ..k - 11. En other words, shift P exactly
i + 1 (sp: + 1) = i - spj places to the right, so that character sp,' 1 of P will
align with character k of T . In the case that an occurrence of P has been found (no
mismatch), shift P by n spi places.
The shift rule guarantees that the prefix PI l..spl] of the shifted P matches its opposing
I].
substring in T.The next comparison is then made between characters T ( k )and P[sp: iThe use of the stronger shift rule based on spi guarantees that the same mismatch will not
occur again in the new alignment, but it does not guarantee that T ( k )= P[spj 11.
In the above example, where P = abcxabcde and sp; = 3, if character 8 of P
mismatches then P will be shifted by 7 3 = 4 places. This is true even without knowing
T or how P is positioned with T.
The advantage of the shift ntle is twofold. First, it often shifts P by more than just a
single character. Second, after a shift, the left-most spf characters of P are guaranteed to
match their counterparts in T. Thus, to determine whether the newly shifted P matches
its counterpart in T , the algorithm can start comparing P and T at position spi + 1
of P (and position k of T). For example, suppose P = abcxabcde as above, T =
xyabcxabcxadcdq f eg, and the left end of P is aligned with character 3 of T.Then P
and T will match for 7 characters but mismatch an character 8 of P, and P will be shifted
The reader should be alentd that traditionally the Knuth-Morris-Pratt algorirbm has been described in [ems of
/oiiurefwcrions. which a= reJated to the spi values. Failure functions will k explicitly deRned in Section 2.3.3.
sp,(P) := spA(P);
for i := n - 1 downto 2 do
s p i ( P ) := max[sp;+~(P
) - I, sp:(P)l
Definition For each position i from 1 to n + 1, define the failure function F'(i) to be
SPi- 1 + I (and define F (i) = spi- 1 + 1 ), where sph and spo are defined to be zero.
!
We will only use the (stronger) failure function F'(i) in this discussion but will refer to
F(i) later,
After a mismatch in position i 1 > 1 of P , the Knuth-Morris-Pratt algorithm *'shiftsw
P so that the next comparison is between the character in position c of T and the character
in position sp: + 1 of P . But spj 1 = F'(i l), so a general "shift" can be implemented in
constant time by just setting p to F1(i+ 1). Two special cases remain. When the mismatch
occurs in position 1 of P , then p is set to F1(I) = I and c is incremented by one. When an
occurrence of P is found, then P is shifted right by n - sp; places. This is implemented
1.
by setting F1(n+ 1) to sp;
Putting all the pieces together gives the full Knuth-Morris-Pratt algorithm.
Knuth-Morris-Pratt algorithm
begin
Preprocess P to find F1(k) = sp;-,
1 for k from 1 to n I.
c := 1;
p := I;
While c (n - p ) 5 m
do begin
While P(p) = T(c) and p 5 n
do begin
p:=p+l;
c:=c+l;
end;
if p = n + 1 then
report an occurrence of P starting at position c - n of T .
i f p := 1 t h e n c : = c + 1
p := F1(p);
end;
end.
26
Given Theorem 2.3.4, all the sp' and s p values can be computed in linear time using
the Zi values as follows:
2-based Knuth-Morris-Pratt
for i := 1 ton d o
sp; := 0;
for j := n downto 2 do
begin
i := j Z j ( P )- 1;
s p ; := zi;
end;
29
2.5. EXERCISES
shift rule, the method becomes real time because it still never reexamines a position in T
involved in a match (a feature inherited from the Knuth-Morris-Pratt algorithm), and it
now also never reexamines a position involved in a mismatch. So, the search stage of this
algorithm never examines a character in T more than once. It follows t h k the search is
done in real time. Below we show how to find all the sp;,,,, values in linear time. Together,
this gives an algorithm that does linear preprocessing of P and real-time search of T.
It is easy to establish that the algorithm finds all occurrences of P in T, and we leave
that as an exercise.
Note that the linear time (and space) bound for this method require that the alphabet C
be finite. This allows us to do 1 Z I comparisons in constant time. If the size of the alphabet
is explicitly included in the time and space bounds, then the preprocessing time and space
needed for the algorithm is O(IC In).
2.5. Exercises
1. In "typical" applications of exact matching, such as when searching for an English word
in a book, the simple bad character rule seems to be as effective as the extended bad
character rule. Give a "hand-waving" explanation for this.
2. When searching for a single word or a small phrase in a large English text, brute force
(the naive algorithm) is reported [ I 841 to run faster than most other methods. Give a handwaving explanation for this. In general terms, how would you expect this observation to
hold up with smaller alphabets (say in DNA with an alphabet size of four), as the size
of the pattern grows, and when the text has many long sections of similar but not exact
substrings?
3. "Common sense" and the O(nm) worst-case time bound of the Boyer-Moore algorithm
(using only the bad character rule) both would suggest that empirical running times increase
with increasing pattern length (assuming a fixed text). But when searching in actual English
28
case an occurrence of P in T has been found) or until a mismatch occurs at some positions
i + 1 of P and k of T. In the latter case, if sp: > 0, then P is shifted right by i -spI positions,
guaranteeing that the prefix P [ l ..sp,!] of the shifted pattern matches its opposing substring
in T, No explicit comparison of those substrings is needed, and the next comparison is
between characters T(k) and P(sp: 1). Although the shift based on spi guarantees that
P(i 1) differs from P(spf I), it does not guarantee that T(k) = P(spi 1). Hence
T(k) might be compared several times (perhaps R(I PI) times) with differing characters
in P. For that reason, the Knuth-Morris-Pratt method is not a real-time method.
To be real time, a method must do at most a constant amount of work between the
time it first examines any position in T and the time it last examines that position. In the
Knuth-Morris-Pratt method, if a position of T is involved in a match, it is never examined
again (this is easy to verify) but, as indicated above, this is not true when the position is
involved in a mismatch. Note that the definition of real time only concerns the search stage
of the algorithm. Preprocessing of P need not be real time. Note also that if the search
stage is real time it certainly is also linear time.
The utility of areal-time matcher Is two fold. First, in certain applications, such as when
the characters of the text are being sent to a small memory machine, one might need to
guarantee that each character can be fully processed before the next one is due to arrive.
If the processing time for each character is constant, independent of the length of the
string, then such a guarantee may be possible. Second, in this particular real-time matcher,
the shifts of P may be longer but never shorter than in the original Knuth-Morris-Pratt
algorithm. Hence, the real-time matcher may run faster in certain problem instances.
Admittedly, arguments in favor of real-time matching algorithms over linear-time methods are somewhat tortured, and the real-time matching is more a theoretical issue than a
practical one. Still, it seems worthwhile to spend a little timediscussing real-time matching.
Definition Let x denote a character of the alphabet. For each position i in pattern P,
define ~ p l , . , ~ , {toP )be the length of the longest proper suffix of P[l..i] that matches a
prefix of P, wirh rhe added condition thnr character P{sp: 1) is x .
Knowing the sp;,+,, values for each character x in the alphabet allows a shift rule
that converts the Knuth-Morris-Pratt method into a real-time algorithm. Suppose P is
compared against a substring of T and a mismatch occurs at characters T(k) = x and
P(i + 1). Then P should be shifted right by i - sp;,.,, places. This shift guarantees that the
prefix P [ l . . ~ p ; ~ ~matches
,,]
the opposing substring in T and that Tjk) matches the next
character in P . Hence, the comparison between T(k) and P ( S ~ ; , , , ~ 1 ) can be skipped.
The next needed comparison is between characters P(sp;,,,, 2) and T(k 1). With this
2.5. EXERCISES
31
16. Suppose there is a position i such that spi maps to k , and let i be the largest such position.
Prove that Zk = i- k -t1 = spi and that rk = i .
17. Given the answer to the previous exercise, it is natural to conjecture that Zk always equals
spi, where i is the largest position such that spi maps to k. Show that this is not true. Given
an example using at least three distinct characters.
Stated another way, give an example to show that Zkcan be greater than zero even when
there is no position i such that sp, maps to k.
18. Recall that rk-I is known at the start of iteration k of the Z algorithm (when Z
k is computed),
but rk is known only at the end of iteration k. Suppose, however, that rk is known (somehow)
at the start of iteration k. Show how the Z algorithm can then be modified to compute Z k
using no character comparisons. Hence this modified algorithm need not even know the
string S.
19. Prove that if Z k is greater than zero, then rkequals the largest position i such that k 3 isp,. Conclude that rk can be deduced from the s p values for every position k where Zk is
not zero.
20. Combine the answers to the previous two exercises to create a linear-time algorithm that
computes all the Z values for a string S given only the s p values for S and not the string
S itself.
30
4. Evaluate empirically the utility of the extended bad character rule compared to the original
bad character rule. Perform the evaluation in combination with different choices for the two
good-suffix rules. How much more is the average shift using the extended rule? Does the
extra shift pay for the extra computation needed to implement it?
5. Evaluate empirically, using different assumptions about the sizes of P and T, the number
of occurrences of P in T, and the size of the alphabet, the following idea for speeding
up the Boyer-Moore method. Suppose that a phase ends with a mismatch and that the
good suffix rule shifts Pfarther than the extended bad character rule. Let x and y denote
the mismatching characters in T and P respectively, and let z denote the character in the
shifted Pbelow x. By the suffix rule, z wiH not be y , but there is no guarantee that it will be
x. So rather than starting comparisons from the right of the shifted P, as the Boyer-Moore
method would do, why not first compare x and z? If they are equal then a right-to-left
comparison is begun from the right end of P, but if they are unequal then we apply the
extended bad character rule from z in P. This will shifl Pagain. At that point we must begin
a right-to-left comparison of Pagainst T.
6. The idea of the bad character rule in the Boyer-Moore algorithm can be generalized so that
instead of examining characters in P from right to left, the algorithm compares characters
in P i n the order of how unlikely they are to be in T (most unlikelyfirst). That is, it looks
first at those characters in P that are least likely to be in T. Upon mismatching, the bad
character rule or extended bad character rule is used as before. Evaluate the utility of this
approach, either empirically on real data or by analysis assuming random strings.
7. Construct an example where fewer comparisons are made when the bad character rule is
used alone, instead of combining it with the good suffix rule.
8. Evaluate empirically the effectivenessof the strong good suffix shifl for Boyer-Moore versus
the weak shift rule.
9. Give a proof of Theorem 2.2.4. Then show how to accumulate all the l'(i) values in linear
time.
10. If we use the weak good suffix rule in Boyer-Moore that shifts the closest copy of t under
the matched suffix t, but doesn't require the next character to be different, then the preprocessing for Boyer-Moore can be based directly on sp, values rather than on Z values.
Explain this.
11. Prove that the Knuth-Morris-Pratt shift rules (either based on sp or sp') do not miss any
occurrences of P i n T.
12. It is possible to incorporate the bad character shift rule from the Boyer-Moore method to
the Knuth-Morris-Pratt method or to the naive matching method itself. Show how to do that.
Then evaluate how effective that rule is and explain why it is more effective when used in
the Boyer-Moore algorithm.
13. Recall the definition of lion page 8. It is natural to conjecture that spi = i - li for any index
i , where i 2 li.Show by example that this conjecture is incorrect.
14. Prove the claims in Theorem 2.3.4 concerning sp/(P).
15. Is it true that given only the sp values for a given string P, the sp' values are completely
determined? Are the sp values determined from the sp' values alone?
2.5. EXERCISES
25. Prove that the shift rule used by the real-time string matcher does not miss any occurrences
of P i n T.
26. Prove Theorem 2.4.1.
27. In this chapter, we showed how to use Z values to compute both the sp,! and spi values
used in Knuth-Morris-Pratt and the sp:, values needed for its real-time extension. instead
of using Z values for the sp:, values, show how to obtain these values from the sp, and/or
sp; values in linear [O(nl C I)] time, where n is the length of P and IC I is the length of the
alphabet.
28. Although we don't know how to simply convert the Boyer-Moore algorithm to be a real-time
method the way Knuth-Morris-Pratt was converted, we can make similar changes to the
strong shift rule to make the Boyer-Moore shift more effective. That is, when a mismatch
occurs between fli) and T(h) we can look for the right-most copy in Pof P [ i 1..n] (other
than P[i + l..n] itself) such that the preceding character is T(h). Show how to modify
32
m:=length ( p );
writeln('the length of the string is ' , m);
end;
procedure gsshift(p:tstring; var
gs-shift:indexarray;m:integer);
var
i,j,j-old,k:integer;
kmp-shift:indexarray;
go-on:boolean;
begin (11
for j:= 1 to m do
gs-shiftrjl : = m;
kmp,shift[ml:=l;
{stage 1)
3 :=m;
Figure 2.3: The pattern P = aqra labels two subpaths of paths starting at the root. Those paths start at
the root, but the subpaths containing aqra do not. There is also another subpath in the tree labeled aqra
(it starts above the character z), but it violates the requirement that it be a subpath of a path starting at the
root, Note that an edge label is displayed from the top of the edge down towards the bottom of the edge.
Thus in the figure, there is an edge labeled "qra", not "arq".
the Boyer-Moore preprocessing so that the needed information is collected in linear time,
assuming a fixed size alphabet.
29. Suppose we are given a tree where each edge is labeled with one or more characters, and
we are given a pattern P. The label of a subpath in the tree is the concatenation of the
labels on the edges in the subpath. The problem is to find all subpaths of paths starting at
the root that are labeled with pattern P. Note that although the subpath must be part of a
path directed from the root, the subpath itself need not start at the root (see Figure 2.3).
Give an algorithm for this problem that runs in time proportional to the total number of
characters on the edges of the tree plus the length of P.
37
4. If M(h) > Niand Ni < i, then P matches T from the right end of P down to character
i - Ni 1 of P , but the next pair of characters mismatch [i.e., P(i - N i ) # T ( h - N , ) ] .
Hence P matches T for j - h + N, characters and mismatches at position i - Ni of P.
M ( j ) must be set to a value less than or equal to j - h + N j . Set M t j ) to j - h. Shift
P by the Boyer-Moore rules based on a mismatch at position i - Ni of P (this ends the
phase).
5 . If M(h) = Ni and 0 < Ni -c i, then P and T must match for at least M(h) characters to
the left, but the left end of P has not yet been reached, so set i to i - M(h) and set h to
h - M(h) and repeat the phase algorithm.
The following definitions and lemma will be helpful in bounding the work done by the
algorithm.
Definition If j is a position where M ( j ) is greater than zero then the interval [ j M ( j ) + 1.. j ] is called a covered interval defined by j .
Definition Let j' < j and suppose covered intervals are defined for both j and j ' . We
say that the covered intervals for j and j' cross if j - M( j )+ I 5 j' and j' - M ( j ' ) 1 <
j - M( j ) + 1 (see Figure 3.2).
Lemma 3.1.1. No covered intervals computed by the algorithm ever cross each o t h e ~
Mareovel; if the algorithm examines a position h of T in a covered interval, then h is at
the right end of that interval.
36
Figure 3.1: Substring rr has length Ni and substring p has length M(h)> Ni.The two strings must match
from their right ends for Ni characters, but mismatch at the next character.
at position j . As the algorithm proceeds, a value for M ( j ) is set for every position j in T
that is aligned with the right end of P; M ( j )is undefined for a11 other positions in T.
The second modification exploits the vectors N and M to speed up the Boyer-Moore
algorithm by inferring certain matches and mismatches. To get the idea, suppose the
Boyer-Moore algorithm is about to compare characters P ( i ) and T(h), and suppose it
knows that M(h) > N, (see Figure 3.1). That means that an N,-length substring of P
ends at position i and matches a suffix of P , while an M(h)-length substring of T ends
at position h and matches a suffix of P . So the N,-length suffixes of those two substrings
must match, and we can conclude that the next Ni comparisons (from P ( i ) and T(h)
moving leftward) in the Boyer-Moore algorithm would be matches. Further, if N; = i ,
then an occurrence of P in T has been found, and if Ni < i , then we can be sure that
the next comparison (after the Ni matches) would be a mismatch. Hence in simulating
Boyer-Moore, if M(h) r Nj we can avoid at least N; explicit comparisons. Of course, it
is not always the case that M(h) Ni, but all the cases are similar and are detailed below.
Phase algorithm
1. If M(h) is undefined or M(h) = Ni= 0, then compare T ( h )and P(i) as follows:
If T(h) = P(i) and i 3= 1, then report an occurrence of P ending at position j of T, set
M(j) = n , and shift as in the Boyer-Moore algorithm (ending this phase).
If T(h) = P(i) and i > 1, then set h to h - I and i to i - 1 and repeat the phase
algorithm.
If T(h) # P(i), then set M ( j ) = j - h and shift P according to the Boyer-Moore rules
based on a mismatch occurring in position i of P (this ends the phase).
2. If M(h) < N i , then P matches its counterparts in T from position n down to position
i - M(h) + I of P. By the definition of M(h). P might match more of T to the left, so
set i to i - M(h), set h to h - M(h), and repeat the phase algorithm.
3. If M(h) 5 N iand Ni = i > 0, then declare that an occurrence of P has been found in
T ending at position j . M ( j ) must be set to a value less than or equal to n. Set M ( j ) to
j - h, and shift according to the Boyer-Moore rules based on finding an occurrence of P
ending at j (this ends the phase).
39
if the comparison involving T ( h ) is a match then, at the end of the phase, M ( j ) is set at
least as large as j - h 1. That means that all characters in T that matched a character
of P during that phase are contained in the covered interval [ j - M ( j ) 1..j ] . Now the
algorithm only examines the right end of an interval, and if h is the right end of an interval
then M ( h )is defined and greater than 0, so the algorithm never compares a character of T
in a covered interval. Consequently, no character of T will ever be compared again after
it is first in a match. Hence the algorithm finds at most m matches, and the total number
of character comparisons is bounded by 2m.
To bound the amount of additional work, we focus on the number of accesses of M
during execution of the five cases since the amount of additional work is proportional to
the number of such accesses. A character comparison is done whenever Case 1 applies.
Whenever Case 3 or 4 applies, P is immediately shifted. Hence Cases 1, 3, and 4 can
apply at most O ( m ) times since there are at most O ( m ) shifts and compares. However,
it is possible that Case 2 or Case 5 can apply without an immediate shift or immediate
character comparison. That is, Case 2 or 5 could apply repeatedly before a comparison or
shift is done. For example, Case 5 would apply twice in a row (without a shift or character
comparison) if Ni= M ( h ) > 0 and N,-NL= M(h - M(h)). But whenever Case 2 or 5
applies, then j > h and M ( j ) will certainly get set to j - h 1 or more at the end of
that phase. So position h will be in the strict interior of the covered interval defined by j .
Therefore, h will never be examined again, and M ( h ) will never be accessed again. The
effect is that Cases 2 and 5 can apply at most once for any position in T , so the number of
accesses made when these cases apply is also O(m).
38
j-M(j)+l
Figure 3.2: a. Diagram showing covered intervals that do not cross, although one interval can contain
another. b. Two covered intervals that do cross.
The proof is by induction on the number of intervals created. Certainly the claim
is true until the first interval is created, and that interval does not cross itself. Now assume
that no intervals cross and consider the phase where the right end of P is aligned with
position j of T .
Since h = j at the start of the phase, and j is to the right of any interval, h begins
outside any interval. We consider how h could first be set to a position inside an interval,
other than the right end of the interval. Rule 1 is never executed when h is at the right end
of an interval (since then M(h) is defined and greater than zero), and after any execution
of Rule 1, either the phase ends or h is decremented by one place. So an execution of Case
1 cannot cause h to move beyond the right-most character of a covered interval. This is
also true for Cases 3 and 4 since the phase ends after either of those cases. So if h is ever
moved into an interval in a position other than its right end, that move must follow an
execution of Case 2 or 5. An execution of Case 2 or 5 moves h from the right end of some
interval I = [k..h]to position k - 1, one place to the left of I. Now suppose that k - 1 is
in some interval I' but is not at its right end, and that this is the first time in the phase that
h (presently k - 1) is in an interval in a position other than its right end. That means that
the right end of I cannot be to the left of the right end of I' (for then position k - 1 would
have been strictly inside 1'), and the right ends of I and I' cannot be equal (since M(h)
has at most one value for any h). But these conditions imply that I and I' cross, which
is assumed to be untrue. Hence, if no intervals cross at the start of the phase, then in that
phase only the right end of any covered interval is examined.
A new covered interval gets created in the phase only after the execution of Case 1, 3,
or 4. In any of these cases, the interval [h 1.. j J is created after the algorithm examines
position h. In Case 1, h is not in any interval, and in Cases 3 and 4, h is the right end of
an interval, so in all cases h 1 is either not in a covered interval or is at the left end of
an interval. Since j is to the right of any interval, and h 1 is either not in an interval
or is the left end of one, the new interval [h 1.. j ] does not cross any existing interval.
The previously existing intervals have not changed, so there are no crossing intervals at
the end of the phase, and the induction is complete.
PROOF
41
be periodic. For example, abababab is periodic with period abab and also with shorter
period ab. An alternate definition of a semiperiodic string is sometimes useful.
Lemma 3.2.2. A string a is semiperiodic with period B ifand only fi it is prefi semiperiodic with the same length period as B.
For example, the string abaabaabaabaabaah is serniperiodic with period aab and is
prefix semiperiodic with period aba.
The following useful lemma is easy to verify, and its proof is typical of the style of
thinking used in dealing with overlapping matches.
Lemma 3.2.3. Suppose pattern P occurs in text T starting at positions p and p' > p,
where p' - p 5 Ln/2 J. Then P is serniperiodic with period p' - p.
The following lemma, called the GCD Lemma, is a very powerful statement about
periods of strings. We won't need the lemma in our discussion of Cole's proof, but it is
natural to state it here. We will prove it and use it in Section 16.17.5.
Lemma 3.2.4. Suppose string a is semiperiodic with both a period of length p and a
period of length q, and I f f 1 2 p q. Theit cr is semiperiodic with a period rvhose length
is the greatest common divisor of p and q.
+1
mlcl
Starting from the right end of p , mark off substrings of length s, until less than
si characters remain on the left (see Figure 3.4). There will be at least three full substrings
since I I = Iti 1 1 > 3si. Phase i ends by shifting P right by si positions. Consider how
jj aligns with T before and after that shift (see Figure 3.5). By definition of si and a , a is
the part of the shifted P to the right of the original F. By the good suffix rule, the portion
PROOF
40
t i . The
pattern is then
3.2.1. Cole's proof when the pattern does not occur in the text
Definition Let s, denote the amount by which P is shifted right at the end of phase i.
Assume that P does not occur in T, so the compare part of every phase ends with a
mismatch. In each compare/shift phase, we divide the comparisons into those that compare
a character of T that has previously been compared (in a previous phase) and those
comparisons that compare a character of T for the first time in the execution of the
algorithm. Let gi be the number of comparisons in phase i of the first type (comparisons
involving a previously examined character of T), and let g j be the number of comparisons
in phase i of the second type. Then, over the entire algorithm the number of comparisons
is r = I (gj gi), and our goal is to show that this sum is O(m).
Certainly, xy=, g/ 5 rn since a character can be compared for the first time only once.
We will show that for any phase i , s, 2 g , / 3 . Then since x y = , si 5 m (because the total
length of all the shifts is at most rn) it will follow that x P = , g, 5 3m.Hence the total
number of comparisons done by the algorithm is x:='=,(gi g f ) 5 4m.
xq
An initial lemma
We start with the following definition and a lemma that is valuable in its own right.
Definition For any string b , Pi denotes the string obtained by concatenating together
i copies of B .
Lemma 3.2.1. Let y and S be two nonernpty strings such that y6 = 6 y. Then 6 = p i and
y = p J for some string p and positive integers i and j .
This lemma says that if a string is the same before and after a circular shift (so that it
can be written both as yS and Sy, for some strings y and 6) then y and 6 can both be
written as concatenations of some single string p.
Forexample,let6 = ababand y = abcrhwb, so6y = abnbababab = y6.Thenp = a b ,
6 = p2,and y = p 3.
Figure 3.7: The case when the right end of P i s aligned with a right end of
A mismatch must occur between ~ ( k 'and
) qk).
concreteness, call that copy and say that its right end is q IB I places to the left of the right
of ti, where q 2 1 (see Figure 3.7). We will first deduce how phase h must have ended,
and then we'll use that to prove the lemma.
Let k' be the position in T just to the left of ti (so T(kl) is involved in the mismatch
ending phase i), and let k be the position in P opposite T(k') in phase h. We claim that,
in phase h, the comparison of P and T will find matches until the left end of t, but then
mismatch when comparing T(k') and P(k). The reason is the following: Strings and ti are
semiperiodic with period B, and in phase h the right end of P is aligned with the right end
of some B. So in phase h, P and T will certainly match until the left end of string ti. Now p
is semiperiodic with B, and in phase h, the right end of P is exactly qlB] places to the left
of the right end of ti. Therefore, P(1) = P(1 / P I )= . = p(1 ql#?[)= P(k). But in
phase i the mismatch occurs when comparing T(kf) with P ( l ) , so P(k) = P(1) # T(k ).
Hence, if in phase h the right end of P is aligned with the right end of a B, then phase h
must have ended with a mismatch between T(k ) and P(k). This fact will be used below
to prove the lemma.'
Now we consider the possible shifts of P done in phase h. We will show that every
possible shift leads to a contradiction, so no shifts are possible and the assumed alignment
of P and T in phase h is not possible, proving the lemma.
Since h < i , the right end of P will not be shifted in phase h past the right end of
ti;consequently, after the phase h shift a character of p is opposite character T(k') (the
character of T that will mismatch in phase i). Consider where the right end of P is after
the phase h shift. There are two cases to consider: 1. Either the right end of P is opposite
the right end of another full copy of /3 (in ti) or 2. The right end of P is in the interior of
a full copy of /3.
Case 1 If the phase h shift aligns the right end of P with the right end of a full copy
of 8, then the character opposite T(kl) would be P(k - rl/3/) for some r . But since P is
'
Later we will analyze the Boyer-Moore algorithm when P is in T . For that purpose we note here that when phase
h is assumed to end by finding an occurrence of P, then the proof of Lemma 3.2.6 is complete at this point, having
esVablished a contradiction. That is. on the assumption that the right end of P is aligned with the right end o f a #I
in phase h , we proved that phase h ends with a mismatch, which would contradict the assumption that h ends by
tinding an occurrence of P in T . S o even if phase h ends by finding an occurrence of P, the right end of P could
not be aligned with the right end of a B block in phase h .
42
P
Figure 3.4: Starting from the right, substrings of length la1 = si are marked off in
P.
Figure 3.5: The arrows show the string equalities described in the proof.
of the shifted P below ti must match the portion of the urrshifted below ti, so the second
marked-off substring from the right end of the shifted must be the same as the first
substring of the unshifted F . Hence they must both be copies of string a . But the second
substring is the same in both copies of p , so continuing this reasoning we see that all the
si-length marked substrings are copies of a and the left-most substring is a suffix of a (if
it is not a complete copy of a ) .Hence P is semiperiodic with period a . The right-most Itil
characters of P match t i , and so t, is also semiperiodic with period a . Then since a = pl,
and ti must also be semiperiodic with period B.
Recall that we want to bound g,, the number of characters compared in the i th phase that
have been previously compared in earlier phases. All but one of the characters compared
in phase i are contained in t i , and a character in ti could have previously been examined
only during a phase where P overlaps ti. So to bound g i , we closely examine in what ways
P could have overlapped ti during earlier phases.
Lemma 3.2.6. I f Iti( + I > 3si, then in any phase h < i, the right end of P coirld not
have been aligned opposite the right end of any f~illcopy of B in substring t, of T.
By Lemma 3.2.5, ti is semiperiodic with period B. Figure 3.6 shows string, ti as
a concatenation of copies of string B. In phase h , the right end of P cannot be aligned
with the right end of ti since that is the alignment of P and T in phase i > h , and P must
have moved right between phases h and i . So, suppose, for contradiction, that in phase
h the right end of P is aligned with the right end of some other full copy of B in ti. For
PROOF
45
so that the two characters of P aligned with T(kU)before and after the shift are unequal.
We claim these conditions hold when the right end of P is aligned with the right end of
p'. Consider that alignment. Since P is semiperiodic with period p, that alignment of P
and T would match at least until the left end of ti and so would match a\ position k" of T.
Therefore, the two characters of P aligned with T(k") before and after the shift cannot be
equal. Thus if the end of P were aligned with the end of B' then all the characters of T
that matched in phase h would again match, and the characters of P aligned with T(kJ')
before and after the shift would be different. Hence the good suffix rule would not shift
the right end of P past the right of the end of p'.
Therefore, if the right end of P is aligned in the interior of p' in phase h, it must also
be in the interior of B' in phase h 1. But h was arbitrary, so the phase-h + 1 shift would
also not move the right end of P past p'. So if the right end of P is in the interior of B' in
phase h, it remains there forever. This is impossible since in phase i > h the right end of
P is aligned with the right end of t i , which is to the right of p'. Hence the right end of P
is not in the interior of B', and the Lemma is proved. CI
Note again that Lemma 3.2.8 holds even if phase h is assumed to end by finding an
occurrence of P in T. That is, the proof only needs the assumption that phase i ends with
a mismatch, not that phase h does. In fact, when phase h finds an occurrence of P in T,
then the proof of the lemma only needs the reasoning contained in the first two paragraphs
of the above proof.
Theorem 3.2.1. Assuming P does not occur in T, si 2 g,/3 in every phase i .
This is trivially true if s; 2 (Itil 1)/3, so assume Itit + 1 > 3s,. By Lemma
3.2.8, in any phase h < i , the right end of P is opposite either one of the left-most
- 1
characters of ti or one of the right-most \j!? 1 characters of ti (excluding the extreme right
character). By Lemma 3.2.7, at most Ip I comparisons are made in phase h < i . Hence
the only characters compared in phase i that could possibly have been compared before
phase i are the left-most IpI - 1 characters of ti, the right-most 2181 characters of t,, or
the character just to the left of ti. So gi 5 31BI = 3si when Iti[ 1 > 3s;. In both cases
then,si 3 gi/3.
PROOF
Theorem 3.2.2. [I081 Assuming that P does not occur in T, the worst-case number of
comparisons made by the Boyer-Moore algorithm is at most 4m.
AS noted before, x f = , gj 5 m and x:='=, s; 5 m, so the total number of comparisons done by the algorithm is r=l (g; gi) 5
3si) m 5 4m.
PROOF
xq
(xi
3.2.2. The case when the pattern does occur in the text
Consider P consisting of n copies of a single character and T consisting of m copies of
the same character. Then P occurs in T starting at every position in T except the last n - 1
positions, and the number of comparisons done by the Boyer-Moore algorithm is O(mn).
The O(m) time bound proved in the previous section breaks down because it was derived by
showing that gi 5 3si, and that required the assumption that phase i ends with a mismatch.
So when P does occur in T (and phases do not necessarily end with mismatches), we must
modify the Boyer-Moore algorithm in order to recover the linear running time. Galil [I681
gave the first such modification. Below we present a version of his idea.
The approach comes from the following observation: Suppose in phase i that the right
end of P is positioned with character k of T, and that P is compared with T down
Figure 3.8: Case when the right end of P i s aligned with a character in the interior of a
have a smaller period than B , contradicting the definition of p.
B. Then ti would
serniperiodic with period /3, P(k) must be equal to P(k - rlBI), contradicting the good
suffix rule.
Case 2 Suppose the phase h shift aligns P so that its right end aligns with some character
in the interior of a full copy of /3. That means that, in this alignment, the right end of some
p string in P is opposite a character in the interior of B. Moreover, by the good suffix
rule, the characters in the shifted P below B agree with (see Figure 3.8). Let yS be
the string in the shifted P positioned opposite in t i , where y is the string through the
end of p and S is the remainder. Since = /3, y is a suffix of /3, S is a prefix of B, and
IyI 181 =
= 1/31; thus y8 = Sy. By Lemma 3.2.1, however, /3 = p r for t > 1, which
contradicts the assumption that /3 is the smallest string such that cr = /3' for some 1.
Starting with the assumption that in phase h the right end of P is aligned with the fight
end of a full copy of /3, we reached the conclusion that no shift in phase h is possible.
Hence the assumption is wrong and the lemma is proved.
+1
Since P is not aligned with the end of any /3 in phase h , if P matches ti in T for /3
or more characters then the right-most /3 characters of P would match a string consisting
of a suffix ( y ) of /3 followed by a prefix (8) of /3. So we would again have /3 = y8 = 6 y ,
and by Lemma 3.2.1, this again would lead to a contradiction to the selection of /3.
PROOF
Note again that this lemma holds even if phase h is assumed to find an occurrence of
P. That is, nowhere in the proof is it assumed that phase h ends with a mismatch, only
that phase i does. This observation will be used later.
Lemma 3.2.8. I f Iti 1 + 1 > 3si, then in phase h < i if the right end of P is aligned with
a character in ti, it can only be aligned with one of the left-most 1/31 - 1 characters o f t ;
or one of the right-most 1/31 characters oft,.
Suppose in phase h that the right end of P is aligned with a character of ti
other than one of the left-most Ip I - 1 characters or the right-most ( B( characters. For
concreteness, say that the right end of P is aligned with a character in copy /3' of string
B . Since P' is not the left-most copy of /3, the right end of P is at least
characters to
the right of the left end of t i , and so by Lemma 3.2.7 a mismatch would occur in phase
h before the left end of ti is reached. Say that mismatch occurs at position k" of T. After
that mismatch, P is shifted right by some amount determined by the good suffix rule. By
Lemma 3.2.6, the phase-h shift cannot move the right end of P to the right end of B', and
we will show that the shift will also not move the end of P past the right end of B'.
Recall that the good suffix rule shifts P (when possible) by the smallest amount so
that all the characters of T that matched in phase h again match with the shifted P and
PROOF
47
all comparisons in phases that end with a mismatch have already been accounted for (in
the accounting for phases not in Q)and are ignored here.
Let k' > k > i be a phase in which an occurrence of P is found overlapping the earlier
run but is not part of that run. As an example of such an overlap, suppose P = axaaxa
and T contains the substring axaaxaaxaararaaxaaxa. Then a run begins at the start of
the substring and ends with its twelfth character, and an overlapping occurrence of P (not
part of the run) begins with that character. Even with the Galil rule, characters in the run
will be examined again in phase kf, and since phase k' does not end with a mismatch those
comparisons must still be counted.
In phase k', if the left end of the new occurrence of P in T starts at a left end of a copy
of ,8 in the run, then contiguous copies of ,8 continue past the right end of the run. But then
no mismatch would have been possible in phase k since the pattern in phase k is aligned
exactly Ip I places to the right of its position in phase k - 1 (where an occurrence of P was
found). So in phase k', the left end of the new P in T must start with an interior character
characters, Lemma
of some copy of p . But then if P overlaps with the run by more than
3.2.1 implies that p is periodic, contradicting the selection of p. So P can overlap the run
only by part of the run's left-most copy of 8. Further, since phase k f ends by finding an
occurrence of P, the pattern is shifted right by skj = IS/ positions. Thus any phase that
finds an occurrence of P overlapping an earlier run next shifts P by a number of positions
larger than the length of the overlap (and hence the number of comparisons). It follows
then that over the entire algorithm the total number of such additional comparisons in
overlapping regions is O(m).
All comparisons are accounted for and hence C r Ed,a= O(m), finishing the proof of
the lemma.
Theorem 3.2.4. When both shift r~ilesare used togethel; the worst-case nrnning time of
the modijied Boyer-Moore algorithm remains O(m).
In the analysis using only the suffix rule we focused on the comparisons done
in an arbitrary phase i. In phase i the right end of P was aligned with some character of
T. However, we never made any assumptions about how P came to be positioned there.
Rather, given an arbitrary placement of P in a phase ending with a mismatch, we deduced
bounds on how many characters compared in that phase could have been compared in
earlier phases. Hence all of the lemmas and analyses remain correct if P is arbitrarily
picked up and moved some distance to the right at any time during the algorithm. The
(extended) bad character rule only moves P to the right, so all lemmas and analyses
showing the O(m) bound remain correct even with its use.
PROOF
46
to character s of T . (We don't specify whether the phase ends by finding a mismatch
or by finding an occurrence of P in T.) If the phase-i shift moves P so that its left
1 a prefix of P definitely
end is to the right of character s of T, then in phase i
1, if the right-to-left commatches the characters of T up to T ( k ) . Thus, in phase i
parisons get down to position k of T, the algorithm can conclude that an occurrence of
P has been found even without explicitly comparing characters to the left of T(k
1).
It is easy to implement this modification to the algorithm, and we assume in the rest
of this section that the Boyer-Moore algorithm includes this rule, which we call the
Galil rule.
+
+
Theorem 3.2.3. Using the Galil rule, the Boyer-Moore algorithm never does more than
O(m) comparisons, no matter how many occurrences or P there are in T.
Partition the phases into those that do find an occurrence of P and those that do
not. Let Q be the set of phases of the first type and let di be the number of comparisons
d,
ti 1 1) is a bound on the total number
done in phase i if i E Q. Then CGp
of comparisons done in the algorithm.
The quantity CigQ(ltiI 1) is again O(m). To see this, recall that the lemmas of the
previous section, which proved that gi 5 3si, only needed the assumption that phase i
ends with a mismatch and that h < i . In particular, the analysis of how P of phase h < i is
aligned with P of phase i did not need the assumption that phase h ends with a mismatch.
Those proofs cover both the case that h ends with a mismatch and that h ends by finding
an occurrence of P. Hence it again holds that gi 5 3si if phase i ends with a mismatch,
even though earlier phases might end with a match.
1)/3, since
For phases in Q, we again ignore the case that Si ? (n 1)/3 2 (di
3si 5 3m.
the total number of comparisons done in such phases must be bounded by
So suppose phase i ends by finding an occurrence of P in T and then shifts by less
than n/3. By a proof essentially the same as for Lemma 3.2.5 it follows that P is semiperiodic; let p denote the shortest period of P. Hence the shift in phase i moves P
right by exactly JBl positions, and using the Galil rule in the Buyer-Moore algorithm,
no character of T compared in phase i + 1 will have ever been compared previously.
1 ends by finding an occurrence of P then P
Repeating this reasoning, if phase i
will again shift by exactly )PI places and no comparisons in phase i 2 will examine
a character of T compared in any earlier phase. This cycle of shifting P by exactly
IpI positions and then identifying another occurrence of P by examining only ]PI new
characters of T may be repeated many times. Such a succession of overlapping occurrences
of P then consists of a concatenation of copies of B (each copy of P starts exactly IpI
places to the right of the previous occurrence) and is called a run. Using the Galil rule,
it follows immediately that in any single run the number of comparisons used to identify
the occurrences of P contained in that run is exactly the length of the run. Therefore,
over the entire algorithm the number of comparisons used to find those occurrences is
O(m). If no additional comparisons were possible with characters in a run, then the
analysis would be complete. However, additional examinations are possible and we have
to account for them.
A run ends in some phase k > i when a mismatch is found (or when the algorithm
terminates), It is possible that characters of T in the run could be examined again in phases
after k. A phase that reexamines characters of the run either ends with a mismatch or ends
by finding an occurrence of P that overlaps the earlier run but is not part of it. However,
PROOF
+ xi
49
k k+l
spk
Figure 3.1 1:
J must be a suffix of a.
spk 1 = la[ 1, then would be a prefix of P that is longer than a . But 6is also a
proper suffix of P[l..k] (because Bx is a proper suffix of P[l..k + 11). Those two facts
would contradict the definition of spk (and the selection of a ) . Hence spk+l Ispk 1.
Now clearly, spk+l = SPA 1 if the character to the right of a is x, since a x would
then be a prefix of P that also occurs as a proper suffix of P[l..k
I]. Conversely, if
spk+, = spk 1 then the character after a must be x.
Lemma 3.3.1 identifies the largest "candidate" value for spk+l and suggests how to
initially look for that value (and for string B). We should first check the character P(spk+ I),
just to the right of a . If it equals P(spk 1) then we conclude that equals a , B is a x ,
and spk+l equals spk 1. But what do we do if the two characters are not equal?
When character P(k 1) # P(spk I), then spk+~< spk 1 (by Lemma 3.3.I), so
spk+, 5 SPA.It follows that ,d must be a prefix of a, and must be aproper prefix of a.
Now substring ,d = /?x ends at position k 1 and is of length at most spk, whereas a' is
a substring ending at position k and is of length spk. SO is a suffix of a', as shown in
Figure 3.11. But since a' is a copy of a, is also a suffix of a.
In summary, when P(k I) # P(spk 1 ), B occurs as a suffix of a and also as a
proper prefix of a followed by character x. So when P(k 1) # P(spk I), is the
longest proper prefix of a that matches a suffix of a and that is followed by character x in
position 161 1 of P. See Figure 3.1 1.
However, since a = P [ l ..spk],we can state this as
+ B
B is the longest proper prefix of P[l ..spk]that matches a suffix of P[l ..k] and
that is followed by character x in position 1p1+ 1 of P .
**)
and spk+l:
B is the longest proper prefix of P [ l..k] that matches a suffix of P [ 1..k] and that
+
+
Lemma 3.3.1. For any k , spk+l 5 spk 1 . Further; s p k + ~= spk 1 if and only i f the
character after a is X . That is, spk+! = spk 1 if and only if P(spk 1 ) = P ( k 1).
Let B = j!?x denote the prefix of P of length spk+l. That is, = g x i s the
longest proper suffix of P [ 1..k 11 that is a prefix of P. If spk+~is strictly greater than
PROOF
"Pk
Figure 3.9: The situation after finding spk.
k k+l
Figure 3.1 0: spk,, is found by finding
6.
51
each time the for statement is reached; it is assigned a variable number of times inside
the while loop, each time this loop is reached. Hence the number of times v is assigned is
n - 1 plus the number of times it is assigned inside the while loop. How many times that
can be is the key question.
Each assignment of v inside the while loop must decrease the value of v , and each of
the n - 1 times v is assigned at the for statement, its value either increases by one or it
remains unchanged (at zero). The value of v is initially zero, so the total amount that the
value of v can increase (at thefor statement) over the entire algorithm is at most n - 1. But
since the value of v starts at zero and is never negative, the total amount that the value of
v can decrease over the entire algorithm must also be bounded by n - I, the total amount
it can increase. Hence v can be assigned in the while loop at most n - 1 times, and hence
the total number of times that the value of v can be assigned is at most 2(n - 1) = O(n),
and the theorem is proved.
Algorithm SP'(P)
sp; = 0;
F o r i : = 2 t o n do
begin
V := Spi;
~f P(V + 1) # P(i + 1) then
sp; := v
else
spj := spu7
end;
1 .
Theorem 3.3.2. Algorithm S P'( P ) correctly computes all the spj val~tesin O(n) time.
The proof is by induction on the value of i. Since s p ~= 0 and spf 5 sp; for all i ,
then sp', = 0, and the algorithm is correct for i = 1. Now suppose that the value of sp: set
by the algorithm is correct for all i < k and consider i = k. If P[spk 11 # P[k 11 then
clearly sp; is equal to s p k , since the spk length prefix of P[l..k] satisfies all the needed
requirements. Hence in this case, the algorithm correctly sets sp;.
If P(spk 1) = P(k + l), then sp; < spk and, since P[l..spk] is a suffix P[l..k],
spk can be expressed as the length of the longest proper prefix of P[l..spk] that also
occurs as a suffix of P [ l . . s p k ] with the condition that P(k 1) # P(spL I), But since
1).
P(k 1) = P(spk I), that condition can be rewritten as P(spk 1) # P(sp;
By the induction hypothesis, that value has already been correctly computed as spiPk.S o
when P(spk 1) = P(k 1) the algorithm correctly sets sp; to ~p,:,~.
Because the algorithm only does constant work per position, the total time for the
algorithm is O(n).
PROOF
It is interesting to compare the classical method for computing sp and sp' and the
method based on fundamental preprocessing (i.e., on Z values). In the classical method
the (weaker) sp values are computed first and then the more desirable sp' values are derived
Figure 3.12: "Bouncing ball" cartoon of original Knuth-Morris-Pratt preprocessing. The arrows show the
successive assignments to the variable v.
Algorithm SP(P)
sp, = 0
F o r k : = 1 t o n - I do
begin
x := P(k I);
u
spk;
While P ( v + 1) # x and v # 0 do
v := s p v ;
end;
If P ( v I ) = x then
spk+l := v 1
else
SPk+l := 0;
end;
Theorem 3.3.1. Algorithm SPfinds all the s p i ( P ) values in O ( n ) time, where n is the
length of P .
Note first that the algorithm consists of two nested loops, a for loop and a while
loop. The for loop executes exactly n - 1 times, incrementing the value of k each time.
The while loop executes a variable number of times each time it is entered.
The work of the algorithm is proportional to the number of times the value of v is
assigned. We consider the places where the value of v is assigned and focus on how the
value of v changes over the execution of the algorithm. The value of v is assigned once
PROOF
2
Figure 3.14: Pattern PI is the string pat. a. The insertion of pattern F$ when P2 is pa. b. The insertion
when F$ is party.
Tree K 1 just consists of a single path of I P1I edges out of root r. Each edge on this path
is labeled with a character of PI and when read from the root, these characters spell out
P I . The number 1 is written at the node at the end of this path. To create K 2 from K I ,
first find the longest path from root r that matches the characters of P2 in order. That is,
find the longest prefix of P2 that matches the characters on some path from r. That path
either ends by exhausting P2 or it ends at some node v in the tree where no further match
is possible. In the first case, P2 already occurs in the tree, and so we write the number 2 at
the node where the path ends. In the second case, we create a new path out of v, labeled by
the remaining (unmatched) characters of P2,and write number 2 at the end of that path.
An example of these two possibilities is shown in Figure 3.14.
In either of the above two cases, K 2 will have at most one branching node (a node with
more than one child), and the characters on the two edges out of the branching node will
be distinct. We will see that the latter property holds inductively for any tree K , . That is,
at any branching node v in K , , all edges out of v have distinct labels.
In general, to create
from K i , start at the root of Ki and follow, as far as possible,
the (unique) path in K i that matches the characters in Pi+, in order. This path is unique
because, at any branching node v of K i , the characters on the edges out of v are distinct.
If pattern Pi+1 is exhausted (fully matched), then number the node where the match ends
with the number i 1. If a node v is reached where no further match is possible but Pi+]is
not fully matched, then create a new path out of v labeled with the remaining unmatched
part of Pi+,and number the endpoint of that path with the number i 1.
During the insertion of Pi+l,the work done at any node is bounded by a constant, since
the alphabet is finite and no two edges out of a node are labeled with the same character.
Hence for any i, it takes 0 (I Pi+ I) time to insert pattern Pi+ into X i , and so the time to
construct the entire keyword tree is O(n).
Figure 3.13:
from them, whereas the order is just the opposite in the method based on fundamental
preprocessing.
Definition The keyword tree for set P is a rooted directed tree K satisfying three
conditions: 1. each edge is labeled with exactly one character; 2. any two edges out of
the same node have distinct labels; and 3. every pattern P,in P maps to some node v of
K such that the characters on the path from the coot of K to v exactly spell out Pi,and
every leaf of K is mapped to by some pattern in P.
For example, Figure 3.13 shows the keyword tree for the set of patterns (potato, poetry,
pottery, science, school).
Clearly, every node in the keyword tree corresponds to a prefix of one of the patterns
in P , and every prefix of a pattern maps to a distinct node in the tree.
Assuming a fixed-size alphabet, it is easy to construct the keyword tree for P in O(n)
time. Define X ito be the (partial) keyword tree that encodes patterns P I , .. . , Pi of p.
There is a more recent exposition of the Aho-Corasick method in 181, where the algorithm is used just as an
"acceptor", deciding whether or not there is an occurrence in T of at least one pattern from P.Because we will
want to explicitly find all occurrences, that version of the algorithm is too limited to use here.
For example, consider the set of patterns P = botato, tattoo, theatel; other} and its
keyword tree shown in Figure 3.16. Let v be the node labeled with the string potar. Since
tat is prefix of tattoo, and it is the longest proper suffix of potat that is a prefix of any
pattern in P, lp(v) = 3.
Lemma 3.4.1. Let a be the lp(v)-length suf/ir ofstring C(v).Then there is a unique node
in the keyword tree that is labeled by string a.
K encodes all the patterns in P and, by definition, the lp(v)-length suffix of L(v)
is a prefix of some pattern in P. So there must be a path from the root in K that spells out
PROOF
54
Numbered nodes along that path indicate patterns in P that start at position 1. For a fixed
I, the traversal of a path of K takes time proportional to the minimum of m and n, so by
successively incrementing 1from to m and traversing K for each 1, the exact set matching
problem can be solved in O(nm) time. We will reduce this to O(n m k) time below,
where k is the number of occurrences.
'
+ +
Definition For any node v of K,define Ep(v) to be the length of the longest proper
suffix of string C ( v ) that is a prefix of some pattem in P.
57
to the node n, labeled tat, and 1p(v) = 3. So 1 is incremented to 5 = 8 - 3, and the next
comparison is between character T(8) and character t on the edge below tat.
With this algorithm, when no further matches are possible, 1 may increase by more than
one, avoiding the reexamination of characters of T to the left of c, and yet we may be
sure that every occurrence of a pattern in P that begins at character c - Ip(u) of T will be
correctly detected. Of course (just as in Knuth-Morris-Pratt), we have to argue that there
are no occurrences of patterns of P starting strictly between the old 1 and c - lp(v) in
T, and thus 1 can be incremented to c - lp(v) without missing any occurrences. With the
given assumption that no pattern in P is a proper substring of another one, that argument
is almost identical to the proof of Theorem 2.3.2 in the analysis of Knuth-Morris-Pratt,
and it is left as an exercise.
When lp(v) = 0,then 1 is increased to c and the comparisons begin at the root of K.
The only case remaining is when the mismatch occurs at the root. In this case, c must be
incremented by 1 and comparisons again begin at the root.
Therefore, the use of function u H nucertainly accelerates the naive search for patterns
of P. But does it improve the worst-case running time? By the same sort of argument used
to analyze the search time (not the preprocessing time) of Knuth-Morris-Pratt (Theorem
2.3.3), it is easily established that the search time for Aho-Corasick is O(m).We leave
this as an exercise. However, we have yet to show how to precompute the function v I+ n,
in linear time.
Figure 3.17: Keyword tree used to compute the failure function for node v.
56
string a. By the construction of 7 no two paths spell out the same string, so this path is
unique and the lemma is proved.
Definition For a node u of K let n u be the unique node in K labeled with the suffix of
L ( v ) of length lp(v). When lp(u) = 0 then n, is the root of K.
Definition We call the ordered pair (u, n,) afailure link.
Figure 3.16 shows the keyword tree for P
' = Ipotato, tatoo, theater, other). Failure
links are shown as pointers from every node v to node n, where lp(v) > 0.The other
failure links point to the root and are not shown.
To understand the use of the function v H n,, suppose we have traversed the tree to
node u but cannot continue (i.e., character T(c) does not occur on any edge out of v). We
know that string L(v) occurs in T starting at position 1 and ending at position c - 1. By
the definition of the function v H n , , it is guaranteed that string L(n,) matches string
T [c - lp(u)..c - I]. That is, the algorithm could traverse K from the root to node n, and be
sure to match all the characters on this path with the characters in T starting from position
c - Ip(v). So when lp(v) 1 0,1 can be increased to c - Ip(v), c can be left unchanged,
and there is no need to actually make the comparisons on the path from the root to node
n u .Instead, the comparisons should begin at node a , , comparing character c of T against
the characters on the edges out of nu.
For example, consider the text T = xrpofnttooxx and the keyword tree shown in
Figure 3.16. When 1 = 3, the text matches the string potat but mismatches at the next
character. At this point c = 8, and the failure link from the node v labeled potat points
Figure 3.18: Keyword tree showing a directed path from potatto at through tat,
Repeating this analysis for every pattern in P yields the result that all the failure links
are established in time proportional to the sum of the pattern lengths in P (i.e., in O ( n )
total time).
Lemma 3.4.2. Suppose in a keyword tree K there is a directed path of failure links
(possibly empty) from a node v to a node that is numbered with pattern i . Then pattern Pi
must occur in T ending at position c (the current character) whenever node v is reached
during the search phase of the Aho-Corasick algorithm.
For example, Figure 3.18 shows the keyword tree for P = (potato, pot, tatter, at) along
with some of the failure links. Those links form a directed path from the node v labeled
potat to the numbered node labeled at. If the traversal of K reaches v then T certainly
contains the patterns tat and at end at the current c .
Conversely,
Lemma 3.4.3. Suppose a node v has been reached during the algorithm. Then pattern
58
Algorithm n,
v' is the parent of v in K ;
x is the character on the edge (v', v);
w := nu/;
While there is no edge out of w labeled x and w # r
do w := n,;
end (while);
If there is an edge ( w , w ' ) out of w labeled x then
n, := w ' ;
else
n u := r;
Note the importance of the assumption that n, is already known for every node u that
is k or fewer characters from r .
To find n, for every node v, repeatedly apply the above algorithm to the nodes in K in
a breadth-first manner starting at the root.
Theorem 3.4.1. Let n be the total length of all the patterns in P. The total time used by
Algorithm n, when applied to all nodes in K: is 0(n).
PROOF The argument is a direct generalization of the argument used to analyze time
in the classic preprocessing for Knuth-Morris-Pratt. Consider a single pattern P in P of
length t and its path in K for pattern P. We will analyze the time used in the algorithm to
find the failure links for the nodes on this path, as if the path shares no nodes with paths
for any other pattern in P. That analysis will overcount the actual amount of work done
by the algorithm, but it will still establish a linear time bound.
The key is to see how lp(u) varies as the algorithm is executed on each successive node
v down the path for P. When v is one edge from the root, then lp(v) is zero. Now let v be
an arbitrary node on the path for P and let v' be the parent of v. Clearly, lp(v) 5 1p(vr) 1,
so over all executions of Algorithm n, for nodes on the path for P , lp() is increased by a
total of at most t. Now consider how lp() can decrease. During the computation of n, for
any node v, w starts at nu(and so has initial node depth equal to 1p(v1),However, during
the computation of n u , the node depth of w decreases every time an assignment to w is
made (inside the while loop). When n , is finally set, lp(v) equals the current depth of w ,
so if w is assigned k times, then lp(v) 5 1p(vf)- k and lp() decreases by at least k. Now
lp() is never negative, and during all the computations along path P, lp() can be increased
by a total of at most t. It follows that over all the computations done for nodes on the path
for P , the number of assignments made inside the while loop is at most t. The total time
used is proportional to the number of assignments inside the loop, and hence all failure
links on the path for P are set in O(t) time.
61
of an output link leads to the discovery of a pattern occurrence, so the total time for the
algorithmis O(n +rn+k), where k is the total number of occurrences. In summary we have,
Theorem 3.4.2. I f P is a set of patterns with total length n and T is a text.of total lengrh
m , then one can find all occurrences in T of patterns from P in O(n)prepiocessing time
pllis O(m+ k ) search time, where k is the number of occurrences. This is true even without
assuming that the patterns in P are substring free.
In a later chapter (Section 6.5) we will discuss further implementation issues that affect
the practical performance of both the Aho-Corasick method, and suffix tree methods.
Sequence-tagged-sites
The concept of a Sequence-tagged-site (STS) is one of the most useful by-products that
has come out of the Human Genome Project [I11, 234, 3991. Without going into full
biological detail, an STS is intuitively a DNA string of length 200-300 nucleotides whose
right and left ends, of length 20-30 nucleotides each, occur only once in the entire genome
[ I l l , 3 171. Thus each STS occurs uniquely in the DNA of interest. Although this definition
is not quite correct, it is adequate for our purposes. An early goal of the Human Genome
Project was to select and map (locate on the genome) a set of STSs such that any substring
in the genome of length 100,000 or more contains at least one of those STSs. A more
refined goal is to make a map containing ESTs (expressed sequence tags), which are STSs
that come from genes rather than parts of intergene DNA. ESTs are obtained from mRNA
and cDNA (see Section 11 3 . 3 for more detail on cDNA) and typically reflect the protein
coding parts of a gene sequence.
With an STS map, one can locate on the map any sufficiently long string of anonymous
but sequenced DNA - the problem is just one of finding which STSs are contained in the
anonymous DNA. Thus with STSs, map location of anonymous sequenced DNA becomes
a string problem, an exact set matching problem. The STSs or the ESTs provide a computerbased set of indices to which new DNA sequences can be referenced. Presently, hundreds
of thousands of STSs and tens of thousands of ESTs have been found and placed in
computer databases [234]. Note that the total length of all the STSs and ESTs is very large
compared to the typical size of an anonymous piece of DNA. Consequently, the keyword
tree and the Aho-Corasick method (with a search time proportional to the length of the
anonymous DNA) are of direct use in this problem for they allow very rapid identification
of STSs or ESTs that occur in newly sequenced DNA.
Of course, there may be some errors in either the STS map or in the newly sequenced
DNA causing trouble for this approach (see Section 16.5 for a discussion of STS maps).
But in this application, the number of errors should be a small percentage of the length of
the STS, and that will allow more sophisticated exact (and inexact) matching methods to
succeed. We will describe some of these in Sections 7.8.3,9.4,and 12.2 of the book.
60
Complexity The time used by the Aho-Corasick algorithm to build the keyword tree
for P is O ( n ) .The time to search for occurrences in T of patterns from P is O ( m + z ) ,
where IT I = m and z is the number of occurrences. We treat each pattern in P as being
distinct even if there are multiple copies of it in P. Then whenever an occurrence of a
pattern from P is found in T, exactly one cell in C is incremented; furthermore, a cell can
be incremented to at most k. Hence z must be bounded by km, and the algorithm runs in
O ( k m )time . Although the number of character comparisons used is just O ( m ) ,km need
not be O ( m )and hence the number of times C is incremented may grow faster than O ( m ) ,
leading to a nonlinear O ( k m )time bound. But if k is assumed to be bounded (independent
of I P I), then the method does run in linear time. In summary,
Theorem 3.5.1. Ifthe number of wild cards in pairem P is bounded by a constant, then
the exact matching problem with wild cards in the Partem can be solved in O ( n m ) time.
Later, in Sections 9.3, we will return to the problem of wild cards when they occur in
either the pattern, text, or both.
62
A related application comes from the "BAC-PAC" proposal [442] for sequencing the
human genome (see page 4 18). In that method, 600,000 strings (patterns) of length 500
would first be obtained and entered into the computer. Thousands of times thereafter,
one would look for occurrences of any of these 600,000 patterns in text strings of length
150,000. Note that the total length of the patterns is 300 million characters, which is
two-thousand times as large as the typical text to be searched.
where CYS is the amino acid cysteine and HIS is the amino acid histidine. Another important transcription factor is the Leucine Zipper, which consists of four to seven leucines,
each separated by six wild card amino acids.
If the number of permitted wild cards is unbounded, it is not known if the problem
can be solved in linear time. However, if the number of wild cards is bounded by a fixed
constant (independent of the size of P) then the following method, based on exact set
pattern matching, runs in linear time:
Exact matching with wild cards
0. Let C be a vector of length 1 TI initialized to all zeros.
65
Every string specified by this regular expression has ten positions, which are separated
by a dash. Each capital letter specifies a single amino acid and a group of amino acids
enclosed by brackets indicates that exactly one of those amino acids must be chosen. A
small x indicates that any one of the twenty amino acids from the protein alphabet can be
chosen for that position. This regular expression describes 192,000 amino acid strings, but
only a few of these actually appear in any known proteins. For example, ENLSSEDEEL
is specified by the regular expression and is found in human granin proteins.
+,
Definition A single character from C is a regular expression. The symbol E is a regular expression. A regular expression followed by another regular expression is a regular
form a regular exexpression. Two regular expressions separated by the symbol
pression. A regular expression enclosed in parentheses is a regular expression. A regular
expression enclosed in parentheses and followed by the symbol "*" is a regular expression. The symbol * is called the Kleene closure.
"+"
These recursive rules are simple to follow, but may need some explanation. The symbol
E represents the empty string (i.e., the string of length zero). If R is a parenthesized regular
expression, then R* means that the expression R can be repeated any number of times
(including zero times). The inclusion of parentheses as part of a regular expression (outside
of C) is not standard, but is closer to the way that regular expressions are actually specified
in many applications. Note that the example given above in PROSITE format does not
conform to the present definition but can easily be converted to do so.
As an example, let C be the alphabet of lower case English characters. Then R =
(a c r)ykk(p q)* vdt(1 z
~ ) ( p q is
) a regular expression over C, and S =
+ +
+ +
Note that in the context of regular expressions, the meaning of the word "pattern" is different from its previous and
general meaning in this book.
64
more complex techniques of the type we will examine in Part I11 of the book. So for now,
we view two-dimensional exact matching as an illustration of how exact set matching can
be used in more complex settings and as an introduction to more realistic two-dimensional
problems. The method presented follows the basic approach given in [44] and [66]. Since
then, many additional methods have been presented since that improve on those papers
in various ways. However, because the problem as stated is somewhat unrealistic, we
will not discuss the newer, more complex, methods. For a sophisticated treatment of
two-dimensional matching see [22] and [169].
Let m be the total number of points in T , let n be the number of points in P , and let n'
be the number of rows in P. Just as in exact string matching, we want to find the smaller
picture in the larger one in O(n + m) time, where O(nm) is the time for the obvious
approach. Assume for now that each of the rows of P are distinct; later we will relax this
assumption.
The method is divided into two phases. In the first phase, search for all occurrences of
each of the rows of P among the rows of T. To do this, add an end of row marker (some
character not in the alphabet) to each row of T and concatenate these rows together to
form a single text string T' of length O(m). Then, treating each row of P as a separate
pattern, use the Aho-Corasick algorithm to search for all occurrences in T' of any row
of P. Since P is rectangular, all rows have the same width, and so no row is a proper
substring of another and we can use the simpler version of Aho-Corasick discussed in
Section 3.4.2. Hence the first phase identifies all occurrences of complete rows of P in
complete rows of T and takes O(n + m) time.
Whenever an occurrence of row i of P is found starting at position (p, q ) of T, write the
number i in position ( p , q ) of another array M with the same dimensions as T . Because
each row of P is assumed to be distinct and because P is rectangular, at most one number
will be written in any cell of M.
In the second phase, scan each column of M, looking for an occurrence of the string
1 , 2 ,. . . , n' in consecutive cells in a single column. For example, if this string is found in
column 6, starting at row 12 and ending at row n' 12, then P occurs in T when its upper
left corner is at position (6,12). Phase two can be implemented in O(nJ+ m) = O(n m)
time by applying any linear-time exact matching algorithm to each column of M.
This gives an O(n + rn) time solution to the two-dimensional exact set matching
problem. Note the similarity between this solution and the solution to the exact matching
problem with wild cards discussed in the previous section. A distinction will be discussed
in the exercises.
Now suppose that the rows of P are not all distinct. Then, first find all identical rows
and give them a common label (this is easily done during the construction of the keyword
tree for the row patterns). For example, if rows 3,6, and 10 are the same then we might give
them all the label of 3. We do a similar thing for any other rows that are identical. Then, in
phase one, only look for occurrences of row 3, and not rows 6 and 10. This ensures that a
cell of M will have at most one number written in it during phase 1. In phase 2, don't look
for the string 1,2, 3, . . . , n' in the columns of M, but rather for a string where 3 replaces 6
and 10, etc. It is easy to verify that this approach is correct and that it takes just O(n m)
time. In summary,
Theorem 3.5.2. IfT a n d P are rectangularpictures with m and n cells, respectively, then
all exact occurrences of P i~ T can be found in O(n + nz) time, improving upon the naive
method, which takes O(nm) time.
3.7. EXERCISES
3.7. Exercises
1. Evaluate empirically the speed of the Boyer-Moore method against the Apostolic+
Giancarlo method under different assumptions about the text and the pattern. These assumptions should include the size of the alphabet, the "randomness" of h e text or pattern,
the level of periodicity of the text or pattern, etc.
2. In the Apostolic+Giancarlo method, array M is of size m, which may be large. Show how
to modify the method so that it runs in the same time, but in place of M uses an array of
size n.
the algorithm sets the value to be strictly less than the length of the match. Now, since the
algorithm learns the exact location of the mismatch in all cases, M( j ) could always be set
to the full length of the match, and this would seem to be a good thing to do. Argue that this
change would result in a correct simulation of Boyer-Moore. Then explain why this was not
done in the algorithm.
Hint: tt's the time bound.
5. Prove Lemma 3.2.2showing the equivalence of the two definitions of semiperiodic strings.
6. For each of the n prefixes of P, we want to know whether the prefix P [ l ..i] is a periodic
string. That is, for each i we want to know the largest k > 1 (if there is one) such that
P [ l ..ij can be written as a k for some string a. Of course, we also want to know the period.
Show how to determine this for all n prefixes in time linear in the length of
Hint: 2-algorithm.
P.
7. Solve the same problem as above but modified to determine whether each prefix is
semiperiodic and with what period. Again, the time should be linear.
8. By being more careful in the bookkeeping, establish the constant in the O(m) bound from
Cole's linear-time analysis of the Boyer-Moore algorithm.
9. Show where Cole's worst-case bound breaks down if only the weak Boyer-Moore shift
rule is used. Can the argument be fixed, or is the linear time bound simply untrue when
only the weak rule is used? Consider the example of T = abababababababababab and
P = xaaaaaaaaa without also using the bad character rule.
10. Similar to what was done in Section 1.5, show that applying the classical Knuth-Morris-Pratt
preprocessing method to the string P$T gives a linear-time method to find all occurrence of
Pin T . In fact, the search part of the Knuth-Morris-Pratt algorithm (alter the preprocessing
of P is finished) can be viewed as a slightly optimized version of the Knuth-Morris-Pratt
preprocessing algorithm applied to the T part of P $ T . Make this precise, and quantify the
utility of the optimization.
11. Using the assumption that P is substring free (i.e., that no pattern F$ E P is a substring of
another pattern P, E P), complete the correctness proof of the Aha-Corasick algorithm.
That is, prove that if no further matches are possible at a node v, then I can be set to
c - IAv) and the comparisons resumed at node n, without missing any occurrences in T
of patterns from P.
12. Prove that the search phase of the AhwCorasick algorithm runs in O(m) time if no pattern
in P is a proper substring of another, and otherwise in O(m k) time, where k is the total
number of occurrences.
13. The AhwCorasick algorithm can have the same problem that the Knuth-Morris-Pratt algorithm
Figure 3.19:
+ +
69
3.7. EXERCISES
(i , j - i + 1). Then declare that P occurs in T with upper left corner in any cell whose
counter becomes n (the number of rows of P). Does this work?
f
Hint: No.
Why not? Can you fix it and make it run in O(n
+ m) time?
27. Suppose we have q > 1 small (distinct) rectangular pictures and we want to find all occurrences of any of the q small pictures in a larger rectangular picture. Let n be the total
number of points in all the small pictures and rn be the number of points in the large
picture. Discuss how to solve this problem efficiently. As a simplification, suppose all the
small pictures have the same width. Then show that O(n m)time suffices.
28. Show how to construct the required directed graph G(R) from a regular expression R. The
construction should have the property that if R contains n symbols, then G(R) contains at
most O(n) edges.
29. Since the directed graph G(R) contains O(n) edges when R contains n symbols, I N(i)l =
O(n) for any i.This suggests that the set N(i) can be naively found from N(i - 1) and T(i)
in O(ne) time. However, the time stated in the text for this task is O(e). Explain how this
reduction of time is achieved. Explain that the improvement is trivial if G(R) contains no E
edges.
30. Explain the importance, or the utility, of E edges in the graph G(R). If Rdoes not contain
the closure symbol "*",can c edges aiways be avoided? Biological strings are always finite,
hence "*" can atways be avoided. Explain how this simplifies the searching algorithm.
31. Wild cards can clearly be encoded into a regular expression, as defined in the text. However,
it may be more efficient to modify the definition of a regular expression to explicitly include
the wild card symbol. Develop that idea and explain how wild cards can be efficiently
handled by an extension of the regular expression pattern matching algorithm.
32. PROSITE patterns often specify the number of times that a substring can repeat as a
finite range of numbers. For example, CD(24) indicates that CD can repeat either two,
three, or four times. The formal definition of a regular expression does not include such
concise range specifications, but finite range specifications can be expressed in a regular
expression. Explain how. How much do those specifications increase the length of the
expression over the length of the more concise PROSITE expression? Show how such
range specifications are reflected in the directed graph for the regular expression ( E edges
are permitted). Show that one can still search for a substring of T that matches the regular
expression in O(rne) time, where rn is the length of T and e is the number of edges in the
graph.
33. Theorem 3.6.1 states the time bound for determining if T contains a substring that matches
a regular expression R. Extend the discussion and the theorem to cover the task of explicitly
finding and outputting all such matches. State the time bound as the sum of a term that is
independent of the number of matches plus a term that depends on that number.
68
14. Give an example showing that k, the number of occurrences in T of patterns in set P , can
grow faster than O(n+ m).Be sure you account for the input size n. Try to make the growth
as large as possible.
15. Prove Lemmas 3.4.2 and 3.4.3 that relate to the case of patterns that are not substring
free.
16. The time analysis in the proof of Theorem 3.4.1 separately considers the path in K for
each pattern Pin P . This results in an overcount of the time actually used by the algorithm.
Perform the analysis more carefully to relate the running time of the algorithm to the number
of nodes in K.
17. Discuss the problem (and soiution if you see one) of using the Aho-Corasick algorithm
when a, wild cards are permitted in the text but not in the pattern and b. when wild cards
are permitted in both the text and pattern.
18. Since the nonlinear time behavior of the wild card algorithm is due to duplicate copies of
strings in P , and such duplicates can be found and removed in linear time, it is tempting to
'Yfix up" the method by first removing duplicates from P . That approach is similar to what is
done in the two-dimensional string matching problem when identical rows were first found
and given a single label. Consider this approach and try to use it to obtain a linear-time
method for the wild card problem. Does it work, and if not what are the problems?
19. Show how to modify the wild card method by replacing array C (which is of length m > n)
by a list of length n, while keeping the same running time.
20. In the wild card problem we first assumed that no pattern in P is a substring of another
one, and then we extended the algorithm to the case when that assumption does not hold.
Could we instead simply reduce the case when substrings of patterns are atlowed to the
case when they are not? For example, perhaps we just add a new symbol to the end of
each string in P that appears nowhere else in the patterns. Does it work? Consider both
correctness and complexity issues.
21. Suppose that the wild card can match any length substring, rather than just a single character. What can you say about exact matching with these kinds of wild cards in the pattern,
in the text, or in both?
22. Another approach to handling wild cards in the pattern is to modify the Knuth-Morris-Pratt
or Boyer-Moore algorithms, that is, to develop shift rules and preprocessjng methods that
can handle wild cards in the pattern. Does this approach seem promising? Try it, and
discuss the problems (and solutions if you see them).
23. Give a complete proof of the correctness and O(n+ m)time bound for the two-dimensionat
matching method described in the text (Section 3.5.3).
24. Suppose in the two-dimensionalmatching problem that Knuth-Morris-Pratt is used once for
each pattern in P , rather than Aho-Corasick being used. What time bound would result?
25. Show how to extend the two-dimensional matching method to the case when the bottom of
the rectangular pattern is not parallel to the bottom of the large picture, but the orientation
of the two bottoms is known. What happens if the pattern is not rectangular?
26. Perhaps we can omit phase two of the two-dimensional matching method as follows: Keep
a counter at each cell of the large picture. When we find that row i of the small picture
occurs in row j of the large picture starting at position ( i ' , 11, increment the counter for cell
0
0
1
0
1
1
because prefixes of P of lengths one and three end at position seven of T . The eighth
character of T is character a , which has a U vector of
4
Seminumerical String Matching
For example, Figure 4.1 shows a column j - 1 before and after the bit-shift.
73
zero column of each array is again initialized to all zeros. Then the jth column of M' is
computed by:
~ ' ( j=) M ' - ' ( j ) OR [ f l i t - ~ h i f t ( ~-' (1))
j AND U ( T ( j ) ) ]OR ~
- 1).
' - l ( j
Intuitively, this just says that the first i characters of P will match a substring of T
ending at position j, with at most I mismatches, if and only if one of the following three
conditions hold:
The first i characters of P match a substring of T ending at j , with at most 1 - 1 mismatches.
The first i - 1 characters of P match a substring of T ending at j - 1, with at most 1
mismatches, and the next pair of characters in P and T are equal.
The first i - 1 characters of P match a substring of T ending at j - 1, with at most I - 1
mismatches.
It is simple to establish that these recurrences are correct, and over the entire algorithm
the number of bit operations is O(knrn).As in the Shift-And method, the practical efficiency
comes from the fact that the vectors ate bit vectors (again of length n) and the operations
are very simple - shifting by one position and ANDing bit vectors. Thus when the pattern
is relatively small, so that a column of any M' fits into a few words, and k is also small,
agrep is extremely fast.
72
That is, M k ( i , j ) is the natural extension of the definition of M ( i , j)to allow up to k mismatches. Therefore, M o is the array M used in the Shift-And method. If ~ " ( nj ), = 1 then
there is an occurrence of P in T ending at posi~ionj that contains at most k mismatches.
We let M" j ) denote the jth column of M k .
In agrep, the user chooses a value of k and then the arrays M , M 1 ,
M ~. .,. , M~ are
computed. The efficiency of the method depends on the size of k - the larger k is, the
slower the method. For many applications, a value of k as small as 3 or 4 is sufficient, and
the method is extremely fast.
75
and
The problem then becomes how to compute V,(a, /?, i) for each i. Convert the two
respectively, where every occurrence of character a
strings into binary strings d?, and
becomes a 1, and all other characters become 0s. For example, let a be acaacggaggrat and
fi be accacgang. Then the binary strings Ea and Fa are 10 11000100010 and 100100 1 10.
To compute V,(a, p , i), position Fa to start at position i of dT, and count the number of
columns where both bits are equal to 1. For example, if i = 3 then we get
L,
and V,(a, p , 9) = 1.
Another way to view this is to consider each space opposite a bit to be a 0 (so both
binary strings are the same length), d o a bitwise AND operation with the strings, and then
add the resulting bits.
To formalize this idea, pad the right end of (the larger string) with n additional zeros
and pad the right end of 15with m additional zeros. The two resulting strings then each
have length n rn. Also, for convenience, renumber the indices of both strings to run from
O t o n + r n - 1.Then
where the indices in the expression are taken modulo n + m . The extra zeros are there
to handle the cases when the left end of a is to the left end of /I and, conversely, when
the right end of a is to the right end of /?. Enough zeros were padded so that when the
right end of a is right of the right end of /3, the corresponding bits in the padded d?, are all
opposite zeros. Hence no "illegitimate wraparound" of a and p is possible, and V,((Y, /?,i )
is correctly computed.
So far, all we have done is to recode the match-count problem, and this recoding
doesn't suggest a way to compute V,(cr, /?) more efficiently than before the binary coding
and padding. This is where correlation and the FFT come in.
Clearly, when cr = P and = T the vector V(a, 8 ) contains the information needed
for the last row of MC. But it contains more information because we allow the left end of
ct to be to the left of the left end of /?, and we also allow the right end of a to be to the
right of the right end of B. Negative numbers specify positions to the left of the left end of
B, and positive numbers specify the other positions. For example, when a is aligned with
B as follows,
21123456789
B:
A:
accctgtcc
aac t g c c g
77
where for each character x , V,(cr, B, i) is computed by replacing each wild card with
character x . In summary,
Theorem 4.3.1. The match-count problem can be solved in O(m log m) time even
unbounded number of wild cards are allowed in either P or T.
if an
Later, after discussing suffix trees and common ancestors, we will present in Section 9.3
a different, more comparison-based approach to handling wild cards that appear in both
strings.
Similarly, let
H(T,) =
2'-'~(r
+ i - I).
if and only if
Cyclic correlation
Definition Let X and Y be two z-length vectors with real number components indexed
from 0 to z - 1. The cyclic correlation of X and Y is an z-length real vector W(i) =
~ i ~ X(j)
i - "x Y(i + j ) , where the indices in the expression are taken modulo z.
Clearly, the problem of computing vector V,(ar, B) is exactly the problem-of computing
the cyclic correlation of padded strings @a and
In detail, X = a,, Y = Pa, z = n m ,
and W = V,(a, p).
Now an algorithm based only on the definition of cyclic correlation would require 0(z2)
operations, so again no progress is apparent. But cyclic correlation is a classic problem
known to be solvable in O(z log z) time using the Fast Fourier Transform. (The FFT is
more often associated with the convalutian problem for two vectors, but cyclic correlation
and convolution are very similar. In fact, cyclic correlation is solved by reversing one of
the input vectors and then computing the convolution of the two resulting vectors.)
The FFT method, and its use in the solution of the cyclic correlation problem, is
beyond the scope of this book, but the key is that it solves the cyclic correlation problem
in O(z log z) arithmetic operations, for two vectors each of length 2 . Hence it solves the
match-count problem using only O ( m log m ) arithmetic operations. This is surprisingly
efficient and a definite improvement over the O(nm) bound given by the generalized
Shift-And approach. However, the FFT requires operations over complex numbers and so
each arithmetic step is more involved (and perhaps more costly) than in the more direct
Shift-And method.'
K.
A related approach [ 5 8 ] attempts to solve the match-count problem in O ( m log m ) integer (nonuomplex) operations
by implementing the FFT over a finite field. In practice. this approach is probably superior to the approach based on
complex numbers, although in terms of pure complexity theory the claimed O ( m logm) bound is not completely
kosher because it uses a precomputed table of numbers that is only adequate for values of m up to a certain size.
79
1 x2mod7+0=2
2~2mod7+1=5
5~2mod7+1=4
4~2mod7+1=2
2x2mod7+1=5
5 mod 7 = 5.
The point of Homer's rule is not only that the number of multiplications and additions
required is linear, but that the intermediate numbers are always kept small.
Intermediate numbers are also kept small when computing Hp(Tr) for any r , since that
computation can be organized the way that H,(P) was. However, even greater efficiency
is possible: For r > 1, Hp(Tr)can be computed from Hp(Tr-I) with only a small constant
number of operations. Since
and
H(Tr) = 2 x H(T,-l)
2"T(r - 1)
+ T(r + n - I ) ,
it follows that
Hp(Tr)= [(2 x H(Tr-1) mod p ) - (2" mod p) x T(r - 1) + T(r + n - I)] mod p.
Further,
2" mod p = 2 x
(2"-l
mod p) mod p.
Therefore, each successive power of two taken mod p and each successive value Hp(Tr)
can be computed in constant time.
The goal will be to choose a modulus p small enough that the arithmetic is kept
efficient, yet large enough that the probability of a false match between P and T is kept
small. The key comes from choosing p to be a prime number in the proper range and
exploiting properties of prime numbers. We will state the needed properties of prime
numbers without proof.
Definition For a positive integer 11, n(u)is the number of primes that are less than or
equal to u.
The following theorem is a variant of the famous prime number theorem.
Theorem 4.4.2.
& 5 n ( u ) 5 1.26&,
78
The proof, which we leave to the reader, is an immediate consequence of the fact that
every integer can be written in a unique way as the sum of positive powers of two.
Theorem 4.4.1 converts the exact match problem into a numerical problem, comparing
the two numbers H ( P ) and H (T, ) rather than directly comparing characters. But unless
the pattern is fairly small, the computation of H ( P ) and H(T,) will not be efficient.' The
problem is that the required powers of two used in the definition of H ( P ) and H(T,)
grow large too rapidly. (From the standpoint of complexity theory, the use of such large
numbers violates the unit-time random access machine (RAM) model. In that model,
the largest allowed numbers must be represented in O[log(n + m)] bits, but the number
2" requires n bits. Thus the required numbers are exponentially too large.) Even worse,
when the alphabet is not binary but say has t characters, then numbers as large as t" are
needed.
In 1987 R. Karp and M. Rabin 12661 published a method (devised almost ten years
earlier), called the randomized fingerprint method, that preserves the spirit of the above
numerical approach, but that is extremely efficient as well, using numbers that satisfy the
RAM model. It is a randomized method where the only if part of Theorem 4.4.1 continues
to hold, but the if part does not. Instead, the ifpart will hold with high probability. This is
explained in detail in the next section.
Definition For a posi tive integer p, H p ( P )is defined as H ( P ) mod p. That is H,(P) is
the remainder of H ( P ) after division by p. Similarly, H,,(T,) is defined as H(T,) mod p.
The numbers Hp(P) and H,(T,) are calledjngerprints of P and T,.
Already, the utility of using fingerprints should be apparent. By reducing H ( P ) and
H(T,) modulo a number p, every fingerprint remains in the range 0 to p - 1, so the size of
a fingerprint does not violate the RAM model. But if H ( P ) and H(T,) must be computed
before they can be reduced modulo p , then we have the same problem of intermediate
numbers that are too large. Fortunately, modular arithmetic allows one to reduce at any
time (i.e., one can never reduce too much), so that the following generalization of Homer's
rule holds:
One can more efficiently compute H(Tr+l) from H(Tr) than by following the detinition directly (and we will need
that later on). but the time to do the updates is not the issue here.
Given the fact that each H,(T,) can be computed in constant time from HP(T,-,),
the
fingerprint algorithm runs in O ( m ) time, excluding any time used to explicitly check a
declared match. It may, however, be reasonable not to bother explicitly checking declared
matches, depending on the probability of an error. We will return to the issue of checking
later. For now, to fully analyze the probability of error, we have to answer the question of
what I should be.
How to choose I
The utility of the fingerprint method depends on finding a good value for I. As I increases,
the probability of a false match between P and T decreases, but the allowed size of p
increases, increasing the effort needed to compute H,(P) and H,(T,). Is there a good
balance? There are several good ways to choose I depending on n and m. One choice is to
take I = n m 2 .With that choice the largest number used in the algorithm requires at most
4flogn + logm) bits, satisfying the RAM model requirement that the numbers be kept
small as a function of the size of the input. But, what of the probability of a false match?
Corollary 4.4.2. When I = nm 2 , the probability of a false match is at most
By Theorem 4.4.3 and the prime number theorem (Theorem 4.4.2), the probability of a false match is bounded by
PROOF
A small example from [266] illustrates this bound. Take n = 250, rn = 4000, and
< lom3.
hence I = 4 x lo9 < 2". Then the probability of a false match is at most
Thus, with just a 32-bit fingerprint, for any P and T the probability that even a single one
of the algorithm's declarations is wrong is bounded by 0.001.
Alternately, if I = n'm then the probability of a false match is O(i/n),
and since it
takes O(n) time to determine whether a match is false or real, the expected verification
time would be constant. The result would be an O(m) expected time method that never
has a false match.
Extensions
If one prime is good, why not use several? Why not pick k primes p i , p?.. . . , pk randomly
and compute k fingerprints? For any position r, there can be an occurrence of P starting
at r only if H,,( P ) = Hp,(T,) for every one of the k selected primes. We now define a
false match between P and T to mean that there is an r such that P does not occur in T
starting at r, but Hpi(P) = H,, (T,) for each of the k primes. What now is the probability
of a false match between P and T? One bound is fairly immediate and intuitive.
80
Lemma 4.4.2. I f u >_ 29, then the product of all the primes that are less thurl or equal to
u is greater than 2" [383].
For example, for u = 29 the prime numbers less than or equal to 29 are 2,5,7, 1 1, 13,
17, 19,23, and 29. Their product is 2,156,564,410 whereas z~~is 536,870,912.
Corollary 4.4.1. If u > 29 and x is any number less than or equal to 2", then x has fewer
than n ( u ) (distinct) prime divisors.
Suppose x does have k > n ( u ) distinct prime divisors q , , q?,. . . , q k . Then
2" 2 x 1, qlq2 . . . q k (the first inequality is from the statement of the corollary, and the
second from the fact that some primes in the factorization of x may be repeated). But
qiqz . . . q k is at least as large as the product of the smallest k primes, which is greater than
the product of the first n ( u ) primes (by assumption that k > ~ ( 1 1 ) ) .However, the product
of the primes less than or equal to u is greater than 2" (by Lemma 4.4.2). So the assumption
that k > n ( u ) leads to the contradiction that 2" > 2", and the lemma is proved.
PROOF
PROOF
Notice that Theorem 4.4,3 holds for any choice of pattern P and text T such that
nm 2 29. The probability in the theorem is not taken over choices of P and T but rather
over choices of prime p. Thus, this theorem does not make any (questionable) assumptions
about P or T being random or generated by a Markov process, etc. It works for any P and
T! Moreover, the theorem doesn't just bound the probability that a false match occurs at
a fixed position r , it bounds the probability that there is even a single such position r in
7".It is also notable that the analysis in the proof of the theorem feels "weak". That is, it
only develops rt very weak property of a prime p that allows a false match, namely being
one of at most n ( n m ) numbers that divide II.,,R(I H ( P ) - H(T,)I). This suggests that the
true probability of a false match occurring between P and T is much less than the bound
established in the theorem*
Theorem 4.4.3 leads to the following random fingerprint algorithm for finding all occurrences of P in T.
83
allows numerous false matches (a demon seed). Theorem 4.4.3 says nothing about how
bad a particular prime can be. But by picking a new prime after each error is detected, we
can apply Corollary 4.4.2 to each prime, establishing
Theorem 4.4.6. v a new prime is randomly chosen afcer the detection of an error; then
for any pattern and text the probability oft errors is at most
(y)r.
This probability falls so rapidly that one is effectively protected against a long series
of errors on any particular problem instance. For additional probabilistic analysis of the
Karp-Rabin method, see [ 1 821.
82
Theorem 4.4.4. When k primes are chosen randomly between 1 nnd 1 and kfingerprints
rr(nm)l k
are used, the probability of a false match between P and T is a t most [-;;iTi.
We saw in the proof of Theorem 4.4.3 that if p is a prime that allows H,(P) =
H,(T,) at some position r where P does not occur, then p is in a set of at most n(nm)
integers. When k fingerprints are used, a false match can occur only if each of the k primes
is in that set, and since the primes are chosen randomly (independently), the bound from
Theorem 4.4.3 holds for each of the primes. S o the probability that all the primes are in
the set is bounded by [ s ) ' , and the theorem is proved. o
PROOF
Theorem 4.4.5. When k primes are chosen randomly between 1 and 1 and k fingerprints
are used, the probability of a false match between P and T is a t most rn[%lk.
PROOF Suppose that a false match occurs at some fixed position r . That means that each
CorolIary 4.4.3. When k primes are chosen randomly and used in the fingerprint algorithm, the probability of a false match between P and T is at most (1 .26)km-(2k-1'(1 +
0.6 ln m)k.
Applying this to the running example of n = 250, m = 4000, and k = 4 reduces the
probability of a false match to at most 2 x
We mention one further refinement discussed in [266]. Returning to the case where only
a single prime is used, suppose the algorithm explicitly checks that P occurs in T when
H,(P) = H,(T,), and it finds that P does not occur there. Then one may be better off by
picking a new prime to use for the continuation of the computation. This makes intuitive
sense. Theorem 4.4.3 randomizes over the choice of primes and bounds the probability
that a randomly picked prime will allow a false match anywhere in T. But once the prime
has been shown to allow a false match, it is no longer random. It may well be a prime that
4.5. EXERCISES
85
specifications be efficiently handled with the Shift-And method or agrep? The answer partly
depends on the number of such specifications that appear in the expression.
8. (Open problem) Devise a purely comparison-based method to compute match-counts in
O(mlog m) time. Perhaps one can examine the FFT method in detail to see if complex
arithmetic can be replaced with character comparisons in the case of computing matchcounts.
11. Complete the details and analysis to convert the Karp-Rabin method from a Monte-Carl*
style randomized algorithm to a Las Vegas-style randomized algorithm.
12. There are improvements possible in the method to check for false matches in the K a r p
Rabin method. For example, the method can find in O(m) time all those runs containing
no false matches. Explain how. Also, at some point, the method needs to explicitly check
for Pat only I, and not 12. Explain when and why.
84
more than two consecutive runs. It follows that the total time for the method, over all runs,
is O(m).
With the ability to check for false matches in O(m) time, the KarpRabin algorithm can
be converted from a method with a small probability of error that runs in O (m) worst-case
time, to one that makes no error, but runs in O(m) expected time (a conversion from a
Monte Carlo algorithm to a Las Vegas algorithm). To achieve this, simply (re)run and
(re)check the Karp-Rabin algorithm until no false matches are detected. We leave the
details as an exercise.
4.5. Exercises
1. Evaluate empirically the
sizes of P and T.
2. Extend the agrep method to solve the problem of finding an "occurrence" of a pattern P
inside a text T, when a small number of insertions and deletions of characters, as well as
mismatches, are allowed. That is, characters can be inserted into P and characters can
be deleted from P.
3. Adapt Shift-Andand agrep to handle a set of patterns. Can you do better than just handling
each pattern in the set independentty?
4. Prove the correctness of the agrep method.
5. Show how to efficiently handle wild cards (both in the pattern and the text) in the Shift-And
approach. Do the same for agrep. Show that the efficiency of neither method is affected
by the number of wild cards in the strings.
6. Extend the Shift-And method to efficient!^ handle regular expressions that do not use the
Kleene closure. Do the same for agrep. Explain the utility of these extensions to collections
of biosequence patterns such as those in PROSITE.
7. We mentioned in Exercise 32 of Chapter 3 that PROSITE patterns often specify a range for
the number of times that a subpattern repeats. Ranges of this type can be easily handled
by the O(nm) regular expression pattern matching method of Section 3.6. Can such range
PART I1
A suffix tree is a data structure that exposes the internal structure of a string in a deeper way
than does the fundamental preprocessing discussed in Section 1.3. Suffix trees can be used
to solve the exact matching problem in linear time (achieving the same worst-case bound
that the Knuth-Morris-Pratt and the Boyer-Moore algorithms achieve), but their real virtue
comes from their use in linear-time solutions to many string problems more complex than
exact matching. Moreover (as we will detail in Chapter 9), suffix trees provide a bridge
between exact matching problems, the focus of Part I, and inexact matching problems that
are the focus of Part 111.
The classic application for suffix trees is the substringproblem. One is first given a text
T of length m. After O(m), or linear, preprocessing time, one must be prepared to take in
any unknown string S of length n and in O(n) time either find an occurrence of S in T
or determine that S is not contained in T. That is, the allowed preprocessing takes time
proportional to the length of the text, but thereafter, the search for S must be done in time
proportional to the length of S, independent of the length of T. These bounds are achieved
with the use of a suffix tree. The suffix tree for the text is built in O(m) time during a
preprocessing stage; thereafter, whenever a string of length O(n) is input, the algorithm
searches for it in O(n) time using that suffix tree.
The O(m) preprocessing and O(n) search result for the substring problem is very
surprising and extremely useful. In typical applications, a long sequence of requested
strings will be input after the suffix tree is built, so the linear time bound for each search
is important. That bound is not achievable by the Knuth-Moms-Pratt or Boyer-Moore
methods - those methods would preprocess each requested string on input, and then take
O(m) (worst-case) time to search for the string in the text. Because m may be huge
compared to n, those algorithms would be impractical on any but trivial-sized texts.
Often the text is a fixed set of strings, for example, a collection of STSs or ESTs (see
Sections 3.5.1 and 7. lo), so that the substring problem is to determine whether the input
string is a substring of any of the fixed strings. Suffix trees work nicely to efficiently
solve this problem as well. Superficially, this case of multiple text strings resembles the
dictionary problem discussed in the context of the Aho-Corasick algorithm. Thus it is
natural to expect that the Aho-Corasick algorithm could be applied. However, the AhoCorasick method does not solve the substring problem in the desired time bounds, because
it will only determine if the new string is a full string in the dictionary, not whether it is a
substring of a string in the dictionary.
After presenting the algorithms, several applications and extensions will be discussed
in Chapter 7. Then a remarkable result, the constant-time least common ancesror method,
will be presented in Chapter 8. That method greatly amplifies the utility of suffix trees,
as will be illustrated by additional applications in Chapter 9. Some of those applications
provide a bridge to inexact matching; more applications of suffix trees will be discussed
in Part 111, where the focus is on inexact matching.
3
Figure 5.1 : Suffix tree for string xabxac. The node labels u and w on the two interior nodes will
later.
used
of another suffix of S then no suffix tree obeying the above definition is possible, since
the path for the first suffix would not end at a leaf. For example, if the last character of
xabxac is removed, creating string xabxa, then suffix xa is a prefix of suffix xu bxa, so
the path spelling out xu would not end at a leaf.
To avoid this problem, we assume (as was m e in Figure 5.1) that the last character of
S appears nowhere else in S. Then, no suffix of the resulting string can be a prefix of any
other suffix. To achieve this in practice, we can add a character to the end of S that is not
in the alphabet that string S is taken from. In this book we use $ for the "termination"
character. When it is important to emphasize the fact that this termination character has
been added, we will write it explicitly as in S$. Much of the time, however, this reminder
will not be necessary and, unless explicitly stated otherwise, every string S is assumed to
be extended with the termination symbol $, even if the symbol is not explicitly shown.
A suffix tree is related to the keyword tree (without backpointers) considered in Section 3.4. Given string S, if set P is defined to be the m suffixes of S, then the suffix tree
for S can be obtained from the keyword tree for P by merging any path of nonbranching
nodes into a single edge. The simple algorithm given in Section 3.4 for building keyword
trees could be used to construct a suffix tree for S in 0(m2)time, rather than the O(m)
bound we will establish.
Definition The label of a path from the root that ends at a node is the concatenation,
in order, of the substrings labeling the edges of that path. The path-label of a nude is the
label of the path from the root of 7 to that node.
Definition For any node v in a suffix tree, the string-depth of u is the number of
characters in u's label.
Definition A path that ends in the middle of an edge (u, u ) splits the label on ( u , v) at
a designated point. Define the label of such a path as the label of u concatenated with
the characters on edge (u, v) down to the designated split point.
For example, in Figure 5.1 string xu labels the internal node w (so node w has path-label
xn), string a labels node u , and string xabx labels a path that ends inside edge(w,I), that
is, inside the leaf edge touching leaf 1,
90
Definition A suffix tree 7 for an m-character string S is a rooted directed tree with
exactly m leaves numbered 1 to m . Each internal node, other than the root, has at least
two children and each edge is labeled with a nonempty substring of S. No two edges
out of a node can have edge-labels beginning with the same character. The key feature
of the suffix tree is that for any leaf i , the concatenation of the edge-labels on the path
from the root to leaf i exactly spells out the suffix of S that srarts at position i. That is,
it spells out S[i..mj.
For example, the suffix tree for the string xabxac is shown in Figure 5.1. The path from
the root to the leaf numbered 1 spells out the full string S = xabxac, while the path to
the leaf numbered 5 spells out the suffix ac, which starts in position 5 of S.
As stated above, the definition of a suffix tree for S does not guarantee that a suffix
tree for any string S actually exists. The problem is that if one sum of S matches a prefir
93
is proportional to the number of edges traversed, so the time for the traversal is O(k), even
though the total string-depth of those O(k) edges may be arbitrarily larger than k.
If only a single occurrence of P is required, and the preprocessing is extended a bit,
then the search time can be reduced from O(n k) to O(n) time. The idea is to write
at each node one number (say the smallest) of a leaf in its subtree, This can be achieved
in O(m) time in the preprocessing stage by a depth-first traversal of T. The details are
straightforward and are left to the reader. Then, in the search stage, the number written on
the node at or below the end of the match gives one starting position of P in T.
In Section 7.2.1 we will again consider the relative advantages of methods that preprocess the text versus methods that preprocess the pattern(s). Later, in Section 7.8, we
will also show how to use a suffix tree to solve the exact matching problem using O(n)
preprocessing and O(m) search time, achieving the same bounds as in the algorithms
presented in Part I.
Figure 5.2: Three occurrences of aw in awyawxawxz. Their starting positions number the leaves in the
subtree of the node with path-label aw.
Je%---
bxa $
6
Figure 6.1: Suffix tree for string xabxa $.
xu b x a
2
Figure 6.2: Implicit suffix tree for string xabxa.
string S$ if and only if at least one of the suffixes of S is a prefix of another suffix. The
terminal symbol $ was added to the end of S precisely to avoid this situation. However, if
S ends with a character that appears nowhere else in S, then the implicit suffix tree of S
will have a leaf for each suffix and will hence be a true suffix tree.
As an example, consider the suffix tree for string xabxa$ shown in Figure 6.1. Suffix
xu is a prefix of suffix xabxa, and similarly the string a is a prefix of abxa. Therefore,
in the suffix tree for xabxa the edges leading to leaves 4 and 5 are labeled only with $.
Removing these edges creates two nodes with only one child each, and these are then
removed as well. The resulting implicit suffix tree for xabxa is shown in Figure 6.2. As
another example, Figure 5.1 on page 91 shows a tree built for the string xabxac. Since
character c appears only at the end of the string, the tree in that figure is both a suffix tree
and an implicit suffix tree for the string.
Even though an implicit suffix tree may not have a leaf for each suffix, it does encode all
the suffixes of S -each suffix is spelled out by the characters on some path from the root of
the implicit suffix tree. However, if the path does not end at a leaf, there will be no marker
to indicate the path's end. Thus implicit suffix trees, on their own, are somewhat less
informative than true suffix trees. We will use them just as a tool in Ukkonen's algorithm
to finally obtain the true suffix tree for S.
6
Linear-Time Construction of Suffix Trees
We will present two methods for constructing suffix trees in detail, Ukkonen's method
and Weiner's method. Weiner was the first to show that suffix trees can be built in linear
time, and his method is presented both for its historical importance and for some different
technical ideas that it contains. However, Ukkonen's method is equally fast and uses far
less space (i.e., memory) in practice than Weiner's method. Hence Ukkonen is the method
of choice for most problems requiring the construction of a suffix tree. We also believe
that Ukkonen's method is easier to understand. Therefore, it will be presented first. A
reader who wishes to study only one method is advised to concentrate on it. However, our
development of Weiner's method does not depend on understanding Ukkonen's algorithm,
and the two algorithms can be read independently (with one small shared section noted in
the description of Weiner's method).
Ti,for i from
1
1
The implicit suffix tree for any string S will have fewer leaves than the suffix tree for
94
I
1
Figure 6.3: Implicit suffix tree for string axabx before the sixth character, b, is added.
"
Figure 6.4: Extended implicit suffix tree after the addition of character b.
As an example, consider the implicit suffix tree for S = axabx shown in Figure 6.3.
The first four suffixes end at leaves, but the single character suffix x ends inside an edge.
When a sixth character b is added to the string, the first four suffixes get extended by
applications of Rule 1, the fifth suffix gets extended by rule 2, and the sixth by rule 3. The
result is shown in Figure 6.4.
96
Ukkonen's algorithm by first presenting an 0(m 3)-time method to build all trees Ti and
then optimizing its implementation to obtain the claimed time bound.
+
+
+
+
+ l),
Rule 1 In the current tree, path B ends at a leaf. That is, the path from the root labeled
#l extends to the end of some leaf edge. To update the tree, character S(i 1) is added to
the end of the label on that leaf edge.
Rule 2 No path from the end of string starts with character S(i + I), but at least one
labeled path continues from the end of /3.
In this case, a new leaf edge starting from the end of /3 must be created and labeled
with character S(i 1). A new node will also have to be created there if /3 ends inside an
edge. The leaf at the end of the new leaf edge is given the number j .
Rule 3 Some path from the end of string B starts with character S(i + 1). In this case the
string PS(i I) is already in the current tree, so (remembering that in an implicit suffix
tree the end of a suffix need not be explicitly marked) we do nothing.
Following Corollary 6.1.1, all internal nodes in the changing tree will have suffix links
from them, except for the most recently added internal node, which will receive its suffix
link by the end of the next extension. We now show how suffix links are used to speed up
the implementation.
Recall that in phase i 1 the algorithm locates suffix S[j..i] of S[l..i] in extension j , for
j increasing from 1 to i 1. Naively, this is accomplished by matching the string S[j..i]
along a path from the root in the current tree. Suffix links can shortcut this walk and each
extension. The first two extensions (for j = 1 and j = 2) in any phase i 1 are the easiest
to describe.
The end of the full string S[l..i] must end at a leaf of Tisince S[l..i] is the longest
string represented in that tree. That makes it easy to find the end of that suffix (as the trees
are constructed, we can keep a pointer to the leaf corresponding to the current full string
S[l..i]), and its suffix extension is handled by Rule 1 of the extension rules. So the first
extension of any phase is special and only takes constant time since the algorithm has a
pointer to the end of the current full string.
Let string S[ 1..i] be x a , where x is a single character and a is a (possibly empty)
substring, and let (u, 1) be the tree-edge that enters leaf 1. The algorithm next must find
The key is that node v is
the end of string S[2..i] = a in the current tree derived from Ti.
either the root or it is an interior node of T i .If it is the root, then to find the end of a the
algorithm just walks down the tree following the path labeled a as in the naive algorithm.
u has a suffix link
But if v is an internal node, then by Corollary 6.1.2 (since v was in Ti)
out of it to node s(u). Further, since s(v) has a path-label that is a prefix of string a , the
end of string a must end in the subtree of s(u). Consequently, in searching for the end of
a in the current tree, the algorithm need not walk down the entire path from the root, but
can instead begin the walk from node s(v). That is the main point of including suffix links
in the algorithm.
To describe the second extension in more detail, let y denote the edge-label on edge
(u. 1). To find the end of a,walk up from leaf 1 to node v; follow the suffix link from v to
s(u); and walk from s(v) down the path (which may be more than a single edge) labeled
y . The end of that path is the end of cr (see Figure 6.5). At the end of path a , the tree
is updated following the suffix extension rules. This completely describes the first two
extensions of phase i 1.
11 for j > 2, repeat the same general idea:
To extend any string S[j..i] to S[j..i
Starting at the end of string S[ j - 1 ..i] in the current tree, walk up at most one node to
either the root or to a node v that has a suffix link from it; let y be the edge-label of that
edge; assuming v is not the root, traverse the suffix link from v to s(v); then walk down the
tree from s(v), following a path labeled y to the end of S[j..i]; finally, extend the suffix
11 according to the extension rules.
to S[j..i
There is one minor difference between extensions for j > 2 and the first two extensions.
In general, the end of S [ j - l..i] may be at a node that itself has a suffix link from it, in
which case the algorithm traverses that suffix link. Note that even when extension rule 2
applies in extension j - 1 (so that the end of S [ j - l..i] is at a newly created internal
node w), if the parent of w is not the root, then the parent of w already has a suffix link
out of it, as guaranteed by Lemma 6.1.1. Thus in extension j the algorithm never walks
up more than one edge.
bound. However, taken together, they do achieve a linear worst-case time. The most
important element of the acceleration is the use of suffir links.
We will sometimes refer to a suffix link from v to s(v) as the pair (v, s(v)). For example,
in Figure 6.1 (on page 95) let v be the node with path-label xa and let s(v) be the node
whose path-label is the single character a . Then there exists a suffix link from node v to
node s(v). In this case, a is just a single character long.
As a special case, if a is empty, then the suffix link from an internal node with path-label
x a goes to the root node. The root node itself is not considered internal and has no suffix
link from it.
Although definition of suffix links does not imply that every internal node of an implicit
suffix tree has a suffix link from it, it will, in fact, have one. We actually establish something
stronger in the following lemmas and corollaries.
Lemma 6.1.1. Ifa new internal node v with path-label xtr is added to the current tree in
extension j of some phase i + 1, then either the path labeled a already ends at an internal
node of the current tree or an internal node at the end of string a will be created (by the
extension rules) in extension j + 1 in the same phase i 1.
Corollary 6.1.1. In Ukkonen's algorithm, any newly created internal node will have a
suffix link from it by the end of the next extension.
The proof is inductive and is true for tree T1since 2, contains no internal nodes.
Suppose the claim is true through the end of phase i, and consider a single phase i 1. By
Lemma 6.1.1, when a new node v is created in extension j , the correct node s ( v ) ending
the suffix link from v will be found or created in extension j 1. No new internal node
gets created in the last extension of a phase (the extension handling the single character
1 are known
suffix S(i l)), so all suffix links from internal nodes created in phase i
by the end of the phase and tree Ti+,has all its suffix links.
PROOF
Corollary 6.1.1 is similar to Theorem 6.2.5, which will be discussed during the treatment
of Weiner's algorithm, and states an important fact about implicit suffix trees and ultimately
about suffix trees. For emphasis, we restate the corollary in slightly different language.
101
algorithm to 0(m2).
This trick will also be central in other algorithms to build and use
suffix trees.
Definition Define the node-depth of a node u to be the number of nodes on the path
from the root to u.
Lemma 6.1.2. Let (v, s(v)) be any sufJixlink traversed during Ukkonen 's algorithm. At
that moment, the node-depth of u is at most one greater than the node depth of s(v).
When edge ( v , s(v)) is traversed, any internal ancestor of v, which has path-label
xp say, has a suffix link to a node with path-label p. But x p is a prefix of the path to v ,
so is a prefix of the path to s(v) and it follows that the suffix link from any internal
ancestor of v goes to an ancestor of s ( v ) . Moreover, if is nonempty then the node labeled
by B is an internal node. And, because the node-depths of any two ancestors of v must
differ, each ancestor of u has a suffix link to a distinct ancestor of s ( v ) . It follows that the
node-depth of s(v) is at least one (for the root) plus the number of internal ancestors of
v who have path-labels more than one character long. The only extra ancestor that v can
have (without a corresponding ancestor for s(v)) is an internal ancestor whose path-label
PROOF
Figure 6.5: Extension j > I in phase i + 1. Walk up atmost one edge (labeled y ) from the end of the path
labeled S [ j - 1..i]to node v; then follow the suffix link to s(v);then walk down the path specifying substring
y ; then apply the appropriate extension rule to insert suffix S[j..i+ 11.
End.
Assuming the algorithm keeps a pointer to the current full string S[l..i], the first
extension of phase i
1 need not do any up or down walking. Furthermore, the first
extension of phase i 1 always applies suffix extension rule 1.
+
+
Figure 6.7: For every node v on the path X U , the corresponding node ~ ( vis) on the path a.However, the
node-depth of s ( v ) can be one less than the node-depth of v, it can be equal, or it can be greater. For
example, the node labeled xab has node-depth two, whereas the node-depth of ab is one. The node-depth
of the node labeled xabcdefg is four, whereas the node-depth of abcdefg is five.
Corollary 6.1.3. Ukkonen's algorithm can be implemented with suffix links to run in
0 ( m 2 ) time.
Note that the 0 ( m 2 ) time bound for the algorithm was obtained by multiplying the O ( m )
time bound on a single phase by m (since there are m phases). This crude multiplication
was necessary because the time analysis was directed to only a single phase. What is
needed are some changes to the implementation allowing a time analysis that crosses
phase boundaries. That will be done shortly.
At this point the reader may be a bit weary because we seem to have made no progress,
since we started with a naive 0 ( m 2 )method. Why all the work just to come back to the
same time bound? The answer is that although we have made no progress on the time
bound, we have made great conceptual progress so that with only a few more easy details,
the time will fall to O ( m ) .In particular, we will need one simple implementation detail
and two more little tricks.
end of
suffix j
Figure 6.6: The skip/count trick. In phase i + 1, substring y has length ten. There is a copy of substring
y out of node s(v);it is found three characters down the last edge, after four node skips are executed.
has length one (it has label x ) . Therefore, v can have node-depth at most one more than
s (v). (See Figure 6.7). a
Definition As the algorithm proceeds, the current node-depth of the algorithm is the
node depth of the node most recently visited by the algorithm.
Theorem 6.1.1. Using the skip/count trick, any phase of Ukkonen's algorithm takes O(m)
time.
PROOF
Observation 1: Rule 3 is a show stopper In any phase, if suffix extension rule 3 applies
in extension j , it will also apply in all further extensions ( j 1 to i + 1 ) until the end of
the phase. The reason is that when rule 3 applies, the path labeled S[j..i] in the current
tree must continue with character S(i l), and so the path labeled S[ j + 1 ..i] does also,
and rule 3 again applies in extensions j 1, j 2, . . . , i 1.
When extension rule 3 applies, no work needs to be done since the suffix of interest is
already in the tree. Moreover, a new suffix link needs to be added to the tree only after
an extension in which extension rule 2 applies. These facts and Observation 1 lead to the
following implementation trick.
Trick 2 End any phase i +1 the first time that extension rule 3 applies. If this happens in
extension j, then there is no need to explicitly find the end of any string S [ k . . i ] for k > j .
The extensions in phase i + 1 that are "done" after the first execution of rule 3 are said
to be done implicitly.This is in contrast to any extension j where the end of S [ j . . i ] is
explicitly found. An extension of that kind is called an explicit extension.
Trick 2 is clearly a good heuristic to reduce work, but it's not clear if it leads to a better
worst-case time bound. For that we need one more observation and trick.
Observation 2: Once a leaf, always a leaf That is, if at some point in Ukkonen's
algorithm a leaf is created and labeled j (for the suffix starting at position j of S ) , then
that leaf will remain a leaf in all successive trees created during the algorithm. This is
true because the algorithm has no mechanism for extending a leaf edge beyond its current
leaf. In more detail, once there is a leaf labeled j , extension rule 1 will always apply to
extension j in any successive phase. So once a leaf, always a leaf.
Now leaf 1 is created in phase 1, so in any phase i there is an initial sequence of
consecutive extensions (starting with extension I ) where extension rule 1 or 2 applies.
Let ji denote the last extension in this sequence. Since any application of rule 2 creates
a new leaf, it follows from Observation 2 that ji 5 j i + , . That is, the initial sequence of
extensions where rule 1 or 2 applies cannot shrink in successive phases. This suggests an
implementation trick that in phase i + I avoids all explicit extensions 1 through j i . Instead,
only constant time will be required to do those extensions implicitly.
To describe the trick, recall that the label on any edge in an implicit suffix tree (or a
suffix tree) can be represented by two indices p , q specifying the substring S [ p . . q ] .Recall
index q is equal to i and in phase i + 1 index q gets
also that for any leaf edge of I,,
incremented to i + 1 , reflecting the addition of character S(i + 1) to the end of each suffix.
Trick 3 In phase i 1, when a leaf edge is first created and would normally be labeled
with substring S [ p . . i + 11, instead of writing indices (p, i + 1) on the edge, write (p, e ) ,
where e is a symbol denoting "the current e n d . Symbol e is a global index that is set to
i + 1 once in each phase. In phase i + 1, since the algorithm knows that rule 1 will apply
in extensions 1 through j; at least, it need do no additional explicit work to implement
104
Figure 6.8: The !eft tree is a fragment of the suffix tree for string S = abcdefabcuvw, with the edge-labels
written explicitly. The right tree shows the edge-labels compressed. Note that that edge with label 2, 3 could
also have been labeled 8, 9.
each is labeled with a complete suffix, requiring 26x27/2 characters in all. For strings
longer than the alphabet size, some characters will repeat, but still one can construct strings
of arbitrary length rn so that the resulting edge-labels have more than O ( m ) characters
in total. Thus, an O(m)-time algorithm for building suffix trees requires some alternate
scheme to represent the edge-labels.
Edge-label compression
A simple, alternate scheme exists for edge labeling. Instead of explicitly writing a substring
on an edge of the tree, only write a pair of indices on the edge, specifying beginning and
end positions of that substring in S (see Figure 6.8). Since the algorithm has a copy of
string S, it can locate any particular character in S in constant time given its position in the
string. Therefore, we may describe any particular suffix tree algorithm as if edge-labels
were explicit, and yet implement that algorithm with only a constant number of symbols
written on any edge (the index pair indicating the beginning and ending positions of a
substring).
For example, in Ukkonen's algorithm when matching along an edge, the algorithm uses
the index pair written on an edge to retrieve the needed characters from S and then performs
the comparisons on those characters. The extension rules are also easily implemented with
this labeling scheme. When extension rule 2 applies in aphase i 1, label the newly created
edge with the index pair (i + 1, i + l), and when extension rule 1 applies (on a leaf edge),
change the index pair on that leaf edge from (p, q ) to (p, q 1). It is easy to see inductively
that q had to be i and hence the new label ( p , i + 1) represents the correct new substring
for that leaf edge.
By using an index pair to specify an edge-label, only two numbers are written on
any edge, and since the number of edges is at most 2m - I , the suffix tree uses only
O(m)symbols and requires only O ( m )space. This makes it more plausible that the tree
can actually be built in 0 ( m ) time.' Although the fully implemented algorithm will not
explicitly write a substring on an edge, we will still find it convenient to talk about "the
substring or label on an edge or path" as if the explicit substring was written there.
We make the standard RAM model assumption that a number with up to log rn bits can be read, written, or compared
in constant time.
107
Since there are only rn phases, and is bounded by m, the algorithm therefore executes
only 2m explicit extensions. As established earlier, the time for an explicit extension is
a constant plus some time proportional to the number of node skips it does during the
down-walk in that extension.
To bound the total number of node skips done during all the down-walks, we consider
(similar to the proof of Theorem 6.1.1) how the current node-depth changes during successive extensions, even extensions in different phases. The key is that the first explicit
extension in any phase (after phase 1) begins with extension j*,which was the last explicit extension in the previous phase. Therefore, the current node-depth does not change
between the end of one extension and the beginning of the next. But (as detailed in the
proof of Theorem 6.1. l), in each explicit extension the current node-depth is first reduced
by at most two (up-walking one edge and traversing one suffix link), and thereafter the
down-walk in that extension increases the current node-depth by one at each node skip.
Since the maximum node-depth is m, and there are only 2m explicit extensions, it follows
(as in the proof of Theorem 6.1.1) that the maximum number of node skips done during
all the down-walking (and not just in a single phase) is bounded by O(m). All work has
been accounted for, and the theorem is proved. ~7
Theorem 6.1.3. Ukkonen's algorithm builds a true sum tree for S, along with all its
suJff;r links in O(m)time.
For example, Suff, is the entire string S, and Suff, is the single character S(m).
Definition Define ?; to be the tree that has m - i -I-2 leaves numbered i through m + 1
such that the path from the root to any leaf j ( i 5 j 5 rn + 1) has label Suffj$. That is,
?; is a tree encoding all and only the suffixes of string S [ i . . m ] $ ,so it is a suffix tree of
string S[i.,m]$.
Figure 6.9: Cartoon of a possible execution of Ukkonen's algorithm. Each line represents a phase of the
algorithm, and each number represents an explicit extension executed by the algorithm. In this cartoon
there are four phases and seventeen explicit extensions. In any two consecutive phases, there is at most
one index where the same explicit extension is executed in both phases.
those jiextensions. Instead, it only does constant work to increment variable e , and then
does explicit work for (some) extensions starting with extension ji 1.
End
Step 3 correctly sets ji+l because the initial sequence of extensions where extension
rule 1 or 2 applies must end at the point where rule 3 first applies.
The key feature of algorithm SPA is that phase i + 2 will begin computing explicit
extensions with extension j * , where j* was the last explicit extension computed in phase
i 1. Therefore, two consecutive phases share at most one index ( j * )where an explicit
extension is executed (see Figure 6.9). Moreover, phase i 1 ends knowing where string
S[j*..i 11 ends, so the repeated extension of j* in phase i 2 can execute the suffix
extension rule for j* without any up-walking, suffix link traversals, or node skipping. That
means the first explicit extension in any phase only takes constant time. It is now easy to
prove the main result.
Theorem 6.1.2. Using su@x links and implementation tricks 1,2, and 3, Ukkanen's
algorithm bililcis implicit su&r trees Z1through Z, in O(m)total time.
The time for all the implicit extensions in any phase is constant and so is O(m)
over the entire algorithm.
As the algorithm executes explicit extensions, consider an index corresponding to the
explicit extension the algorithm is currently executing. Over the entire execution of the
algorithm, never decreases, but it does remain the same between two successive phases.
PROOF
109
Since a copy of string Head(i) begins at some position between i 1 and m , Head(i) is
also a prefix of Suffk for some k > i . It follows that Head(i) is the longest prefix (possibly
empty) of Suffi that is a label on some path from the root in tree ?;+[.
can be descrjbed as follows:
The above straightforward algorithm to build 7,from
2. If there is no node at the end of Head(i) then create one, and let w denote the node (created
or not) at the end of Head(i). If w is created at this point, splitting an existing edge, then
split its existing edge-label so that w has node-label Head(i). Then, create a new leaf
numbered i and a new edge (w, i) labeled with the remaining characters of Suff, $. That is,
the new edge-label should be the last rn - i 1 - (Head(i)l characters of Suffi, followed
by the termination symbol $.
Figure 6.10: A step in the naive Weiner algorithm. The full string tat is added to the suffix tree for at. The
edge labeled with the single character $ is omitted, since such an edge is part of every suffix tree.
serve to introduce and illustrate important definitions and facts. Then we will speed up the
straightforward construction to obtain Weiner's linear-time algorithm.
z+l
Definition For any position i , Head(i)denotes the longest prefix of S[i..mj that matches
a substring of S[i 1 ..m]$.
Note that Head(i) could be the empty string. In fact, Head(m) is always the empty
string because S[i l ..m] is the empty string when i 1 is greater than m and character
S ( m ) # $.
111
Suffi and Suffk both begin with string Head(i) = S(i)B and differ after that. For
concreteness, say Suffi begins S(i)#?a and Suf& begins S(i)Bb. But then Suffi+l begins
Ba and Suffk+, begins Bb. Both i + 1 and k 1 are greater than or equal to i 1 and less
than or equal to m , so both suffixes are represented in tree ? ; + I .Therefore, in tree
there must be a path from the root labeled p (possibly the empty string) that extends in
two ways, one continuing with character a and the other with character b. Hence there is a
with path-label B, and I,(S(i)) = 1 since there is a path (namely, an initial
node u in
part of the path to leaf k) labeled S(i)p in ?1+1. Further, node u must be on the path to leaf
i 1 since #l is a prefix of Suff,,
Now l,(S(i)) = 1 and v has path-label a, so Head(i) must begin with S(i)cr. That
means that CY is a prefix of B and so node u , with path label 13, must either be v or below v
on the path to leaf i + 1. However, if u # v then u would be a node below v on the path
to leaf i + 1, and l,(S(i)) = 1. This contradicts the choice of node v , so v = u, cr = #l,
and the theorem is proved. That is, Head(i) is exactly the string S(i)a.
z+,,
z+l
Note that in Theorem 6.2.1 and its proof we only assume that node v exists. No assumption about v' was made. This will be useful in one of the degenerate cases examined
later.
Theorem 6.2.2. Assume both u and u' have been found and L,t(S(i)) points to node v" .
I f l ; = 0 then Head(i) ends at v"; othemise it ends after exactly li characters on a single
edge out of v".
Since v' is on the path to leaf i 1 and L,l(S(i)) points to node v " , the path from
the root labeled Head(i) must include v". By Theorem 6.2.1 Head(i) = S(i)cx,so Head(i)
must end exactly li characters below v". Thus, when li = 0, Head(i) ends at v " . But when
li > 0, there must be an edge e = ( v " , z) out of v" whose label begins with character c
(the first of the li characters on the path from v' to v ) in ?;+, .
Can Head(i) extend down to node z (i.e., to a node below vr')? Node z must be a
branching node, for if it were a leaf then some suffix Suffk,for k > i, would be a prefix
of Suff,, which is not possible. Let z have path-label S(i)y. If Head(i) extends down to
branching node z,then there must be two substrings starting at or after position i 1 of
S that both begin with string y . Therefore, there would be a node z' with path-label y in
?;+, . Node z' would then be below v r on the path to leaf i 1, contradicting the selection
of v'. So Head(i) must not reach z and must end in the interior of edge e . In particular, it
ends exactly 1, characters from v" on edge e. o
PROOF
Thus when li = 0, we know Head(i) ends at v", and when li > 0, we find Head(i) from
v" by examining the edges out of v" to identify that unique edge e whose first character
is c. Then Heud(i) ends exactly li characters down e from u". Tree ?; is then constructed
by subdividing edge e, creating a node w at this point, and adding a new edge from w to
leaf i labeled with the remainder of Suff,. The search for the correct edge out of v" takes
only constant time since the alphabet is fixed.
In summary, when v and v' exist, the above method correctly creates ?; from ?j+~,
although we must still discuss how to update the vectors. Also, it may not yet be clear at
this point why this method is more efficient than the naive algorithm for finding Head(i).
That will come later. Let us first examine how the algorithm handles the degenerate cases
when v and v' do not both exist.
110
For any (single) character x and any node u, Iu(x) = 1 in ?;+, if and only if there is apath
from the root of 7,+,labeled x a , where ct is the path-label of node u . The path labeled
x a need not end at a node.
For any character x, L,(x) in Z+l points to (internal) node 7T in ?;+I if and only if TI has
path-label x a , where u has path-label a. Otherwise L,(x) is null.
For example, in the tree in Figure 5.1 (page 91) consider the two internal nodes u and
w with path-labels a and xu respectively. Then I, (x) = 1 for the specific character x, and
L,(x) = w . Also, I,(b) = 1, but L,(b) is null.
Clearly, for any node u and any character x, L,(x) is nonnull only if I,(x) = 1, but
the converse is not true. It is also immediate that if I,(x) = 1 then I,(x) = 1 for every
ancestor node u of u .
Tree ?-, has only one nonleaf node, namely the root r. In this tree we set I,(S(m)) to
one, set I,(x) to zero for every other character x, and set all the link entries for the root to
The algorithm will maintain the vectors as
null. Hence the above properties hold for Tm.
the tree changes, and we will prove inductively that the above properties hold for each tree.
z+l
has just been constructed and we now want to build ?;. The
We assume that tree
algorithm starts at leaf i + 1 of ?;+[ (the leaf for Suffi+!) and walks toward the root looking
for the first node v , if it exists, such that I,(S(i)) = 1. If found, it then continues from
v walking upwards toward the root searching for the first node v' it encounters (possibly
u) where LUf(S(i))is nonnull. By definition, L,!(S(i)) is nonnull only if I,,(S(i)) = 1, so
if found, v' will also be the first node encountered on the walk from leaf i 1 such that
L,l(S(i)) is nonnull. In general, it may be that neither v nor v' exist or that v exists but u'
does not. Note, however, that v or v' may be the root.
The "good case" is that both v and v' do exist.
Let li be the number of characters on the path between v' and v, and if li > 0 then let
c denote the first of these li characters.
Assuming the good case, that both v and v' exist, we will prove below that if node u
has path-label ct then Head(i) is precisely string S(i)a. Further, we will prove that when
L,.(S(i)) points to node v" in
Head(i) either ends at v", if li = 0, or else it ends
exactly 1, characters below v" on an edge out of v". So in either case, Head(i) can be found
in constant time after v' is found.
Theorem 6.2.1. Assume that node v has been found by the algorithrrz and that it has
path-label a. Then the string Hend(i) is exactly S(i)a.
PROOF Head(i) is the longest prefix of Suff, that is also a prefix of Suffk for some k > i.
that begins with S(i), so
Since v was found with I,(S(i)) = I there is a path in
Head(i) is at least one character long. Therefore, we can express Head(i) as S(i)#l, for
some (possibly empty) string p.
113
on this path. If li = 0 then Head(i) ends at v". Otherwise, search for the edge e out of v"
whose first character is c. Head(i) ends exactly li characters below v" on edge e.
4, If a node already exists at the end of Head(i), then let w denote that node; otherwise,
create a node w at the end of Head(i). Create a new leaf numbered i; create a new edge
(w, i ) Iabeled with the remaining substring of Suffi (i.e., the last m - i
1 - IHead(i)l
characters of Suff,), followed with the termination character $. Tree ?; has now been
created.
Correctness
It should be clear from the proof of Theorems 6.2.1 and 6.2.2 and the discussion of the
degenerate cases that the algorithm correctly creates tree ?; from ?;+I, although before it
can create ?;-I, it must update the I and L vectors,
Theorem 6.2.3. When a not. node w is created in the interior of an edge (v", z ) the
indicator vector for w sho~lldbe copied from the indicator vector for z.
It is immediate that if Iz(x) = 1 then I,(x) must also be 1 in ?;. But can it happen
that I,(x) should be 1 and yet I,(x) is set to 0 at the moment that w is created? We will
see that it cannot.
Let node z have path-label y , and of course node w has path-label Head(i), a prefix of
Y.The fact that there are no nodes between u and z in ?;+] means that every suffix from
Suff,, down to Suff;,, that begins with string Head(i) must actually begin with the longer
PROOF
Case 1 I,(S(i)) = 0.
In this case the walk ends at the root and no node u was found. It follows that character
S(i) does not appear in any position greater than i, for if it did appear, then some suffix
in that range would begin with S(i), some path from the root would begin with S(i), and
I,(S(i)) would have been I . So when I, (S(i)) = 0, Head(i) is the empty string and ends
at the root.
Case 2 I,(S(i)) = 1 for some v (possibly the root), but v' does not exist.
In this case the walkends at the root with L,(S(i)) null. Let ti be the number of characters
from the root to v. From Theorem 6.2.1 Head(i) ends exactly ti 1 characters from the
root. Since v exists, there is some edge e = (r, z) whose edge-label begins with character
S(i). This is true whether ti = 0 or ti > 0.
If ti = 0 then Head(i) ends after the first character, S(i), on edge e.
Similarly, if ti > 0 then Head(i) ends exactly ti + 1 characters from the root on edge
e. For suppose Head(i) extends all the way to some child z (or beyond). Then exactly as
in the proof of Theorem 6.2.2, z must be a branching node and there must be a node z'
below the root on the path to leaf i + 1 such that LZr(S(i))is nonnull, which would be a
contradiction. S o when ti > 0, Head(i) ends exactly ti + 1 characters from the root on the
edge e out of the root. This edge can be found from the root in constant time since its first
character is S(i ).
In either of these degenerate cases (as in the good case), Head(i) is found in constant
time after the walk reaches the root. After the end of Head(i) is found and w is created or
found, the algorithm proceeds exactly as in the good case.
Note that degenerate Case 2 is very similar to the " g o o d case when both v and v f were
found, but differs in a small detail because Head(i) is found ti 1 characters down on e
rather than ti characters down (the natural analogue of the good case).
z+l
115
The current node-depth can increase by one each time a new node is created and
each time a link pointer is traversed. Hence the total number of increases in the current
node-depth is at most 2m. It follows that the current node-depth can also only decrease at
most 2m times since the current node-depth starts at zero and is never negative. The current
node-depth decreases at each move up the walk, so the total number of nodes visited during
all the upward walks is at most 2rn. The time for the algorithm is proportional to the total
number of nodes visited during upward walks, so the theorem is proved.
PROOF
The space requirements For Ukkonen and McCreight's algorithms are determined by the need to represent and
move around the tree quickly. We will be much more precise about space and practical impfementation issues in
Section 6.5.
114
string y . Hence in '7;+, there can be a path labeled xHead(i) only if there is also a path
labeled x y , and this holds for any character x . Therefore, if there is a path in ?; labeled
xHead(i) (the requirement for I,(x) to be 1) but no path x y , then the hypothesized string
xHead(i) must begin at character i of S. That means that Suff;+l must begin with the
string Head(i). But since w has path-label Head(i), leaf i + 1 must be below w in '7; and
so must be below z in ?;+, . That is, z is on the root to i + 1 path. However, the algorithm
starts at leaf i 1 and walks toward the root, and when it finds
to construct '7; from
node v or reaches the root, the indicator entry for x has been set to 1 at every node on
the path from the leaf i 1. The walk finishes before node w is created, and so it cannot
be that I,(x) = 0 at the time when w is created. S o if path xHeab(i) exists in ?;, then
IZ(x)= 1 at the moment w is created, and the theorem is proved.
z+,
Lemma 6.2.1. When the algorithm traverses a link pointer from a node v' to a node v"
in
?;+I,
Let u be a nonroot node in '7;+, on the path from the root to v", and suppose u
has path-label S(i)a for some nonempty string a. All nodes on the root-to-v" path are of
this type, except for the single node (if it exists) with path-label S(i). Now S(i)a is the
prefix of Suffi and of Suffk for some k > i, and this string extends differently in the two
cases. Since v' is on the path from the root to leaf i 1, a is a prefix of Suffi+,, and there
must be a node (possibly the root) with path-label cr on the path to v' in 7;+,. Hence the
path to v' has a node corresponding to every node on the path to v", except the node (if it
exists) with path-label S(i). Hence the depth of v" is at most one more than the depth of
v', although it could be less. o
PROOF
Theorem 6.2.4. Assuming afinite alphabet, Weiner's algorithm construcrs the s u f i tree
for a string of length m in O ( m ) time.
2,3
2,l
232
Figure 6.11: Generalized suffix tree for strings S1 = xabxa and S2 = babxba. The first number at a leaf
indicates the string; the second number indicates the starting position of the suffix in that string.
in theory by suffix trees, where the typical string size is in the hundreds of thousands, or
even millions, and/or where the alphabet size is in the hundreds. For those problems, a
"linear" time and space bound is not sufficient assurance of practicality. For large trees,
paging can also be a serious problem because the trees do not have nice locality properties.
Indeed, by design, suffix links allow an algorithm to move quickly from one part of the
tree to a distant parr of the tree. This is great for worst-case time bounds, but it is horrible
for paging if the tree isn't entirely in memory. Consequently, implementing suffix trees to
reduce practical space use can be a serious concei-n.4The comments made here for suffix
trees apply as well to keyword trees used in the Aho-Corasick method.
The main design issues in all three algorithms are how to represent and search the
branches out of the nodes of the tree and how to represent the indicator and link vectors
in Weiner's algorithm. A practical design must balance the constraints of space against
the need for speed, both in building the tree and in using it afterwards. We will discuss
representing tree edges, since the vector issues for Weiner's algorithm are identical.
There are four basic choices possible to represent branches. The simplest is to use an
array of size @(I C 1) at each nonleaf node v. The m a y at v is indexed by single characters
of the alphabet; the cell indexed by character x has a pointer to a child of v if there is an
edge out of v whose edge-label begins with character x and is otherwise null. If there is
such an edge, then the cell should also hold the two indices representing its edge-label.
This array allows constant-time random accesses and updates and. although simple to
program, it can use an impractical amount of space as ] C 1 and m get large.
An alternative to the array is to use a linked list at node v of characters that appear at
the beginning of edge-labels out of v . When a new edge from v is added to the tree, a new
character (the first character on the new edge label) is added to the list. Traversals from
node v are implemented by sequentially searching the list for the appropriate character.
Since the list is searched sequentially it costs no more to keep it in sorted order. This
somewhat reduces the average time to search for a given character and thus speeds up (in
practice) the construction of the tree. The key point is that it allows a faster termination
of a search for a character that is not in the list. Keeping the list in sorted order will be
particularly useful in some of applications of suffix trees to be discussed later.
A very different approach to limiting space. based on changing the suffix tree into a different data structure called
a sr~& array. will be discussed in Section 7.14.
116
6.6. EXERCISES
119
many in molecular biology, space is more of a constraint than is time), the size of the
suffix tree for a string may dictate using the solution that builds the smaller suffix tree. So
despite the added conceptual burden, we will discuss such space-reducing alternatives in
some detail throughout the book.
6.5.1. Alphabet independence: all linears are equal, but some are
more equal than others
The key implementation problems discussed above are all related to multiple edges (or
links) at nodes. These are influenced by the size of the alphabet C -the larger the alphabet,
the larger the problem. For that reason, some people prefer to explicitly reflect the alphabet
size in the time and space bounds of keyword and suffix tree algorithms. Those people
usually refer to the construction time for keyword or suffix trees as O(m log ( CI), where m
is the size of all the patterns in a keyword tree or the size of the string in a suffix tree. More
completely, the Aho-Corasick, Weiner, Ukkonen, and McCreight algorithms all either
require O(m 1 T:1) space, or the O(m) time bound should be replaced with the minimum of
O(m log m) and O(m log I C I). Similarly, searching for a pattern P using a suffix tree can
be done with O ( (P 1) comparisons only if we use O(mI X I) space; otherwise we must allow
the minimum of O(I P I log m) and O(I PI log I C 1) comparisons during a search for P.
In contrast, the exact matching method using Z values has worst-case space and comparison requirements that are alphabet independent - the worst-case number of comparisons
(either characters or numbers) used to compute Z values is uninfluenced by the size of the
alphabet. Moreover, when two characters are compared, the method only checks whether
the characters are equal or unequal, not whether one character precedes the other in some
ordering. Hence no prior knowledge about the alphabet need be assumed. These properties
are also true of the Knuth-Morris-Pratt and the Boyer-Moore algorithms. The alphabet
independence of these algorithms makes their linear time and space bounds superior, in
some people's view, to the linear time and space bounds of keyword and suffix tree algorithms: "All linears are equal but some are more equal than others". Alphabet-independent
algorithms have also been developed for a number of problems other than exact matching. Two-dimensional exact matching is one such example. The method presented in
Section 3.5.3 for two-dimensional matching is based on keyword trees and hence is not
alphabet independent. Nevertheless, alphabet-independent solutions for that problem have
been developed. Generally, alphabet-independent methods are more complex than their
coarser counterparts. In this book we will not consider alphabet-independence much further, although we will discuss other approaches to reducing space that can be employed if
large alphabets cause excessive space use.
6.6. Exercises
1. Construct an infinite family of strings over a fixed alphabet, where the total length of the
edge-labels on their suffix trees grows faster than O(m)( m is the length of the string). That
is, show that linear-time suffix tree algorithms would be impossible if edge-labels were
written explicitly on the edges.
2. In the text, we first introduced Ukkonen's algorithm at a high level and noted that it could
be implemented in O ( d ) time. That time was then reduced to O ( d ) with the use of
suffix links and the skip/count trick. An alternative way to reduce the O(d)
time to O ( d )
(without suffix links or skip/count) is to keep a pointer to the end of each suffix of S[l ..i].
118
Keeping a linked list at node v works well if the number of children of v is small, but
in worst-case adds time I E I to every node operation. The O(m) worst-case time bounds
are preserved since I Z I is assumed to be fixed, but if the number of children of v is large
then little space is saved over the array while noticeably degrading performance.
A third choice, a compromise between space and speed, is to implement the list at node
v as some sort of balanced tree [lo]. Additions and searches then take O(1og k) time and
O(k) space, where k is the number of children of v. Due to the space and programming
overhead of these methods, this alternative makes sense only when k is fairly large.
The final choice is some sort of hashing scheme. Again, the challenge is to find a scheme
balancing space with speed, but for large trees and alphabets hashing is very attractive
at least for some of the nodes. And, using perfect hashing techniques [I671 the linear
worst-case time bound can even be preserved.
When m and C are large enough to make implementation difficult, the best design is
probably a mixture of the above choices. Nodes near the root of the tree tend to have
the most children (the root has a child for every distinct character appearing in S), and
so arrays are a sensible choice at those nodes. In addition, if the tree is dense for several
levels below the root, then those levels can be condensed and eliminated from the explicit
tree. For example, there are 20' possible amino acid substrings of length five. Every
one of these substrings exists in some known protein sequence already in the databases.
Therefore, when implementing a suffix tree for the protein database, one can replace the
first five levels of the tree with a five-dimensional array (indexed by substrings of length
five), where an entry of the array points to the place in the remaining tree that extends the
five-tuple, The same idea has been applied [320] to depth seven for DNA data. Nodes in
the suffix tree toward the leaves tend to have few children and lists there are attractive. At
the extreme, if w is a leaf and v is its parent, then infomation about w may be brought
up to v, removing the need for explicit representation of the edge ( v , w) or the node
w. Depending on the other implementation choices, this can lead to a large savings in
space since roughly half the nodes in a suffix tree are leaves. A suffix tree whose leaves
are deleted in this way is called a position tree. In a position tree, there is a one-to-one
correspondence between leaves of the tree and substrings that are uniquely occurring in S.
For nodes in the middle of a suffix tree, hashing or balanced trees may be the best
choice. Fortunately, most large suffix trees are used in applications where S is fixed (a
dictionary or database) for some time and the suffix tree will be used repeatedly. In those
applications, one has the time and motivation to experiment with different implementation
choices. For a more in-depth look at suffix tree implementation issues, and other suggested
variants of suffix trees, see [23].
Whatever implementation is selected, it is clear that a suffix tree for a string will take
considerably more space than the representation of the string itself.' Later in the book
we will discuss several problems involving two (or more) strings P and T, where two
O(I PI ] T 1) time solutions exist, one using a suffix tree for P and one using a suffix tree
for T. We will also have examples where equally time-efficient solutions exist, but where
one uses a generalized suffix tree for two or more strings and the other uses just a suffix tree
for the smaller string. In asymptotic worst-case time and space, neither approach is superior
to the other, and usually the approach that builds the larger tree is conceptually simpler.
However, when space is a serious practical concern (and in many problems, including
Although, we have built suffix trees for DNA and amino acid strings more than one million characters long that can
be completely contained in the main memory of a moderate-size workstation.
6.6. EXERCISES
121
14. Suppose one must dynamically maintain a suffix tree for a string that is growing or contracting. Discuss how to do this efficiently if the string is growing (contracting) on the left
end, and how to do it if the string is growing (contracting) on the right end.
Can either Weiner's algorithm or Ukkonen's algorithm efficiently handle both changes to
the right and to the left ends of the string? What would be wrong in reversing the string so
that a change on the left end is "simulated by a change on the right end?
15. Consider the previous problem where the changes are in the interior of the string. If you
cannot find an efficient solution to updating the suffix tree, explain what the technical issues
are and why this seems like a difficult problem.
16. Consider a generalized suffix tree built for a set of k strings. Additional strings may be
added to the set, or entire strings may be deleted from the set. This is the common case
for maintaining a generalized suffix tree for biological sequence data [320]. Discuss the
problem of maintaining the generalized suffix tree in this dynamic setting. Explain why this
problem has a much easier solution than when arbitrary substrings represented in the suffix
tree are deleted.
120
3. The relationship between the suffix tree for a string S and for the reverse string S' is not
obvious. However, there is a significant relationship between the two trees. Find it, state it,
and prove it.
Hint: Suffix links help.
4. Can Ukkonen's algorithm be implemented in linear time without using suffix links? The idea
is to maintain, for each index i,a pointer to the node in the current implicit suffix tree that
is closest to the end of suffix i.
5. In Trick 3 of Ukkonen's algorithm, the symbol "emis used as the second index on the label
of every leaf edge, and in phase i 1 the global variable e is set to i 1. An alternative
to using "enis to set the second index on any leaf edge to m (the total length of S) at the
point that the leaf edge is created. In that way, no work is required to update that second
index. Explain in detail why this is correct, and discuss any disadvantages there may be in
this approach, compared to using the symbol "el'.
6. Ukkonen's algorithm builds all the implicit suffix trees I, through I, in order and on-line,
all in O(m) time. Thus it can be called a linear-time on-line algorithm to construct implicit
suffix trees.
(Open question) Find an on-line algorithm running in O(m) total time that creates all the
true suffix trees. Since the time taken to explicitly store these trees is 0(m 2 ), such an
algorithm would (like Ukkonen's algorithm) update each tree without saving it.
7. Ukkonen's algorithm builds all the implicit suffix trees in O(m) time. This sequence of implicit
suffix trees may expose more information about S than does the single final suffix tree for
S. Find a problem that can be solved more efficiently with the sequence of implicit suffix
trees than with the single suffix tree. Note that the atgorithm cannot save the implicit suffix
trees and hence the problem will have to be solved in parallel with the construction of the
implicit suffix trees.
8. The naive Weiner algorithm for constructing the suffix tree of S (Section 6.2.1) can be
described in terms of the Aho-Corasick algorithm of Section 3.4: Given string S of tength
m, append $ and let P be the set of patterns consisting of the rn 1 suffixes of string
S$. Then build a keyword tree for set P using the Aho-Corasick algorithm. Removing
the backlinks gives the suffix tree for S. The time for this construction is O ( d ) . Yet, in our
discussion of Aho-Corasick, that method was considered as a linear time method. Resolve
this apparent contradiction.
9. Make explicit the relationship between link pointers in Weiner's algorithm and suffix links
in Ukkonen's algorithm.
10. The time analyses of Ukkonen's algorithm and of Weiner's algorithm both rely on watching
how the current node-depth changes, and the arguments are almost perfectly symmetric.
Examine these two algorithms and arguments closely to make explicit the similarities and
differences in the analysis. Is there some higher-level analysis that might establish the time
bounds of both the algorithms at once?
11. Empirically evaluate different implementation choices for representing the branches out
of the nodes and the vectors needed in Weiner's algorithm. Pay particular attention to
the effect of alphabet size and string length, and consider both time and space issues in
building the suffix tree and in using it afterwards.
12. By using implementation tricks similar to those used in Ukkonen's algorithm (particularly,
suffix links and skip/count) give a linear-time implementation for McCreight's algorithm.
13. Flesh out the relationship between McCreight's algorithm and Ukkonen's algorithm, when
they both are implemented in linear time.
123
7.2. APL2: SUFFIX TREES AND THE EXACT SET MATCHING PROBLEM
search can be done in O(m) time whenever a text T is specified. Can suffix trees be used
in this scenario to achieve the same time bounds? Although it is not obvious, the answer is
"yes". This reverse use of suffix trees will be discussed along with a more general problem
in Section 7.8. Thus for the exact matching problem (single pattern), suffix trees can be
used to achieve the same time and space bounds as Knuth-Moms-Pratt'and Boyer-Moore
when the pattern is known first or when the pattern and text are known together, but they
achieve vastly superior performance in the important case that the text is known first and
held fixed, while the patterns vary.
7.2. APL2: Suffix trees and the exact set matching problem
Section 3.4 discussed the exact set matching problem, the problem of finding all occurrences from a set of strings P in a text T, where the set is input all at once. There we
developed a linear-time solution due to Aho and Corasick. Recall that set P is of total
length n and that text T is of length m. The Aho-Corasick method finds all occurrences
in T of any pattern from P in O(n + m k ) time, where k is the number of occurrences.
This same time bound is easily achieved using a suffix tree 7 for T. In fact, we saw in
the previous section that when T is first known and fixed and the pattern P varies, all
occurrences of any specific P (of length n) in T can be found in O(n k p ) time, where
k p is the number of occurrences of P. Thus the exact set matching problem is actually a
simpler case because the set P is input at the same time the text is known. To solve it, we
build suffix tree 7 for T in O(m) time and then use this tree to successively search for all
occurrences of each pattern in P. The total time needed in this approach is O(n m k).
+ +
7
First Applications of Suffix Trees
We will see many applications of suffix trees throughout the book. Most of these
applications allow surprisingly efficient, linear-time solutions to complex string problems. Some of the most impressive applications need an additional tool, the constant-time
lowest common ancestor algorithm, and so are deferred until that algorithm has been discussed (in Chapter 8). Other applications arise in the context of specific problems that
will be discussed in detail later. But there are many applications we can now discuss that
illustrate the power and utility of suffix trees. In this chapter and in the exercises at its end,
several of these applications will be explored.
Perhaps the best way to appreciate the power of suffix trees is for the reader to spend
some time trying to solve the problems discussed below, without using suffix trees. Without
this effort or without some historical perspective, the availability of suffix trees may
make certain of the problems appear trivial, even though linear-time algorithms for those
problems were unknown before the advent of suffix trees. The longest common substring
problem discussed in Section 7.4 is one clear example, where Knuth had conjectured that
a linear-time algorithm would not be possible [24, 2781, but where such an algorithm is
immediate with the use of suffix trees. Another classic example is the longestprefi repeat
problem discussed in the exercises, where a linear-time solution using suffix trees is easy,
but where the best prior method ran in O ( n log n ) time.
125
:an find all strings in the database containing S as a substring. This takes O(n k) time,
where k is the number of occurrences of the substring. As expected, this is achieved by
:raversing the subtree below the end of the matched path for S, If the full s t i n g S cannot
Je matched against a path in 7,then S is not in the database, and neither is it contained
.n any string there. However, the matched path does specify the longest prefix of S that is
zontained as a substring in the database.
The substring problem is one of the classic applications of suffix trees. The results
abtained using a suffix tree are dramatic and not achieved using the Knuth-Morris-Pratt,
Boyer-Moore, or even the Aho-Corasick algorithm.
Theorem 7.4.1. The longest common substring of two strings can be found in linear time
using a generalized su& tree.
Although the longest common substring problem looks trivial now, given our knowledge
of suffix trees, it is very interesting to note that in 1970 Don Knuth conjectured that a
linear-time algorithm for this problem would be impossible [24, 2781, We will return to
this problem in Section 7.9, giving a more space efficient solution.
Now recall the problem of identifying human remains mentioned in Section 7.3. That
problem reduced to finding the longest substring in one fixed string that is also in some
string in a database of strings. A solution to that problem is an immediate extension of the
longest common substring problem and is left to the reader.
124
and space bounds as for the AhwCorasick method - O(n) for preprocessing and O(m) for
search. This is the reverse of the bounds shown above for suffix trees. The timelspace tradeoff remains, but a suffix tree can be used for either of the chosen time/space combinations,
whereas no such choice is available for a keyword tree.
127
given length I. These substrings are candidates for unwanted pieces of S2 that have
contaminated the desired DNA string.
This problem can easily be solved in linear time by extending the approach discussed
above for the longest common substring of two strings. Build a generalized suffix tree
for S1 and S2. Then mark each internal node that has in its subtree a leaf representing a
suffix of S1 and also a leaf representing a suffix of Sz. Finally, report all marked nodes that
have string-depth of 1 or greater. If v is such a marked node, then the path-label of v is a
suspicious string that may be contaminating the desired DNA string. If there are no marked
nodes with string-depth above the threshold 1, then one can have greater confidence (but
not certainty) that the DNA has not been contaminated by the known contaminants.
More generally, one has an entire set of known DNA strings that might contaminate
a desired DNA string. The problem now is to determine if the DNA string in hand has
any sufficiently long substrings (say length 1 or more) from the known set of possible
contaminants. The approach in this case is to build a generalized suffix tree for the set
P of possible contaminants together with S 1 ,and then mark every internal node that has
a leaf in its subtree representing a suffix from S I and a leaf representing a suffix from a
pattern in P. All marked nodes of string-depth 1 or more identify suspicious substrings.
Generalized suffix trees can be built in time proportional to the total length of the strings
in the tree, and all the other marking and searching tasks described above can be performed
in linear time by standard tree traversal methods. Hence suffix trees can be used to solve
the contamination problem in linear time. In contrast, it is not clear if the Aho-Corasick
algorithm can solve the problem in linear time, since that algorithm is designed to search
for occurrences offull patterns from P in S1,rather than for substrings of patterns.
As in the longest common substring problem, there is a more space efficient solution to
the contamination problem, based on the material in Section 7.8. We leave this to the reader.
126
by a fragment (substring) of a vector (DNA string) used to incorporate the desired DNA
in a host organism, or the contamination is from the DNA of the host itself (for example
bacteria or yeast). Contamination can also come from very small amounts of undesired
foreign DNA that gets physically mixed into the desired DNA and then amplified by
PCR (the polymerase chain reaction) used to make copies of the desired DNA. Without
going into these and other specific ways that contamination occurs, we refer to the general
phenomenon as DNA contamination.
contamination is an extremely serious problem, and there have been embarrassing occurrences of large-scale DNA sequencing efforts where the use of highly contaminated
clone libraries resulted in a huge amount of wasted sequencing. Similarly, the announcement a few years ago that DNA had been successfully extracted from dinosaur bone is
now viewed as premature at best. The "extracted" DNA sequences were shown, through
DNA database searching, to be more similar to mammal DNA (particularly human) [2]
than to bird and crockodilian DNA, suggesting that much of the DNA in hand was from
human contamination and not from dinosaurs. Dr. S. Blair Hedges, one of the critics of
the dinosaur claims, stated: "In looking for dinosaur DNA we all sometimes find material
that at first looks like dinosaur genes but later turns out to be human contamination, so we
move on to other things. But this one was published." [SO]
These embarrassments might have been avoided if the sequences were examined early
for signs of likely contaminants, before large-scale analysis was performed or results
published. Russell Doolittle [129] writes ". . . O n a less happy note, more than a few
studies have been curtailed when a preliminary search of the sequence revealed it to be a
common contaminant . . . used in purification. As a rule, then, the experimentalist should
search early and often".
Clearly, it is important to know whether the DNA of interest has been contaminated.
Besides the general issue of the accuracy of the sequence finally obtained, contamination
can greatly complicate the task of shotgun sequence assembly (discussed in Sections 16.14
and 16.15) in which short strings of sequenced DNA are assembled into long strings by
looking for overlapping substrings.
Often, the DNA sequences from many of the possible contaminants are known. These
include cloning vectors, PCR primers, the complete genomic sequence of the host organism
(yeast, for example), and other DNA sources being worked with in the laboratory. (The
dinosaur story doesn't quite fit here because there isn't yet a substantial transcript of human
DNA.) A good illustration comes from the study of the nemotode C. elegans, one of the
key model organisms of molecular biology. In discussing the need to use YACs (Yeast
Artificial Chromosomes) to sequence the C. elegans genome, the contamination problem
and its potential solution is stated as follows:
The main difficulty is the unavoidable contamination of purified YACs by substantial amounts
of DNA from the yeast host, leading to much wasted time in sequencing and assembling irrelevant yeast sequences. However, this difficulty should be eliminated (using). . . the complete
(yeast) sequence. . . It will then become possible to discard instantly all sequencing reads that
are recognizable as yeast DNA and focus exclusively on C. elegans DNA. [225]
DNA contamination problem Given a string SI (the newly isolated and sequenced
string of DNA) and a known string Sz(the combined sources of possible contamination), find all substrings of S2 that occur in SIand that are longer than some
7.7. APL7: BUILDING A SMALLER DIRECTED GRAPH FOR EXACT MATCHING 129
1
Figure 7.1: Suffix tree for string xyxauaxa without sutfix links shown.
128
set of strings, find substrings "common" to a large number of those strings. The word
L ' ~ ~ m m ~here
n ' ' means "occurring with equality". A more difficult problem is to find
"similar" substrings in many given strings, where "similar" allows a small number of
differences. Problems of this type will be discussed in Part 111.
Definition For each k between 2 and K, we define l(k) to be the length of the longest
substring common to at least k of the strings.
We want to compute a table of K - 1 entries, where entry k gives 1(k) and also points
to one of the common substrings of that length. For example, consider the set of strings
(sandollar, sandlot, handler, grand, panrry}. Then the l(k) values (without pointers to the
strings) are:
k
l(k)
one substring
sand
and
and
an
Surprisingly, the problem can be solved in linear, O ( n ) ,time [236]. It really is amazing
that so much information about the contents and substructure of the strings can be extracted
in time proportional to the time needed just to read in the strings. The linear-time algorithm
will be fully discussed in Chapter 9 after the constant-time lowest common ancestor
method has been discussed*
To prepare for the O(n) result, we show here how to solve the problem in O(Kn)
time. That time bound is also nontrivial but is achieved by a generalization of the longest
common substring method for two strings. First, build a generalized suffix tree 7 for the
K strings. Each leaf of the tree represents a suffix from one of the K strings and is marked
with one of K unique string identifiers, 1 to K, to indicate which string the suffix is from.
Each of the K strings is given a distinct termination symbol, so that identical suffixes
appearing in more than one string end at distinct leaves in the generalized suffix tree.
Hence, each leaf in 7 has only one string identifier.
By the same reasoning, if there is a path of suffix links from p to q going through a node
v, then the number of leaves in the subtree of v must be at least as large as the number in
the subtree of p and no larger than the number in the subtree of q. It follows that if p and q
have the same number of leaves in their subtrees, then all the subtrees below nodes on the
path have the same number of leaves, and all these subtrees are isomorphic to each other.
For the converse side, suppose that the subtrees of p and q are isomorphic. Clearly
then they have the same number of leaves. We will show that there is a directed path of
suffix links between p and q . Let a be the path-label of p and B be the path-label of q and
5 laf,
assume that
Since # a,if jl is a suffix of a it must be a proper suffix. And, if /Iis a proper suffix
of a,then by the properties of suffix links, there is a directed path of suffix links from p
to q, and the theorem would be proved. So we will prove, by contradiction, that /Imust
be a suffix of or.
Suppose is not a suffix of a. Consider any occurrence of a in T and let y be the suffix
of T just to the right of that occurrence of a. That means that a y is a suffix of T and there
is a path labeled y running from node p to a leaf in the suffix tree. Now since /?is not a
suffix of a, no suffix of T that starts just after an occurrence of B can have length 1 y 1, and
therefore there is no path of length 1 y 1 from q to a leaf. But that implies that the subtrees
rooted at p and at q are not isomorphic, which is a contradiction.
Definition Let Q be the set of all pairs (p, q ) such that a) there exists a suffix link from
p to q in T ,and b) p and q have the same number of leaves in their respective subtrees.
begin
Identify the set Q of pairs (p, q ) such that there is a suffix link from p to q and the
number of leaves in their respective subtrees is equal.
While there is a pair (p, q ) in Q and both p and q are in the current DAG,
Merge node p into q .
end.
The "correctness" of the resulting DAG is stated formaily in the following theorem.
Theorem 7.7.2. Let 7 be the su8x tree for an input string S, and let D be the DAG
resulting from running the compaction algorithm on 7.Any directed path in D from the
root enumerates a substring of S, and every substring of S is enumerated by some such
path. Therefore, the problem of determining whether a string is a subsrring of S can be
solved in linear time using D instead of 7.
DAG D can be used to determine whether a pattern occurs in a text, but the graph
seems to lose the location(s) where the pattern begins. It is possible, however, to add
simple (linear-space) information to the graph so that the locations of all the occurrences
can also be recovered when the graph is traversed. We address this issue in Exercise 10.
It may be surprising that, in the algorithm, pairs are merged in arbitrary order. We leave
the correctness of this, a necessary part of the proof of Theorem 7.7.2, as an exercise. As
a practical matter it makes sense to merge top-down, never merging two nodes that have
ancestors in the suffix tree that can be merged.
1
Figure 7.2: A directed acyclic graph used to recognize substrings of xyxaxaxa.
be used to solve the exact matching problem in the same way a suffix tree is used. The
algorithm matches characters of the pattern against a unique path from the root of the
graph; the pattern occurs somewhere in the text if and only if all the characters of the
pattern are matched along the path. However, the Ieaf numbers reachable from the end of
the path may no longer give the exact starting positions of the occurrences. This issue will
be addressed in Exercise 10.
Since the graph is a DAG after the first merge, the algorithm must know how to merge
nodes in a DAG as well as in a tree. The general merge operation for both trees and DAGs
is stated in the following way:
A merge of node p into node q means that all edges out of p are removed, that the
edges into p are directed to q but have their original respective edge-labels, and that
any part of the graph that is now unreachable from the root is removed.
Although the merges generally occur in a DAG, the criteria used to determine which
nodes to merge remain tied to the original suffix tree - node p can be merged into q if the
edge-labeled subtree of p is isomorphic to the edge-labeled subtree of q in the suffix tree*
Moreover, p can be merged into q , or q into p, only if the two subtrees are isomorphic.
So the key algorithmic issue is how to find isomorphic subtrees in the suffix tree. There
are general algorithms for subtree isomorphism but suffix trees have additional structure
making isomorphism detection much simpler.
Theorem 7.7.1. In a sufi tree 7 the edge-labeled subtree below a node p is isomorphic
to the subtree below a nude q ifand only ifthere is a directedpath of s u m linksfrom one
node to the other node, and the number of leaves in the two subtrees is equal.
First suppose p has a direct suffix link to q and those two nodes have the same
number of leaves in their subtrees. Since there is a suffix link from p to q , node p has
path-label x a while q has path-label a.For every leaf numbered i in the subtree of p there
is a leaf numbered i 1 in the subtree of q , since the suffix of T starting at i begins with
x a only if the suffix of T starting at i
I begins with a.Therefore, for every (labeled)
path from p to a leaf in its subtree, there is an identical path (with the same labeled edges)
from q to a Ieaf in its subtree. Now the numbers of leaves in the subtrees of p and q are
assumed to be equal, so every path out of q is identical to some path out of p, and hence
the two subuees are isomorphic.
PROOF
7.8. APL8: A REVERSE ROLE FOR SUFFIX TREES, MAJOR SPACE REDUCTION
133
Thus the problem of finding the matching statistics is ageneralization of the exact matching
problem.
Matching statistics can be used to reduce the size of the suffix tree needed in solutions to
problems more complex than exact matching. This use of matching statistics will probably
be more important than their use to duplicate the preprocessing/search bounds of KnuthMorris-Pratt and Aho-Corasick. The first example of space reduction using matching
statistics will be given in Section 7.9.
Matching statistics are also used in a variety of other applications described in the
book. One advertisement we give here is to say that matching statistics are central to a fast
approximate matching method designed for rapid database searching. This will be detailed
in Section 12.3.3. Thus matching statistics provide one bridge between exact matching
methods and problems of approximate string matching.
7.8. APLS: A reverse role for suffix trees, and major space reduction
We have previously shown how suffix trees can be used to solve the exact matching problem
with O(m) preprocessing time and space (building a suffix tree of size O(m) for the text
T ) and O(n + k) search time (where n is the length of the pattern and k is the number of
occurrences). We have also seen how suffix trees are used to solve the exact set matching
problem in the same time and space bounds (n is now the total size of all the patterns
in the set). In contrast, the Knuth-Morris-Pratt (or Boyer-Moore) method preprocesses
the pattern in O(n) time and space, and then searches in O(m) time. The Aho-Corasick
method achieves similar bounds for the set matching problem.
Asymptotically, the suffix tree methods that preprocess the text are as efficient as the
m) time and use O(n m)
methods that preprocess the pattern - both run in O(n
space (they have to represent the strings). However, the practical constants on the time
and space bounds for suffix trees often make their use unattractive compared to the other
methods. Moreover, the situation sometimes arises that the pattern(s) will be given first
and held fixed while the text varies. In those cases it is clearly superior to preprocess the
pattern(s). So the question arises of whether we can solve those problems by building a
suffix tree for the pattern(s), not the text. This is the reverse of the normal use of suffix
trees. In Sections 5.3 and 7.2.1 we mentioned that such a reverse role was possible, thereby
using suffix trees to achieve exactly the same time and space bounds (preprocessing versus
search time and space) as in the Knuth-Morris-Pratt or A h a o r a s i c k methods. To explain
this, we will develop a result due to Chang and Lawler [94], who solved a somewhat more
general problem, called the matching statistics problem.
= 4.
135
node v with the leaf number of one of the leaves in its subtree. This takes time linear in
the size of T . Then, when using T to find each ms(i), if the search stops at a node u , the
desired p(i) is the suffix number written at u ; otherwise (when the search stops on an edge
(u, v)), p(i) is the suffix number written at node v .
Back to STSs
Recall the discussion of STSs in Section 3.5.1. There it was mentioned that, because of
errors, exact matching may not be an appropriate way to find STSs in new sequences.
But since the number of sequencing errors is generally small, we can expect long regions
of agreement between a new DNA sequence and any STS it (ideally) contains. Those
regions of agreement should allow the correct identification of the STSs it contains. Using
a (precomputed) generalized suffix tree for the STSs (which play the role of P), compute
matching statistics for the new DNA sequence (which is T) and the set of STSs. Generally,
the pointer p ( i ) will point to the appropriate STS in the suffix tree. We leave it to the reader
to flesh out the details. Note that when given a new sequence, the time for the computation
is just proportional to the length of the new sequence.
Definition Given two strings S; and S j , any suffix of Si that matches a prefix of S, is
called a suj5.x-prefi match of S,, Sj.
Given a collection of strings S = S1, Sz,. . . , Sk of total length m, the all-pairs suffixprefix problem is the problem of finding, for each ordered pair S,, S, in S , the longest
suffix-prefix match of Si, S, .
134
no further matches are possible. In either case, ms(i 1) is the string-depth of the ending
position. Note that the character comparisons done after reaching the end of the @ path
begin either with the same character in T that ended the search for nzs(i) or with the next
character in T, depending on whether that search ended with a mismatch or at a leaf.
There is one special case that can arise in computingms(i 1). If ms(i) = 1 orms(i) = 0
(so that the algorithm is at the root), and T(i 1) is not in P, then ms(i 1) = 0.
Theorem 7.8.1. Using only a s u m treefor P and a copy of T, all the m matching statistics
can be found in O(m) time.
The search for any ms(i + 1) begins by backing up at most one edge from position
b to a node v and traversing one suffix link to node s(v). From stv) a #
path
I is traversed
in time proportional to the number of nodes on it, and then a certain number of additional
character comparisons are done. The backup and link traversals take constant time per
i and so take O(m) time over the entire algorithm. To bound the total time to traverse
the various B paths, recall the notion of current node-depth from the time analysis of
Ukkonen's algorithm (page 102). There it was proved that a link traversal reduces the
current depth by at most one (Lemma 6.1.2), and since each backup reduces the current
depth by one, the total decrements to current depth cannot exceed 2m. But since current
depth cannot exceed m or become negative, the total increments to current depth are
bounded by 3m. Therefore, the total time used for all the @ traversals is at most 3m since
the current depth is increased at each step of any B traversal. It only remains to consider
the total time used in all the character comparisons done in the "after-B" traversals. The
key there is that the after-@ character comparisons needed to compute ms(i
l), for
i 2 1, begin with the character in T that ended the computation for ms(i) or with the
next character in T. Hence the after-@comparisons performed when computing ms(i) and
ms(i 1) share at most one character in common. It follows that at most 2m comparisons
in total are performed during all the after-B comparisons. That takes care of all the work
done in finding all the matching statistics, and the theorem is proved.
PROOF
137
Definition We call an edge a terminal edge if it is labeled only with a string termination
symbol. Clearly, every terminal edge has a leaf at one end, but not all edges touching
Leaves are terminal edges.
The main data structure used to solve the all-pairs suffix-prefix problem is the generalized suffix tree T ( S )for the k strings in set S. As T ( S )is constructed, the algorithm
also builds a list L ( v ) for each internal node v. List L ( v ) contains the index i if and only
if v is incident with a terminal edge whose leaf is labeled by a suffix of string Si. That is,
L(u) holds index i if and only if the path label to v is a complete suffix of string Si. For
example, consider the generalized suffix tree shown in Figure 6.11 (page 1 17). The node
with path-label ba has an L list consisting of the single index 2, the node with path-label a
has a list consisting of indices 1 and 2, and the node with path-label x a has a list consisting
of index 1. All the other lists in this example are empty. Clearly, the lists can be constructed
in linear time during (or after) the construction of T ( S ) .
Now consider a fixed string S,, and focus on the path from the root of T ( S )to the leaf
j representing the entire string S j . The key obsemation is the following: If v is a node on
this path and i is in L ( u ) , then the path-label of v is a suffix of Si that matches a prefix
of S j . So for each index i , the deepest node u on the path to leaf j such that i E L ( u )
identifies the longest match between a suffix of Si and a prefix of S j . The path-label of v
is the longest suffix-prefix match of ( S i ,S j ) .It is easy to see that by one traversal from the
root to leaf j we can find the deepest nodes for all 1 5 i 5 k (i # j).
Following the above observation, the algorithm efficiently collects the needed suffixprefix matches by traversing T ( S )in a depth-first manner. As it does, it maintains k stacks,
one for each string. During the depth-first traversal, when a node v is reached in a forward
edge traversal, push v onto the ith stack, for each i E L ( v ) . When a leaf j (representing
the entire string S,) is reached, scan the k stacks and record for each index i the current
top of the ith stack. It is not difficult to see that the top of stack i contains the node v that
defines the suffix-prefix match of (Si,S , ) . If the i th stack is empty, then there is no overlap
between a suffix of string Si and a prefix of string S j . When the depth-first traversal backs
up past a node u, we pop the top of any stack whose index is in L(v).
Theorem 7.10.1. All the k 2 longest suffix-prefix matches are found in O ( m k 2 ) time by
the algorithm. Since m is the size of the input and k 2 is the size of the output, the algorithm
is time optimal.
The total number of indices in all the lists L ( v ) is O ( m ) .The number of edges
in T ( S )is also O ( m ) .Each push or pop of a stack is associated with a leaf of T ( S ) and
,
each leaf is associated with at most one pop and one push; hence traversing T ( S )and
updating the stacks takes O ( m )time. Recording of each of the 0 ( k 2 ) answers is done in
constant time per answer.
PROOF
Extensions
We note two extensions. Let k' 5 k 2 be the number of ordered pairs of strings that have a
nonzero length suffix-prefix match. By using double links, we can maintain a linked list of
In the following discussion of repetitive structures in DNA and protein, we divide the
structures into three types: local, small-scale repeated strings whose function or origin is
at least partially understood; simple repeats, both local and interspersed, whose function
is less clear; and more complex interspersed repeated strings whose function is even more
in doubt.
Definition A complementedpalindrome is a DNA or RNA string that becomes a palindrome if each character in one half of the string is changed to its complement character
(in DNA, A - T are complements and C - G are complements; in RNA A - U and C - G
are complements). For example, AGCTCGCGAGCT is a complemented palindrome.'
Small-scale local repeats whose function or origin is partially understood include: complemented palindromes in both DNA and RNA, which act to regulate DNA transcription
(the two parts of the complemented palindrome fold and pair to form a "hairpin loop");
nested complemented palindromes in tRNA (transfer RNA) that allow the molecule to
fold up into a cloverleaf structure by complementary base pairing; tandem arrays of repeated RNA that flank retroviruses (viruses whose primary genetic material is RNA) and
facilitate the incorporation of viral DNA (produced from the RNA sequence by reverse
transcription) into the host's DNA; single copy inverted repeats that flank transposable
(movable) DNA in various organisms and that facilitate that movement or the inversion
of the DNA orientation; short repeated substrings (both palindromic and nonpalindrornic)
in DNA that may help the chromosome fold into a more compact structure; repeated substrings at the ends of viral DNA (in a linear state) that allow the concatenation of many
copies of the viral DNA (a molecule of this type is called a conca~aclmer);copies of genes
that code for important RNAs (rRNAs and tRNAs) that must be produced in large number;
clustered genes that code for important proteins (such as histone) that regulate chromosome structure and must be made in large number; families of genes that code for similar
proteins (hemoglobins and myoglobins for example); similar genes that probably arose
through duplication and subsequent mutation (including pseudogenes that have mutated
' The use of the word "palindrome" in molecular biology does not conform to the normal Englishdictionwy definition
of the word. The easiest translation of the molecular biologist's "palindrome" to normal English is: "complemented
palindrome". A more molecular view is that a palindrome is a segment of double-stranded DNA or RNA such that
both strands read the same when both are read in the same direction, say in the 5' to 3' direction. Alternately, a
palindrome is a segment of double-stranded DNA that is symmetric (with respect to reflection) around both the
horizontal axis and the midpoint of the segment. (See Figure 7.3).Since the two strands are complementary, each
strand defines a complemented palindrome in the sense deli ned above. The term "mirror repeat" is sometimes used
in the molecular biology literature to refer to a "palindrome" as defined by the dictionary.
138
the nonempty stacks. Then when a leaf of the tree is reached during the traversal, only the
stacks on this list need be examined. In that way, all nonzero length suffix-prefix matches
can be found in O(m k') time. Note that the position of the stacks in the linked list will
vary, since a stack that goes from empty to nonempty must be linked at one of the ends of the
list; hence we must also keep (in the stack) the name of the string associated with that stack.
At the other extreme, suppose we want to collect for every pair not just the longest
suffix-prefix match, but all suffix-prefix matches no matter how long they are. We modify
the above solution so that when the tops of the stacks are scanned, the entire contents of
each scanned stack is read out. If the output size is k*, then the complexity for this solution
is O(m k*).
. . .reports of various kinds of repeats are too common even to list. 11281
In an analysis of 3.6 million bases of DNA from C. elegans, over 7,000 families of
repetitive sequences were identified [5]. In contrast, prokaryotes (organisms such as bacteria whose DNA is not enclosed in a nucleus) have in total little repetitive DNA, although
they still possess certain highly structured small-scale repeats.
In addition to its sheer quantity, repetitive DNA is striking for the variety of repeated
structures it contains, for the various proposed mechanisms explaining the origin and
maintenance of repeats, and for the biological functions that some of the repeats may play
(see [394] for one aspect of gene duplication). In many texts (for example, [3 171, [469], and
[315]) on genetics or molecular biology one can find extensive discussions of repetitive
strings and their hypothesized functional and evolutionary role. For an introduction to
repetitive elements in human DNA, see [253] and [255].
' It is reported in [192] that a search of the database MEDLINE using the key (repeat OR repcritirle) AND @rotein
OR sequence) turned up over 6,000 papers published in the preceding twenty years.
.f
14 1
and account for as much as 5% of the DNA of human and other mammalian genomes.
Alu repeats are substrings of length around 300 nucleotides and occur as nearly (but not
exactly) identical copies widely dispersed throughout the genome. Moreover, the interior
of an Alu string itself consists of repeated substrings of length around 40, and the Alu
sequence is often flanked on either side by tandem repeats of length 7-10. Those right and
left flanking sequences are usually complemented palindromic copies of each other, So
the Alu repeats wonderfully illustrate various kinds of phenomena that occur in repetitive
DNA. For an introduction to Alu repeats see [254].
One of the most fascinating discoveries in molecular genetics is a phenomenon called
genomic (or gametic) imprinting, whereby a particular allele of a gene is expressed only
when it is inherited from one specific parent [48,227,391]. Sometimes the required parent
is the mother and sometimes the father. The allele will be unexpressed, or expressed
differently, if inherited from the "incorrect" parent. This is in contradiction to the classic
Mendelian rule of equivalence - that chromosomes (other than the Y chromosome) have
no memory of the parent they originated from, and that the same allele inherited from either
parent will have the same effect on the child. In mice and humans, sixteen imprinted gene
alleles have been found to date [48]. Five of these require inheritance from the mother,
and the rest from the father. The DNA sequences of these sixteen imprinted genes all share
the common feature that
They contain, or are closely associated with, a region rich in direct repeats. These repeats
range in size from 25 to 120 bp,3 are unique to the respective imprinted regions, but have
no obvious homology to each other or to highly repetitive mammalian sequences. The direct
repeats may be an important feature of gametic imprinting, as they have been found i n all
imprinted genes analyzed to date, and are also evolutionarily conserved. [48]
Thus, direct repeats seem to be important in genetic imprinting, but like many other
examples of repetitive DNA, the function and origin of these repeats remains a mystery.
A detail not contained in this quote is that the direct (tandem) repeats in the genes studied [48] have a total length
of about I,500 bases.
140
to the point that they no longer function); common exons of eukaryotic DNA that may
be basic building blocks of many genes; and common functional or structural subunits in
protein (motifs and domains).
Restriction enzyme cutting sites illustrate another type of small-scale, structured, repeating substring of great importance to molecular biology. A restriction enzyme is an
enzyme that recognizes a specific substring in the DNA of both prokaryotes and eukaryotes and cuts (or cleaves) the DNA every place where that pattern occurs (exactly where
it cuts inside the pattern varies with the pattern). There are hundreds of known restriction
enzymes and their use has been absolutely critical in almost all aspects of modem molecular biology and recombinant DNA technology. For example, the surprising discovery
that eukaryotic DNA contains intruns (DNA substrings that interrupt the DNA of protein
coding regions), for which Nobel prizes were awarded in 1993, was closely coupled with
the discovery and use of restriction enzymes in the late 1970s.
Restriction enzyme cutting sites are interesting examples of repeats because they tend
to be complemented palindromic substrings. For example, the restriction enzyme EcuRI
recognizes the complemented palindrome GAATTC and cuts between the G and the adjoining A (the substring TTC when reversed and complemented is GAA). Other restriction
enzymes recognize separated (or interrupted) complemented palindromes. For example,
restriction enzyme BglI recognizes GCCNNNNNGGC, where N stands for any nucleotide.
The enzyme cuts between the last two Ns. The complemented palindromic structure has
been postulated to allow the two halves of the complemented palindrome (separated or
not) to fold and form complementary pairs. This folding then apparently facilitates either
recognition or cutting by the enzyme. Because of the palindromic structure of restriction enzyme cutting sites, people have scanned DNA databases looking for common
repeats of this form in order to find additional candidates for unknown restriction enzyme
cutting sites.
Simple repeats that are less well understood often arise as tandem arrays (consecutive
repeated strings, also called "direct repeats") of repeated DNA. For example, the string
T7;4GGG appears at the ends of every human chromosome in arrays containing one to two
thousand copies [332]. Some tandem arrays may originate and continue to grow by apostulated mechanism of unequal crossing over in meiosis, although there is serious opposition
to that theory. With unequal crossing over in meiosis, the likelihood that more copies will
be added in a single meiosis increases as the number of existing copies increases. A number of genetic diseases (Fragile X syndrome, Huntington's disease, Kennedy's disease,
myotonic dystrophy, ataxia) are now understood to be caused by increasing numbers of
tandem DNA repeats of a string three bases long. These triplet repeats somehow interfere
with the proper production of particular proteins. Moreover, the number of triples in the
repeat increases with successive generations, which appears to explain why the disease
increases in severity with each generation. Other long tandem arrays consisting of short
strings are very common and are widely distributed in the genomes of mammals. These
repeats are called satellite DNA (further subdivided into micro and mini-satellite DNA),
and their existence has been heavily exploited in genetic mapping and forensics. Highly
dispersed tandem arrays of length-two strings are common. In addition to tri-nucleotide
repeats, other mini-satellite repeats also play a role in human genetic diseases 12861.
Repetitive DNA that is interspersed throughout mammalian genomes, and whose function and origin is less clear, is generally divided into SINES (short interspersed nuclear
sequences) and LINES (long interspersed nuclear sequences). The classic example of a
SINE is the Alu family. The Alu repeats occur about 300,000 times in the human genome
143
142
The existence of highly repetitive DNA, such as Alus, makes certain kinds of large-scale
DNA sequencing more difficult (see Sections 16.11 and 16.16), but their existence can
also facilitate certain cloning, mapping, and searching efforts. For example, one general
approach to low-resolution physical mapping (finding on a true physical scale where
features of interest are located in the genome) or to finding genes causing diseases involves
inserting pieces of human DNA that may contain a feature of interest into the hamster
genome. This technique is called somatic cell hybridization. Each resulting hybrid-hamster
cell incorporates different parts of the human DNA, and these hybrid cells can be tested
to identify a specific cell containing the human feature of interest. In this cell, one then
has to identify the parts of the hamster's hybrid genome that are human. But what is a
distinguishing feature between human and hamster DNA?
One approach exploits the Alu sequences. Alu sequences specific to human DNA are
so common in the human genome that most fragments of human DNA longer than 20,000
bases will contain an Alu sequence [317]. Therefore, the fragments of human DNA in
the hybrid can be identified by probing the hybrid for fragments of Alu. The same idea
is used to isolate human oncogenes (modified growth-promoting genes that facilitate
certain cancers) from human tumors. Fragments of human DNA from the tumor are first
transferred to mouse cells. Cells that receive the fragment of human DNA containing the
oncogene become transformed and replicate faster than cells that do not. This isolates
the human DNA fragment containing the oncogene from the other human fragments,
but then the human DNA has to be separated from the mouse DNA. The proximity of
the oncogene to an Alu sequence is again used to identify the human part of the hybrid
genome [47 11. A related technique, again using proximity to Alu sequences, is described
in [403].
a sense, the longest common substring probls~nand the k-common substring problem (Sections 7.h and 9.7)
also concern repetitive substrings. However, the repeats in those problems occur across distinct strings. rather than
inside the same string. That distinction is critical, both in the definition of the problems and For the techniques used
to solve them.
145
Note that being left diverse is a property that propagates upward. If a node v is left
diverse. so are all of its ancestors in the tree.
Theorem 7.12.2. The string a labeling the path to a node v of 7 is a marimal repeat if
and only if v is left diverse.
Suppose first that v is left diverse. That means there are substrings x a and y a
in S, where x and .;represent different characters. Let the first substring be followed
by character p. If the second substring is followed by any character but p, then a is a
maximal repeat and the theorem is proved. So suppose that the two occurrences are x a p
and y a p . But since v is a (branching) node there must also be a substringaq in S for some
character q that is different from p. If this occurrence of aq is preceded by character x
then it participates in a maximal pair with string y a p , and if it is preceded by y then it
participates in a maximal pair with x a p . Either way, a cannot be preceded by both x and
y, so a must be part of a maximal pair and hence a must be a maximal repeat.
Conversely, if a is a maximal repeat then it participates in a maximal pair and there
must be dccurrences of a that have distinct left characters. Hence v must be left diverse.
PROOF
144
PROOF
The key point in Lemma 7.12.1 is that path a must end at a node of T . This leads
immediately to the following surprising fact:
Theorem 7.12.1. There can be at most n maximal repeats in any string of length n.
Since 7 has n leaves, and each internal node other than the root must have at
least two children, 7 can have at most n internal nodes. Lemma 7.12.1 then implies the
theorem.
PROOF
Theorem 7.12.1 would be a trivial fact if at most one substring starting at any position
i could be part of a maximal pair. But that is not true. For example, in the string S =
xabcyiiizczbc.qcrbcyr considered earlier, both copies of substring a b c y participate in
maximal pairs, while each copy of a b c also participates in maximal pairs.
So now we know that to find maximal repeats we only need to consider strings that end
at nodes in the suffix tree 7. But which specific nodes correspond to maximal repeats?
Definition For each position i in string S, character S(i - 1 ) is called the lefr chcrmc-ter
of i. The IeJcharcrcter of n leaf of 7 is the left character of the suffix position represented
by that leaf.
Definition A node v of 7 is called left diverse if at least two leaves in v's subtree have
different left characters. By definition. a leaf cannot be left diverse.
147
146
Theorem 7.12.3. All the maximal repeats in S can be found in O ( n ) time, and a tree
representation for them can be constructed from s u f i tree 7in O ( n ) time as well.
Lemma 7.12.3. Suppose w is a lea3 and let i be the (single) occurrence of represented
by leaf w. k t x be the left character of leuf w . Then the occurrence of Q at position i
witnesses the near-sripermaximality of a if and only ifx is the left character of no other
leaf below u.
If there is another occurrence of a! with a preceding character x , then xu occurs
twice and so is either a maximal repeat or is contained in one. In that case, the occurrence
of a at i is contained in a maximal repeat.
If there is no other occurrence of a with a preceding x , then x a occurs only once
in S . Now let y be the first character on the edge from v to w . Since ur is a leaf, ay
occurs only once in S. Therefore, the occurrence of a starting at i , which is preceded
PROOF
149
That is, the suffix starting at position Pos (1) of T is the lexically smallest suffix, and
in general suffix Pus (i) of T is lexically smaller than suffix Pus ( i 1).
As usual, we will affix a terminal symbol $ to the end of S, but now we interpret
it to be lexically less than any other character in the alphabet. This is in contrast to its
interpretation in the previous section. As an example of a suffix array, if T is mississippi,
then the suffix array Pos is 11,8, 5 , 2 , 1, 1 0 , 9 , 7 , 4 , 6 , 3 .Figure 7.4 lists the eleven suffixes
in lexicographic order.
148
maximal pairs, then the algorithm can be modified to run in O(n) time. If only maximal
pairs of a certain minimum length are requested (this would be the typical case in many
applications), then the algorithm can be modified to run in O(n k,,,) time, where k, is
the number of maximal pairs of length at least the required minimum. Simply stop the
bottom-up traversal at any node whose string-depth falls below that minimum.
In summary, we have the following theorem:
Theorem 7.12.5. All the maximal pairs can be found in O(n k) time, where k is the
number of maximal pairs. I f there are only k, maximal pairs of length above a given
threshold, then all those can be found in O(n k,) time.
1
Figure 7.5: The lexical depth-first traversal of the suffix tree visits the leaves in order 5, 2,6, 3, 4. 1.
For example, the suffix tree for T = tartar is shown in Figure 7.5. The lexical depth-first
traversal visits the nodes in the order 5 , 2 , 6 , 3 , 4 , 1 , defining the values of array Pos.
As an implementation detail, if the branches out of each node of the tree are organized
in a sorted linked list (as discussed in Section 6.5, page 116) then the overhead to do a
lexical depth-first search is the same as for any depth-first search. Every time the search
must choose an edge out of a node v to traverse, it simply picks the next edge on v's
linked list.
Theorem 7.14.2. By using binary search on array Pos, all the occurrences of P in T can
be found in O(n log rn) time.
Of course, the true behavior of the algorithm depends on how many long prefixes of
P occur in T . If very few long prefixes of P occur in T then it will rarely happen that a
specific lexical comparison actually takes O(n) time and generally the O(n log m) bound
is quite pessimistic. In "random" strings (even on large alphabets) this method should run
in O(n + log m) expected time. In cases where many long prefixes of P do occur in T, then
the method can be improved with the two tricks described in the next two subsections.
i
ippi
issippi
ississippi
mississippi
pi
ppi
sippi
sisippi
ssippi
ssissippi
Figure 7.4: The eleven suffixes of mississippi listed in lexicographic order. The starting positions of those
suffixes define the suffix array Pos.
Notice that the suffix array holds only integers and hence contains no information about
the alphabet used in string T . Therefore, the space required by suffix arrays is modest for a string of length m, the array can be stored in exactly rn computer words, assuming a
word size of at least log rn bits.
When augmented with an additional 2m values (called Lcp values and defined later),
the suffix array can be used to find all the occurrences in T of a pattern P in O(n
log, m) single-character comparison and bookkeeping operations. Moreover, this bound
is independent of the alphabet size. Since for most problems of interest log, m is O(n), the
substring problem is solved by using suffix arrays as efficiently as by using suffix trees.
Since no two edges out of v have labels beginning with the same character, there is a
strict lexical ordering of the edges out of v. This ordering implies that the path from the
root of 7-following the lexically smallest edge out of each encountered node leads to a leaf
of 7 representing the lexically smallest suffix of T . More generally, a depth-first traversal
of 7 that traverses the edges out of each node v in their lexical order will encounter the
leaves of 7 in the lexical order of the suffixes they represent. Suffix array Pos is therefore
just the ordered list of suffix numbers encountered at the leaves of 7 during the lexical
depth-first search. The suffix tree for T is constructed in linear time, and the traversal also
takes only linear time, so we have the following:
Theorem 7.14.1. The s u f i array Pos fur a string T of length m can be constructed in
O(m) time.
Figure 7.6: Subcase t of the super-accelerant. Pattern Pis abcdemn, shown vertically running upwards
from the first character. The suffixes Pos(L),Pos(M), and Pos(R) are also shown vertically. In this case,
Lcp[L,M) > 0 and I > r. Any starting location of Pin Tmust occur in Pos to the right of M, since Pagrees
with suffix Pos(MJ only up to character I.
If Lcp ( L , M ) > 1 , then the common prefix of suffix Pos(L) and suffix Pos(M) is longer
than the common prefix of P and Pos(L). Therefore, P agrees with suffix Pos(M) up
through character 2. In other words, characters 1 1 of suffix Pos(L) and suffix Pos(M)
are identical and lexically less than character 1 1 of P (the last fact follows since P is
lexically greater than suffix Pos(L)).Hence all (if any) starting locations of P in T must
occur to the right of position M in Pos. So in any iteration of the binary search where
this case occurs, no examinations of P are needed; L just gets changed to M , and 1 and r
remain unchanged. (See Figure 7.6.)
If Lcp ( L , M ) < 1, then the common prefix of suffix Pus(L) and Pos(M) is smaller than
the common prefix of suffix Pos(L) and P. Therefore, P agrees with suffix Pos(M) up
through character Lcp ( L , M ) . The Lcp ( L , M ) 1 characters of P and suffix Pus(L)are
identical and lexically less than character Lcp ( L , M ) + 1 of suffix Pos(M). Hence all (if
any) starting locations of P in T must occur to the left of position M in Pus. So in any
iteration of the binary search where this case occurs, no examinations of P are needed; r
is changed to t c p ( L , M), 1 remains unchanged, and R is changed to M .
If Lcp ( L , M ) = I , then P agrees with suffix Pus(M) up to character I . The algorithm then
lexically compares P to suffix Pos(M) starting from position 1 1 . In the usual manner,
the outcome of that lexical comparison determines which of L or R change, along with
the corresponding change of 1 or r .
+
+
Theorem 7.14.3. Using the Lcp vnlues, the search algorithm does at most O ( n + log nr)
comparisons and runs in that time.
First, by simple case analysis it is easy to verify that neither 1 nor r ever decrease
during the binary search. Also, every iteration of the binary search terminates the search,
examines no characters of P , or ends after the first mismatch occurs in that iteration.
In the two cases (1 = r or Lcp ( L , M) = 1 > r ) where the algorithm examines a
character during the iteration, the comparisons start with character max(1, r ) of P . Suppose
there are k characters of P examined in that iteration. Then there are k - 1 matches during
the iteration, and at the end of the iteration max(l, r ) increases by k - 1 (either 1 or r
is changed to that value). Hence at the start of any iteration, character max(1, r ) of P
may have already been examined, but the next character in P has not been. That means at
most one redundant comparison per iteration is done. Thus no more than logz m redundant
comparisons are done overall. There are at most n nonredundant comparisons of characters
PROOF
152
7.14.4. A super-accelerant
Call an examination of a character in P redundant if that character has been examined
before. The goal of the acceleration is to reduce the number of redundant character examinations to at most one per iteration of the binary search - hence O(log rn) in all. The
desired time bound, O(n logm), follows immediately. The use of rnlr alone does not
achieve this goal. Since rnlr is the minimum of 1 and r, whenever 1 # r , all characters
in P from rnlr 1 to the maximum of 1 and r will have already been examined. Thus
any comparisons of those characters will be redundant. What is needed is a way to begin
comparisons at the maximum of 1 and r .
Definition Lcp (i, j)is the length of the longest common prefix of the suffixes specified
in positions i and j of Pos. That is, Lcp (i, j ) is the length of the longest prefix common
to suffix Pos ( i ) and suffix Pos (j).The term Lcp stands for longest common prej3.x.
For example, when T = mississippi, suffix Pos (3) is issippi, suffix Pos (4) is ississippi,
and so Lcp(3,4) is four (see Figure 7.4).
To speed up the search, the algorithm uses Lcp (L. M ) and Lcp (M, R) for each triple
( L , M, R) that arises during the execution of the binary search. For now, we assume that
these values can be obtained in constant time when needed and show how they help the
search. Later we will show how to compute the particular Lcp values needed by the binary
search during the preprocessing of T.
General case When 1 # r , let us assume without loss of generality that 1 > r . Then
there are three subcases:
155
If we assume that the string-depths of the nodes are known (these can be accumulated
in linear time), then by the Lemma, the values Lcp (i, i + 1) for i from 1 to m - 1 are easily
accumulated in O(m) time. The rest of the Lcp values are easy to accumulate because of
the following lemma:
Lemma 7.14.2. For any p a i r of positions i, j, where j is greater than i
the smallest value of Lcp (k, k I), where k ranges from i to j - 1.
+ I, Lcp (i, j ) is
Suffix Pos(i) and Suffix Pos(j) of T have a common prefix of length lcp(i, j).
By the properties of lexical ordering, for every k between i and j, suffix Pos(k) must also
have that common prefix. Therefore, lcp (k, k I) 2 lcp (i, j ) for every k between i and
j - 1.
Now by transitivity, Lcp (i, i +2) must be at least as large as the minimum of Lcp (i, i+ 1)
and Lcp (i 1, i 2). Extending this observation, Lcp (i, j ) must be at least as large as
the smallest Lcp ( k , k 1) for k from i to j - 1. Combined with the observation in the
first paragraph, the lemma is proved.
PROOF
Given Lemma 7.14.2, the remaining Lcp values for B can be found by working up from
the leaves, setting the Lcp value at any node v to the minimum of the lcp values of its two
children. This clearly takes just O(m) time.
In summary, the O(n log m)-time string and substring matching algorithm using a
suffix array must precompute the 2m - 1 Lcp values associated with the nodes of binary
tree B . The leaf values can be accumulated during the linear-time, lexical, depth-first
traversal of 7 used to construct the suffix array. The remaining values are computed from
the leaf values in linear time by a bottom-up traversal of B, resulting in the following:
Theorem 7.14.4. All the needed Lcp vallies can be accumulated in O(m) time, and all
occurrences of P in T can be found using a su@ array in O(n log m) time.
Figure 7.7: Binary tree 8 representing all the possible search intervals in any execution of binary search
in a list of length m = 8.
of P, giving a total bound of n + log m comparisons. All the other work in the algorithm
can clearly be done in time proportional to these comparisons.
+
+
Essentially, the node labels specify the endpoints (L, R ) of all the possible search
intervals that could arise in the binary search of an ordered list of length m . Since B is a
binary tree with m leaves, B has 2m - 1 nodes in total. S o there are only O(m) Lcp values
that need be precomputed. It is therefore plausible that those values can be accumulated
during the O(m)-time preprocessing of T ; but how exactly? In the next lemma we show
that the Lcp values at the leaves of B are easy to accumulate during the lexical depth-first
traversal of 7.
Lemma 7.14.1. In the depth-first traversal of 7, consider the internal nodes visited
between the visits to leaf Pos(i) and leaf Pos(i + I), that is, between the ith leaf visited
and the next leaf visited. From among those internal nodes, let v denote the one that is
closest to the root. Then Lcp(i, i + 1) equals the string-depth of node v.
For example, consider again the suffix tree shown in Figure 7.5 (page 15 1). Lcp(5,6)
is the string-depth of the parent of leaves 4 and 1. That string-depth is 3, since the parent
of 4 and 1 is labeled with the string tar. The values of Lcp ( i , i 1) are 2, 0, 1,0, 3 for i
from 1 to 5.
The hardest part of Lemma 7.14.1 involves parsing it. Once done, the proof is immediate
from properties of suffix trees, and it is left to the reader.
157
suffix trees to speed up regular expression pattern matching (with errors) is discussed in
Section 12.4.
Yeast Suffix trees are also the central data structure in genome-scale analysis of Saccharomyces cerevisiae (brewer's yeast), done at the Max-Plank Institute [320]. Suffix trees
are "particularly suitable for finding substring patterns in sequence databases" [320]. So
in that project, highly optimized suffix trees called hashedposition trees are used to solve
problems of "clustering sequence data into evolutionary related protein families, structure
prediction, and fragment assembly" [320]. (See Section 16.15 for a discussion of fragment
assembly.)
Borrelia burgdorferi Borrelia burgdo~eriis the bacterium causing Lyme disease.
Its genome is about one million bases long, and is currently being sequenced at the
Brookhaven National Laboratory using a directed sequencing approach to fill in gaps after
an initial shotgun sequencing phase (see Section 16.14). Chen and Skiena [loo] developed
methods based on suffix trees and suffix arrays to solve the fragment assembly problem
for this project. In fragment assembly, one major bottleneck is overlap detection, which
requires solving a variant of the suffix-prefix matching problem (allowing some errors) for
all pairs of strings in a large set (see Section 16.15.1 .). The Borrelia work [loo] consisted
of 4,612 fragments (strings) totaling 2,032,740 bases. Using suffix trees and suffix arrays,
the needed overlaps were computed in about fifteen minutes. To compare the speed and
accuracy of the suffix tree methods to pure dynamic programming methods for overlap
detection (discussed in Section 11.6.4 and 16.15.1), Chen and Skiena closely examined
cosrnid-sized data. The test established that the suffix tree approach gives a 1,000 times
speedup over the (slightly) more accurate dynamic programming approach, finding 99%
of the significant overlaps found by using dynamic programing.
Efficiency is critical
In all three projects, the efficiency of building, maintaining, and searching the suffix trees
is extremely important, and the implementation details of Section 6.5 are crucial, However,
because the suffix trees are very large (approaching 20 million characters in the case of the
Arnbidopsis project) additional implementation effort is needed, particularly in organizing
the suffix tree on disk, so that the number of disk accesses is reduced. All three projects
have deeply explored that issue and have found somewhat different solutions. See [320],
[I001 and [63] for details.
156
consists of characters from a finite alphabet (representing the known patterns of interest)
alternating with integers giving the distances between such sites. The alphabet is huge
because the range of integers is huge, and since distances are often known with high
precision, the numbers are not rounded off. Moreover, the variety of known patterns of
interest is itself large (see [435]).
It often happens that a DNA substring is obtained and studied without knowing where
that DNA is located in the genome or whether that substring has been previously researched. If both the new and the previously studied DNA are fully sequenced and put in
a database, then the issue of previous work or locations would be solved by exact string
matching. But most DNA substrings that are studied are not fully sequenced - maps are
easier and cheaper to obtain than sequences. Consequently, the following matching problem on maps arises and translates to an matching problem on strings with large alphabets:
Given an established (restriction enzyme) map for a large DNA string and a map
from a smaller string, determine if the smaller string is a substring of the larger one.
Since each map is represented as an alternating string of characters and integers, the
underlying alphabet is huge. This provides one motivation for using suffix arrays for
matching or substring searching in place of suffix trees. Of course, the problems become
more difficult in the presence of errors, when the integers in the strings may not be exact,
or when sites are missing or spuriously added. That problem, called map alignment, is
discussed in Section 16.10.
Arabidopsis rhaliano is the "fruit fly" of plant genetics. i.e., the classic model organism in studying the molecular
biology of plants. Its size is about 100 million base pairs.
159
patterns and m is the size of the text. The more efficient algorithm will increase i by more
than one whenever possible, using rules that are analogous to the bad character and good
suffix rules of Boyer-Moore. Of course, no shift can be greater than .the length of the
shortest pattern P in P, for such a shift could miss occurrences of P in'T.
+ [Pmi,l;otherwise increase i
Ps
Figure 7.9: Shift when the bad character rule is applied
to the
158
As usual, the algorithm preprocesses the set of patterns and then uses the result of the
preprocessing to accelerate the search. The following exposition interleaves the descriptions of the search method and the preprocessing that supports the search.
161
Figure 7.11: The shift when the weak good suffix rule is applied. In this figure, pattern P3 determines the
amount of the shift.
from the end of P , then i should be increased by exactly r positions, that is, :i should be
set to i r. (See Figure 7.10 and Figure 7.11.)
We will solve the problem of finding iZ, if it exists, using a suffix tree obtained by
preprocessing set P'. The key involves using the suffix tree to search for a pattern P r in
P' containing a copy of a' starting closest to its left end but not occurring as a prefix of
P r . If that occurrence of a' starts at position z of pattern P r , then an occurrence of a ends
r = z - 1 positions from the end of P.
During the preprocessing phase, build a generalized suffix tree 7' for the set of patterns
P r . Recall that in a generalized suffix tree each leaf is associated with both a pattern
P r E Pr and a number z specifying the starting position of a suffix of P r .
Definition For each internal node u of T r ,z, denotes the smallest number z greater
than 1 (if any) such that z is a suffix position number written at a leaf in the subtree of
v. If no such Ieaf exists, then z , is undefined.
With this suffix tree T r ,determine the number z , for each internal node v . These two
preprocessing tasks are easily accomplished in linear time by standard methods and are
left to the reader.
As an example of the preprocessing, consider the set P = [wxa, xnqq, q.rxj and the
generalized suffix tree for P' shown in Figure 7.12. The first number on each leaf refers to
a string in P r ,and the second number refers to a suffix starting position in that string. The
number z , is the first (or only) number written at every internal node (the second number
will be introduced later).
We can now describe how 7'is used during the search to determine value iz,if it exists.
After matching crr along a path in K r , traverse the path labeled a' from the root of 7'.
That path exists because a is a suffix of some pattern in P (that is what the search in Kr
160
The preprocessing needed to implement the bad character rule is simple and is left to
the reader,
The generalization of the bad character rule to set matching is easy but, unlike the case
of a single pattern, use of the bad character rule alone may not be very effective. As the
number of patterns grows, the typical size of i i - i is likely to decrease, particularly if
the alphabet is small. This is because some pattern is likely to have character T ( j - 1)
close to, but left of, the point where the previous matches end. As noted earlier, in some
applications in molecular biology the total length of the patterns in P is larger than the
size of T, making the bad character rule almost useless. A bad character rule analogous to
the simpler, unextended bad character rule for a single pattern would be even less useful.
Therefore, in the set matching case, a rule analogous to the good suffix mle is crucial in
making a Boyer-Moore approach effective.
163
The proof is immediate and is left to the reader. Clearly, i3 can be found during the
traversal of the a' path in T used to search for i2. If neither i2 nor i3 exist, then i should
be increased by the length of the smallest pattern in P.
z.
5. For each node v in T r set d: equal to the smallest value of d, for any ancestor (including
V ) of U.
6. Remove the subtree rooted at any unmarked node (including leaves) of T.(Nodes were
marked in step 2.)
End.
The above preprocessing tasks are easily accomplished in linear time by standard tree
traversal methods.
determined), so a
' is a prefix of some pattern in P'. Let v be the first node at or below the
end of that path in T'. If z, is defined, then i2 can be obtained from it: The leaf defining
z, (i.e., the leaf where z = 2,) is associated with a string P r E P r that contains a copy of
a
' starting to the right of position one. Over all such occurrences of a' in the strings of
p r ,Pr contains the copy of a' starting closest to its left end. That means that P contains
a copy of a that is not a suffix of P, and over all such occurrences of a,P contains the
copy of a! ending closest to its right end. P is then the string in P that should be used
to set i2. Moreover, a ends in P exactly z, - 1 characters from the end of P. Hence, as
argued above, i should be increased by 2, - 1 positions. In summary, we have
165
Note that when 1, > 0, the copy of Prior, starting at si is totally contained in S[1..i - 11.
The Ziv-Lempel method uses some of the 1; and si values to construct a compressed
representation of string S. The basic insight is that if the text S[1..i - 11 has been represented
(perhaps in compressed form) and li is greater than zero, then the next 1;' characters of S
(substring Prior i ) need not be explicitly described. Rather, that substring can be described
by the pair (si, li), pointing to an earlier occurrence of the substring. Following this insight,
a compression method could process S left to right, outputting the pair (s;, li) in place
of the explicit substring S[i..i 1, - 11 when possible, and outputting the character S(i)
when needed. Full details are given in the algorithm below.
Compression algorithm 1
begin
i := 1
Repeat
compute li and si
if li > 0 then
begin
output (si, I;)
i :=i+li
end
else
begin
output S(i)
i:=i+l
end
Until i > n
end.
For example, S = abacabaxabz can be described as ab(1, l)c(l, 3)x(1,2)z. Of course,
in this txample the number of symbols used to represent S did not decrease, but rather
increased! That's typical of small examples. But as the string length increases, providing
more opportunity for repeating substrings, the compression improves. Moreover, the algorithm could choose to output character S ( i ) explicitly whenever li is "small" (the actual
rule depends on bit-level considerations determined by the size of the alphabet, etc.). For
a small example where positive compression is observed, consider the contrived string S =
ababababababababababababnbababab, represented as ab(1,2)(1,4)(1, 8)(1, 16).
That representation uses 24 symbols in place of the original 32 symbols. If we extend this
example to contain k repeated copies of a b , then the compressed representation contains
approximately 5 log, k symbols - a dramatic reduction in space.
To decompress a compressed string, process the compressed string left to right, so
that any pair (si, 1;) in the representation points to a substring that has already been fully
decompressed. That is, assume inductively that the first j terms (single characters or
s, I pairs) of the compressed string have been processed, yielding characters 1 through
i - 1 of the original string S. The next term in the compressed string is either character
S(i l), or it is a pair (s;, Ii) pointing to a substring of S strictly before i . In either case,
the algorithm has the information needed to decompress the j t h term, and since the first
164
Figure 7.13: Tree L corresponding to tree T f for the set P = (wxa, xaqq. qxax)
1234567890
qxaxtqqps t
wxa
xaqq
qxax
1234567890
qxaxtqqps t
wxa
xaqq
qxax
1234567890
qxaxtqqps t
ma
xa qq
qxax
1234567890
qxaxtqqps t
ma
xaqq
qxax
Figure 7.14: The first comparisons start at position 3 of T and match ax. The value of z, is equal to two,
so a shift of one position occurs. String qxax matches; z, is undefined, but d: is defined and equais 4, so
a shift of three is made. The string qq matches, followed by a mismatch; z, is undefined, but d: is defined
to be four, so a shift of three is made, after which no further matches are found and the algorithm halts.
To see how L: is used during the search, let T be qxmtqqps. The shifts of the P are
shown in Figure 7.14.
Definition For any position i in a string S of length m , define the substring Prior, to
be the longest prefix of S[i..ml that also occurs as a substring of S[ 1..i - 11.
For example, if S = a b a x c a b a x n b z then Prior, is b a x .
Definition For any position i in S, define 1; as the length of Prior i . For I , > 0, define
si as the starting position of the left-most copy of Prior i .
In the above example, 1, = 3 and $7 = 2.
' Other, more common ways to study the relatedness or similarity of strings of two strings are extensively discussed
in Part 111.
166
term in the compressed suing is the first character of S, we conclude by induction that the
decompression algorithm can obtain the original string S.
z-1.
7.20. EXERCISES
169
6. Discuss the relative advantages of the Aho-Corasick method versus the use of suffix trees
for the exact set matching problem, where the text is fixed and the set of patterns is varied
over time. Consider preprocessing, search time, and space use. Consider both the cases
when the text is larger than the set of patterns and vice versa.
7. In what way does the suffix tree more deeply expose the structure of a string compared
to the Aho-Corasick keyword tree or the preprocessing done for the Knuth-Morris-Pratt or
Boyer-Moore methods? That is, the sp values give some information about a string, but
the suffix tree gives much more information about the structure of the string. Make this
precise. Answer the same question about suffix trees and Z values.
8. Give an algorithm to take in a set of k strings and to find the longest common substring
of each of the (:) pairs of strings. Assume each string is of length n. Since the longest
time, 0 ( k 2 n )time clearly suffices. Now
common substring of any pair can be found in O(n)
suppose that the string lengths are different but sum to m.Show how to find all the longest
common substrings in time O(km).Now try for O(m+ k 2) (I don't know how to achieve this
last bound).
9. The problem of finding substrings common to a set of distinct strings was discussed separately from the problem of finding substrings common to a single string, and the first
problem seems much harder to solve than the second. Why can't the first problem just be
reduced to the second by concatenating the strings in the set to form one large string?
10. By modifying the compaction algorithm and adding a little extra (linear space) information
to the resulting DAG, it is possible to use the DAG to determine not only whether a pattern
occurs in the text, but to find all the occurrences of the pattern. We illustrate the idea when
there is only a single merge of nodes p and q. Assume that p has larger string depth than
q and that u is the parent of p before the merge. During the merge, remove the subtree of
pand put a displacement number of - 1 on the new u to pq edge. Now suppose we search
for a pattern P i n the text and determine that Pis in the text. Let i be a leaf below the path
labeled P (i.e., below the termination point of the search). If the search traversed the u to
pq edge, then Poccurs starting at position i - 1; otherwise it occurs starting at position i.
Generalize this idea and work out the details for any number of node merges.
11. In some applications it is desirable to know the number of times an input string P occurs
in a larger string S. After the obvious linear-time preprocessing, queries of this sort can
be answered in O(I PI) time using a suffix tree. Show how to preprocess the DAG in linear
time so that these queries can be answered in O(I PI) time using a DAG.
12. Prove the correctness of the compaction algorithm for suffix trees.
13. Let Sr be the reverse of the string S. Is there a relationship between the number of nodes
in the DAG for S and the DAG for Sf?Prove it. Find the relationship between the DAG for
S and the DAG for Sr (this relationship is a bit more direct than for suffix trees).
14. In Theorem 7.7.1 we gave an easily computed condition to determine when two subtrees
of a suffix tree for string S are isomorphic. An alternative condition that is less useful for
efficient computation is as follows: Let a be the substring labeling a node p and P be the
substring labeling a node q in the suffix tree for S. The subtrees of p and q are isomorphic
if and only if the set of positions in S where occurrences of a end equals the set of positions
in S where occurrences of 6 end.
Prove the correctness of this alternative condition for subtree isomorphism.
15. Does Theorem 7.7.1 still hold for a generalized suffix tree (for more than a single string)?
If not, can it be easily extended to hold?
16. The DAG Dfor a string Scan be converted to a finite-state machine by expanding each edge
with more than one character on its label into a series of edges labeled by one character
168
Definition For any position i in string S, let Z L ( i ) denote the length of the longest
substring beginning at i that appears somewhere in the string S[l..i].
Definition Given a DNA string S partitioned into exons and introns, the exon-average
ZL vallle is the average ZL(i) taken over every position i in the exons of S. Similarly,
the inrron-average ZL is the average ZL(i) taken over positions in introns of S .
It should be intuitive at this point that the exon-average ZL value and the intron-average
ZL value can be computed in O ( n ) time, by using suffix trees to compute a11 the ZL(i)
values. The technique, resembles the way matching statistics are computed, but is more
involved since the substring starting at i must also appear to the left of position i.
The main empirical result of [I461 is that the exon-average ZL value is lower than the
intron-average ZL value by a statistically significant amount. That result is contrary to the
expectation stated above that biologically significant substrings (exons in this case) should
be more compressable than more random substrings (which introns are believed to be).
Hence, the full biological significance of string compressability remains an open question.
7.20. Exercises
1. Given a set S of k strings, we want to find every string in S that is a substring of some other
string in S. Assuming that the total length of all the strings is n, give an O(n)-time algorithm
to solve this problem. This result will be needed in algorithms for the shortest superstring
problem (Section 16.1 7).
2. For a string S of length n, show how to compute the N(i), L ( i ) , Lr(i) and Spi values (discussed in Sections 2.2.4 and 2.3.2) in O(n)time directly from a suffix tree for S.
3. We can define the suffix tree in terms of the keyword tree used in the Aho-Corasick (AC)
algorithm. The input to the AC algorithm is a set of patterns P , and the AC tree is a compact
representation of those patterns. For a single string S we can think of the n suffixes of S
as a set of patterns. Then one can build a suffix tree for S by first constructing the AC tree
for those n patterns, and then compressing, into a single edge, any maximal path through
nodes with only a single child. If we take this approach, what is the relationship between
the failure links used in the keyword tree and the suffix links used in Ukkonen's algorithm?
Why aren't suffix trees built in this way?
4. A suffix tree for a string S can be viewed as a keyword tree, where the strings in the
keyword tree are the suffixes of S. In this way, a suffix tree is useful in efficiently building
a keyword tree when the strings for the tree are only implicitly specified. Now consider the
following implicitly specified set of strings: Given two strings Sland &, let D be the set of
all substrings of S1 that are not contained in &. Assuming the two strings are of length n,
show how to construct a keyword tree for set D i n O(n)time. Next, build a keyword tree for
D together with the set of substrings of & that are not in S1.
5. Suppose one has built a generalized suffix tree for a string S along with its suffix links (or
link pointers). Show how to efficiently convert the suffix tree into an Aho-Corasick keyword
tree.
7.20. EXERCISES
171
23. In Section 7.8 we discussed the reverse use of a suffix tree to solve the exact pattern
matching problem: Find all occurrences of pattern P in text T. The solution there computed the matching statistic ms(i) for each position i in the text. Here is a modification of
that method that solves the exact matching problem but does not compute the matching
statistics: Follow the details of the matching statistic algorithm but never examine new characters in the text unless you are on the path from the root to the leaf labeled 1. That is, in
each iteration, do not proceed below the string ay in the suffix tree, until you are on the
path that leads to leaf 1. When not on this path, the algorithm just follows suffix links and
performs skip/count operations until it gets back on the desired path.
Prove that this modification correctly solves the exact matching problem in linear time.
What advantages or disadvantages are there to this modified method compared to computing the matching statistics?
24. There is a simple practical improvement to the previous method. Let v be a point on the
path to leaf 1 where some search ended, and let v' be the node on that path that was next
entered by the algorithm (after some number of iterations that visit nodes off that path).
Then, create a direct shortcut link from v to v'. The point is that if any future iteration ends
at v, then the shortcut link can be taken to avoid the longer indirect route back to v'.
Prove that this improvement works (i.e., that the exact matching problem is correctty solved
in this way).
What is the relationship of these shortcut links to the failure function used in the KnuthMorris-Pratt method? When the suffix tree encodes more than a single pattern, what is the
relationship of these shortcut links to the backpointers used by the Aho-Corasik method?
25. We might modify the previous method even further: In each iteration, only follow the suffix
link (to the end of a) and do not do any skip/count operations or character comparisons
unless you are on the path to leaf 1. At that point, do all the needed skippount computations
to skip past any part of the text that has already been examined.
Fill in the details of this idea and establish whether it correctly solves the exact matching
problem in linear time.
26. Recall the discussion of STSs in Section 7.8.3, page 135. Show in more detail how matching
statistics can be used to identify any STSs that a string contains, assuming there are
"modest" number of errors in either the STS strings or the new string.
27. Given a set of k strings of length n each, find the longest common prefix for each pair
of strings. The total time should be O(kn p), where p is the number of pairs of strings
having a common prefix of length greater than zero. (This can be solved using the lowest
common ancestor algorithm discussed later, but a simpler method is possible.)
28. For any pair of strings, we can compute the length of the longest prefix common to the
pair in time linear in their total length. This is a simple use of a suffix tree. Now suppose
we are given k strings of total length nand want to compute the minimum length of all the
pairwise longest common prefixes over all of the (,k) pairs of strings, that is, the smallest
170
each. This finite-state machine will recognize substrings of S, but it will not necessarily be
the smallest such finite-state machine. Give an example of this.
We now consider how to build the smallest finite-state machine to recognize substrings of
S. Again start with a suffix tree for S, merge isomorphic subtrees, and then expand each
edge that it labeled with more than a single character. However, the merge operation must
be done more carefully than before. Moreover, we imagine there is a suffix link from each
teaf i to each leaf i + 1, for i < n. Then, there is a path of suffix links connecting all the
leaves, and each leaf has zero leaves beneath it. Hence, all the leaves will get merged.
Recall that Q is the set of all pairs (p,q) such that there exists a suffix link from p to q in
T,where p and q have the same number of leaves in their respective subtrees. Suppose
( p , q ) is in Q. Let v be the parent of p, let y be the label of the edge ( v , p) into p, and
let 6 be the label of the edge into q. Explain why l y l 2 161. Since every edge of the DAG
will ultimately be expanded into a number of edges equal to the length of its edge-label,
we want to make each edge-label as small as possible. Clearly, 6 is a suffix of y , and we
will exploit this fact to better merge edge-labels. During a merge of p into q , remove all
out edges from p a s before, but the edge from v is not necessarily directed to q. Rather,
if IS1 > 1, then the S edge is split into two edges by the introduction of a new node u. The
first of these edges is labeled with the first character of S and the second one, edge (u,q ) ,
is labeled with the remaining characters of 6 . Then the edge from v is directed to u rather
than to q. Edge (v, u) is labeled with the first 1 y l - 161 1 characters of y .
Using this modified merge, clean up the description of the entire compaction process and
prove that the resulting DAG recognizes substrings of S. The finite-state machine for S
is created by expanding each edge of this DAG labeled by more than a single character.
Each node in the DAG is now a state in the finite-state machine.
17. Show that the finite-state machine created above has the fewest number of states of any
finite-state machine that recognizes substrings of S. The key to this proof is that a deterministic finite-state machine has the fewest number of states if no state in it is equivalent
to any other state. Two states are equivalent if, starting from them, exactly the same set of
strings are accepted. See f2281.
18. Suppose you already have the Aho-Corasick keyword tree (with backlinks). Can you use it
to compute matching statistics in linear time, or if not, in some "reasonablen nonlinear time
bound? Can it be used to solve the longest common substring problem in a reasonable
time bound? If not, what is the difficulty?
19. In Section 7.16 we discussed how to use a suffix tree to search for all occurrences of a
set of patterns in a given text. If the length of all the patterns is n and the length of the
text is m, that method takes O(n + m) time and O(m)space. Another view of this is that
the solution takes O(m)preprocessing time and O(n)search time. In contrast, the AhoCorasick method solves the problem in the same total time bound but in O(n)
space. Also,
it needs O(n)
preprocessing time and O(m)search time.
Because there is no definite relationship between n and m, sometimes one method will
use less space or preprocessing time than the other. By using a generalized suffix tree,
for the set of patterns and the reverse role for suffix trees discussed in Section 7.8, it is
possible to solve the problem with a suffix tree, obtaining exactly the same time and space
bounds obtained by the Aho-Corasick method. Show in detail how this is done.
20. Using the reverse role for suffix trees discussed in Section 7.8, show how to solve the
general DNA contamination problem of Section 7.5 using only a suffix tree for St, rather
than a generalized suffix tree for Sj together with all the possible contaminants.
21. In Section 7.8.1 we used a suffix tree for the small string P to compute the matching
statistics ms(i)for each position i in the long text string T. Now suppose we also want to
173
7.20. EXERCISES
efficiently find all interesting substrings in the database. If the database has total length m,
then the method should take time O(m) plus time proportional to the number of interesting
substrings.
35. (Smallest &-repeat) Given a string Sand a number k, we want to find the smallest substring
of S that occurs in S exactly k times. Show how to solve this problem in linear time.
36. Theorem 7.12.1, which states that there can be at most n maximal repeats in a string of
length n, was established by connecting maximal repeats with suffix trees. It seems there
should be a direct, simple argument to establish this bound. Try to give such an argument.
Recall that it is not true that at most one maximal repeat begins at any position in S.
37. Given two strings S1 and & we want to find all maximal common pairs of Sl and &. A
common substring C is maximal if the addition to C of any character on either the right or
left of C results in a string that is not common to both S1and &.For example, if A = aayxpt
and B = aqyxpw then the string y x p is a maximal common substring, whereas y x is not.
A maximal common pair is a triple ( p , , fi,n'), where pl and fi are positions in S1 and
&, respectively, and n' is the length of a maximal common substring starting at those
positions. This is a generalization of the maximal pair in a single string.
Letting m denote the total length of Sl and &, give an O(m k)-time solution to this
problem, where k is the number of triples output. Give an O(m)-time method just to count
the number of maximal common pairs and an O(n + /)-time algorithm to find one copy
of each maximal common substring, where I is the total length of those strings. This is a
generalization of the maximal repeat problem for a single string.
38. Another, equally efficient, but less concise way to identify supermaximal repeats is as
follows: A maximal repeat in S represented by the left-diverse node v in the suffix tree for
S is a supermaximal repeat if and only if no proper descendant of v is left diverse and no
node in v's subtree (including v) is reachable via a path of suffix links from a left diverse
node other than v. Prove this.
Show how to use the above claim to find all supermaximal repeats in linear time.
39. In biological applications, we are often not only interested in repeated substrings but in
occurrences of substrings where one substring is an inverted copy of the other, a complemented copy, or (almost always) both. Show how to adapt all the definitions and techniques developed for repeats (maximal repeats, maximal pairs, supermaximal repeats,
near-supermaximal repeats, common substrings) to handle inversion and complementation, in the same time bounds.
40. Give a linear-time algorithm that takes in a string S and finds the longest maximal pair in
which the two copies do not overlap. That is, if the two copies begin at positions p, < pn
and are of length n', then p~ n' < pn.
41. Techniques for handling repeats in DNA are not only motivated by repetitive structures
that occur in the DNA itself but also by repeats that occur in data collected from the DNA.
The paper by Leung et al. [298]gives one example. In that paper they discuss a problem
of analyzing DNA sequences from E. coli, where the data come from more than 1,000
independently sequenced fragments stored in an E. colj database. Since the sequences
were contributed by independent sequencing efforts, somp fragments contained others,
some of the fragments overlapped others, and many intervals of the E. coligenome were yet
unsequenced. Consequently, before the desired analysis was begun, the authors wanted
to "clean upn the data at hand, finding redundantly sequenced regions of the E. coligenome
and packaging all the available sequences into a few contigs, i.e., strings that contain all
the substrings in the data base (these contigs may or may not be the shortest possible).
Using the techniques discussed for finding repeats, suffix-prefix overlaps, and so on, how
172
29. Verify that the all-pairs suffix-prefix matching problem discussed in Section 7.10 can be
solved in O(km) time using any linear-time string matching method. That is, the O(km)
time bound does not require a suffix tree. Explain why the bound does not involve a term
for k 2.
30. Consider again the all-pairs suffix-prefix matching probtem. It is possible to solve the problem in the same time bound without an explicit tree traversal. First, build a generalized
suffix tree T(S) for the set of k strings S (as before), and set up a vector V of length k.
Then successively initialize vector V to contain all zeros, and match each string in the set
through the tree. The match using any string Si ends at the leaf labeled with suffix 1 of
string S,. During this walk for S,, if a node v is encountered containing index i in its list L(v),
then write the string-depth of node v into position i of vector V. When the walk reaches
the leaf for suffix 1 of S,, V(i), for each i , specifies the length of the longest suffix of Si that
matches a prefix of S,.
Establish the worst-case time analysis of this method. Compare any advantages or disadvantages (in practical space and/or time) of this method compared to the tree traversal
Then propose modifications to the tree traversal method
method discussed in Section 7-10.
that maintain all of its advantages and also correct for its disadvantages.
31. A substring a is called a prefix repeat of string S if a is a prefix of S and has the form
pp for some string B. Give a linear-time algorithm to find the longest prefix repeat of
an input string S. This problem was one of Weiner's motivations for developing suffix
trees.
Very frequently in the sequence analysis literature, methods aimed at finding interesting features in a biological sequence begin by cataloging certain substrings of
a long string. These methods almost always pick a fixed-length window, and then
find all the distinct strings of that fixed length. The result of this window or q-gram
approach is of course influenced by the choice of the window length. In the following three exercises, we show how suffix trees avoid this problem, providing a
natural and more effective extension of the window approach. See also Exercise 26
of Chapter 14.
32. There are m 2 /2 substrings of a string T whose length is m. Some of those substrings are
identical and so occur more than once in the string. Since there are @(m2)substrings, we
cannot count the number of times each appears in T in O(m) time. However, using a suffix
tree we can get an imphcit representation of these numbers in O(m) time. In particular,
when any string P of length n is specified, the implicit representation should allow us to
compute the frequency of Pin Tin O(n)
time. Show how to construct the implicit frequency
representation and how to use it.
33. Show how to count the number of distinct substrings of a string T in O(m) time, where
the length of T is m. Show how to enumerate one copy of each distinct substring in time
proportional to the length of all those strings.
34. One way to hunt for "interesting" sequences in a DNA sequence database is to look for
substrings in the database that appear much more often than they would be predicted to
appear by chance alone. This is done today and will become even more attractive when
huge amounts of anonymous DNA sequences are avaitable.
Assumingonehas a statistical model to determine how likely any particular substring would
occur by chance, and a threshold above which a substring is "interesting", show how to
7.20. EXERCISES
45. Prove the correctness of the method presented in Section 7.13 for the circular string linearization problem.
46. Consider in detail w6ther a suffix array can be used to efficient!^ solve the more complex
string problems considered in this chapter. The goal is to maintain: the space-efficient
properties of the suffix array while achieving the time-efficient properties of the suffix tree.
Therefore, it would be cheating to first use the suffix array for a string to construct a suffix
tree for that string.
47. Give the details of the preprocessing needed to implement the bad character rule in the
Boyer-Moore approach to exact set matching.
48. tn Section 7.16.3, we used a suffix tree to implement a weakgood suffix rule for a BoyerMoore set matching algorithm. With that implementation, the increment of index i was
determined in constant time after any test, independent even of the alphabet size. Extend
the suffix tree approach to implement a strong good suffix rule, where again the increment
to i can be found in constant time. Can you remove the dependence on the alphabet in this
case?
49. Prove Theorem 7.1 6.2.
50. In the Ziv-Lempel algorithm, when computing (si,li) for some position i , why should the
traversal end at point p if the string-depth of p plus c, equals i? What would be the problem
with letting the match extend past character i?
51. Try to give some explanation for why the Ziv-Lempel algorithm outputs the extra character
compared to compression algorithm 1.
52. Show how to compute all the n values ZL(i),defined in Section 7.18, in O(n) time. One
solution is related to the computation of matching statistics (Section 7.8,1).
53. Successive refinement methods
Successive refinement is a general algorithmic technique that has been used for a number
of string problems 1114, 199, 2651.In the next several exercises, we introduce the ideas,
connect successive refinement to suffix trees, and apply successive refinement to particular
string problems.
Let S be a string of length n. The relation Ek is defined on pairs of suffixes of S. We say
i Ekj if and only if suffix i and suffix jof S agree for at least their first k characters. Note that
Ek is an equivalence relation and so it partitions the elements into equivalence classes.
Also, since S has n characters, every class in E,., is a singleton. Verify the following two
facts:
Fact1 F o r a n y i # j , i E k + , j i f a n d o n l y i f i E k j a n d i + l E k j + l .
Fact 2 Every Ek+, class is a subset of an Ek class and so the Ek+,partition is a refinement
of the Ek partition.
We use a labeled tree T , called the refinement tree, to represent the successive refinements
of the classes of Ek as k increases from 0 to n. The root of T represents class Eo and
contains all the n suffixes of S. Each child of the root represents a class of El and contains
the elements in that class. In general, each node at levelIrepresents a class of 6 and its
children represent all the El,, classes that refine it.
What is the relationship of T to the keyword tree (Section 3.4) constructed from the set of
n suffixes of S?
Now modify Tas follows. If node v represents the same set of suffixes as its parent node
v', contract v and v' to a single node. In the new refinement tree, T', each nonleaf node
has at least two children. What is the relationship of T' to the suffix tree for string S? Show
how to convert a suffix tree for S into tree T' in 0($) time.
174
42. k-cover problem. Given two input strings S, and & and a parameter k , a k-cover C is a
set of substrings of St, each of length k or greater, such that & can be expressed as the
concatenation of the substrings of C in some order. Note that the substrings contained in
C may overlap in S , , but not in &. That is, & is a permutation of substrings of S,that are
each of length k or greater. Give a linear-time algorithm to find a k-cover from two strings
Sland 5 ,or determine that no such cover exists.
If there is no k-cover, then find a set of substrings of S1,
each of length k or greater, that
cover the most characters of &. Or, find the largest k' < k (if any) such that there is a
k'-cover. Give linear-lime algorithms for these problems.
Consider now the problem of finding nonoverlappingsubstrings in St, each of length k or
greater, to cover &, or cover it as much as possible. This is a harder problem. Grapple
with it as best you can.
43. exon shuffling. In eukaryotic organisms, a gene is composed of alternating exons, whose
concatenation specifies a single protein, and introns, whose function is unclear. Similar
exons are often seen in a variety of genes. Proteins are often built in a modular form,
being composed of distinct domains (units that have distinct functions or distinct folds
that are independent of the rest of the protein), and the same domains are seen in many
different proteins, although in different orders and combinations. It is natural to wonder if
exons correspond to individual protein domains, and there is some evidence to support
this v~ew.Hence modular protein construction may be reflected in the DNA by modular
gene construction based on the reuse and reordering of stock exons. It is estimated tha!
all proteins sequenced to date are made up of just a few thousand exons [468].This
phenomenon of reusing exons is called exon shuffling, and proteins created via exon
shuffling are called mosaic proteins. These facts suggest the following general search
problem.
The problem: Given anonymous, but sequenced, strings of DNA from protein-coding regions where the exons and introns are not known, try to identify the exons by finding
common regions (ideally, identical substrings) in two or more DNA strings. Clearly, many
of the techniques discussed in this chapter concerning common or repeated substrings
could be applied, although they would have to be tried out on real data to test their utility
or limitations. No elegant analytical result should be expected. in addition to methods for
repeats and common substrings, does the k-cover problem seem of use in studying exon
shuffling? That question will surely require an empirical, rather than theoretical answer.
Although it may not give an elegant worst-case result, it may be helpful to first find ail the
maximal common substrings of length k or more.
44. Prove Lemma 7.14.1.
of the (singleton) classes describes a permutation of the integers 1 to n. Prove that this
permutation is the suffix array for string S. Conclude that the reverse refinement method
creates a suffix array in O(n log n) time. What is the space advantage of this method over
the O(n)-time method detailed in Section 7.14.1?
56. Primitive tandem arrays
Recall that a string a is called a tandem array if a is periodic (see Section 3.2.1), i.e., it
can be written as #I1 for some I 2 2. When I = 2, the tandem array can also be called a
tandem repeat. A tandem array a = 8' contained in a string S is called maximal if there
are no additional copies of before or after a.
Maximal tandem arrays were initially defined in Exercise 4 in Chapter 1 (page 13) and
the importance of tandem arrays and repeats was discussed in Section 7.1 1.1. We are
interested in identifying the maximal tandem arrays contained in a string. As discussed
before, it is often best to focus on a structured subset of the strings of interest in order to
limit the size of the output and to identify the most informative members. We focus here on
a subset of the maximal tandem arrays that succinctly and implicitly encode all the maximal
tandem arrays. (In Section 9.5, 9.6, and 9.6.1 we wilt discuss efficient methods to find all
the tandem repeats in a string, and we allow the repeats to contain some errors.)
We use the pair ( p , I) to describe the tandem array pr. Now consider the tandem array
a = abababababababab. It can be described by the pair (abababab, 2), or by (abab, 4),
or by (ab, 8). Which description is best? Since the first two pairs can be deduced from the
last, we choose the later pair. This "choice" will now be precisely defined.
A string 8 is said to be primitive if 8 is not periodic. For example, the string ab is primitive,
whereas abab is not. The pair (ab, 8) is the preferred description of abababababababab
because string ab is primitive. The preference for primitive strings extends naturally to the
description of maximal tandem arrays that occur as substrings in larger strings. Given a
string S, we use the triple (i, 8 , I) to mean that a tandem array ( p , I) occurs in S starting at
position i. A triple (i, 8 , I) rS called a pm-triple if is primitive and is a maximal tandem
array.
@'
For example, the maximal tandem arrays in mississippidescribed by the pm-triples are
(2,iss,2),(3,s,2),(3,ssi,2),(6,~,2)
and (9,p,2). Note that two or more pm-triples can have
the same first number, since two different maximal tandem arrays can begin at the same
position. For example, the two maximal tandem arrays ss and ssissiboth begin at position
three of mississippi.
The pm-triples succinctly encode all the tandem arrays in a given string S. Crochemore
[ I 141(with different terminology) used a successive refinement method to find all the pmtriples in O(nlog n) time. This implies the very nontrivial fact that in any string of length n
there can be only O(nlog n) pm-triples. The method in 11141finds the Ek partition for e&ch
k. The following lemma is central:
Lemma 7.20.1. There is a tandem repeat of a k-length substring p starting at position i
of S if and only if the numbers i and i k are both contained in a single class of Ek and
no numbers between i and i k are in that class.
Prove Lemma 7.20.1. One direction is easy. The other direction is harder and it may be
useful to use Lemma 3.2.1 (page 40).
57. Lemma 7.20.1 makes it easy to identify pm-triples. Assume that the indices in each class
of Ekare sorted in increasing order. Lemma 7.20.1 implies that (i,p,l]is a pm-triple, where
p is a k-length substring, if and only if some single class of Ek contains a maximal series
of numbers i , i k,i 2k,. . . ,i jk,such that each consecutive pair of numbers differs by
k. Explain this in detail.
+ +
54. Several string algorithms use successive refinement without explicitly finding or representing all the classes in the refinement tree. Instead, they construct only some of the cfasses
or only compute the tree implicitly. The advantage is reduced use of space in practice or
an algorithm that is better suited for parallel computation [116]. The original suffix array
construction method [308] is such an algorithm. In that algorithm, the suffix array is obtained as a byproduct of a successive refinement computation where the Ek partitions are
computed only for values of k that are a power of two. We develop that method here. First
we need an extension of Fact 1:
7.20. EXERCISES
179
The above problems can be generalized in many different directions and solved in essentially the same way. One particular generalization is the exact matching version of the
primer selection problem. (In Section 12.2.5 we will consider a version of this problem that
allows errors.)
The primer selection problem arises frequently in molecular biology. One such situation is
in "chromosome walking", a technique used in some DNA sequencing methods or gene
location problems. Chromosome walking was used extensively in the location of the Cystic Fibrosis gene on human chromosome 7 . We discuss here only the DNA sequencing
application.
In DNA sequencing, the goal is to determine the complete nucleotide sequence of a long
string of DNA. To understand the application you have to know two things about existing
sequencing technology. First, current common laboratory methods can only accurately sequence a small number of nucleotides, from 300 to 500, from one end of a longer string.
Second, it is possible to replicate substrings of a DNA string starting at almost any point
as tong as you know a small number of the nucleotides, say nine, to the left of that point.
This replication is done using a technology called polymerase chain reaction (PCR),which
has had a tremendous impact on experimental molecular biology. Knowing as few as nine
nucleotides allows one to synthesize a string that is complementary to those nine nucleotides. This complementary string can be used to create a "primer", which finds its way
to the point in the long string containing the complement of the primer. I t then hybridizes
with the longer string at that point. This creates the conditions that allow the replication
of part of the original string to the right of the primer site. (Usually PCR is done with two
primers, one for each end, but here only one "variable" primer is used. The other primer is
fixed and can be ignored in this discussion.)
The above two facts suggest a method to sequence a long string of DNA, assuming we
know the first nine nucleotides at the very start of the string. After sequencing the first
300 (say) nucleotides, synthesize a primer complementary to the last nine nucleotides just
sequenced. Then replicate a string containing the next 300 nucleotides, sequence that
substring and continue. Hence the longer string gets sequenced by successively sequencing 300 nucleotides at a time, using the end of each sequenced substring to create the
primer that initiates sequencing of the next substring. Compared to the shotgun sequencing method (to be discussed in Section 16.14), this directed method requires much less
sequencing overall, but because it is an inherently sequential process it takes longer to
sequence a long DNA string. (In the Cystic Fibrosis case another idea, called gene jumping, was used to partially parallelize this sequential process, but chromosome walking is
generally laboriously sequential.)
There is a common problem with the above chromosome walking approach. What happens if the string consisting of the last nine nucleotides appears in another place in the
larger string? Then the primer may not hybridize in the correct position and any sequence
determined from that point would be incorrect. Since we know the sequence to the left of
our current point, we can check the known sequence to see if a string complementary to
the primer exists to the left. I f it does, then we want to find a nine-length substring near the
end of the last determined sequence that does not appear anywhere earlier. That substring
can then by used to form the primer, The result will be that the next substring sequenced
will resequence some known nucleotides and so sequence somewhat fewer than 300 new
nucleotides.
Problem: Formalize this primer selection problem and show how to solve it efficiently using
suffix trees. More generally, for each position i in string a find the shortest substring that
begins at i and that appears nowhere else in c~ or S.
By using Fact 1 in place of Fact 3, and by modifying the reverse refinement method developed in Exercises 54 and 55, show how to compute all the Ek partitions for all k (not
just the powers of two) in O($) time. Give implementation details to maintain the indices
of each class sorted in increasing order. Next, extend that method, using Lemma 7.20.1,
to obtain an o($)-time algorithm to find all the pm-triples in a string S.
58. To find all the pm-triples in O(nlog n) time, Crochemore 11141used one additional idea. To
introduce the idea, suppose all Ek ciasses except one, C say, have been used as refiners
to create Ek+l from Ek.Let p and q be two indices that are together in some Ek class. We
claim that if p and q are not together in the same Ek+, class, then one of them (at least)
has already been placed in its proper Ek+, class. The reason is that by Fact 1, p 1 and
q 1 cannot both be in the same Ek class. SO by the time C is used as a refiner, either p
or q has been marked and moved by an Ek class already used as refiners.
Now suppose that each E class is held in a linked list and that when a refiner identifies
a number, p say, then p is removed from its current linked list and placed in the linked
list for the appropriate Ek+, class. With that detail, if the algorithm has used all the Ek
classes except C as refiners, then all the Ek+, classes are correctly represented by the
newly created linked lists plus what remains of the original linked lists for Ek. Explain this
in detail. Conclude that one Ek class need not be used as a refiner.
Being able to skip one class while refining Ek is certainly desirable, but it isn't enough to
produce the stated bound. To do that we have to repeat the idea on a larger scale.
Theorem 7.20.1. When refining Ek to create Ek+ , suppose that for every k 3 1, exactly
one (arbitrary) child of each Ek-t class is skipped (i.e., not used as a refiner). Then the
resulting linked Iists correctly identify the Ek+, classes.
Prove Theorem 7.20.1. Note that Theorem 7.20.1 allows complete freedom in choosing
which child of an Ek-I class to skip. This leads to the following:
Theorem 7.20.2. I f , for every k > 1, the largest child of each EkW1
class is skipped, then
the total size of all the classes used as refiners is at most n log, n.
Prove Theorem 7.20.2. Now provide all the implementation details to find all the pm-triples
in S in O(nlog n) time.
59. Above, we established the bound of O(nlog n) pm-triples as a byproduct of the algorithm
to find them. But a direct, nonalgorithmic proof is possible, still using the idea of successive
refinement and Lemma 7.20.1. In fact, the bound of 3n log, n is fairly easy to obtain in this
way. Do it.
60. Folklore has it that for any position i in S, if there are two pm-triples, (i,p,I), and (i,p',lf),
and if ]pr1 > 181, then IS'I2 2181.That would limit the number of pm-triples with the same
first number to log, n, and the O(nlog n) bound would be immediate.
Show by example that the folklore belief is false.
61. Primer selection problem
Let S be a set of strings over some finite alphabet X.Give an algorithm (using a generalized
suffix tree) to find a shortest string S over C that is a substring in none of the strings of S.
The algorithm shoutd run in time proportional to the sum of the lengths of the strings in S.
A more useful version of the problem is to find the shortest string S that is longer than a
certain minimum length and is not a substring of any string of S. Often, a string a is given
along with the set S. Now the problem becomes one of finding a shortest substring of e
(if any) that does not appear as a substring of any string in S. More generally, for every i ,
compute the shortest substring (if any) that begins at position iof e and does not appear
as a substring of any string in S.
8
Constant-Time Lowest Common Ancestor Retrieval
8.1. Introduction
We now begin the discussion of an amazing result that greatly extends the usefulness of
suffix trees (in addition to many other applications).
Definition In a rooted tree 7,a node u is an ancestor of a node v if u is on the unique
path from the root to v. With this definition a node is an ancestor of itself. A proper
ancestor of v refers to an ancestor that is not u.
Definition In a rooted tree 7,the lowest common arrcestor (lca)of two nodes x and y
is the deepest node in 7 that is an ancestor of both x and y .
For example, in Figure 8.1 the lca of nodes 6 and 10 is node 5 while the lca of 6 and 3
is 1.
The amazing result is that after a linear amount of preprocessing of a rooted tree, any
two nodes can then be specified and their lowest common ancestor found in constant
time. That is, a rooted tree with n nodes is first preprocessed in O ( n )time, and thereafter
any lowest common ancestor query takes only constant time to solve, independent of n .
Without preprocessing, the best worst-case time bound for a single query is O ( n ) ,so this
is a most surprising and useful result. The lca result was first obtained by Harel and Tarjan
[214] and later simplified by Schieber and Vishkin [393]. The exposition here is based on
the later approach.
180
62. In the primer selection problem, the goal of avoiding incorrect hybridizations to the rightof
the sequencedpart of the string is more difficult since we don't yet know the sequence. Still,
there are some known sequences that should be avoided. As discussed in Section 7.11.1,
eukaryotic DNA frequently contains regions of repeated substrings, and the most commanly occurring substrings are known. On the problem that repeated substrings cause for
chromosome walking, R. Weinbergs writes:
They were like quicksand; anyone treading on them would be sucked in and then propelled,
like Alice in Wonderland, through some vast subterranean tunnel system, only to resurface
somewhere else in the genome, miles away from the starting site. The genome was riddled wih these sinkholes, called "repeated sequences." They were guaranteed to slow any
chromosomal walk to a crawl.
So a more general primer problem is the following: Given a substring a of 300 nucleotides
(the last substring sequenced), a string p of known sequence (the part of the long string
to the left of a whose sequence is known), and a set S of strings (the common parts of
known repetitive DNA strings), find the furthest right substring in a of length nine that is
not a substring of p or any string in set S. If there is no such string, then we might seek a
string of length larger than nine that does not appear in /?or S. However, a primer much
larger than nine nucleotides long may falsely hybridize for other reasons. So one must
balance the constraints of keeping the primer length in a certain range, making it unique,
and placing it as far right as possible.
Problem: Formalize this version of the primer selection problem and show how to apply
suffix trees to it.
Probe selection
A variant of the primer selection problem is the hybridization probe selection problem. In
DNA fingerprinting and mapping (discussed in Chapter 16) there is frequent need to see
which oligomers (short pieces of DNA) hybridize to some target piece of DNA. The purpose
of the hybridization is not to create a primer for PCR but to extract some information about
the target DNA. In such mapping and fingerprinting efforts, contamination of the target DNA
by vector DNA is common, in which case the oligo probe may hybridize with the vector DNA
instead of the target DNA. One approach to this problem is to use specifically designed
oligomers whose sequences are rarely in the genome of the vector, but are frequently found
in the cloned DNA of interest. This is precisely the primer (or probe) selection problem.
In some ways, the probe selection problem is a better fit than the primer problem is to
the exact matching techniques discussed in this chapter. This is because when designing
probes for mapping, it is desirable and feasible to design probes so that even a single
mismatch will destroy the hybrization. Such stringent probes can be created under certain
conditions [I 34, 1771.
Racing to the Beginning of the Road; The Search for the Origin of Cancer. Harmony Books, 1996.
001
0 11
101
111
Figure 8.2: A binary tree with four leaves. The path numbers are written both in binary and in base ten.
that encode paths to them. The notation B will refer to this complete binary tree, and 7
will refer to an arbitrary tree.
Suppose that B is a rooted complete binary tree with p leaves (n = 2 p - 1 nodes in
total), so that every internal node has exactly two children and the number of edges on
the path from the root to any leaf in B is d = log, p . That is, the tree is complete and
all leaves are at the same depth from the root. Each node v of B is assigned a d 1 bit
number, called itspath number, that encodes the unique path from the root to v. Counting
from the left-most bit, the ith bit of the path number for v corresponds to the ith edge on
the path from the root to v: A 0 for the ith bit from the left indicates that the ith edge on
the path goes to a left child, and a 1 indicates a right child.' For example, a path that goes
left twice, right once, and then left again ends at a node whose path number begins (on the
left) with 0010. The bits that describe the path are called path bits. Each path number is
then padded out to d + 1 bits by adding a 1 to the right of the path bits followed by as many
additional 0 s as needed to make d 1 bits. Thus for example, if d = 6, the node with
path bits 0010 is named by the 7-bit number 0010100. The root node f o r d = 6 would be
1000000. In fact, the root node always has a number with left bit 1 followed by d 0s. (See
Figure 8.2 for an additional example.) We will refer to nodes in B by their path numbers.
As the tree in Figure 8.2 suggests, path numbers have another well-known description
- that of inorder numbers. That is, when the nodes of B are numbered by an inorder
traversal (recursively number the left child, number the root, and then recursively number
the right child), the resulting node numbers are exactly the path numbers discussed above.
We leave the proof of this for the reader (it has little significance in our exposition). The
path number concept is preferred since it explicitly relates the number of a node to the
description of the path to it from the root.
Given two nodes i and j, we want to find lca(ij) in 13 (remembering that both i and j
are path numbers). First, when lca(iJ ) is either i or j (i.e., one of these two nodes is an
ancestor of the other), then this can be detected by a very simple constant-time algorithm,
discussed in Exercise 3. So assume that lca(i,j) is neither i nor j. The algorithm begins
by taking the exclusive or (XOR) of the binary number for i and the binary number for j ,
denoting the result by xii. The X O R of two bits is 1 if and only if the two bits are different,
and the XOR of two d 1 bit numbers is obtained by independently talung the XOR of
' Note that normally when discussing binary numbers, the bits are numbered from right (least significant) to left
(most significant). r h i s is opposite the left-to-right ordering used for strings and for path numbers.
However, although the method is easy to program, it is not trivial to understand at first and
has been described as based on "bit magic". Nonetheless, the result has been so heavily
applied in many diverse string methods, and its use is so critical in those methods, that
a detailed discussion of the result is worthwhile. We hope the following exposition is a
significant step toward making the method more widely understood.
185
standard deplh-first numbering (preorder numbering) of nodes (see Figure 8.1). With this
numbering scheme, the nodes in the subtree of any node v in 7 have consecutive depthfirst numbers, beginning with the number for v . That is, if there are q nodes in the subtree
rooted at v, and v gets numbered k, then the numbers given to the ather nodes in the
subtree are k 1 through k q - 1.
For convenience, from this point on the nodes in 7 will be referred to by their depth-first
numbers. That is, when we refer to node v, v is both a node and a number. Be careful not
to confuse depth-first numbers used for the general tree 7 with path numbers used only
for the binary tree B.
Definition For any number k, h(k) denotes the position (counting from the right) of
the least-significant I-bit in the binary representation of k.
For example, h(8) = 4 since 8 in binary is 1000, and h(5) = 1 since 5 in binary is 101.
Another way to think of this is that h(k) is one plus the number of consecutive zeros at the
right end of k .
Definition In a complete binary tree the heigh! of a node is the number of nodes on
the path from it to a leaf. The height of a leaf is one.
The following lemma states a crucial fact that is easy to prove by induction on the
height of the nodes.
Lemma 8.5.1. For any node k (node with path number k) in B,h(k) equals the height of
node k in L3.
For example, node 8 (binary 1000) is at height 4, and the path from it to a leaf has four
nodes (three edges).
let I(v) be a node w in 7 such that h(w) is maximum
Definition For a node v of 7,
over all nodes in the subtree of v (including v itself).
That is, over all the nodes in the subtree of v, I(v) is a node (depth-first number) whose
binary representation has the largest number of consecutive zeros at its right end. Figure 8.4
shows the node numbers from Figure 8.1 in binary and base 10. Then 1(1), I(5), and I(8)
are all 8, i(2) and i(4) are both 4, and /(v) = v for every other node in the figure.
Figure 8.4: Node numbers given in four-bit binary, to illustrate the definition of I ( v ) .
Figure 8.3: A binary tree with four leaves. The path numbers are in binary, and the position of the leastsignificant 1-bit is given in base ten.
each bit of the two numbers. For example, XOR of 00 101 and 1001 1 is 101 10. Since i
and j are O(log n) bits long, XOR is a constant-time operation in our model.
The algorithm next finds the most significant (left-most) 1-bit in xi,. If the left most
1-bit in the XOR of i and j is in position k (counting from the left), then the left most
k - 1 bits of i and j are the same, and the paths to i and j must agree for the first k - 1
edges and then diverge. i t follows that the path number for lca(i,j) consists of the left most
1 - k zeros. For example, in
k - 1 bits of i (or j) followed by a 1-bit followed by d
Figure 8.2, the XOR of 10 1 and 1 1 1 (nodes 5 and 7) is 0 10, so their respective paths share
one edge - the right edge out of the root. The XOR of 010 and 101 (nodes 2 and 5) is
11 1, so the paths to 2 and 5 have no agreement, and hence 100, the root, is their lowest
common ancestor.
Therefore, to find lca(i,j), the algorithm must XOR two numbers, find the left-most
1-bit in the result (say at position k), shift i right by d 1 - k places, set the right most bit
to a 1, and shift it back left by d + 1 - k places. By assumption, each of these operations
can be done in constant time, and hence the lowest common ancestor of i and j can be
found in constant time in 8.
In summary, we have
Theorem 8.4.1. In a complete binary tree, after linear-time preprocessing to name nodes
by their path numbers, any lowest common ancestor query can be answered in constant
time.
This simple case of a complete binary tree is very special, but it is presented both
to develop intuition and because complete binary trees are used in the description of
the general case. Moreover, by actually using complete binary trees, a very elegant and
relatively simple algorithm can answer lca queries in constant time, if O(nlogn) time
is allowed for preprocessing 7 and O(n log n) space is available after the preprocessing.
That method is explored in Exercise 12.
The lca algorithm we will present for general trees builds on the case of a complete
binary tree. The idea (conceptually) is to map the nodes of a general tree 7 to the nodes of
a complete binary tree t3 in such a way that lca retrievals on L3 will help to quickly solve
lca queries on 7. We first describe the general lca algorithm assuming that the 7 to
mapping is explicitly used, and then we explain how explicit mapping can be avoided.
0011
0111 1001
Figure 8.7: A node v in B is numbered if there is a node in 7 that maps to v .
That is, two nodes u and v are in the same run if and only if I ( u ) = I ( v ) . Figure 8.6
shows a partition of the nodes of ? into runs.
Algorithmically we can set I ( v ) , for all nodes, using a linear-time bottom-up traversal
of 7 as follows: For every leaf v, I ( v ) = v. For every internal node v, I ( v ) = v if h ( u )
is greater than h ( u l ) for every child v' of v . Otherwise, I ( v ) is set to the l ( v l )value of
the child v' whose h ( l ( v l ) )value is the maximum over all children of v. The result is that
each run forms an upward path of nodes in 7.And, since the h ( I ()) values never decrease
along any upward path in 7,it follows that
Lemma 8.6.1. For any node v , node I ( v ) is the deepest node in the run containing node v.
These facts are illustrated in Figure 8.6.
Definition Define the head of a run to be the node of the run closest to the root.
For example, in Figure 8.6 node 1 (0001) is the head of a run of length three, node 2
(0010) is the head of a run of length two, and every remaining node (not in either of those
two runs) is the head of a run consisting only of itself.
Lemma 8.5.2. For any node v in 7,there is a unique node w in the subtree of v such that
h ( w ) is maximum over all nodes in v 's subtree.
Suppose not, and let u and w be two nodes in the subtree of u such that h(u) =
h ( w ) 3 h ( q ) for every node q in that subtree. Assume h(u) = i. B y adding zeros to the
left ends if needed, we can consider the two numbers u and w to have the same number of
bits, say I. Since u # w , those two numbers must differ in some bit to the left of i (since
by assumption bit i is 1 in both u and w , and all bits to the right of i are zero in both).
Assume u > w , and let k be the left-most position where such a difference between u
and w occurs. Consider the number N composed of the left-most 1 - k bits of u followed
by a 1 in bit k followed by k - 1 zeros (see Figure 8.5). Then N is strictly less than u
and greater than w . Hence N must be the depth-first number given to some node in the
subtree of v , because the depth-first numbers given to nodes below v form a consecutive
interval. But h ( N ) = k > i = h ( u ) ,contradicting the fact that h(u) 2 h ( q ) for all nodes
in the subtree of v. Hence the assumption that h ( u ) = h ( w ) leads to a contradiction, and
the Lemma is proved.
PROOF
189
8.7.1 The proof is trivial if I(z) = I(x), so assume that they are
unequal. Since z is an ancestor of x in 7, h(I(z)) 2 h(I(x)) by the definition of I, but
equality is only possible if I(z) = I(x). So h(I(z)) > h(I(x)). Now h(I(z)) and h(I(x))
are the respective heights of nodes I(z) and I(x) in L3, so I(z) is at a height greater than
the height of I ( x ) in 17.
Let h(I(z)) be i. We claim that I(z) and I(x) are identical in all bits to the left of i
(recall that bits of a binary number are numbered from the right). If not, then let k > i
be the left-most bit where I(z) and I ( x ) differ. Without loss of generality, assume that
I(z) has bit 1 and I ( x ) has bit 0 in position k . Since k is the point of left-most difference,
the bits to the left of position k are equal in the two numbers, implying that I(z) > I(x).
Now z is an ancestor of x in 7, so nodes l ( z ) and I(x) are both in the subtree of z in
7 . Furthermore, since I(z) and I(x) are depth-first number of nodes in the subtree of z
in 7,every number between I(x) and I (z) occurs as a depth-first number of some node
in the subtree of z. In particular, let N be the number consisting of the bits to the left of
position k in I(z) (or I(x)) followed by 1 followed by all 0s. (Figure 8.5 helps illustrate
the situation, although z plays the role of u and x plays the role of w , and bit i in I ( z )
is unknown.) Then I(x) iN iI(z); therefore N is also a node in the subtree of z. But
k > i, so h(N) > h(I(z)), contradicting the definition of I . It follows that I ( z ) and I ( x )
must be identical in the bits to the left of bit i.
Now bit i is the right most 1-bit in I(z), so the bits to the left of bit i describe the
complete path in L3 to node I(z). Those identical bits to the left of bit i also form the initial
part of the description of the path in L3 to node I(x), since I ( x ) has a 1-bit to the right of
bit i. So those bits are in the path descriptions of both I(z) and I(x), meaning that the path
to node I(x) in B must go through node I(z). Therefore, node I (z) is an ancestor of node
I(x) in B, and the lemma is proved.
PROOF OF LEMMA
Having described the preprocessing of 7 and developed some of the properties of the
tree map, we can now describe the way that lca queries are answered.
188
Definition The tree map is the mapping of nodes of 7 to nodes of a complete binary
tree B with depth d = [log nl - 1. In particular, node v of 7 maps to node I(v) of t3
(recall that nodes of B are named by their path numbers).
The tree map is well defined because I (v) is a d 1 bit number, and each node of is
named by a distinct d I bit number. Every node in a run of 7 maps to the same node in
but not all nodes in L? generally have nodes in 7 mapping to them. Figure 8.7 shows
tree L3 for tree 7 from Figure 8.6. A node v in L3 is numbered if there is a node in 7 that
maps to v.
a,
191
Theorem 8.8.2. Let j be the smallest position greater or equal to i such that both A,
and A, have I-bits in position j . Then node I ( z ) is at height j in B,or in other words,
h(I(z)) = j,
Suppose I ( : ) is at height k in El. We will show that k = j . Since z is an ancestor
of both x and y , both A, and A, have a 1-bit in position k. Furthermore, since I ( z ) is
an ancestor of both I ( x ) and I ( y ) in B (by Lemma 8.7.1), k 2 i , and it follows (by the
selection of j ) that k 2 j . This also establishes that a position j 2 i exists where both A,
and A, have 1-bits.
A, has a 1 -bit in position j and j 2 i, so x has an ancestor x' in T
' such that I ( x t ) is
an ancestor of I ( x ) in B and I ( x ' ) is at height j 2 i, the height of b in l3. It follows that
I ( x t ) is an ancestor of b. Similarly, there is an ancestor y' of y in 7 such that I ( y f )is at
height j and is an ancestor of b in B.But if I ( x f )and I(y') are at the same height ( j ) and
both are ancestors of the single node b, then it must be that I ( x f ) = I ( y ) , meaning that
x' and y' are on the same run. Being on the same run, either x' is an ancestor in 7of y'
or vice versa. Say, without loss of generality, that x' is an ancestor of y' in 7.Then x' is
a common ancestor of x and y, and x' is an ancestor of z in 7.Hence x' must map to the
same height or higher than z in l?. That is, j >_ k. But k >_ j was already established, so
k = j as claimed, and the theorem is proved.
PROOF
All the pieces for lea retrieval in 7 have now been described, and each takes only
constant time. In summary, the lowest common ancestor z of any two nodes x and y in 7
(assuming z is neither x nor y ) can be found in constant time by the following method:
190
Theorem 8.8.1. Let r denote the ica of x and y in 7.I f we know h(l(z)), then we can
j n d z in T in constant time.
Consider the run containing z in 7.The path up 7 from x to z enters that run at
some node x' (possibly z ) and then continues along that run until it reaches z. Similarly,
the path up from y to z enters the run at some node 7 and continues along that run until z.
It follows that z is either X or 7. In fact, z is the lugher of those two nodes, and so by the
numbering scheme, z = X if and only if X < 7. For example, in Figure 8.6 when x = 9
(1001) and y = 6 (01 lo), then X = 8 (1000) and 7 = z = 5 (0101).
Given the above discussion, the approach to finding z from h(I(z)) is to use h(I(z)) to
find x and j , since those nodes determine z. We will explain how to find F.Let h(I (z)) = j ,
so the height in B of I(z) is j. By Lemma 8.7.1, node x (which is in the subtree of z in
7)maps to a node I(x) in the subtree of node I(z) in B,so if h(l(x)) = j then x must be
on the same run as z (i.e., x = F), and we are finished. Conversely, if x = F,then h(l(x))
must be j. So assume from here on that x # 5 .
Let w (which is possibly x ) denote the node in 7on the z-to-x path just below (off) the
run containing z. Since x is not i ,x is not on the same run as z,and w exists. From h(I(z))
(which is assumed to be known) and A, (which was computed during the preprocessing),
we will deduce h(I(w)) and then I(w), w , and Z.
Since w is in the subtree of z in 7 and is not on the same run as z, w maps to a node in
B with height strictly less than the height of I(z) (this follows from Lemma 8.7.1). In fact,
by Lemma 8.7.1, among all nodes on the path from x to z that are not on 2's run, w maps
to a node of greatest height in B.Thus, h ( l ( w ) )(which is the height in B that w maps to)
must be the largest position less than j such that A, has a 1-bit in that position. That is,
we can find h(I(w)) (even though we don't know w ) by finding the most significant 1-bit
of A, in a position less than j . This can be done in constant time on the assumed machine
(starting with all bits set to 1, shift right by d - j + I positions, AND this number together
with A,, and then find the left-most 1-bit in the resulting number.)
Let h ( l ( w ) )= k. We will now find I(w). Either w is x or w is a proper ancestor of x in
7,so either I(w) = I(x) or node I(w) is a proper ancestor of node I(x) in B.Moreover,
by the path-encoding nature of the path numbers in B,numbers I (x) and I ( w ) are identical
in bits to the left of k, and I(w) has a 1 in bit k and all 0s to the right. So I(w) can be
obtained from I(x) (which we know) and k (which we obtained as above from h(I(z))
and A,). Moreover, I ( w ) can be found from I(x) and h ( l ( w ) )using constant-time bit
operations.
Given I(w) we can find w because w = L(I (w)). That is, w was just off the z run, so
it must be the head of the run that it is on, and each node in 7"points to the head of its
run. From w we find its parent xt in constant time.
PROOF
In summary, assuming we know h(I(z)), we can find node 3, which is the closest
ancestor of x in 7 that is on the same run as z.Similarly, we find J . Then z is either E
or J ; in fact, z is the node among those two with minimum depth-first number in 7.Of
course, we must now explain how to find j = h(I(t)).
8.11. EXERCISES
8.1 I. Exercises
I.Using depth-first traversal, show how to construct the path numbers for the nodes of L? in
time proportional to n, the number of nodes in B.Be careful to observe the constraints of
the RAM model.
2. Prove that the path numbers in
3. The Ica algorithm for a complete binary tree was detailed in the case that Ica(i,j) was
neither i nor j. In the case that Ica(i,j) is one of i or j, then a very simple constant-time
algorithm can determine Ica(i,j). The idea is first to number the nodes of the binary tree
B by a depth-first numbering, and to note for each node v, the number of nodes in the
subtree of v (including v ) .Let I(v) be the dfs number given to node v, and let s(v)be the
number of nodes in the subtree of v. Then node i is an ancestor of node j if and only if
l(i)5 I( j ) and I( j ) < / ( i ) s(i).
Prove that this is correct, and fill in the details to show that the needed preprocessing can
be done in O(n)time.
Show that the method extends to any tree, not just complete binary trees.
4. In the special case of a complete binary tree 8,there is an alternative way to handle the
situation when Ica(i,j) is i or j. Using h(i) and h(j) we can determine which of the nodes
i and j is higher in the tree (say i) and how many edges are on the path from the root to
node i. Then we take the XOR of the binary for i and for j and find the left-most 1-bit as
before, say in position k (counting from the left). Node i is an ancestor of j if and only if k
is larger than the number of edges on the path to node i. Fill in the details of this argument
and prove it is correct.
5. Explain why in the Ica algorithm for 8,it was necessary to assun-~e
that Ica(i,j) was neither
i nor j. What would go wrong in that algorithm if the issue were ignored and that case was
not checked explicitly?
6. Prove that the height of any node k in
B is h(k).
7. Write a C program for both the preprocessing and the tea retrieval. Test the program on
large trees and time the results.
8. Give an explicit O(n)-time RAM algorithm for building the table containing the right-most
1-bit in every log, n bit number. Remember that the entry for binary number i must be in
the ith position in the table. Give details for building tables for AND, OR, and NOT for
time.
bit numbers in O(n)
9. It may be more reasonable to assume that the RAM can shift a word left and right in
constant time than to assume that it can multiply and divide in constant time. Show how to
solve the Ica problem in constant time with linear preprocessing under those assumptions.
10. In the proof of Theorem 8.8.1 we showed how to deduce I(w) from h ( l ( w ) )in constant time.
Can we use the same technique to deduce I(zjfrom h(l(z))? If so, why doesn't the method
do that rather than involving nodes w , X,and
11. The constant-time Ica algorithm is somewhat difficult to understand and the reader might
wonder whether a simpler idea works. We know how to find the Ica in constant time in a
complete binary tree after O(n)preprocessing time. Now suppose we drop the assumption
that the binary tree is complete. So 7 is now a binary tree, but not necessarily complete.
Letting d again denote the depth of T , we can again compute d 1 length path numbers
that encode the paths to the nodes, and again these path numbers allow easy construction
of the towest common ancestor. Thus it might seem that even in incomplete binary trees,
one can easily find the Ica in this simple way without the need for the full Ica algorithm.
Either give the details for this or explain why it fails to find the Ica in constant time.
192
used in steps 3,4, or 5 to obtain z from h(l(z)). However, it is used in step 1 to find node
.
,,
b from I ( x ) and I ( y ) . But all we really need from b is h(b)(step 2), and that can be gotten
from the right-most common 1-bit of Z(x) and I ( y ) . So the mapping from 5'- to is only
conceptual, merely used for purposes of exposition.
In summary, after the preprocessing on T,when given nodes x and y, the algorithm
finds i = h(b) (without first finding 6)from the right-most common 1-bit in I ( x ) and
I ( y ) . Then it finds j = h(Z(z)) from i and A, and A,,, and from j it finds z = lca(x,y).
Although the logic behind this method has been difficult to convey, a program for these
operations is very easy to write.
195
8.11. EXERCISES
Step 2 For an arbitrary internal node v in B,let 8, denote the subtree of 8 rooted at v, and
let L, = nl, h... . , n, be an ordered list containing the elements of L written at the leaves of
B,, in the same left-to-right order as they appear in 8. Create two lists, Pmin(v) and Smintv),
for each internal node v. Each list will have size equal to the number of leaves in v's subtree.
The kth entry of list Pmin(v) is the smallest number among inl, n2, . . . , nk). That is, the Mh
entry of Pmin(v) is the smallest number in the prefix of list L, ending at position k. Similarly,
the kth entry of list Smin(v) is the smallest number in the suffixof L, starting at position k.
This is the end of the preprocessing and exercises follow.
b. Prove that the total size of all the Pmin and Smin lists is O(m log m), and show how they can
be constructed in that time bound.
After the O(mlog m) preprocessing, the smallest number in any intervalIcan be found in
constant time. Here's how. Let interval I in L have endpoints I and r and recall that these
correspond to leaves of 8.To find the smallest number in I , first find the Ica(l,r), say node v.
Let v' and vr' be the left and right children of v in 8, respectively. The smallest number in I
can be found using one lookup in list Smin~v'),one lookup in Pmin(v"), and one additional
comparison.
c. Give complete details for how the smallest number inIis found, and fully explain why only
constant time is used.
13. By refining the method developed in Exercise 12, the O(m log m) preprocessing bound (time
and space) can be reduced to only O(m log log m) while still maintaining constant retrieval
time for any Ica query. (It takes a pretty big value of m before the difference between O(m)
and O(mlog log m) is appreciable!) The idea is to divide list L into
blocks each of
size log m and then separately preprocess each block as in Exercise 12. Also, compute
numbers in an ordered list Lmin, and
the minimum number in each block, put these
preprocess Lmin as in Exercise 12.
&
&
a. Show that the above preprocessing takes O(mlog log m) time and space.
Now we sketch how the retrieval is done in this faster method. Given an interval I with starting
and ending positionsI and r, one finds the smallest number in I as follows: If I and r are in
the same block, then proceed as in Exercise 12. If they are in adjacent blocks, then find the
minimum number fromI to the end of I's block, find the minimum number from the start of r's
block to r, and take the minimum of those two numbers. If I and r are in nonadjacent blocks,
then do the above and also use Lmin to find the minimum number in all the blocks strictly
between the block containing I and the block containing r. The smallest number in I is the
minimum of those three numbers.
b. Give a detailed description of the retrieval method and justify that it takes only constant time.
194
A simpler (but slower) Ica algorithm. In Section 8.4.1 we mentioned that if O(n1og n)
preprocessing time is allowed, and O(nlog n) space can be allocated during both the
preprocessing and the retrieval phases, then a (conceptually) simpler constant-time Ica
retrieval method is possible. In many applications, O(nlog n) is an acceptable bound, which
is not much worse than the O(n)bound we obtained in the text. Here we sketch the idea
of the O(nlog n) method. Your problem is to flesh out the details and prove correctness.
First we reduce the general Ica problem to a problem of finding the smallest number in an
interval of a fixed list of numbers.
The reduction of Ice t o a list problem
Step 1 Execute a depth-first traversal of tree 7 to label the nodes in depth-first order and
to build a multilist L of the nodes in the order that they are visited. (For any node v other
than the root, the number of times v is in L equals the degree of v.) The only property of
the depth-first numbering we need is that the number given to any node is smaller than the
number given to any of its proper descendants. From this point on, we refer to a node only
by its dfs number.
For example, the list for the tree in Figure 8.1 (page 182) is
s,
,...,
abcdefghijklmnop
. .. .
Figure 9.1: The longest common extension for pair (i, j) has length eight. The match~ngsubstring is
abcdefgh.
With the ability to solve lowest common ancestor queries in constant time, suffix trees
can be used to solve many additional string problems. Many of those applications move
from the domain of exact matching to the domain of inexact, or approximate, matching
(matching with some errors permitted). This chapter illustrates that point with several
examples.
Longest common extension problem Two strings S1 and S2 of total length n are
first specified in a preprocessing phase. Later, a long sequence of index pairs is
specified. For each specified index pair (i, j),we must find the length of the longest
substring of SI starting at position i that matches a substring of S2 starting at position
j . That is, we must find the length of the longest prefix of suffix i of SI that matches
a prefix of suffix j of S2 (see Figure 9.1).
Of course, any time an index pair is specified, the longest common extension can
be found by direct search in time proportional to the length of the match. But the goal
is to compute each extension in constant time, independent of the length of the match.
Moreover, it would be cheating to allow more than linear time to preprocess Sl and S2.
To appreciate the power of suffix trees combined with constant-time lca queries, the
reader should again try first to devise a solution to the longest common extension problem
without those two tools.
199
although in the biological literature the distinction between separated and nonseparated
palindromes is sometimes blurred. The problem of finding all separated palindromes is
really one of finding all inverted repeats (see Section 7.12) and hence is more complex
than finding palindromes. However, if there is a fixed bound on the permitted distance
of the separation, then all the separated palindromes can again be found in linear time.
This is an immediate application of the longest common extension problem, the details of
which are left to the reader.
Another variant of the palindrome problem, called the k-mismatch palindrome problem,
will be considered below, after we discuss matching with a fixed number of mismatches.
End.
The space needed by this method is O(n + m),since it uses a suffix tree for the two
strings. However, as detailed in Theorem 9.1.l, only a suffix tree for P plus the matching
statistics for T are needed (although we must still store the original strings). Since m > n
we have
Theorem 9.3.1. The exact matching problem with k wild cards distributed in the two
strings can be solved in O(krn) time and O ( m )space.
IYb
+ +
+
+ +
1. In linear time, create the reverse string Sr from S and preprocess the two strings so that
any longest common extension query can be solved in constant time.
2. For each q from 1 to n - 1, solve the longest common extension query for the index pair
(q 1, n - q 1) in S and S r , respectively. If the extension has nonzero length k, then
there is a maximal palindrome of radius k centered at q .
The method takes O ( n ) time since the suffix tree can be built and preprocessed in that
time, and each of the O ( n )extension queries is solved in constant time.
In summary, we have
Theorem 9.2.1. All the maximal even-length palindromes in a string can be identified in
linear time.
201
Note that the space required for this solution is just O(n + m ) , and that the method can
be implemented using a suffix tree for the small string P alone.
We should note a different practical approach to the k-mismatch problem, based on
suffix trees, that is in use in biological database search [320].The idea is to generate every
string P' that can be derived from P by changing up to k characters of P, and then to
search for P' in a suffix tree for T. Using a suffix tree, the search for P' takes time just
proportional to the length of P' (and can be implemented to be extremely fast), so this
approach can be a winner when k and the size of the alphabet are relatively small,
Definition
substring.
is a
Each tandem repeat is specified by a starting position of the repeat and the length of the
substring B. This definition does not require that be of maximal length. For example,
in the string xabnbababy there are a total of six tandem repeats. Two of these begin at
position two: abab and abababab. In the first case, B is ab, and in the second case, B is
abab.
Using longest common extension queries, it is immediate that all tandem repeats can be
found in 0 ( n 2 ) time -just guess a start position i and a middle position j for the tandem
and do a longest common extension query from i and j . If the extension from i reaches j
or beyond, then there is a tandem repeat of length 2 ( j - i 1) starting at position i . There
are ~ ( n "choices for i and j , yielding the 0(n2) time bound.
Theorem 9.5.1. All the tandem repeats in S in which the two copies differ by ar most
k mismatches can be found in 0 ( k n 2 ) rime. Typically, k is a fixed nztmber, and the time
bound is reported as 0(n2)-
k-mismatch check
Begin
1. Set j to 1 and i' to i , and coimt to 0.
2. Compute the length 1 of the longest common extension starting at positions j of P and i t
of T .
3. If j 1 = n + 1, then a k-mismatch of P occurs in T starting at i (in fact, only count
mismatches occur); stop.
Figure 9.2: Any position between A and B inclusive is a starting point of a tandem repeat of length 21.
As detailed in Step 4, if 1, and I* are both at least one, then a subinterval of these starting points specify
tandem repeats whose first copy spans h.
3. Compute the longest common extension in the reverse direction from positions h - 1 and
q - 1. Let 12 denote the length of that extension.
4. There is a tandem repeat of length 21 whose first copy spans position h if and only if
lI 12 > 1 and both 11and 12 are at least one. Moreover, if there is such a tandem repeat of
length 21, then it can begin at any position from Max(h - 12, h - 1 I ) to Min(h lI -1, h )
inclusive. The second copy of the repeat begins 1 places to the right. Output each of these
starting positions along with the length 21. (See Figure 9.2.)
End,
To solve an instance of subproblem 3 (finding all tandem repeats whose first copy spans
position h), just run the above algorithm for each l from 1 to h.
Lemma 9.6.1. The above method correctly solves subproblem 3 for afired h. That is, it
finds all tandem repeats whosejrst copy spans position h. Further;forfired h, its running
time is O ( n / 2 )+ z h , where zh is the number of such tandem repeats.
Assume first that there is a tandem repeat whose first copy spans position h , and it
has some length, say 21. That means that position q = h f l in the second copy corresponds
to position h in the first copy. Hence some substring starting at h must match a substring
starting at q , in order to provide the suffix of each copy. This substring can have length at
most l I . Similarly, there must be a substring ending at h - 1 that matches a substring ending
at q - 1, providing the prefix of each copy. That substring can have length at most 12. Since
all characters between h and q are contained in one of the two copies, l l l2 must be at
least 1 . Conversely, by essentially the same reasoning, if l I l2 5 I and both 1 and l2 are at
y
h . The
least one then one can specify a tandem repeat of length 21 whose first c ~ p spans
necessary and sufficient condition for the existence of such a tandem is therefore proved.
The converse proof that all starting positions fall in the stated range involves similar
reasoning and is left to the reader.
For the time analysis, note first that for a fixed choice of h, the method takes constant
time per choice of l to execute the common extension queries, and so it takes O ( n / 2 )
time for all those queries. For any fixed I , the method takes constant time per tandem that
it reports, and it never reports the same tandem twice since it reports a different starting
point for each repeat of length 21. Since each repeat is reported as a starting point and a
length, it follows that over all choices of 1 , the algorithm never reports any tandem repeat
twice. Hence the time spent to report tandem repeats is proportional to zh, the number of
tandem repeats whose first copy spans position h .
PROOF
1. Find all tandem repeats contained entirely in the first half of S (up to position h ) .
2. Find all tandem repeats contained entirely in the second half of S (after position h).
3. Find all tandem repeats where the first copy spans (contains) position h of S.
4. Find all tandem repeats where the second copy spans position h of S.
Clearly, no tandem repeat will be found in more than one of these four subproblems. The
first two subproblems are solved by recursively applying the Landau-Schmidt solution.
The second two problems are symmetric to each other, so we consider only the third
subproblem. An algorithm for that subproblem therefore determines the algorithm for
finding all tandem repeats.
205
and tandem repeat problems to allow for string complementation and bounded-distance
separation between copies.
,,
(,,
We show below that all the correction factors for all internal nodes can be computed
in O(n) total time. That then gives an O(n)-time solution to the k-common substring
problem.
204
That all tandem repeats are found is immediate from the fact that every tandem
is of a form considered by one of the subproblems 1 through 4. To show that no tandem
repeat is reported twice, recall that for h = n/2, no tandem is of the form considered by
more than one of the four subproblems. This holds recursively for subproblems 1 and 2.
Further, in the proof of Lemma 9.6.1 we established that no execution of subproblem 3
(and also 4) reports the same tandem twice. Hence, over the entire execution of the four
subproblems, no tandem repeat is reported twice. It also follows that the total time used
to output the tandem repeats is O(z).
To finish the analysis, we consider the time taken by the extension queries. This time
is proportional to the number of extension queries executed. Let T(n) denote the number
of extension queries executed for a string of length n. Then, T(n) = 2T (n/2) 2n, and
T(n) = O(n logn) as claimed.
PROOF
Theorem 9.6.2. All k-mismatch tandem repeats in a string of length n can be found in
0(kn log n r ) time.
The bound can be sharpened to O(kn log(n / k) z) by the observation that any 1 ( k
need not be tested in subproblems 3 and 4. We leave the details as an exercise.
We also leave it to the reader to adapt the solutions for the k-mismatch palindrome
9.8. EXERCISES
207
5. For each identifier i, compute the Ira of each consecutive pair of leaves in Li,
and increment
h ( w )by one each time that w is the computed Ira.
6. With a bottom-up traversal of 7,compute, for each node v, S(u) and U ( v ) =
~ , : n , , a [ n i ( u) 13 -- C [ h ( w ): w is in the subtree of u].
7. Set C ( v )= S(v) - U ( v )for each node u.
8. Accumulate the table of l ( k ) values as detailed in Section 7.6.
End.
zf=,
Theorem 9.7.1. Let S be a set of K strings of total length n, and let l ( k )denote the length
of the longest substring that appears in at least k distinct strings of S. A table of all l ( k )
values, fork from 2 to K , can be biiilt in O ( n )time.
That so much information about the substrings of S can be obtained in time proportional
to the time needed just to read the strings is very impressive. It would be a good challenge
to try to obtain this result without the use of suffix trees (or a similar data structure).
9.8. Exercises
1. Prove Theorem 9.1. l .
2. Fill in all the details and prove the correctness of the space-efficient method solving the
Figure 9.3: The boxed leaves have identifier i . The circled internal nodes are the lowest common ancestors
of the four adjacent pairs of leaves from list Li.
y are any two leaves in L i ( v ) ,then the lca of x and y is a node in the subtree of v . So if
we compute the Eca for each consecutive pair of leaves in L i ( v ) ,then all of the n i ( v )- 1
computed lcas will be found in the subtree of u. Further, if x and y are not both in the
subtree of u, then the lea of x and y will not be a node in v's subtree. This leads to the
following lemma and method.
Lemma 9.7.2. I f we compute the lca for each consecutive pair of leaves in Lj, then for
any node v, exactly n i ( v )- 1 of the comprtted lcas will lie in the subtree of u.
Lemma 9.7.2 is illustrated in Figure 9.3.
Given the lemma, we can compute n i ( v )- 1 for each node u as follows: Compute the
lca of each consecutive pair of leaves in L i , and accumulate for each node w a count of
the number of times that w is the computed lca. Let h ( w ) denote that count for node w .
Then for any node v , n i ( v )- 1 is exactly C [ h ( w ): w is in the subtree of v ] . A standard
O(n)-time bottom-up traversal of T can therefore be used to find n i ( v )- 1 for each node v .
TO find U ( v ) ,we don't want n i ( v )- 1 but rather
[ n i ( u) 11. However, the algorithm
must not do a separate bottom-up traversal for each identifier, since then the time bound
would then be O ( K n ) .Instead, the algorithm should defer the bottom-up traversal until
each list Li has been processed, and it should let h ( w )count the total number of times that
w is the computed lca over all of the lists. Only then is a single bottom-up traversal of T
done. At that point, U ( v ) = ~i:.,,,[ni(v) - I] = x [ h ( w ) : w is in the subtree of v ] .
We can now summarize the entire O ( n ) method for solving the k-common substring
problem.
xi
PART III
208
6. Recall that a plasmid is a circular DNA molecule common in bacteria (and elsewhere).
Some bacterial plasmids contain relatively long complemented palindromes (whose function is somewhat in question). Give a linear-time algorithm to find all maximal complemented
palindromes in a circular string.
7. Show how to find all the k-mismatch palindromes in a string of length n in O(kn) time.
8. Tandem repeats. In the recursive method discussed in Section 9.6 (page 202) for finding the tandem repeats (no mismatches), problem 3 is solved with a linear number of
constant-time common extension queries, exploiting suffix trees and lowest common ancestor computations. An earlier, equally efficient, solution to probtem 3 was developed by
Main and Lorenz (3071, without using suffix trees.
The idea is that the problem can be solved in an amortized linear-time bound without
suffix trees. In an instance of problem 3, h is held fixed while q = h + 1 - 1 varies over
all appropriate values of I.Each forward common extension query is a problem of finding
the length of the longest substring beginning at position q that matches a prefix of S[h
. . . n]. All those lengths must be found in linear time. But that objective can be achieved by
computing Z values (again) from Chapter 1, for the appropriate substring of S. Flesh out
the details of this approach and prove the linear amortized time bound.
Now show how the backward common extensions can also be solved in linear time by
computing Z values on the appropriately constructed substring of S. This substring is a
bit less direct than the one used for forward extensions.
9. Complete the details for the O(kn log n z)-time algorithm for the k-mismatch tandem
repeat problem. Consider both correctness and time.
11. Try to modify the Main and Lorenz method for finding all the tandem repeats (without errors)
to solve the k-mismatch tandem repeat problem in O(kn log n z) time. If you are not
successful, explain what the difficulties are and how the use of suffix trees and common
ancestors solves these problems.
12. The tandem repeat method detailed in Section 9.6 finds all tandem repeats even if they are
not maximal. For example, it finds six tandem repeats in the stringxababababy, even though
the left-most tandem repeat abab is contained in the longer tandem repeat abababab. Depending on the application, that output may not be desirable. Give a definition of maximaljty
that would reduce the size of the output and try to give efficient algorithms for the different
definitions.
13. Consider the following situation: A long string S is given and remains fixed. Then a sequence of shorter strings S,, &, . . . , Sk is given. After each string Si is given (but before
S,,, is known), a number of longest common extension queries will be asked about S, and
S. Let r denote the total number of queries and n denote the total length of ail the short
strings. How can these on-line queries be answered efficiently? The most direct approach
is to build a generalized suffix tree for both S and Si when Si is presented, preprocess it (do
a depth-first traversal assigning dfs numbers, setting I() values, etc.) for the constant-time
Ica algorithm, and then answer the queries for S;. But that would take O(kl SI
n
r)
time. The kiSI term comes from two sources: the time to build the k generalized suffix
trees and the time to preprocess each of them for Ica queries.
+ +
Reduce that k / S (term from both sources to I SI, obtaining an overall bound of O(IS1 +
n r ) . Reducing the time for building all the generalized suffix trees is easy. Reducing the
time for the Ica preprocessing takes a bit more thought.
211
210
At this point we shift from the general area of exact matching and exact pattern discovery to the general area of inexact, approximate matching, and sequence alignment.
"Approximate" means that some errors, of various types detailed later, are acceptable in
valid matches. "Alignment" will be given a precise meaning later, but generally means
lining up characters of strings, allowing mismatches as well as matches, and allowing
characters of one string to be placed opposite spaces made in opposing strings.
We also shift from problems primarily concerning substrings to problems concerning
subsequences. A subsequence differs from a substring in that the characters in a substring
must be contiguous, whereas the characters in a subsequence embedded in a string need
not be.' For example, the string xyr is a subsequence, but not a substring, in axayar, The
shift from substrings to subsequences is a natural corollary of the shift from exact to
inexact matching. This shift of focus to inexact matching and subsequence comparison is
accompanied by a shift in techniq~te.Most of the methods we will discuss in Part 111, and
many of the methods in Part IV, rely on the tool of dynamicprogramming, a tool that was
not needed in Parts I and 11.
'
It is a common and confusing practice in the biological literature to refer to a substring as n subsequence. But
techniques and results for substring problems can be very different from techniques and results for the analogous
subsequence problems, so it is important to maintain a clear distinction. In this book we will never use the term
"subsequence" when "substring" is intended.
213
And fruit flies aren't special. The following is from a book review on DNA repair [424]:
Throughout the present work we see the insights gained through our ability to look for
sequence homologies by comparison of the DNA of different species. Studies on yeast are
remarkable predictors of the human system!
So "redundancy", and "similarity" are central phenomena in biology. But similarity has
its limits - humans and flies do differ in some respects. These differences make conserved
similarities even more significant, which in turn makes comparison and analogy very
powerful tools in biology. Lesk [297] writes:
It is characteristic of biological systems that objects that we observe to have a certain form
arose by evolution from related objects with similar but not identical from. They must,
therefore, be robust, in that they retain the freedom to tolerate some variation. We can take
advantage of this robustness in our analysis: By identifying and comparing related objects,
we can distinguish variable and conserved features, and thereby determine what is crucial to
structure and function.
The important "related objects" to compare include much more than sequence data,
because biological universality occurs at many levels of detail. However, it is usually easier
to acquire and examine sequences than it is to examine fine details of genetics or cellular
biochemistry or morphology. For example, there are vastly more protein sequences known
(deduced from underlying DNA sequences) than there are known three-dimensional protein structures. And it isn't just a matter of convenience that makes sequences important.
Rather, the biological sequences encode and reflect the more complex common molecular
structures and mechanisms that appear as features at the cellular or biochemical levels.
Moreover, "nowhere in the biological world is the Darwinian notion of 'descent with modification' more apparent than in the sequences of genes and gene products" [130]. Hence
a tractable, though partly heuristic, way to search for functional or structural universality
in biological systems is to search for similarity and conservation at the sequence level.
The power of this approach is made clear in the following quotes:
Today, the most powerful method for inferring the biological function of a gene (or the protein
that it encodes) is by sequence similarity searching on protein and DNA sequence databases.
With the development of rapid methods for sequence comparison, both with heuristic algorithms and powerful parallel computers, discoveries based solely on sequence homology
have become routine. [360]
Determining function for a sequence is a matter of tremendous complexity, requiring biological experiments of the highest order of creativity. Nevertheless, with only DNA sequence it
is possible to execute a computer-based algorithm comparing the sequence to a database of
previously characterized genes. In about 50% of the cases, such a mechanical comparison
will indicate a sufficient degree of similarity to suggest a putative enzymatic or structural
function that might be possessed by the unknown gene. [9 11
Sequence comparison, particularly when combined with the systematic collection, curration, and search of databases containing biomolecular sequences, has become essential
in modem molecular biology. Commenting on the (then) near-completion of the effort to
sequence the entire yeast genome (now finished), Stephen Oliver says
In a short time it will be hard to realize how we managed without the sequence data. Biology
will never be the same again. [478]
One fact explains the importance of molecular sequence data and sequence comparison
in biology,
T h e first fact of biological sequence analysis
Thefirstfact of biological sequence analysis In biomolecular sequences (DNA, RNA,
or amino acid sequences), high sequence similarity usually implies significant functional
or structural similarity.
Evolution reuses, builds on, duplicates, and modifies "successful" structures (proteins,
exons, DNA regulatory sequences, morphological features, enzymatic pathways, etc.).
Life is based on a repertoire of structured and interrelated molecular building blocks that
are shared and passed around. The same and related molecular structures and mechanisms
show up repeatedly in the genome of a single species and across a very wide spectrum
of divergent species. "Duplication with modification" (127, 128, 129, 1301 is the central
paradigm of protein evolution, wherein new proteins and/or new biological functions are
fashioned from earlier ones. Doolittle emphasizes this point as follows:
The vast majority of extant proteins are the result of acontinuous series of genetic duplications
and subsequent modifications. As a result, redundancy is a built-in characteristic of protein
sequences, and we should not be surprised that so many new sequences resemble already
known sequences. [ 1 291
He adds that
. . . all of biology is based on an enormous redundancy . . . .[1301
The following quotes reinforce this view and suggest the utility of the "enormous
redundancy" in the practice of molecular biology. The first quote is from Eric Wieschaus,
cowinner of the 1995 Nobel prize in medicine for work on the genetics of Drosophiln
development. The quote is taken from an Associated Press article of October 9, 1995.
Describing the work done years earlier, Wieschaus says
We didn't know it at the time, but we found out everything in life is so similar, that the same
genes that work in flies are the ones that work in humans.
11.1. Introduction
In this chapter we consider the inexact matching and alignment problems that form the
core of the field of inexact matching and others that illustrate the most general techniques.
Some of those problems and techniques will be further refined and extended in the next
chapters. We start with a detailed examination of the most classic inexact matching problem
solved by dynamic programming, the edit distance problem. The motivation for inexact
matching (and, more generally, sequence comparison) in molecular biology will be a
recurring theme explored throughout the rest of the book. We will discuss many specific
examples of how string comparison and inexact matching are used in current molecular
biology. However, to begin, we concentrate on the purely formal and technical aspects of
defining and computing inexact matching.
RIMDMDMMI
v intner
w r i t ers
That is, v is replaced by w, r is inserted, i matches and is unchanged since it occurs in
both strings, n is deleted, t is unchanged, n is deleted, er match and are unchanged, and finally s is inserted. We now more formally define edit transcripts and string transformations.
Definition A string over the alphabet I, D,R, M that describes a transformation of one
string to another is called an edit transcript, or transcript for short, of the two strings.
In general, given the two input strings SI and S2,and given an edit transcript for SIand
the transformation is accomplished by successively applying the specified operation in
the transcript to the next character(s) in the appropriate stringis). In particular, let next1 and
$2,
214
The final quote reflects the potential total impact on biology of the first fact and its
exploitation in the form of sequence database searching. It is from an article (1791 by
Walter Gilbert, Nobel prize winner for the coinvention of a practical DNA sequencing
method. Gilbert writes:
The new paradigm now emerging, is that all the 'genes' will be known (in the sense of being
resident in databases available electronically), and that the starting point of biological investigation will be theoretical. An individual scientist will begin with a theoretical conjecture,
only then turning to experiment to follow or test that hypothesis.
Already, hundreds (if not thousands) of journal publications appear each year that report
biological research where sequence comparison and/or database search is an integral part
of the work. Many such examples that support and illustrate thefirst fact are distributed
throughout the book. In particular, several in-depth examples are concentrated in Chapters 14 and 15 where multiple string comparison and database search are discussed. But
before discussing those examples, we must first develop, in the next several chapters, the
techniques used for approximate matching and (sub)sequence comparison.
Caveat
The first fact of biological sequence analysis is extremely powerful, and its importance
will be further illustrated throughout the book. However, there is not a one-to-one correspondence between sequence and structure or sequence and function, because the converse
of the first fact is not true. That is, high sequence similarity usually implies significant
structural or functional similarity (the first fact), but structural or functional similarity
does not necessarily imply sequence similarity, On the topic of protein structure, F. Cohen
[lo61 writes ".. . similar sequences yield similar structures, but quite distinct sequences
can produce remarkably similar structures". This converse issue is discussed in greater
depth in Chapter 14, which focuses on multiple sequence comparison.
217
Another example of an alignment is shown on page 215 where vintner and writers are
aligned with each other below their edit transcript. That example also suggests a duality
between alignment and edit transcript that will be developed below.
That is, D ( i , j)denotes the minimum number of edit operations needed to transform the
first i characters of S , into the first j characters of S2. Using this notation, if SIhas n letters
and S2 has m letters, then the edit distance of SI and S2 is precisely the value D ( n , m ) .
We will compute D ( n , m ) by solving the more general problemof computing D ( i . j ) for
all combinations of i and j, where i ranges from zero ton and j ranges from zero to m . This
is the standard dynamic programming approach used in a vast number of computational
problems. The dynamic programming approach has three essential components - the
recurrence relation, the tabular compuration, and the traceback. We will explain each
one in turn.
216
nextz be pointers into SI and S2. Both pointers begin with value one. The edit transcript is
read and applied left to right. When symbol "I" is encountered, character next2 is inserted
before character next, in S 1 ,and pointer next2 is incremented one character. When "D" is
encountered, character next, is deleted from S1 and nextl is incremented by one character.
When either symbol "R" or "M" is encountered, character nextl in S1 is replaced or
matched by character next2 from SZ,and then both pointers are incremented by one.
Definition The edit distance between two strings is defined as the minimum number
of edit operations -insertions, deletions, and substitutions -needed to transform the first
string into the second. For emphasis, note that matches are not counted.
Edit distance is sometimes referred to as Levenshtein distance in recognition of the
paper [299] by V. Levenshtein where edit distance was probably first discussed.
We will sometimes refer to an edit transcript that uses the minimum number of edit
operations as an optimal transcript. Note that there may be more than one optimal transcript. These will be called "cooptimal" transcripts when we want to emphasize the fact
that there is more than one optimal.
The edit distance problem is to compute the edit distance between two given strings,
along with an optimal edit transcript that describes the transformation.
The definition of edit distance implies that all operations are done to one string only.
But edit distance is sometimes thought of as the minimum number of operations done on
either of the two strings to transform both of them into a common third string. This view
is equivalent to the above definition, since an insertion in one string can be viewed as a
deletion in the other and vice versa.
q
q
a
a
c
w
d
x
b
b
d
-
In this alignment, character c is mismatched with w , both the ds and the x are opposite
spaces, and all other characters match their counterparts in the opposite string.
219
Since the last transcript symbol must either be I, D, R , or M, we have covered all cases
and established the lemma. o
Now we look at the other side.
Lemma 11.3.2, D(i, j ) 5 rnin[D(i - 1, j )
The reasoning is very similar to that used in the previous lemma, but it achieves a
somewhat different goal. The objective here is to demonstrate constructively the existence
of transformations achieving each of the three values specified in the inequality. Then
since all three values are feasible, their minimum is certainly feasible.
First, it is possible to transform Sl[1 . . i ] into $11 ..j ] with exactly D(i, j - 1) 1 edit
operations. Simply transform Sl[ 1.,i] to S2[l.. j - 11 with the minimum number of edit
operations, and then use one more to insert character Sz(j) at the end. By definition, the
number of edit operations in that particular way to transform SI to S2 is exactly D(i, j 1) 1. Second, it is possible to transform Sl[l..i] to S2[l..j] with exactly D(i - 1, j ) 1
edit operations. Transform Sl[l..i - 11 to S2[1..j] with the fewest operations, and then
delete character Sl(i). The number of edit operations in that particular transformation
is exactly D(i - 1, j ) 1. Third, it is possible to do the transformation with exactly
D(i - 1, j - 1) + t(i, j ) edit operations, using the same argument.
PROOF
Lemmas 11.3.1 and 11.3.2 immediately imply the correctness of the general recurrence
relation for D(i, j).
Theorem 11.3.1. When both i and j are strictly positive, D(i, j ) = rnin[D(i
1, D(i, j - 1) 1, D(i - 1, j - I ) t(i, j)].
1, j )
Lemma 11.3.1 says that D(i, j) must be equal to one of the three values D(i 1, j)+1, D(i, j - l)+1, or D(i - 1, j- l)+t(i, j). Lemma 11.3.2 says that D(i, j ) must be
less than o r equal to the smallest of those three values. It follows that D(i, j ) must therefore
be equal to the smallest of those three values, and we have proven the theorem. D
PROOF
This completes the first component of the dynamic programming method for edit distance, the recurrence relation.
..
218
and
D(0, j) = j.
The base condition D(i, 0) = i is clearly correct (that is, it gives the number required
by the definition of D(i, 0)) because the only way to transform the first i characters of
S1 to zero characters of S2 is to delete all the i characters of Sl. Similarly, the condition
D(0, j ) = j is correct because j characters must be inserted to convert zero characters of
S1 to j characters of S2.
The recurrence relation for D(i, j) when both i and j are strictly positive is
D(i, j ) = min[D(i - 1 , j )
where t(i, j ) is defined to have value 1 if Sl(i) # Sz(j), and t(i, j ) has value 0 if
Sl(i) = Sz(j).
Lemma 11.3.1. The value of D(i, j ) must be D(i, j - 1) + 1, D(i - 1, j ) + 1, or D(i 1, j - 1) t(i, j). There are no other possibilities.
Consider an edit transcript for the transformation of Si[1 ..i] to S2[l..j] using
the minimum number of edit operations, and focus on the last symbol in that transcript.
That last symbol must either be I, D, R, or M. If the last symbol is an I then the last
edit operation is the insertion of character S2(j) onto the end of the (transformed) first
string. It follows that the symbols in the transcript before that I must specify the minimum
number of edit operations to transform Sl[I ..i] to S2[l.. j - 11 (if they didn't, then the
specified transformation of Sl[ 1..i] to Sz[1.. j ] would use more than the minimum number
of operations). By definition, that latter transformation takes D(i, j - 1) edit operations.
1.
Hence if the last symbol in the transcript is I, then D(i, j ) = D(i, j - 1)
Similarly, if the last symbol in the transcript is a D, then the last edit operation is the
deletion of Sl(i), and the symbols in the transcript to the left of that D must specify the
minimum number of edit operations to transform Sl[l..i - 11 to S2[I..j]. By definition,
that latter transformation takes D(i - 1, j ) edit operations. So if the last symbol in the
transcript is D, then D(i, j ) = D(i - 1, j ) + 1.
If the last symbol in the transcript is an R, then the last edit operation replaces Sl(i) with
S2(j),and the symbols to the left of R specify the minimum number of edit operations to
transform Sl[l..i - 11 to S2[1..j - I]. In that case D(i, j ) = D(i - 1, j - 1) 1. Finally,
and by similar reasoning, if the last symbol in the transcript is an M, then Sl(i) = S2(j)
and D(i, j ) = D(i - 1, j - I). Using the variable t(i, j) introduced earlier [i.e., that
t(i, j ) = 0 if Sl(i) = S2(j); otherwise t(i, j ) = 11 we can combine these last two cases
as one: If the last transcript symbol is R or M, then D(i. j ) = D(i - 1, j - 1) t(i, j).
PROOF
221
Figure 11.2: Edit distances are filled in one row at a time, and in each row they are filled in from left to
right. The example shows the edit distances D(i, j] to column 3 of row 4. The next value to be computed is
D(4,4), where an asterisk appears. The value for cell (4,4) is 3, since S1( 4 ) = &(4) = t and D(3. 3 ) = 3.
The reader should be able to establish that the table could also be filled in columnwise
instead of rowwise, after row zero and column zero have been computed. That is, column
one could be first filled in, followed by column two, etc. Similarly, it is possible to fill in
the table by filling in successive anti-diagonals, We leave the details as an exercise.
220
Figure 11 .I: Table to be used to compute the edit distance between vintner and writers. The values in row
zero and column zero are already included. They are given directly by the base conditions.
Bottom-up computation
In the bottom-up approach, we first compute D(i, j ) for the smallest possible values for i
and j , and then compute values of D(i, j ) for increasing values of i and j . Typically, this
bottom-up computation is organized with a dynamic programming table of size (n + 1) x
(m 1). The table holds the values of D(i, j)for all the choices of i and j (see Figure 1 1.1).
Note that string S1corresponds to the vertical axis of the table, while string S2 corresponds
to the horizontal axis. Because the ranges of i and j begin at zero, the table has a zero row
and a zero column. The values in row zero and column zero are filled in directly from the
base conditions for D(i, j). After that, the remaining n x m subtable is filled in one row at
time, in order of increasing i. Within each row, the cells are filled in order of increasing j.
To see how to fill in the subtable, note that by the general recurrence relation for D(i, j),
all the values needed for the computation of D( 1, 1) are known once D(0, O), D(1, O), and
D(0, 1) have been computed. Hence D(1, I) can be computed after the zero row and zero
column have been filled in. Then, again by the recurrence relations, after D(1, 1) has been
computed, all the values needed for the computation of D( 1,2) are known. Following this
idea, we see that the values for row one can be computed in order of increasing index j.
After that, all the values needed to compute the values in row two are known, and that
row can be filled in, in order of increasing j. By extension, the entire table can be filled
in one row at a time, in order of increasing i , and in each row the values can be computed
in order of increasing j (see Figure 1 1.2).
Time analysis
How much work is done by this approach? When computing the value for a specific cell
(i, j ) , only cells (i - 1, j - I), (i, j - l), and (i - 1, j ) are examined, along with the
two characters Si(i) and Sz(j). Hence, to fill in one cell takes a constant number of cell
examinations, arithmetic operations, and comparisons. There are O(nm) cells in the table,
so we obtain the following theorem.
Theorem 11.3.2. The dynamic programming table for computing the edit distance between a string oflength n and a string of length m can befilled in with O(nm) work. Hence,
using dynamic programming, the edit distance D ( n , m) can be computed in O(nm) time.
Flgure11.4: Edit graph for the strings CAN and ANN.The weight on ea& d g e is one, except for the three
zero-weight edges marked in the figure.
operations. Conversely, any optimal edit transcript is specified bj slrch a path. Moreover;
since a path describes only one transcript, the correspondence beruseenpaths a n d optimal
transcripts is one-to-one.
The theorem can be proven by essentially the same reasoning rhat established the correctness of the recurrence relations for D(i, j),and this is lefi to the reader. An alternative
way to find the optimal edit transcript(s), without using pointen. is discussed in Exercise 9. Once the pointers have been established, all the cooptima1 edit transcripts can be
enumerated in O(n + m) time per transcript. That is the focus of Exercise 12.
In the case of the edit distance problem, the edit graph contruns a directed edge from
l), (i
1 , j ) , and (i
1, j
I), provided
each node (i, j ) to each of the nodes (i, j
those nodes exist. The weight on the first two of these edges is one; the weight on the
third (diagonal) edge is t(i + 1, j + 1). Figure 1 1.4 shows the edit graph for strings CAN
and ANN.
The central property of an edit graph is that any shortestpath (one whose total weight is
minimum) from start node (0,O) to destination node (n, m) specities an edit transcript with
the minimum number of edit operations. Equivalently, any shortest path specifies a global
alignment of minimum total weight. Moreover, the following theorem and corollary can
be stated.
Theorem 11.4.1. An edit transcriptfor S1, S2has the minimum number of edit operations
ifand only ifit corresponds to a shortestpath from (0,O) to (n, rn) in the edit graph.
Corollary 11.4.1. The set of all shortest paths from (0,O) to (n, m) in the edit graph
exactly specifies the set of all optimal edit transcripts of S1 to S2.Equivalendy, it specifies
all the optimal (minimum weight) alignments of SI and SZ.
Viewing dynamic programming as a shortest path problem is often useful because there
222
Figure 11.3: The complete dynamic programming table with pointers included. The arrow c in cell (i, f)
points to cell ( I ,j - I), the arrow t points to cell ( i - 1, j), and the arrow *\ points to cell ( i - 1, j I ) .
r
l
t
t
i
n
e
e
r
r
s
-
w
v
r
~
i
n
t
t
e
n
r
e
s
r
w
-
t
t
s
-
and
v
If there is more than one pointer from cell (n, m), then a path from (n, m) to (0,O) can
start with either of those pointers. Each of them is on a path from (n, m) to (0,O). This
property is repeated from any cell encountered. Hence a traceback path from ( n , m) to
(0,O) can start simply by following any pointer out of (n, m); it can then be extended by
following any pointer out of any cell encountered. Moreover, every cell except (0,O) has
a pointer out of it, so no path from (n, m) can get stuck. Since any path of pointers from
(n, m) to (0,O) specifies an optimal edit transcript or alignment, we have the following:
Theorem 11.3.3, Once the dynamic programming table with pointers has been computed,
a n optimal edit transcript can be found in 0(n m) time.
We have now completely described the three crucial components of the general dynamic
programming paradigm, as illustrated by the edit distance problem. We will later consider
ways to increase the speed of the solution and decrease its needed space.
Theorem 11.3.4. Any path from (n , m ) to (0,O) following pointers established during
+'.
.-.--.,*,*;,-
nf
n(;
i j rnorifiov
nn ~ d i rron.~cript
t
with
$:
1
83
.2
225
The operation-weight edit distance problem can also be represented and solved as a
shortest path problem on a weighted edit graph, where the edge weights correspond in the
natural way to the weights of the edit operations. The details are straightforward and are
thus left to the reader.
' In a pure computer science or mathematical discussion of alphabet-weight edit distance, we would prefer to use the
general term "weight matrix" for the matrix holding the alphabet-dependent substitution scores. However, molecular
biologists use the terms "amino acid substitution matrix" or "nucleotide substitution matrix" for those matrices, and
they use the term "weight matrix" for tl very different object (See Section 14.3.1). Therefore, to maintain generality.
and yet to keep in some harmony with the molecular biology titerature, we will use the general term "scoring matrix".
224
are many tools for investigating and compactly representing shortest paths in graphs. This
view will be exploited in Section 13.2 when suboptimal solutions are discussed.
Definition With arbitrary operation weights, the operation-weight edit distance problem is to find an edit transcript that transforms string S1 into S2 with the minimum total
operation weight.
In these terms, the edit distance problem we have considered so far is just the problem
of finding the minimum operation-weight edit transcript when d = 1, r = 1 , and e = 0.
But, for example, if each mismatch has a weight of 2, each space has a weight of 4, and
each match a weight of 1, then the alignment
w
v
r
i
i
n
t
t
e
n
r
e
s
-
and
The terms "weight" or "cost" are heavily used in the computer science literature, while the term "score" is used
in the biological literature. We will use these terms more or less interchangeably in discussing algorithms, but the
term "score" will be used when talking about specific biological applications,
Definition V ( i , j ) is defined as the value of the optimal alignment of prefixes SI[I ..i]
and S2[1.. j ] .
Recall that a dash ("-") is used to represent a space inserted into a string. The base
conditions are
and
The correctness of this recurrence is established by arguments similar to those used for
edit distance. In particular, in any alignment A, there are three possibilities: characters
S l ( i ) and S 2 ( j )are in the same position (opposite each other), S , ( i ) is in a position after
S z ( j ) ,or S l ( i ) is in a position before S 2 ( j ) .The correctness of the recurrence is based on
that case analysis. Details are left to the reader.
If SI and S2 are of length n and m , respectively, then the value of their optimal alignment
is given by V ( n , m ) . That value, and the entire dynamic programming table, can be obtained
in 0 ( n m )time, since only three comparisons and arithmetic operations are needed per cell.
By leaving pointers while filling in the table, as was done with edit distance, an optimal
alignment can be constructed by following any path of pointers from cell ( n , rn) to cell
(0,O). So the optimal (global) alignment problem can be solved in O ( n m ) time, the same
time as for edit distance.
' The distinction between subsequence and substring is often lost in the biological literature. But algorithms for
substrings are usually quite different in spirit and efficiency than algorithms for subsequences, so the distinction is
an important one.
226
similarity, the language of alignment is usually more convenient than the language of edit
transcript. We now begin to develop a precise definition of similarity.
Definition Let C be the alphabet iised for strings SI and SZ,and let C' be C with the
added character "-" denoting a space. Then, for any two characters x, y in C', s(x, y)
denotes the value (or score) obtained by aligning character x against character y .
Definition For a given alignment A of Si and S2, let S; and S; denote the strings after
the chosen insertion of spaces, and let 1 denote the (equal) length of the two strings S;
and S; in A. The value of alignment A is defined as
s(S;(i). S;(i)).
XI=,
That is, every position i in A specifies a pair of opposing characters in the alphabet C',
and the value of d is obtained by summing the value contributed by each pair.
For example, let C = ( a , b, c, d ) and let the pairwise scores be defined in the following
matrix:
+ + +
Definition Given a pairwise scoring matrix over the alphabet C', the similarify of two
strings S l and S2 is defined as the value of the alignment A of S i and S2 that maximizes
total alignment value. This is also called the optimal alignment value of Sl and S2.
String similarity is clearly related to alphabet-weight edit distance, and depending on
the specific scoring matrix involved, one can often transform one problem into the other.
An important difference between similarity and weighted edit distance will become clear
in Section 1 1.7, after we discuss local alignment.
229
One example where end-spaces should be free is in the shotgun sequence assembly (see
Sections 16.14 and 16.15). In this problem, one has a large set of partially overlapping
substrings that come from many copies of one original but unknown string; the problem is
to use compatisons of pairs of substrings to infer the correct original string. Two random
substrings from the set are unlikely to be neighbors in the original string, and this is reflected
by a low end-space free alignment score for those two substrings. But if two substrings do
overlap in the original string, then a "good-sized" suffix of one should align to a "goodsized" prefix of the other with only a small number of spaces and mismatches (reflecting
a small percentage of sequencing errors). This overlap is detected by an end-space free
weighted alignment with high score. Similarly the case when one substring contains
another can be detected in this way. The procedure for deducing candidate neighbor pairs
is thus to compute the end-space free alignment between every pair of substrings; those
pairs with high scores are then the best candidates. We will return to shotgun sequencing
and extend this discussion in Part IV, Section 16.14.
To implement free end spaces in computing similarity, use the recurrences for global
alignment (where all spaces count) detailed on page 227, but change the base conditions
to V(i, 0) = V(0, j ) = 0, for every i and j. That takes care of any spaces on the left
end of the alignment. Then fill in the table as in the case of global alignment. However,
unlike global alignment, the value of the optimal alignment is not necessarily found in cell
(n, m). Rather, the value of the optimal alignment with free ends is the maximum value
over all cells in row n or column m . Cells in row n correspond to alignments where the last
character of string Si contributes to the value of the alignment, but characters of S2 to its
right do not. Those characters are opposite end spaces, which are free. Cells in column m
have a similar characterization. Clearly, optimal alignment with free end spaces is solved
in O(nm) time, the same time as for global alignment.
228
The lcs problem is important in its own right, and we will discuss some of its uses and
some ideas for improving its computation in Section 12.5. For now we show that it can
be modeled and solved as an optimal alignment problem.
Theorem 11.6.1. With a scoring scheme that scores a one for each match and a zero for
each mismatch o r space, the matched characters in a n alignment of maximum valueform
a longest common subsequence.
The proof is immediate and is left to the reader. It follows that the longest common
subsequence of strings of lengths n and m, respectively, can be computed in O(nm) time.
At this point we see the first of many differences between substring and subsequence
problems and why it is important to clearly distinguish between them. In Section 7.4 we
established that the longest common substring could be found in O(n m) time, whereas
here the bound established for finding longest common subsequence is O(n x m) (although this bound can be reduced somewhat). This is typical - substring and subsequence
problems are generally solved by different methods and have different time and space
complexities.
the two spaces at the left end of the alignment are free, as is the single space at the right
end.
Making end spaces free in the objective function encourages one string to align in the
interior of the other, or the suffix of one string to align with a prefix of the other. This is
desirable when one believes that those kinds of alignments reflect the "true" relationship of
the two strings. Without a mechanism to encourage such alignments, the optimal alignment
might have quite a different shape and not capture the desired relationship.
231
strings may be related. When comparing protein sequences, local alignment is also critical
because proteins from very different families are often made up of the same structural or
functional subunits (motifs or domains), and local alignment is appropriate in searching
for these (unknown) subunits. Similarly, different proteins are often made from related
motifs that form the inner core of the protein, but the motifs are separated by outside
surface looping regions that can be quite different in different proteins.
A very interesting example of conserved domains comes from the proteins encoded by
homeobox genes. Homeobox genes [3 19,38 11 show up in a wide variety of species, from
fruit flies to frogs to humans. These genes regulate high-level embryonic development,
and a single mutation in these genes can transform one body part into another (one of the
original mutation experiments causes fruit fly antenna to develop as legs, but it doesn't
seem to bother the fly very much). The protein sequences that these genes encode are
very different in each species, except in one region called the homeodomain. The homeodomain consists of about sixty amino acids that form the part of the regulatory protein
that binds to DNA. Oddly, homeodomains made by certain insect and mammalian genes
are particularly similar, showing about 50 to 95% identity in alignments without spaces.
Protein-to-DNA binding is central in how those proteins regulate embryo development
and cell differentiation. So the amino acid sequence in the most biologically critical part of
those proteins is highly conserved, whereas the other parts of the protein sequences show
very little similarity. In cases such as these, local alignment is certainly a more appropriate
way to compare protein sequences than is global alignment.
Local alignment in protein is additionally important because particular isolated characters of related proteins may be more highly conserved than the rest of the protein (for
example, the amino acids at the active site of an enzyme or the amino acids in the hydrophobic core of a globular protein are the most highly conserved). Local alignment will
more likely detect these conserved characters than will global alignment. A good example
is the family of serineproteases where a few isolated, conserved amino acids characterize
the family. Another example comes from the Helix-Turn-Helix motif, which occurs frequently in proteins that regulate DNA transcription by binding to DNA. The tenth position
of the Helix-Turn-Helix motif is very frequently occupied by the amino acid glycine, but
the rest of the motif is more variable.
The following quote from C. Chothia [ l o l l further emphasizes the biological importance of protein domains and hence of local string comparison.
Extant proteins have been produced from the original set notjust by point mutations, insertions
and deletions but also by combinations of genes to give chimeric proteins. This is particularly
true of the very large proteins produced in the recent stages of evolution. Many of these are
built of different combinations of protein domains that have been selected from a relatively
small repertoire.
Doolittle [129] summarizes the point: "The underlying message is that one must be
alert to regions of similarity even when they occur embedded in an overall background of
dissimilarity."
Thus, the dominant viewpoint today is that local alignment is the most appropriate
type of alignment for comparing proteins from different protein families. However, it has
also been pointed out [359,360] that one often sees extensive global similarity in pairs of
protein strings that are first recognized as being related by strong local similarity. There
are also suggestions [316] that in some situations global alignment is more effective than
local alignment in exposing important biological commonalities.
230
cell in row zero is reached, breaking ties by choosing a vertical pointer over a diagonal
one and a diagonal one over a horizontal one.
Local alignment problem Given two strings S1 and S2, find substrings cr and
fi of S1 and S2, respectively, whose similarity (optimal global alignment value) is
maximum over all pairs of substrings from S, and S2. We use v* to denote the value
of an optimal solution to the local alignment problem.
For example, consider the strings SI = pqraxabcsrvq and S2 = xyaxbacsll. If we
give each match a value of 2, each mismatch a value of -2, and each space a value of -1,
then the two substrings a = axabcs and B = axbacs of SI and S2, respectively, have the
following optimal (global) alignment
a
a
x
x
a
-
b
b
c
c
s
s
which has a value of 8. Furthermore, over all choices of pairs of substrings, one from each
of the two strings, those two substrings have maximum similarity (for the chosen scoring
scheme). Hence, for that scoring scheme, the optimal local alignment of S1 and $2 has
value 8 and is defined by substrings axabcs and axbacs.
It should be clear why local alignment is defined in termsof similarity, which maximizes
an objective function, rather than in terms of edit distance, which minimizes an objective. When one seeks a pair of substrings to minimize distance, the optimal pairs would
be exactly matching substrings under most natural scoring schemes. But the matching
substrings might be just a single character long and would not identify a region of high
similarity. A formulation such as local alignment, where matches contribute positively
and mismatches and spaces contribute negatively, is more likely to find more meaningful
regions of high similarity.
233
Theorem 11.7.2. I f i'. j' is an index pair maximizing v(i, j ) over all i, j pairs, then a
pair of substrings solvirtg the local suffix alignment problem for i', j' also solves the local
alignment problem.
Thus a solution to the local suffix alignment problem solves the local alignment problem.
We now turn our attention to the problem of finding max[v(i,j) : i 5 n , j 5 m ] and a
pair of strings whose alignment has maximum value.
Theorem 11.7.3. For i > 0 and j > 0,the proper recurrence for v(i,j) is
The recurrences for local suffix alignment are almost identical to those for global
alignment. The only difference is the inclusion of zero in the case of local suffix alignment.
This makes intuitive sense. In both global alignment and local suffix alignment of prefixes
S r[ l . . i ] and S2[l.. j] the end characters of any alignment are specified, but in the case of
local sufix alignment, any number of initial characters can be ignored. The zero in the
recurrence implements this, acting to "restart" the recurrence.
Given Theorem 1 1.7.2, the method to compute v* is to compute the dynamic programming table for v(i,j) and then find the largest value in any cell in the table, say in cell
(i*, j * ) . As usual, pointers are created while filling in the values of the table. After cell
232
sum
For example, suppose the objective function counts 2 for each match and - 1 for each
mismatch or space. If SI = abcxdex and S2 = xxxcde, then v(3,4) = 2 (the two cs
match), t1(4,5) = 1 (cx aligns with cd), v(5,5) = 3 (x-d aligns withxcd), and v(6,6) = 5
(.r-de aligns with xcde).
Since the definition allows either or both of the suffixes to be empty, v(i, j ) is always
oreater than or equal to zero.
c'
The following theorem shows the relationship between the local alignment problem
and the local suffix alignment problem. Recall that v* is the value of the optimal local
alignment for two strings of length n and m .
Theorem 11.7.1 only specifies the value v', but its proof makes clear how to find
substrings whose alignment have that value. In particular,
11.8. GAPS
c
c
t
_
t
-
t
c
a
a
a
c
c
c
a
a
a
t
c
c
Figure 11.5: An alignment with seven spaces distributed into four gaps.
with similarity (global alignment value) of v(i, j). Thus, an easy way td look for a set
of highly similar substrings is to find a set of cells in the table with a value above some
set threshold. Not all similar substrings will be identified in this way, but this approach is
common in practice.
11.8. Gaps
11.8.1. Introduction to Gaps
Until now the central constructs used to measure the value of an alignment (and to define
similarity) have been matches, mismatches, and spaces. Now we introduce another important construct, gaps. Gaps help create alignments that better conform to underlying biological models and more closely fit patterns that one expects to find in meaningful alignments.
Definition A gap is any maximal, consecutive run of spaces in a single string of a given
alignment.'
A gap may begin before the start of S , in which case it is bordered on the right by the
first character of S, or it may begin after the end of S,in which case it is bordered on
the left by the last character of S. Otherwise, a gap must be bordered on both sides by
characters of S. A gap may be as small as a single space. As an example of gaps, consider
the alignment in Figure 1 1.5, which has four gaps containing a total of seven spaces. That
alignment would be described as having five matches, one mismatch, four gaps, and seven
spaces. Notice that the last space in the first string is followed by a space in the second
string, but those two spaces are in two gaps and do not form a single gap.
By including a term in the objective function that reflects the gaps in the alignment
one has some influence on the distribution of spaces in an alignment and hence on the
overall shape of the alignment. In the simplest objective function that includes gaps,
Sometimes in the biology literature the term "space" (as we use it) is not used. Rather, the term "gap" is used both
for "space" and for "gap" (as we have defined it here). This can cause much confusion, and in this book the terms
"gap" and "space" have distinct meanings.
234
(i*, j*)is found, the substrings rr and fi giving the optimal local alignment of S1 and S2
are found by tracing back the pointers from cell (i*, j * ) until an entry (if, j') is reached
that has value zero. Then the optimal local alignment substrings are rr = Sl[i'..i*]and
fi = S2[j'.. j*].
Time analysis
Since it takes only four comparisons and three arithmetic operations per cell to compute
v ( i , j ) , it takes only O ( n m ) time to fill in the entire table. The search for u* and the
traceback clearly require only O(nm) time as well, so we have established the following
desired theorem:
Theorem 11.7.4, For two strings S1 and S2 of lengths n and m, the local alignment
problem can be solved in O ( n m ) tirne, the same time as for global alignment.
Recall that the pointers in the dynamic programming table for edit distance, global
alignment, and similarity encode all the optimal alignments. Similarly, the pointers in the
dynamic programming table for local alignment encode the optimal local alignments as
follows.
Theorem 11.7.5. All optimal local alignments of two strings are represented in the dynamic programming table for v ( i , j ) and can be found by tracing any pointers back from
any cell with value v*.
We leave the proof as an exercise.
11.8. GAPS
Figure 11.6: Each of the four rows represents part of the RNA sequence of one strain of the HIV-1 virus.
The HIV virus mutates rapidly, so that mutations can be observed and traced. The bottom three rows are
from virus strains that have each mutated from an ancestral strain represented in the top row. Each of the
bottom sequences is shown aligned to the top sequence. A dark box represents a substring that matches
the corresponding substring in the top sequence, while each white space represents a gap resulting from
a known sequence deletion. This figure is adapted from one in [123].
long string
shows up as a gap when two proteins are aligned. In some contexts, many biologists
consider the proper identification of the major (long) gaps as the essential problem of
protein alignment. If the long (major) gaps have been selected correctly, the rest of the
alignment - reflecting point mutations - is then relatively easy to obtain.
An alignment of two strings is intended to reflect the cost (or likelihood) of mutational
events needed to transform one string to another. Since a gap of more than one space
can be created by a single mutational event, the alignment model should reflect the true
distribution of spaces into gaps, not merely the number of spaces in the alignment. It
follows that the model must specify how to weight gaps so as to reflect their biological
meaning. In this chapter we will discuss different proposed schemes for weighting gaps,
and in later chapters we will discuss additional issues in scoring gaps. First we consider a
concrete example illustrating the utility of the gap concept.
236
each gap contributes a constant weight W,, independent of how long the gap is. That is,
each individual space is free, so that s(x , -) = s(-, x ) = 0 for every character x . Using
the notation established in Section 11.6, (page 226), we write the value of an alignment
containing k gaps as
I
Changing the value of W, relative to the other weights in the objective function can
change how spaces are distributed in the optimal alignment. A large W, encourages the
alignment to have few gaps, and the aligned portions of the two strings will fall into a few
substrings. A smaller W, allows more fragmented alignments. The influence of W,on the
alignment will be discussed more deeply in Section 13.1.
11.8. GAPS
239
Certainly, you don't want to set a large penalty for spaces, since that would align all
the cDNA string close together, rather than allowing gaps in the alignment corresponding
to the long introns. You would also want a rather high penalty for mismatches. Although
there may be a few sequencing errors in the data, so that some mismatches will occur
even when the cDNA is properly cut up to match the exons, there should not be a large
percentage of mismatches. In summary, you want small penalties for spaces, relatively
large penalties for mismatches, and positive values for matches.
What kind of alignment would likely result using an objective function that has low
space penalty, high mismatch penalty, positive match value of course, and no term for
gaps? Remember that the long string contains more than one gene, that the exons are
separated by long introns, and that DNA has an alphabet of only four letters present in
roughly equal amounts. Under these conditions, the optimal alignment would probably be
the longest common subsequence between the short cDNA string and the long anonymous
DNA string. And because the introns are long and DNA has only four characters, that
common subsequence would likely match all of the characters in the cDNA. Moreover,
because of small but real sequencing errors, the true alignment of the cDNA to its exons
would not match all the characters. Hence the longest common subsequence would likely
have a higher score than the correct alignment of the cDNA to exons. But the longest
common subsequence would fragment the cDNA string over the longer DNA and not give
an alignment of the desired form - it would not pick out its exons.
Putting a term for gaps in the objective function rectifies the problem. By adding a
constant gap weight W , for each gap in the alignment, and setting W, appropriately (by
experimenting with different values of W,), the optimal alignment can be induced to cut
up the cDNA to match its exons in the longer string6 As before, the space penalty is set
to zero, the match value is positive, and the mismatch penalty is set high.
Processed pseudogenes
A more difficult version of cDNA matching arises in searching anonymous DNA for
processed pseudogenes. A pseudogene is a near copy of a working gene that has mutated
sufficiently from the original copy so that it can no longer function. Pseudogenes are very
common in eukaryotic organisms and may play an important evolutionary role, providing
a ready pool of diverse "near genes". Following the view that new genes are created by the
process of duplication with modification of existing genes [127, 128, 1301, pseudogenes
either represent trial genes that failed or future genes that will function after additional
mutations.
A pseudogene may be located very far from the gene it corresponds to, even on a different
chromosome entirely, but it will usually contain both the introns and the exons derived
from its working relative. The problem of finding pseudogenes in anonymous sequenced
DNA is therefore related to that of finding repeated substrings in a very long string.
A more interesting type of pseudogene, the processed pseudogene, contains only the
exon substrings from its originating gene. Like cDNA, the introns have been removed and
the exons concatenated. It is thought that a processed pseudogene originates as an mRNA
that is retranscribed back into DNA (by the enzyme Reverse Transcriptase) and inserted
into the genome at a random location.
Now, given a long string of anonymous DNA that might contain both a processed
pseudogene and its working ancestor, how could the processed pseudogenes be located?
"his
238
11.8. GAPS
241
The alphabet-weight version of the affine gap weight model again sets s(x, -) =
s(-, X ) = 0 and has the objective of finding an alignment to
The affine gap weight model is probably the most commonly used gap model in the
molecular biology literature, although there is considerable disagreement about what W,
and W, should be [I611 (in addition to questions about W,,, and W,,$).
For aligning amino
acid strings, the widely used search program FASTA 13593 has chosen the default settings
of W, = 10 and W,= 2. We will return to the question of the choice of these settings in
Section 13.1.
It has been suggested [57,183,466] that some biological phenomena are better modeled
by a gap weight functjon where each additional space in a gap contributes less to the gap
weight than the preceding space (a function with negative second derivative). In other
, ~ not affine, function of its length. An example is
words, a gap weight that is a c o n ~ e x but
the function W, log, q , where q is the length of the gap. Some biologists have suggested
that a gap function that initially increases to a maximum value and then decreases to near
zero would reflect a combination of different biological phenomena that insert or delete
DNA.
Finally, the most general gap weight we will consider is the arbitrar)?gap weight, where
the weight of a gap is an arbitrary function w(q) of its length q. The constant, affine, and
convex weight models are of course subcases of the arbitrary weight model.
240
The problem is similar to cDNA matching but more difficult because one does not have
the cDNA in hand. We leave it to the reader to explore the use of repeat finding methods,
local alignment, and gap weight selection in tackling this problem.
Caveat
The problems of cDNA and pseudogene matching illustrate the utility of including gaps in
the alignment objective function and the importance of weighting the gaps appropriately.
It should be noted, however, that in practice one can approach these matching problems
by a judicious use of local alignment without gaps. The idea is that in computing local
alignment, one can find not only the most similar pair of substrings but many other highly
similar pairs of substrings (see Sections 13.2.4, and 11.7.3). In the context of cDNA or
pseudogene matching, these pairs will likely be the exons, and so the needed match of
cDNA to exons can be pieced together from a number of nonoverlapping local alignments.
This is the more typical approach in practice.
where s(x, -) = s(-, x ) = 0 for every character x, and S;and S; represent the strings S1
and S2 after insertion of spaces.
A generalization of the constant gap weight model is to add a weight W, for each space
in the gap. In this case, W, is called the gap initiation weight because it can represent the
cost of starting a gap, and W, is called the gap extension weight because it can represent the
cost of extending the gap by one space. Then the operator-weight version of the problem is:
Find an alignment to maximize [ W,, (# matches) - W,,(# mismatches) - 4(# gaps)
- W ,(# spaces)].
This is called the a B n e g a p weight model7 because the weight contributed by a single
gap of length q is given by the affine function W , q W,. The constant pap weight model
is simply the affine model with W, = 0.
' The affine gap model is sometimes called the [ i n m r weight model, and I prefer that term. However. "at'tine" has
become the dominant term in the biological literature, and "linear" there usually refers to an aftine function with
w,= 0.
243
11.8. GAPS
where G(0,O) = 0, but G ( i , j ) is undefined when exactly one of i or j is zero. Note that
V(0,O) = w(O), which will most naturally be assigned to be zero.
When end spaces, and hence end gaps, are free, then the optimal alignment value is the
maximum value over any cell in row n or column m , and the base cases are
Time analysis
Theorem 11.8.1. Assuming that IS1( = n and I S2I = m, the recurrences can be evaluated
in 0(nm2 + n2m) time.
PROOF
The increase in running time over the previous case (O(nm) time when gaps are not in
the model) is caused by the need to look j cells to the left and i cells above to determine
V(i, j ) . Before gaps were included in the model, V ( i , j)depended only on the three cells
adjacent to (i, j),and so each V (i, j ) value was computed in constant time. We will show
next how to reduce the number of cell examinations for the case of affine gap weights;
later we will show a more complex reduction for the case of convex gap weights.
242
Figure tt -8: The recurrences for alignment with gaps are divided into three types of alignments: 1. those
that align Sl( i ) to the left of &(I),2. those that aiign Sl ( i ) to the right of Sz(]),
and 3. those that align them
opposite each other.
I. Alignments of Sl [ 1. .i] and Sz[ 1.. j ] where character Sl (i) is aligned to a character strictly
to the left of character S 2 ( j ) .Therefore, the alignment ends with a gap in S I .
2. Alignments of the two prefixes where Sl (i) is aligned strictly to the right of S2(j). Therefore, the alignment ends with a gap in S2.
3. Alignments of the two prefixes where characters St(i) and S z ( j )are aligned opposite each
other. This includes both the case that S i ( i )= S 2 ( j )and that S l ( i ) # S z ( j ) .
Clearly, these three types of alignments cover all the possibilities.
E ( i , j ) = max [ V ( i ,k ) - w ( j - k ) ] ,
Osk<j-t
F ( i , j ) = max [ V ( l , j ) - w ( i - I ) ] .
05/51- 1
TO complete the recurrences, we need to specify the base cases and where the optimal
alignment value is found. If all spaces are included in the objective function, even spaces
that begin or end an alignment, then the optimal value for the alignment is found in cell
(n, m), and the base case is
245
11.9. EXERCISES
11.9, Exercises
1. Write down the edit transcript for the alignment example on page 226.
2. The definition given in this book for string transformation and edit distance allows at most
one operation per position in each string. But part of the motivation for string transformation
and edit distance comes from an attempt to model evolution, where there is no restriction
on the number of mutations that could occur at the same position. A deletion followed
by an insertion and then a replacement could all happen at the same position. However,
even though multiple operations at the same position are allowed, they will not occur in the
transformation that uses the fewest number of operations. Prove this.
3. In the discussion of edit distance, all transforming operations were assumed to be done to
one string only, and a "hand-waiving" argument was given to show that no greater generality
is gained by allowing operations on both strings. Explain in detail why there is no loss in
generality in restricting operations to one string only.
4. Give the details for how the dynamic programming table for edit distance or alignment can
6. In Part I we discussed the exact matching problem when don't-care symbols are allowed.
Formalize the edit distance problem when don't-care symbols are allowed in both strings,
and show how to handle them in the dynamic programming solution.
7. Prove Theorem 11.3.4 showing that the pointers in the dynamic programming table completely capture all the optimal alignments.
8. Show how to use the optimal (global) alignment value to compute the edit distance of two
strings and vice versa. Discuss in general the formal relationship between edit distance
and string similarity. Under what circumstances are these concepts essentially equivalent,
and when are they different?
9. The method discussed in this chapter to construct an optimal alignment left back-pointers
while filling in the dynamic programming (DP) table, and then used those pointers to trace
back a path from cell (n. m) to cell (0,O).However, there is an alternate approach that
works even if no pointers are avaiiable. If given the full DP table without pointers, one can
construct an alignment with an algorithm that "works through" the table in a single pass
from cell (n, m) to cell (0,O).Make this precise and show it can be done as fast as the
algorithm that fills in the table.
10. For most kinds of alignments (for example, global alignment without arbitrary gap weights),
the traceback using pointers (as detailed in Section 11.3.3) runs in O(n m) time, which
is less than the time needed to fill in the table. Determine which kinds of alignments allow
this speedup.
11. Since the traceback paths in a dynamic programming table correspond one-to-one with the
optimal alignments, the number of distinct cooptimal alignments can be obtained by computing the number of distinct traceback paths. Give an algorithm to compute this number
in O(nm) time.
Hint: Use dynamic programming.
12. As discussed in the previous problem, the cooptimal alignments can be found by enumerating all the traceback paths in the dynamic programming table. Give a backtracking method
to find each path, and each cooptimal alignment, in O(n m) time per path.
13. In a dynamic programming table for edit distance, must the entries along a row be
244
has already begun or whether a new gap is being started (either opposite character i of SI
or opposite character j of S2). This insight, as usual, is formalized in a set of recurrences.
The recurrences
For the case where end gaps are included in the alignment value, the base case is easily
seen to be
so that the zero row and columns of the table for V can be filled in easily. When end gaps
are free, then V(i, 0) = V(0, j ) = 0,
The general recurrences are
V(i, j ) = max[E(i, j ) , F(i, j ) , G(i, j)],
Time analysis
Theorem 11.8.2. The optimal alignment with aflne gap weights can be computed in
O(nm) time, the same time as fur optimal alignment without a gap term.
Examination of the recurrences shows that for any pair (i, j ) , each of the terms
V(i, j ) , E(i, j ) , F ( i , j ) , and G ( i , j ) is evaluated by a constant number of references to
previously computed values, arithmetic operations, and comparisons. Hence O(nm) time
I ) cells in the dynamic programming table. 0
suffices to fill in all the (n 1) x (m
PROOF
11.9. EXERCISES
247
do not contribute to the cost of the alignment. Show how to use the affine gap recurrences
developed in the text to solve the end-gap free version of the affine gap model of alignment.
Then consider using the alternate recurrences developed in the previous exercise. Both
should run in O(nm) time. Is there any advantage to using one over the other of these
recurrences?
29. Show how to extend the agrep method of Section 4.2.3 to allow character insertions and
deletions.
30. Give a simple algorithm to solve the local alignment problem in O(nm)time if no spaces
are allowed in the local alignment.
31. Repeated substrings. Local alignment between two different strings finds pairs of substrings from the two strings that have high similarity. It is also important to find substrings
of a single string that have high similarity. Those substrings represent inexact repeated
substrings. This suggests that to find inexact repeats in a single string one should locally
align of a string against itself. But there is a problem with this approach. If we do local
alignment of a string against itself, the best substring will be the entire string. Even using
all the values in the table, the best path to a cell ( i , j) for i # j may be strongly influenced
by the main diagonal. There is a simple fix to this problem. Find it. Can your method produce two substrings that overlap? 1s that desirable? Later in Exercise 17 of Chapter 13, we
will examine the problem of finding the most simijar nunoverlapping substrings in a single
string.
32. Tandem repeats. Let P be a pattern of length n and T a text of length m. Let Pm be
the concatenation of P with itself m times, so Pm has jength mn. We want to compute a
local alignment between Pmand T. That wilt find an interval in T that has the best global
alignment (according to standard alignment criteria) with some tandem repeat of P. This
problem differs from the problem considered in Exercise 4 of Chapter 1, because errors
(mismatches and insertions and deletions) are now allowed. The particular problem arises
in studying the secondary structure of proteins that form what is called a coiled-coil [I 581.
In that context, Prepresents a motif or domain (a pattern for our purposes) that can repeat
in the protein an unknown number of times, and T represents the protein. Local alignment
between Pmand T picks out an interval of T that "optimally" consists of tandem repeats
of the motif (with errors allowed). If Pmis explicitly created, then standard local alignment
will solve the problem in 0 ( n m 2 )time. But because Pm consists of identical copies of P,
an O(nm)-time solution is possible. The method essentialjy simulates what the dynamic
programming algorithm for local alignment would do if it were executed with Pm and T
explicitly. Below we outline the method.
246
f 4. Give a complete argument that the formula in Theorem 11 -6.1 is correct. Then provide
the details for how to find the longest common subsequence, not just its length, using the
algorithm for weighted edit distance.
15. As shown in the text, the longest common subsequence problem can be solved as an
optimal alignment or similarity problem. It can also be solved as an operation-weight edit
distance problem.
Let u represent the length of the longest common subsequence of two strings of lengths
n and m. Using the operation weights of d = 1, r = 2, and e = 0,we claim that D(n, m)=
m+ n- 2u or u = ( m i -n- D(n, m))/2.So, Qn, m)is minimized by maximizing u. Prove this
claim and explain in detail how to find a longest common subsequence using a program
for operation-weight edit distance.
16. Write recurrences for the longest common subsequence problem that do not use weights.
That is, solve the lcs problem more directly, rather than expressing it as a special case of
similarity or operation-weighted edit distance.
17. Explain the correctness of the recurrences for similarity given in Section 11.6.1.
18. Explain how to compute edit distance (as opposed to similarity) when end spaces are free.
19. Prove the one-to-one correspondence between shortest paths in the edit graph and mini-
21. Prove Theorem 1 I .6.2, and show in detail the correctness of the method presented for
finding the shortest approximate occurrence of Pin Tending at position j.
22. Explain how to use the dynamic programming table and traceback to find all the optimal
solutions (pairs of substrings) to the local alignment problem for two strings Sl and &.
23. In Section 11,7.3, we mentioned that the dynamic programming table is often used to
identify pairs of substrings of high similarity, which may not be optimal solutions to the
local alignment problem. Given similarity threshold t, that method seeks to find pairs of
substrings with similarity value t or greater. Give an example showing that the method
might miss some qualifying pairs of substrings.
24. Show how to solve the alphabet-weight alignment problem with affine gap weights in O(nm)
time.
25. The discussions for alignment with gap weights focused on how to compute the vatues in
the dynamic programming table and did not detail how to construct an optimal alignment.
Show how to augment the algorithm so that it constructs an optimal alignment. Try to limit
the amount of additional space required.
26. Explain in detail why the recurrence E(i, 1) = max[E(i, j- I), G(i, j - 1) - W,, V ( i , j - 1) W,] - W, is correct for the affine gap model, but is redundant, and that the middle term
(G(i, j - 1) - W,) can be removed.
27. The recurrences relations we developed for the affine gap model follow the logic of paying
W, W, when a gap is "initiated" and then paying W, for each additional space used in
that gap. An alternative logic is to pay W, W, at the point when the gap is "completed."
Write recurrences relations for the affine gap model that follow that logic. The recurrences
should compute the alignment in O(nm)time. Recurrences of this type are developed in
11661.
28. In the end-gap free version of alignment, spaces and gaps at either end of the alignment
11.9. EXERCISES
249
Usually a scoring matrix is used to score matches and mismatches, and a affine (or linear)
gap penalty model is atso used. Experiments [51,447] have shown that the success of this
approach is very sensitive to the exact choice of the scoring matrixand penalties. Moreover,
it has been suggested that the gap penalty must be made higher in the substrings forming
the a and 3, regions than in the rest of the string (for example, see 1511 and [296]). That
is, no fixed choice for gap penalty and space penalty (gap initiation and gap extension
penalties in the vernacular of computational biology) will work. Or at least, having a higher
gap penalty in the secondary regions will more likely result in a better alignment. High
gap penalties tend to keep the cr and fi regions unbroken. However, since insertions and
deletions do definitely occur in the loops, gaps in the alignment of regions outside the core
should be allowed.
This leads to the following alignment problem: How do you modify the alignment model
and penalty structure to achieve the requirements outlined above? And, how do you find
the optimal alignment within those new constraints?
Technically, this problem is not very hard. However, the application to deducing secondary
structure is very important. Orders of magnitude more protein sequence data are available
than are protein structure data. Much of what is "known" about protein structure is actually
obtained by deductions from protein sequence data. Consequently, deducing structure
from sequence is a central goal.
A multiple alignment version of this structure prediction problem is discussed in the first
part of Section 14.1 0.2.
37, Given two strings S, and S and a text T, you want to find whether there is an occurrence
of S1and & interwoven (without spaces) in T. For example, the strings abac and bbc
occur interwoven in cabbabccdw. Give an efficient algorithm for this problem. (It may have
a relationship to the longest common subsequence problem.)
38. As discussed earlier in the exercises of Chapter 1, bacteria! DNA is often organized into
circular molecules. This motivates the following problem: Given two linear strings of lengths
n and m, there are n circular shifts of the first string and m circular shifts of the second
string, and so there are nm pairs of circular shifts. We want to compute the global alignment
for each of these nm pairs of strings. Can that be done more efficiently than by solving
the alignment problem from scratch for each pair? Consider both worst-case analysis and
"typicaln running time for "naturally occurring" input.
Examine the same problem for local alignment.
39. The stuttering subsequence problem [328].Let P and T be strings of n and m characters each. Give an O(m)-time algorithm to determine if P occurs as a subsequence
of T.
Now let PI denote the string P where each character is repeated i times. For example,
if P = abc then P3 is aaabbbccc. Certainly, for any fixed i, one can test in O(m) time
whether PI occurs as a subsequence of T. Give an algorithm that runs in O(mlog m) time
to determine the largest i such that Pi is a subsequence of T. Let Maxi(P, T) denote the
value of that largest i.
Now we will outline an approach to this problem that reduces the running time from
O(mlogm) to a m ) . You will fill in the details.
For a string T, let d be the number of distinct characters that occur in T. For string T and
character x in T, define odd(x) to be the positions of the odd occurrences of x in T, that
is, the positions of the first, third, fifth, etc. occurrence of x in T. Since there are d distinct
characters in T, there are d such oddsets. For example, if T = 0120002112022220110001
then odd(1) is 2,9,18. Now define hal((T) as the subsequence of T that remains after
removing all the characters in positions specified by the d odd sets. For example, half( 7-)
248
matrix be computed? The motivation for this matrix is essentially the same as for the matrix
described in the preceding problem and is used in 14431 and 14451.
34. Implement the dynamic programming solution for alignment with a gap term in the objective
function, and then experiment with the program to find the right weights to solve the cDNA
matching problem.
35. The process by which intron-exon boundaries (called splice sites) are found in mRNA is not
well understood. The simplest hope -that splice sites are marked by patterns that always
occur there and never occur elsewhere - is false. However, it is true that certain short
patterns very frequently occur at the splice sites of introns, In particular, most introns start
with the dinucleotide G T and end with AG. Modify the dynamic programming recurrences
used in the cDNA matching problem to enforce this fact.
There are additional pattern features that are known about introns. Search a library to find
information about those conserved features - you'll find a lot of interesting things while
doing the search.
11.9. EXERCISES
Figure 11.10: A rough drawing of a cloverleaf structure. Each of the small horizontal or vertical lines inside
a stem represents a base pairing of a- u or c- g.
42. Transfer RNA (tRNA) molecules have a distinctive planar secondary structure called the
cloverleaf structure. In a cloverleaf, the string is divided into alternating stems and loops
(see Figure 11.1 0). Each stem consists 01 two parallel substrings that have the property
that any pair of opposing characters in the stem must be complements (a with u; c with
g). Chemically, each complementary stem pair forms a bond that contributes to the overall
stability of the molecule. A c- g bond is stronger than an a- u bond.
Relate this (very superficial) description of tRNA secondary structure to the weighted
nested pairing problem discussed above.
43. The true bonding pattern of complementary bases (in the stems) of tRNA molecules mostly
conforms to the noncrossing condition in the definition of a nested pairing. However, there
are exceptions, so that when the secondary structure of known tRNA molecules is represented by lines through the circle, a few lines may cross. These violations of the noncrossing
condition are called psuedoknots.
Consider the problem of finding a maximum cardinality proper pairing where a fixed number of psuedoknots are allowed. Give an efficient algorithm for this problem, where the
complexity is a function of the permitted number of crossings.
44. RNA sequence and structure alignment. Because of the nested pairing structure of
RNA, it is easy to incorporate some structural considerations when aligning RNA strings.
Here we examine alignments of this kind.
Let P be an RNA pattern string with a known pairing struclure, and let T be a larger RNA
text string with a known pairing structure. To represent pairing structure in P, let Op(i) be the
offset(positive or negative) of the mate of the character at position i, if any. For example, if
the character at position 17 is mated to the character at position 46, then Op(17) = 29 and
O(46) = -29. If the character at position i has no mate, then Op(i) is zero. The structure
of T is similarly represented by an offset vector OT.Then Pexactly occurs in T starting at
position j if and only if P(i) = T (j + i - 1) and Op(i) = Or( j+ i - I ) , for each position i in P.
a. Assuming the lengths of P and Tare n and m, respectively, give an O(n+ m)-time algorithm
to find every place that Pexactly occurs in T.
b. Now consider a more liberal criteria for deciding that P occurs in T starting at position j.
We again require that P(i) = T( j + i - 1) for each position i in P,but now only require that
Op(i) = OT(j + i - 1) when Op(i)is not zero.
250
above is 0021220101. Assuming that the number of distinct symbols, d, is fixed ahead of
time, give an O(m)-time algorithm to find half(T). Now argue that the length of half(T) is
at most m / 2 . This will be used later in the time analysis.
Now prove that 1 Maxi( P, T ) - 2 Maxi( P, half(T))I 5 1.
This fact is the critical one in the method.
The above facts allow us to find Maxi( P, in O(m) time by adivide-and-conquer recursion.
Give the details of the method: Specify the termination conditions of the divide and conquer,
prove correctness of the method, set up a recurrence relation to analyze the running time,
and then solve the relation to obtain an O(m) time bound.
Harder problem: What is a realistic application for the stuttering subsequence problem?
40. As seen in the previous problem, it is easy to determine if a single pattern P occurs as a
subsequence in a text T. This takes a m ) time. Now consider the problem of determining
if any pattern in a set of patterns occurs in a text. If n is the length of all the patterns in the
set, then O(nm) time is obtained by solving the problem lor each pattern separately. Try for
a time bound that is significantly better than O(nm). Recall that the analogous substring
set problem can be solved in O(n + m) time by Aho-Corasik or suffix tree methods.
41. The tRNA folding problem. The following is an extremely crude version of a problem
that arises in predicting the secondary (planar) structure of transfer RNA molecules. Let
S be a string of n characters over the RNA alphabet a, c, u, g. We deiine a pairing as
set of disjoint pairs of characters in S. A pairing is called proper if it only contains (a, u )
pairs or (c, g) pairs. This constraint arises because in RNA a and u are complementary
nucleotides, as are c and g. If we draw S as a circular string, we define a nestedpairing as
a proper pairing where each pair in the pairing is connected by a line inside the circle, and
where the lines do not cross each other. (See Figure 11.9). The problem is to find a nested
pairing of largest cardinality. Often one has the additional constraint that a character may
not be in a pair with either of its two immediate neighbors. Show how to solve this version
of the tRNA folding problem in O($)time using dynamic programming.
Now modify the problem by adding weights to the objective function so that the weight of
an a-u pair is different than the weight of a c- g pair. The goal now is to find a nested
pairing of maximum total weight. Give an efficient algorithm for this weighted problem.
11.9. EXERCISES
two adjacent gaps where each is in a different string. For example, the alignment
x
x
x
x
a
i
b
d
c
e
y
y
y
y
252
c. Discuss when the more liberal definition is reasonable and when it may not be.
n- i
Figure 12.1: The similarity of the first icharacters of S{ and the first j characters of Si equals the similarity
of the last i characters of Sland the last j characters of S2.(The dotted lines denote the substrings being
aligned.)
single row of the full table can be found and stored in those same time and space bounds.
This ability will be critical in the method to come.
As a further refinement of this idea, the space needed can be reduced to one row plus
one additional cell (in addition to the space for the strings). Thus m 1 space is all that
is needed. And, if n < m then space use can be further reduced to n 1. We leave the
details as an exercise.
+
+
Clearly, the table of V r (i, j ) values can be computed in O(nm) time, and any single
preselected row of that table can be computed and stored in O(nm) time using only O(m)
space.
The initial piece of the full alignment is computed in linear space by computing V(n, m)
in two parts. The first part uses the original strings; the second part uses the reverse strings.
The details of this two-part computation are suggested in the following lemma.
Lemma 12.1.1. V(n, m ) = maxoikr,[V(n/2,
k)
+ Vr(n/2, m - k)j.
In this chapter we look at a number of important refinements that have been developed
for certain core string edit and alignment problems. These refinements either speed up a
dynamic programming solution, reduce its space requirements, or extend its utility.
,-
k2
Figure 12.2: After finding k*,the alignment problem reduces to finding an optimal alignment in section A
of the table and another optimal alignment in section B of the table. The total area of subtables A and B is
at most cnm/2. The subpath Ln,2 through celi (1712,k*) is represented by a dashed path.
path from cell (n/2, k*) to a cell k2 in row n/2 1. That path identifies a subpath of an
optimal path from (n/2, k*) to (n, m ) . These two subpaths taken together form the subpath
Lni2that is part of an optimal path L from (0, 0) to (n, m). Moreover, that optimal path
goes through cell (n/2, k*). Overall, O(nm) time and O(m) space is used to find k*, k l , k2,
and LtI/2.
To analyze the full method to come, we will express the time needed to fill in the
dynamic programming table of size p by q as cpq, for some unspecified constant c , rather
than as O(pq). In that view, the n/2 row of the first dynamic program computation is
found in cnm/2 time, as is the n / 2 row of the second computation. Thus, a total of cnm
time is needed to obtain and store both rows.
The key point to note is that with a cnm-time and O(m)-space computation, the algorithm learns k*, k l , k2, and L),j2.This specifies part of an optimal alignment of S I and
Sz, and not just the value V(n, m).By Lemma 12.1.1 it learns that there is an optimal
alignment of S I and S2 consisting of an optimal alignment of the first n/2 characters of
S I with the first k* characters of S2, followed by an optimal alignment of the last n/2
characters of SI with the last m - k* characters of Sz. In fact, since the algorithm has also
learned the subpath (subalignment) L,,jz, the problem of aligning SI and S2 reduces to
two smaller alignment problems, one for the strings Sl [I ..n/2 - 11and S2[1..kl], and one
for the strings SI[n/2 l..n] and Sz[kz..ml. We call the first of the two problems the top
problem and the second the bottom problem. Note that the top problem is an alignment
problem on strings of lengths at most n / 2 and k*, while the bottom problem is on strings
of lengths at most n/2 and m - k*.
In terms of the dynamic programming table, the top problem is computed in section A
of the original n by m table shown in Figure 12.2, and the bottom problem is computed
in section 3 of the table. The rest of the table can be ignored. Again, we can determine
the values in the middle row of A (or B ) in time proportional to the total size of A (or B ) .
Hence the middle row of the top problem can be determined at most ck*n/2 time, and the
middle row in the bottom problem can be determined in at most c(m - k*)n/2 time. These
two times add to cnm/2. This leads to the full idea for computing the optimal alignment
of S t and Sz.
256
This result is almost obvious, and yet it requires a proof. Recall that S l [ I ..i] is
the prefix of string Si consisting of the first i characters and that S;[l..i] is the reverse
of the suffix of S l consisting of the last i characters of Sr.Similar definitions hold for S2
and S;.
For any fixed position k' in S2, there is an alignment of S1 and Sz consisting of an
aIignment of S l [ 1..n/2] and S2[l..kl] followed by a disjoint alignment of S I[n/2 1..n]
and S2[k' l..m]. By definition of V and V r , the best alignment of the first type has
value V(n/2, k') and the best alignment of the second type has value vr(n/2, m - k'), so
the combined alignment has value V(n/2, k') + V r (n/2, m - k') 5 maxk[V(n/2, k)
V r (n/2, m - k)] 5 V(n, m).
Conversely, consider an optimal alignment of S1and S2.Let k' be the right-most position
in Sz that is aligned with a character at or before position n / 2 in S1.Then the optimal
alignment of S1 and S2 consists of an alignment of Sl[1 ..n/2] and Sz[l..kt] followed by
an alignment of Sl [n/2 + 1..n] and S2[kf 1..m]. Let the value of the first alignment be
denoted p and the value of the second alignment be denoted q . Then p must be equal
to V(n/2, kr), for if p < V(n/2, k') we could replace the alignment of Sl[1 . .n/2] and
S2[l..kt] with the alignment of S,[1 ..n/2] and Sa[l..k'3 that has value V(n/2, k'). That
would create an alignment of S1 and S2 whose value is larger than the claimed optimal.
Hence p = V(n/2, k'). By similar reasoning, q = V r (n/2, m - k'). So V(n, m) =
V(n/2, k') V r (n/2, m - k') 5 maxk[V(n/2, k) V r (n/2, m - k)].
Having shown both sides ofthe inequality, we conclude that V(n, m) = maxk[V(n/2, k)
Vr(n/2, m - k)].
PROOF
+ V (n/2, rn - k)].
r
By Lemma 12.1.1, there is an optimal alignment whose traceback path in the full
dynamic programming table (if one had filled in the full n by m table) goes through cell
(n/2, k'). Another way to say this is that there is an optimal (longest) path L from node
(0,O) to node (n, m) in the alignment graph that goes through node (n/2, k*). That is the
key feature of k*.
Definition Let Ln,*be the subpath of L that starts with the last node of L in row n/2 - 1
and ends with the first node of L in row n/2 I .
Lemma 12.1.2. A position k* in row n/2 can be found in O(nm) time and O(m) space.
Moreover, n subpath Lnj2 can befound and stored in those time and space bounds.
PROOF
259
most cnm/2'-I, The final dynamic programming pass to describe the optimal alignment
takes cnm time. Therefore, we have the following theorem:
The call that begins the computation is to OPTA(1, n, 1, m). Note that the subpath Lh
is output between the two OPTA calls and that the top problem is called before the bottom
problem, The effect is that the subpaths are output in order of increasing h value, so that
their concatenation describes an optimal path L from (0,O) to (n, m), and hence an optimal
alignment of S I and S2.
261
' I recently attended a meeting concerning the Human Genome Project, where numerous examples were presented
in talks. 1 stopped taking notes after the tenth one.
Definition Given strings S I and S2 and a fixed number k. the k-difference global
alignment problem is to find the best global alignment of S l and S2 containing at most
k mismatches and spaces (if one exists).
The k-difference global alignment problem is a special case of edit distance and is
useful when Siand S2 are believed to be fairly similar. It also arises as a subproblem in
more complex string processing problems, such as the approximate PCR primer problem
considered in Section 12.2.5. The solution to the k-difference global alignment problem
will also be used to speed up global alignment when no bound k is specified.
t t
main diagonal
Figure 12.3: The main diagonal and a strip that is k = 2 spaces off the main diagonat on each side.
comparisons with all sequences in SwissProt . . . Sequences belonging to the same species
and having more than 98 percent similarity over 33 amino acids were combined.
A similar example is discussed in [399] where roughly 170,000 DNA sequences "were
subjected to an optimal alignment procedure to identify sequence pairs with at least 97%
identity". In these alignment problems, one can impose a bound on the number of allowed
differences. Alignments that exceed that bound are not of interest - the computation only
needs to determine whether two sequences are "sufficiently similar" or not. Moreover,
because these applications involve a large number of alignments (all database entries
against themselves), efficiency of the method is important.
Admittedly, not every bounded-difference alignment problem in biology requires a sophisticated algorithm. But applications are so common, the sizes of some of the applications
are so large, and the speedsups so great, that it seems unproductive to completely dismiss
the potential utility to molecular biology of bounded-difference and bounded-mismatch
methods. With this motivation, we now discuss specific techniques that efficiently solve
bounded-difference alignment problems.
262
their similarities and differences is a first step in sorting out their history and the constraints
on how they can mutate. The history of their mutations is then represented in the form
of an evolutionary tree (see Chapter 17). Collections of HIV viruses have been studied
in this way. Another good example of molecular epidemiology [348] arises in tracing the
history of Hantavirus infections in the southwest United States that appeared during the
early 1990s.
The final two examples come from the milestone paper [162] reporting the first complete DNA sequencing of a free-living organism, the bacteria Haemophilus injluenzae Rd.
The genome of this bacteria consists of 1,830,137 base pairs and its full sequence was determined by pure shotgun sequencing without initial mapping (see Section 16.14). Before
the large-scale sequencing project, many small, disparate pieces of the bacterial genome
had been sequenced by different groups, and these sequences were in the DNA databases.
One of the ways the sequencers checked the quality of their large-scale sequencing was
to compare, when possible, their newly obtained sequence to the previously determined
sequence. If they could not match the appropriate new sequences to the old ones with only
a small number of differences, then additional steps were taken to assure that the new
sequences were correct. Quoting from [ 1621,"The results of such a comparison show that
our sequence is 99.67 percent identical overall to those GenBank sequences annotated as
H. injluenzae Rd'.
From the standpoint of alignment, the problem discussed above is to determine whether
or not the new sequences match the old ones with few differences. This application illustrates both kinds of bounded difference alignment problems introduced earlier. When the
location in the genome of the database sequence is known, the corresponding string in
the full sequence can be extracted for comparison. The resulting comparison problem is
then an instance of the k-digerence global alignment problem that will be discussed next,
in Section 12.2.3. When the genome location of the database sequence P is nor known
(and this is common), the comparison problem is to find all the places in the full sequence
where P occurs with a very small number of allowed differences. That is then an instance
of the k-difference inexact matching problem, which will be considered in Section 12.2.4.
The above story of H. injctenzae sequencing will be repeated frequently as systematic
large-scale DNA sequencing of various organisms becomes more common. Each full
sequence will be checked against the shorter sequences for that organism already in the
databases. This will be done not only for quality control of the large-scale sequencing,
but also to correct entries in the databases, since it is generally believed that large-scale
sequencing is more accurate.
The second application from [162j concerns building a nonredundant database of bacterial proteins (NRBP). For a number of reasons (for example, to speed up the search or to
better evaluate the statistical significance of matches that are found), it is helpful to reduce
the number of entries in a sequence database (in this case, bacterial protein sequences)
by culling out, or combining in some way, highly similar, "redundant" sequences. This
was done i n the work presented in [162], and a "nonredundant" version of GenBank is
regularly compiled at The National Center for Biotechnology Information. Fleischmann
et al. [162] write:
Redundancy was removed from NRBP at two stages. All DNA coding sequences were extracted from GenBank . . . and sequences from the same species were searched against each
other. Sequences having more than 97 percent identity over regions longer than LOO nucleotides were combined. In addition, the sequences were translated and used in protein
Since end spaces in the text T are free, row zero of the dynamic programming table is
initialized with all zero entries. That allows a left end of T to be opposite a gap without
incurring any penalty.
Definition A d-path in the dynamic programming table is a path that starts in row zero
and specifies a total of exactly d mismatches and spaces.
Definition A d-path is farthest-reaching in diagonal i if it is a d-path that ends in
diagonal i, and the index of it's ending column c (along diagonal i ) is greater than or
equal to the ending column of any other d-path ending in diagonal i.
LW
that this implies that m - n 5 k is a necessary condition for there to be any solution.)
Therefore, to find any k-difference global alignment, it suffices to fill in the dynamic
programming table in a strip consisting of 2k + 1 cells in each row, centered on the main
diagonal, When assigning values to cells in that strip, the algorithm follows the established
recurrence relations for edit distance except for cells on the upper and lower border of the
strip. Any cell on the upper border of the strip ignores the term in the recurrence relation
for the cell above it (since it is out of the strip); similarly, any cell on the lower border
ignores the term in the recurrence relation for the cell to its left. If rn = n, the size of the
strip can be reduced by half (Exercise 4).
If there is no global alignment of SI and S2with k or fewer differences, then the value
obtained for cell (n, m) will be greater than k. That value, greater than k, is not necessarily
the correct edit distance of S, and SZ,but it will indicate that the correct value for (n, m)
is greater than k. Conversely, if there is a global alignment with d 5 k differences, then
the corresponding path is contained inside the strip and so the value in cell (n, m) will be
correctly set to d. The total area of the strip is O(kn) which is O(krn), because n and m
can differ by at most k. In summary, we have
PROOF
+ +
267
Figure 12.6: The dashed line shows path R', the farthest-reaching ( d - 1)-path ending on diagonal i.
The edge M on diagonal i just past the end of R' must correspond to a mismatch between Pand T (the
characters involved are denoted P ( k )and T(k')in the figure).
Theorem 12.2.3. Each of the three paths R1, R2, and R3 are d-paths ending on diagonal
i. The farthest-reaching d-path on diagonal i is the path R1, Rz, o r R3 that extends the
furthest along diagonal i.
PROOF Each of the three paths is an extension of a (d - 1)-path, and each extension adds
either one more space or one more mismatch. Hence each is a d-path, and each ends on
diagonal i by definition. So the farthest-reaching d-path on diagonal i must either be the
farthest-reaching of RI, RZ,and R3,or it must reach farther on diagonal i than any of those
three paths.
Let R' be the farthest-reaching (d - 1)-path on diagonal i . The edge of the alignment
graph along diagonal i that immediately follows R' must correspond to a mismatch,
otherwise R' would not be the farthest-reaching (d - 1)-path on i. Let M denote that edge
(see Figure 1 2.6).
Let R* denote the farthest-reaching d-path on diagonal i . Since R* ends on diagonal i,
there is a point where R* enters diagonal i for the last time and then never leaves diagonal
i. If R* enters diagonal i for the last time above edge M, then R* must traverse edge M,
otherwise R* would not reach as far as R3. When R* reaches M (which marks the end of
R'), it must also have (d- 1) differences; if that portion of R* had less than a total of (d- 1)
differences, then it could traverse M creating a (d - I)-path on diagonal i that reached
farther on diagonal i than Rf, contradicting the definition of R'. It follows that if R* enters
diagonal i above M, then it will have d differences after it traverses M, and so it will end
exactly where R j ends. S o if R* is not R3, then R* must enter diagonal i below edge M.
Suppose R* enters diagonal i for the last time below edge M. Then R* must have d
differences, at that point of entry; if it had fewer differences then R' would again fail to
be the farthest-reaching (d - I)-path on diagonal i. Now R* enters diagonal i for the last
time either from diagonal i - I or diagonal i + 1, say i + 1 (the case of i - 1 is symmetric).
So R* traverses a vertical edge from diagonal i + 1 to diagonal i, which adds a space to
R*. That means that the point where R* ends on diagonal i 1 defines a (d - I)-path on
diagonal i + 1. Hence R* leaves diagonal i 1 at or above the point where the path R1
does. Then R, and R* each have d spaces or mismatches at the points where they enter
diagonal i for the last time, and then they each run along diagonal i until reaching an edge
corresponding to a mismatch. It follows that R* cannot reach farther along diagonal i then
RI does. S o in this case, R*.ends exactly where R I ends.
Figure 12.5: Path R, consists of a farthest-reaching (d - 1)-path on diagonal i + 1 (shown with dashes),
followed by a vertical edge (dots), which adds the dth difference to the alignment, followed by a maximal
path (solid line) on diagonal i that corresponds to (maximal) identical substrings in Pand T.
+ m) time, yielding the desired O(km)-time bound. Space will be similarly bounded.
Details
Each of the paths R 1 , R2, and Rj ends with a maximal extension corresponding to
identical substrings of P and T . In the case of R , (or R 2 ) ,the starting positions of the two
substrings are given by the last entry point of R I (or RZ)into diagonal i. In the case of R3,
the starting position is the position just past the last mismatch on R3.
269
Theorem 12.2.4. All locations in T where pattern P occurs with at most'k dzfferences
can be found in O(km)-time and O(km) space. Moreover; the actual alignment of P and
T for each of these locations can be reconstructed in O(km) total time.
Sometimes this k differences result is reported in a somewhat simpler but less useful
form, requiring less space. If one is only interested in the end locations in T where P
inexactly matches in T with at most k differences, then the O(km) space bound can be
reduced to O ( n m). The idea is that the ends of the farthest-reaching (d - 1)-paths in
each diagonal would then not be needed after iteration d and could be discarded. Thus
only O(n + m) space is needed to solve the simpler problem.
Theorem 12.2.5, In O(km)-time and O ( n + m) space, the algorithm canfind all the end
locations in T where P matches T with at most k drfferences.
mru\u\~
LUKC b LKINCi EDITS AND ALLGNMENTS
uu
The case that R* enters diagonal i for the last time from diagonal i - 1 is symmetric, and
R* ends exactly where Rz ends. In each case we have shown that R*, the assumed farthestreaching d-path on diagonal i, ends at the ending point of either R ( , R2, or R j . Hence the
farthest-reaching d-path o n diagonal i is the farthest-reaching of R ,, R2, and R 3 .
Theorem 12.2.3 is the key to the O(km)-time method.
end.
27 1
explained and analyzed in full detail. Two other methods (Wu-Manber [482]and PevznerWaterman [373]) will also be mentioned. These methods do not completely achieve the
goal of provable linear and sublinear expected running times for all practical ranges of
errors (and this remains a superb open problem), but they do achieve the goal when the
error rate k/n is "modest".
Let a be the size of the alphabet used in P and T . As usual, n is the length of P and
m is the length of T. For the general discussion, an occurrence of P in T with at most
k errors (mismatches or differences depending on the particular problem) will be called
an approximate occurrence of P. The high-level outline of most of the methods is the
following:
Lemma 12.3.1, Suppose P matches a substring T' of T with a t most k differences. Then
TI must contain a t least one interval oflength r that exactly matches one of the r-length
regions of the partition of P.
In the alignment of P to T', each region of P aligns to some part of T' (see Figure
12.7), defining k 1 subalignments. If each of those k 1 subalignments were to contain
at least one error (mismatch or space), then there would be more than k differences in
total, a contradiction. Therefore, one of the first k + 1 regions of P must be aligned to an
interval of T' without any errors.
PROOF
Note that the lemma also holds even for the k-mismatch problem (i.e., when no space
270
a specified p, the k-difference primer problem can be solved for a small range of choices
for k and still be expected to pick out useful primer candidates.
How t o solve the k-difference primer problem
We follow the approach introduced in [243]. The method examines each position j in a
separately. For any position j , the k-difference primer problem becomes:
Find the shortest prefix of string a[j..n] (if it exists) that has edit distance at least k
from every substring in B.
The problem for a fixed j is essentially the "reverse" of the k-differences inexact
matching problem. In the k-difference inexact matching problem we want to find the
substrings of T that P matches, with a t most k differences. But now, we want to reject any
I
less than k differences. The viewpoint
prefix of u[j..n] that matches a substring of #with
is reversed, but the same machinery works.
The solution is to run the k-differences algorithm with string a[j..n] playing the role
of P and 0 playing the role of T. The algorithm computes the farthest-reaching d-paths,
for d = k, in each diagonal. If row n is reached by any d-path for d 5 k - 1, then the
entire string a[j..n] matches a substring of b with less than k differences, so no acceptable
primer can start at j. But, if none of the farthest-reaching (k - 1)-paths reach row n, then
there is an acceptable primer starting at position j.In detail, if none of the farthest-reaching
of the d-paths f o r d = k - 1 reach row r c n, then the substring y = cr[j..r] has edit
distance at least k from every substring in 0. Moreover, if r is the smallest row with that
property, then a[j..r] is the shortest substring starting at j that has edit distance at least k
from every substring in B,
The above algorithm is applied to each potential starting position j in a , yielding the
following theorem:
Theorem 12.2.6, I f a has length n and has length m, then the k-diferences primer
selection problem can be solved in O(knm) total time.
273
takes only O ( k n ) worst-case time. If no spaces are allowed in the alignment of P to T'
(only matches and mismatches) then the simpler O(kn)-time approach based on longest
common extension (Section 9.1) can be used, or if attention is paid to exactly where in P
any match is found, then O(n) time suffices for each check.
mn2(k
+ 1)
'
< cm,
mn
= cm.
0'
$, so r = log,
n 3 - log, c . But r =
L&J,
so
P 1
Figure 12.7: The first k
k+ 1
insertions are allowed). Lemma 12.3.1 leads to the following approximate matching algorithm:
Algorithm BYP
a. Let P be the set of k + 1 substrings of P taken from the first k + 1 regions of P's partition.
b. Build a keyword tree (Section 3.4) for the set of "patterns" P.
c. Using the Aho-Corasik algorithm (Section 3.4), find 2, the set of all starting locations in
T where any pattern in P occurs exactly.
d. For each index i E Z use an approximate matching algorithm (usually based on dynamic
programming) to locate the end points of all approxitnate occurrences of P in the substring
T [ i - n - k..i + n + k] (i.e., in an appropriate-length interval around i).
By Lemma 12.3.1, it is easy to establish that the algorithm correctly finds all approximate occurrences of P in T. The point is that the interval around each i is "large enough" to
align with any approximate occurrence of P that spans i , and there can be no approximate
occurrence of P outside such an interval. A formal proof is left as an exercise. Now we
focus on specific implementation details and time analysis.
Building the keyword tree takes O(n) time, and the Aho-Corasik algorithm takes O(m)
(worst-case) time (Section 3.4). So steps b and c take O(n+ m) time. There are a number
of alternate implementations for steps b and c. One is to build a suffix tree for T, and
then use it to find every occurrence in T of a pattern in P (see Section 7.1). However, that
would be very space intensive. A space-efficient version of this approach is to construct a
generalized suffix tree for only P, and then match T to it (in the way that matching statistics
are computed in Section 7.8.1). Both approaches take O(n + m ) worst-case time, but are
no faster in expected time because every character in T is examined. A faster approach in
practice is to use the Boyer-Moore set matching method based on suffix trees, which was
developed in Section 7.16. That algorithm will skip over parts of T, and hence it breaks
the O ( m )bottleneck. A different variation was developed by Wu and Manber [482] who
implement steps b and c using the Shift-And method (Section 4.2) on a set of patterns.
Another approach, found in the paper of Pevzner and Waterman [373] and elsewhere, uses
hushing to identify long exact matching substrings of P and T. Of course, one can use
suffix trees to find long common substrings, and one could deveIop a Karp-Rabin type
method as well. Hashing, or approaches based on suffix trees, that look directly for long
common substrings between P and T, seem a bit more robust than BYP because there is
no string partition involved. But the only stated time bounds in 13731are the same as those
for BYP.
In the checking phase, step d, the algorithm executes some approximate matching
algorithm between P and an interval of T of length O ( n ) ,for each index in I.Naively, each
of these checks can be done in 0 ( n 2 )time by dynamic programming (global alignment).
Even this time bound will be adequate to establish an expected O(m)overall running
time for the range of error rates that will be detailed below. Alternately, the LandauVishkin method (Section 12.2) based on suffix trees could be used, so that each check
275
The C L search is executed on 2m/n regions of T . For any region R let j' be the last
value of j (i.e., the value of j when cn reaches k or when j - j " exceeds n/2)&Thus, in
R, matching statistics are computed for the interval of length j' - j* < n/2. With the
matching statistics algorithm in Section 7.8.1, the time used to compute those matching
statistics is O ( j ' - j*). Now the expected value of j' - j*is less than or equal to k times
the expected value of ms(i), for any i. Let E(M) denote the expected value of a matching
statistic, and let e denote the expected number of regions that survive the search phase.
Then the expected time for the search phase is O(2mk E(M)/n), and the expected time
for the checking phase is O(kne).
In the following analysis, we assume that P is a random string where each character is
chosen uniformly from an alphabet of size a.
Lemma 12.3.3. E(M), the expected value of a matching statistic, is O(log, n).
For fixed length d, there are roughly n substrings of length d in P, and there are
o dsubstrings of length d that can be constructed. So, for any specific string a! of length
d, the probability that a! is found somewhere in P is less than n / a d . This is true for any
d , but vacuously true until o d= n (i.e., when d = log, n).
Let X be the random variable that has value log, n for ms(i) 5 log, n; otherwise it has
value ms(i). Then
PROOF
Corollary 12,3.1. The expected time that CL spends in the search phase is O(2mk log, n/n),
which is sublinear in m for k < n/ log, n.
The analysis for e, the expected number of surviving regions is too difficult to present
here. It is shown in [94] that when k = O(n/log, n), then e = m/n" so the expected
time that CL spends in the checking phase is 0(km/n3) = o(m). The search phase of CL
is so effective in excluding regions of T that the checlung phase has very small expected
running time.
P
Figure 12.8: Each full region in T has length r = n/2. This assures that no matter how Pis aligned with
T, P spans one full region.
Figure 12.9: Blowup of one region in T aligned with one copy of P. Each black box shows a mismatch
between a character in P and its counterpart in T.
any substring of P with at most k mismatches. These regions are excluded, and then an
interval around each surviving region is checked using an approximate matching method,
as in BYP. The search phase of CL relies heavily on the matching statistics discussed in
Section 7.8.1.
Recall that the value of matching statistic rns(i) is the length of the longest substring
starting at position i of T that matches a substring somewhere (an unspecified location)
in P. Recall also, that for any string S, all the matching statistics for the positions in S
can be computed in O()SJ)total time. This is true even when S is a substring of a larger
string T.
Now let T' be the substring of one of the regions of T's partition that matches a substring
P' of P with at most k mismatches (see Figure 12.9). The alignment of P' and T' can be
divided into at most k 1 intervals where no mismatches occur, alternating with intervals
containing only mismatches. Let i be the starting position of any one of those matching
intervals, and let 1 be its length. Then clearly, m s ( i ) > 1 . The CL search phase exploits this
observation. It executes the following algorithm for each region R in the partition of T:
Lemma 12.3.2. When the CL search declares a region R excluded, then there is no
occurrence of P in T with at most k mismatches that completely contains region R .
The proof is easy and is left to the reader, as is its use in a formal proof of the correctness
of CL. Now we consider the time analysis.
277
Since the intervals of interest double in length, the time used per interval grows four fold
in each successive iteration. However, the number of surviving matches is expected to
fall hyper-exponentially in each successive iteration, more than offsetting the increase in
computation time per interval.
With this iterative expansion, the effort expended to check any initial surviving match
is doled out incrementally throughout the O(1og ~) iterations, and is not continued
for any surviving match past an iteration where it is excluded. We now describe in a bit
more detail how the initial surviving matches are found and how they are incrementally
extended in successive iterations.
E,
For example, over the two-letter alphabet (a,b}, if S = aba and d = 1, then the
1-neighborhood of S is {bba, aaa, abb, aaba, abaa, baba, abba, abab, ba, aa, ab}. It is
created from S by the operations of mismatch, insertion and deletion respectively. The
condensed d-neighborhood of S is created from the d-neighborhood of S by removing
any substring that is a prefi of another string in the d-neighborhood. The condensed
1-neighborhood S is (bba, aaa, aaba, abaa, baba, abba, abab}.
Recall that pattern P is initially partitioned into subpatterns of length log, m (assumed
to be an integer). Let P be the set of these subpatterns. In the first iteration, the algorithm
(conceptually) constructs the condensed d-neighborhood for each subpattern in F,and
then finds all locations of substrings in text T that exactly match one of the substrings
in one of the condensed d-neighborhoods. In this way, the method finds all substrings of
T that 6-match one of the subpatterns in F . These 6-matches form the initial surviving
matches.
In actuality, the tasks of generating the substrings in the condensed d-neighborhoods
and of searching for their exact occurrences in T are intertwined and require text T to
have been preprocessed into some index structure. This structure could be a suffix tree, a
suffix array or a hash table holding short substrings of T. Details are found in [342].
Myers [342] shows that when the length of the subpatterns is O(log, m), then the first
iteration can be implemented to run in O(kmP(') logm) expected time. The function p ( 6 )
is complicated, but it is convex (negative second derivative) increasing, and increases more
slowly as the alphabet size grows. For DNA, it has value less than one for c 5 f , and for
proteins it has value less than one for 6 5 0.56.
Successive iterations
To explain the central idea, let a = aoa 1, where ]aO
I is assumed equal to lal 1
Lemma 12.3.4. Suppose a -marches 0. Then can be divided into two substrings
and 01such that B = Poj31,and either a0 -matches Poor a, E-marches PI.
Do
This lemma (used in reverse) is the key to determining how to expand the intervals
around the surviving matches in each iteration. For simplicity, assume that n is a power of
two and that log, m is also a power of two. Let B be a binary tree representing successive
divisions of P into two equal size parts, until eachpart has length log, m (see Figure 12.10).
The substrings written at the leaves are the subpatterns used in the first iteration of Myers's
algorithm. Iteration i of the algorithm examines substrings of P that label (some) nodes
of B i levels above the leaves (counting the leaves as level 1).
276
here, but we can introduce some of the ideas it uses to address deficiencies in the other
exclusion methods.
There are two basic problems with the Baeza-Yates-Perlberg and the Chang-Lawler
methods (and the other exclusion methods we have mentioned). First, the exclusion criteria
they use permit a large expected number of surviving regions compared to the expected
number of true approximate matches. That is, not every initial surviving region is actually contained in an approximate match, and the ratio of expected survivors to expected
matches is fairly high (for random patterns and text). Further, the higher the permitted
error rate, the more severe is the problem. Second, when a surviving region is first located,
the methods move directly to full dynamic programming computations (or some other relatively expensive operations) to check for an approximate match in a large interval around
the surviving region. Hence the methods are required to do a large amount of computation
for a large number of intervals that don't contain any approximate match.
Compared to the other exclusion methods, Myers's method contains two different ideas
to make it both more selective (finding fewer initial surviving regions) and less expensive
to test the ones that are found. Myers's algorithm begins in a manner similar to the other
exclusion methods. It partitions P into short substrings (to be specified later) and then
finds all locations in T where these substrings appear with a small number of allowed
differences. The details of the search are quite different from the other methods, but the
intent (to exclude a large portion of T from further consideration) is the same. Each of
these initial alignments of a substring of P that is found (approximately) in T is called
a surviving match. A surviving match roughly plays the role of a surviving region in the
other exclusion methods, but it specifies two substrings (one in P and one in T ) rather
than just a single substring, as a surviving region does. Another way to think of a surviving
region is as a roughly diagonal subpath in the alignment graph for P and T.
Having found the initial surviving matches (or surviving regions), all the other exclusion
methods we have mentioned would next check a full interval of length roughly 2n around
each surviving region in T to see if it contains an approximate match to P. In contrast,
Myers's method will incrementally extend and check a growing interval around each initial
surviving match to create longer surviving matches or to exclude a surviving match from
further consideration. This is done in about O(1og n ) iterations. (Recall that n is the length
of the pattern and rn is the length of the text.)
Definition For a given error rate E, a string S -matches a substring of T if S matches
the substring using at most c / S (insertions, deletions, and mismatches.
For example, let S = aba and E = 2/3. Then ac -matches S using one mismatch and
one deletion operation.
In the first iteration, the pattern P is partitioned into consecutive, nonoverlapping subpatterns of length log, m (assumed to be an integer), and the algorithm finds all substrings
in T that -match one of these short subpatterns (discussed in more detail below). The
length of these subpatterns is short enough that all the -matches can be found in sublinear expected time for a wide range of E values. These -matches are the initial surviving
matches.
The algorithm next tries to extend each initial surviving match to become an E-match
between substrings (in P and T ) that are roughly twice as long as those in the current
surviving match. This is done by dynamic programming in an appropriate interval around
the surviving match. In each successive iteration, the method applies a more selective and
expensive filter, trying to double the length of the -match around each surviving match.
279
Two problems
We assume the existence of a scoring matrix used to compute the value of any alignment,
and hence "edit distance" here refers to weighted edit distance. We will discuss two
problems in the text and introduce two more related problems in the exercises.
1. The P-against-all problem Given strings P and T, compute the edit distance between
P and every substring T' of T.
2. The threshold all-against-all problem Given strings P and T and a threshold d , find
every pair of substrings P' of P and T' of T such that the edit distance between P' and
T' is less than d .
The threshold all-against-all problem is similar to problems mentioned in Section 12.2.1
concerning the construction of nonredundant sequence databases. However, the threshold
all-against-all problem is harder, because it asks for the alignment of all pairs of substrings,
Figure 12.10: Binary tree B defining the successive divisions of Pand its partition into regions of length
log, m (equal to two in this figure).
Suppose at iteration i - 1 that substrings P' and T' in the query and text, respectively,
form a surviving match (i.e., are found to align to form an -match). Let P" be the parent
of P' in tree B. If P is a left child of P", then in iteration i, the algorithm tries to c-match
P" to a substring of T in an interval that extends T' to the right. Conversely, if P is a right
child of P", then the algorithm tries to -match P" with a substring in an interval that
extends T' to its left. By Lemma 12.3.4, if the -match of P fto T' is part of an -match
of P to a substring of T , then P" will c-match the appropriate substring of T . Moreover,
the specified interval in T that must be compared against P" is just twice as long as the
interval for T'. The end result, as detailed in [342], is that all of the checking, and hence
the entire algorithm, runs in O(kmJ'(" log m) expected time.
f
There are several points to emphasize. First, the exposition given above is only intended
to be an outline of Myers's method, without any analysis. The full details of the algorithm
and analysis are found in [342]; [337] provides an overview, in relation to other exclusion
methods. Second, unlike the BYP and CL methods, the error rates that establish sublinear
(or linear) running times do not depend on the length of P.In BYP and CL, the pennitted
error rate decreases as the length of P increases. In Myers's method, the permitted error
rate depends only on the alphabet size. Third, although the expected running times for
both CL and for Myers's method are sublinear (for the proper range of error rates), there
is an important difference in the nature of these sublinearities. In the CL method, the
sublinearity is due to a multiplicative factor that is less than one. But in Myers's method,
the sublinearity is due to an exponent that is less than one. So as a function of m, the CL
bound increases linearly (although for any fixed value of m the expected running time is
less than m), while the bound for Myers's method increases sublinearly in m. This is an
important distinction since many databases are rapidly increasing in size.
However, Myers's method assumes that the text T has already been preprocessed into
some index structure, and the time for that preprocessing (while linear in m) is not included
in the above time bounds. In contrast, the running times of the BYP and CL methods include
all the work needed for those methods. Finally, Myers has shown that in experiments on
problems of meaningful size in molecular biology (patterns of length 80 on texts of length
3 million), the k-difference algorithms of Sections 12.2.4 and 12.2.3 run 100 to 500 times
slower than his expected sublinear method.
12.4. 5UFFl.X I K k k 5 A N V
H Y B K l V U Y NAMlL t'KWUKAMMIIYU
Figure 12.11: A cartoon of the dynamic programming tables for computing the edit distance between P
and substring T' (top) and between P and substring T" (bottom). The two tables share the subtable for P
and substring A (shown as a shaded rectangle). This shaded subtabte only needs to be computed once.
root
Figure 12.12: A piece of the suffix tree for T. The traversal from the root to node v is accompanied by the
computation of subtable A (from the previous figure). At that point, the last row and column of subtable A
are stored at node v. Computing the subtable 8 corresponds to the traversal from v to the leaf representing
substring T'. After the traversal reaches the leaf for T ' , it backs up to node v, retrieves the row and column
stored there, and uses them to compute the subtable C needed to compute the edit distance between P
and TI'.
be tween P and every substring beginning at position i of T . When the depth-first traversal
backs up to a node v , and v has an unvisited child v', the row and column stored at v are
retrieved and extended as the traversal follows a new (v , v') edge (see Figure 12.12).
It should be clear that this suffix-tree approach does correctly compute the edit distance
between P and every substring of T, and it does exploit repeated substrings (small or
large) that may occur in T. But how effective is it compared to the 8(nm 2 )-time dynamic
programming approach?
not just the alignment of all pairs of strings. This critical distinction has been the source
of some confusion in the literature [50],[56].
Recent estimates put the amount of repeated human DNA at 50 to 60%.That is, 50 to 60% of a11 human DNA is
contained in nontrivial lengrh, structured substrings that show up repeatedly throughout the genome. Similar levels
of redundancy appear in many other organisms.
283
in DNA) should give rise to suffix trees with lengths that are small enough to make this
method useful. We examined this question empirically for DNA strings up to one million
characters, and the lengths of the resulting suffix trees were around m2/10.
An O(C
+ R)-time method
The method uses a suffix tree 7.p for string P and a suffix tree '7;. for string T ,The worstcase time for the method will be shown to be O ( C R), where C is the length of Tp
times the length of TTindependent of whatever the output criteria are, and R is the size
of the output. (The definition of the length of a suffix tree is found in Section 12.4.1.)
That is, the method will compute certain dynamic programming cell values, which will
be the same no matter what the output criteria are, and then when a cell value satisfies the
particular output criteria, the algorithm will collect the relevant substrings associated with
that cell. Hence our description of the method holds for the full all-against-all problem,
the threshold version of the problem, or any other version with different reporting criteria.
To start, recall that each node in Tprepresents a substring of P and that every substring
of P is a prefix of a substring represented by a node of 7p.
In particular, each suffix of P
is represented by a leaf of Tp.The same is true of T and Tr.
Definition The dynamic programming table for a pair of nodes (u, v ) , from Tpand 5,
respectively, is defined as the dynamic programming table for the edit distance between
the string represented by node K and the string represented by node v.
LOL
K C ~ ~ I V ~ L VLL V
I K ~
b IKINCi
Definition The string-length of an edge label in a suffix tree is the length of the string
labeling that edge (even though the label is compactly represented by a constant number
of characters). The length of a s u f i tree is the sum of the string-lengths for all of its
edges.
The length for a suffix tree 7 for a string T of length m can be anywhere between @(m)
and @ ( t n 2 ) ,depending on how much repetition exists in T. In computational experiments
using long substrings of mammalian DNA (length around one million), the string-lengths
of the resulting suffix trees have been around mZ/lO. Now the number of dynamic programming columns that are generated during the depth-first traversal of 7 is exactly the
length of 7 . Each column takes O(n) time to generate, and so we can state
Lemma 12.4.1. The time used to generate the needed columns in the depth-first traversal
is O(n x (length of 7 ) ) .
We must also account for the time and space used to write the rows and columns stored
at each node of 7.In a suffix tree with m leaves there are O(m) internal nodes and a single
row and column take at most O(m + n ) time and space to write. Therefore, the time and
space needed for the row and column stores is @(m2 nm) = O(mZ). Hence, we have
Theorem 12.4.1. The total timefor the s u m - t r e e approach is O(n x (length of 7 ) + m'),
a n d the marinzum space used is 0 ( m 2 ) .
Reducing space
The size of the required output is 0 ( m 2 ) , since the problem calls for the edit distance
between P and each of 0 ( m 2 ) substrings of T , making the @(m2)term in the time bound
acceptable. On the other hand, the space used seems excessive since the space needed by
the dynamic programming solution without using a suffix tree is just O(nm) and can be
reduced to O(m). We now modify the suffix-tree approach to also use only O(n m)
space and the same time bounds as before.
First, there is no need to store the current column at each node v. When backing up
from a child v' of v, we can use the current column at v' and the string labeling edge
( v , v ' ) to recompute the column for node v . This does, however, double the total time for
computing the columns. There is also no need to keep the current row n at each node v.
Instead, only O(n) space is needed for row entries. The key idea is that the current table is
expanded columnwise, so if the string-depth of v' is j and the string-depth of v' is j + d,
then the row n stored at v and v' would be identical for the first j entries. We leave it as
an exercise to work out the details. In summary, we have
Theorem 12.4.2. The hybrid sufftr-tree/dynamicprogrammingapproachto the P-againstall problem can be implemented to run in O[n(length of 7 ) + m2] time attd O(n + m)
space.
The above time and space bounds should be compared to the @(nm2)time and O(n +m)
space bounds that result from a straightforward application of dynamic programming. The
effectiveness in practice of this method depends on the length of 7for realistic strings. It is
known that for random strings, the length of 7 is @(m2),making the method unattractive.
(For random strings, the suffix tree is bushy for string-depths of log, m or less, where CT
is the size of the alphabet. But beyond that depth, the suffix tree becomes very sparse,
since the probability is very low that a substring of length greater than log, m occurs
more than once in the string.) However, strings with more structured repetitions (as occur
12.4.
' Actually, any topological numbering will do, but string-depth has some advantages when heuristic accelerations
are added.
K~PININC
CUKL
~
Y 1 KING ELIl'fS AND ALlGNMENTS
Figure 12.13: The dynamic programming table for ( u ,v ) is shown below the suffix trees for P and T. The
string on the path to node u is Za and the string to node v is X Y p . Every cell in the ( u , v )table, except any
in the lower right rectangle, is also in the (u,v'), (u',v), or ( u ' , v r ) tables. The new part of the (u,v) table
can be computed from the shaded entries and substrings u and p . The shaded entries contain exactly one
entry from the ( u r , v ' )table; la1 entries from the last column in the ( u , v ' ) table; and
entries from the last
row in the (u',v) table.
Lemma 12.4.2, Let u' be the parent of node u in TPand let cr be the string labeling
the edge between them. Similarly, let v' be the parent of v in TTand let p be the string
labeling the edge betw~eenthem. Then, all but the bottom right ]aI I#l I entries in the clynamic
programming table for the pair ( u , v) appear in one of the tables for (u', v'), (u', v), or
( u , v'). Moreovel; that bottom right part of the ( u , v ) table can be obtained from the orher
three tclbles in O(lal Ip I) time. (See Figure 12.13.)
The proof of this lemma is immediate from the definitions and the edit distance recurrences.
The computation for the new part of the ( u , u ) table produces an la by (#l( rectangular
subtable that forms the lower right section of the (u, v) table. In the algorithm to be
developed below, w e will store and associate with each node pair ( i l , v) the last column
and the last row of this Icrl by
subtable*
We can now fully describe the algorithm.
287
For example, if Il = 5, 3,4,9,6,2, 1, 8,7, 10 then ( 3 , 4 , 6 , 8 ,10) and {5,9, 101 are
both increasing subsequences in Il.(Recall the distinction between subsequences and substrings.) We are interested in the problem of computing a longest increasing subsequence
in n. The method we develop here will later be used to solve the problem of finding the
longest common subsequence of two (or more) strings.
Definition A decreasing subsequence of I7 is a subsequence of
are nonincreasing from left to right.
We will develop an O(n log n)-time method that simultaneously constructs a longest
increasing subsequence (lis) and a smallest cover of TI. The following lemma is the key.
" \
Lemma 12.4.3. The time used by the algorithm for all the needed dynamic programming
computations and cell examinations is proportional to the product of the length of Tpand
the length of TT.Hence that time, de$ned as C,ranges between nm and n 2 m 2 .
In the algorithm, each pair of nodes is processed exactly once. At the point a
pair (u, v ) is processed, the algorithm spends O(lcwllBI) time to compute a subtable and
examine it, where a and are the labels on the edges into u .and u, respectively. Each
edge-label in 'Irp therefore forms exactly one dynamic programming table with each of
Summing over
the edge-labels in TT.The time to build those tables is Icul(1ength of TT).
a11 edges in
gives the claimed time bound. o
PROOF
The above lemma counts all the time used in the algorithm except the time used to
collect and report pairs of substrings (by their starting position, length, and edit distance).
But since the algorithm collects substrings when it sees a cell value that satisfies the
reporting criteria, the time devoted to output is just the time needed to traverse the tree to
collect output pairs. We have already seen that this time is proportional to the number of
pairs collected, R. Hence, we have
Theorem 12.4.3. The complete time fur the algorithm is O(C + R).
289
We will shortly see how to reduce the time needed to find the greedy cover to O ( n log n ) ,
but we first show that the greedy cover is a smallest cover of l7 and that a longest increasing
subsequence can easily be extracted from it.
Algorithmically, we can find a longest increasing subsequence given the greedy cover
as follows:
end.
Since no number is examined twice during this algorithm, a longest increasing subsequence can be found in O ( n ) time given the greedy cover.
An alternate approach is to use pointers. As the greedy cover is being constructed,
whenever a number x is added to subsequence i , connect a pointer from x to the number
at the current end of subsequence i - 1. After the greedy algorithm finishes, pick any
number in the last decreasing subsequence and follow the unique path of pointers starting
from it and ending at the first subsequence.
10
2
1
Lemma 12.5.1. I f I is an increasing subsequence of l-I with length equal to the size of a
cover of n , call it C, then I is a longest increasing subseqlrence of n and C is a smallest
cover of
n.
Lemma 12.5.1 is the basis of a method to find a longest increasing subsequence and
a smallest cover of n. The idea is to decompose l7 into a cover C such that there is an
increasing subsequence I containing exactly one number from each decreasing subsequence in C . Without concern for efficiency, a cover of n can be built in the following
straightforward way:
Naive cover algorithm Starting from the left of n, examine each successive number in and place it at the end of the first (left-most) decreasing subsequence that it
can extend. If there are no decreasing subsequences it can extend, then start a new
(decreasing) subsequence to the right of all the existing decreasing subsequences.
To elaborate, if x denotes the current number from FI being examined, then x extends a
subsequence i if x is smaller than or equal to the current number at the end of subsequence
i, and if x is strictly larger than the last number of each subsequence to the left of i.
For example, with l7 as before the first two numbers examined are put into a decreasing
subsequence {5,3}.Then the number 4 is examined, which is in position 3 of n. Number
4 cannot be placed at the end of the first subsequence because 4 is larger than 3. So 4
begins a new subsequence of its own to the right of the first subsequence. Next, the number
9 is considered and since it cannot be added to the end of either subsequence {5,3) or 4,
it begins a third subsequence. Next, 6 is considered; it can be added to 9 but not to the
end of any of the two subsequences to the left of 9. The final cover of produced by the
algorithm is shown in Figure 12.15, where each subsequence runs vertically.
Clearly, this algorithm produces a cover of n, which we call the greedy cover. To see
whether a number x can be added to any particular decreasing subsequence, we only have
to compare x to the number, say y, currently at the end of the subsequence - x can be added
if and only if x 5 y. Hence if there are k subsequences at the time x is considered, then
the time to add x to the correct subsequence is O ( k ) .Since k 5 n, we have the following:
291
the list associated with the character Sl( i ) . For example, list n ( S I ,S2)for the above two
strings is 6,3,2,4, 1,6, 3, 2, 5.
To understand the importance of n(SI, Sz),we examine what an increasing subsequence
in that list means in terms of the original strings.
n ( S , , S2)is a list of r integers, and the longest increasing subsequence problem can be
solved in O(r logl) time on an r-length list when the longest increasing subsequence is
of length 1. If n 5 m then 1 5 n, yielding the following theorem:
Theorem 12.5.3. The longest common subsequence problem can be solved in O(r log n)
time.
The O(r log n) result for lcs was first obtained by Hunt and Szymanski 12381. Their
algorithm is superficially very different than the one above, but in retrospect one can see
similar ideas embodied in it. The relationship between the Ics and lis problems was partly
identified by Apostolico and Guerra [25,27] and made explicit by Jacobson and Vo [2U]
and independently by Pevzner and Waterman 13701.
The lcs method based on lis is an example of what is called sparse dynamic programming, where the input is a relatively sparse set of pairs that are permitted to align. This
approach, and in fact the solution technique discussed here, has been very extensively
generalized by a number of people and appears in detail in [137] and [138].
290
is, the last number from any subsequence i - 1 appears in L before the last number from
subsequence i.
Lemma 12.5.4. At any point in the execution of the algorithm, the list L is sorted in
increasing order.
Assume inductively that the lemma holds through iteration k- 1. When examining
the kth number in n , call it x, suppose x is to be placed at the end of subsequence i. Let
w be the current number at the end of subsequence i - 1, let y be the current number at
the end of subsequence i (if any), and let z be the number at the end of subsequence i 1
(if it exists). Then w < x 5 y by the workings of the algorithm, and since y < z by the
inductive assumption, x < z also. In summary, LL) -= x < Z, SO the new subsequence L
remains sorted.
PROOF
Note that L itself need not be (and generally will not be) an increasing subsequence
of n. Although x < z, x appears to the right of z in n. Despite this, the fact that L is
in sorted order means that we can use binary search to implement each iteration of the
algorithm building the greedy cover. Each iteration k considers the kth number x in I7
and the current list L to find the left-most number in L larger than x. Since L is in sorted
order, this can be done in O(1ogn) time by binary search. The list l7 has n numbers, so
we have
Theorem 12.5.1. The greedy cover can be constructed in O(n logn) time. A longest
increasing subsequence a n d a smallest cover of Il can therefare be found in O(n log n)
time.
In fact, if p is the length of the lis, then it can be found in O(n log p) time.
ry=,r(i).
For example, suppose we are using the normal English alphabet; when Si = abacx
and S2 = baabca then r(1) = 3, r(2) = 2, r(3) = 3, r(4) = 1, and r(5) = 0, so r = 9.
Clearly, for any two strings, r will fall in the range 0 to nm. We will solve the lcs problem in
O ( r log n) time (where n 5 m), which is inferior to O(nm) when the r is large. However,
r is often substantially smaller than nm, depending on the alphabet C . We will discuss
this more fully later.
The reduction
For each alphabet character x that occurs at least once in S , , create a fist of the positions
where character x occurs in string S2; write this list in decreasing order. Two distinct
alphabet characters will have totally disjoint lists. In the above example (S1 = abacx and
Sz = baabca) the list for character a is 6,3, 2 and the list for b is 4, 1.
Now create a list called n ( S I , Sz) of length r, in which each character instance in SI is
replaced with the associated list for that character. That is, for each position i in S,, insert
293
abacx and S2 = banbca (as above) and S3 = babbac, then the list for character a is
(6,5), (6,2), (3,5), ( 3 . 3 , (2,5), (2,2).
The lists for each character are again concatenated in the order that the characters
appear in string SI,forming the sequence of pairs n ( S , , S2, S3).We define an increasing
subsequence in n ( S I ,S2, S3) to be a subsequence of pairs such that the first numbers in
each pair form an increasing subsequence of integers, and the second numbers in each
pair also form an increasing subsequence of integers. We can easily modify the greedy
cover algorithm to find a longest increasing subsequence of pairs under this definition.
This increasing subsequence is used as follows.
Under this weighting model, the cost to initiate a gap is at most 35.03, and declines
with increasing evolutionary (PAM) distance between the two sequences. In addition to
this initiation weight, the function adds 17.02 log,, q for the actual length, q , of the gap.
It is hard to believe that a function this precise could be correct, but the key point is
that, for a fixed PAM distance, the proposed gap weight is a convex function of its length.'
The alignment problem with convex gap weights is more difficult to solve than with
affine gap weights, but it is not as difficult as the problem with arbitrary gap weights. In
this section we develop a practical algorithm to optimally align two strings of lengths n
and m > n, when the gap weights are specified by a convex function of the gap length.
The algorithm runs in O(nm log m) time, in contrast to the O(nm)-time bound for affine
gap weights and the 0(nm2) time for arbitrary gap weights. The speedup for the convex
case was established by Miller and Myers [3221 and independently by GaIiI and Giancario
' Unfortunately, there is no standard agreement on terminology, and some of the papers refer to the model as the
"convex" gap weight model, while others call it the "concave" gap model. In this book. a convex function is one
with a negative or zero second derivative, and a concave function is one with a positive second derivative.
292
Constrained lcs
The Ics method based on lis has another advantage over the standard dynamic programming
approach. In some applications there are additional constraints imposed on which pairs of
positions are permitted to align in the lcs. That is, in addition to the constraint that position
i in S1 can align with position j in S2 only if Sl(i) = S2(j), some additional constraints
may apply. The reduction of lcs to lis can be easily modified to incorporate these additional
constraints, and we leave the details to the reader. The effect is to reduce the size of r and
consequently to speed up the entire lcs computation. This is another example and variant
of sparse dynamic programming.
295
those recurrences. For convenience, we restate the general recurrences for arbitrary gap
weights.
V ( i , j ) = max[E(i, j ) . F(i, j ) , G ( i , j ) ] ,
E(i, j ) = O imax
[ V ( i ,k ) - w ( j - k ) ] ,
kij-1
F ( i , j ) = max [V(E, j ) - w ( l - I ) ] ,
Oilii-I
Simplifying notation
The value E ( i , j ) depends on i only through the values V ( i ,k ) for k < i . Hence, in any
fixed row, we can drop the reference to the row index i, simplifying the recurrence for E.
That is, in any fixed row we define
E ( j ) = max [ V ( k )- w ( j - k ) ] .
OskLj-1
therefore,
E( j ) = rnax Cand(k, j ) .
OCklj-1
The term Cand stands for "candidate"; the meaning of this will become clear later.
4+d
q'
q'cd
[170]. However, the solution in the second paper is given in terms of edit distance rather
than similarity. Similarity is often more useful than edit distance because it can be used to
handle the extremely important case of local comparison. Hence we will discuss convex
gap weights in terms of similarity (maximum weighted alignment) and leave it to the
reader to derive the analogous algorithms for computing edit distance with convex gap
weights. More advanced results on alignment with convex or concave gap weights appear
in [136], [138], and [276].
Recall from the discussion of arbitrary gap weights that w(q) is the weight given to a
gap of length q . That gap then contributes a penalty of -w(q) to the total weight of the
alignment.
Definition Assume that w(q) is a nonnegative function of q. Then w(q) is convex if
and only if w(q 1) - w(q) 5 w(q) - w(q - 1) for every q.
That is, as a gap length increases, the additional penalty contributed by the gap decreases
for each additional unit of the gap. It follows that w(q d ) - w(q) 2 w(q' d ) - w(qt)
for q < q' and any fixed d (see Figure 12.16). Note that the function w can have regions of
both positive and negative slope, although any region of positive slope must be to the left
of the region of negative slope. Note that the definition allows w(q) to become negative
for large enough n and rn. At that point, -w(q) becomes positive, which is probably not
desirable. Hence, gap weight functions with negative slope must be used with care.
The convex gap weight was introduced in [466] with the suggestion that mutational
events that insert or delete varying length blocks of DNA can be more meaningfully
modeled by convex gap weights, compared to affine or constant gap weights. A convex gap
penalty allows the modeler more specificity in reflecting the cost or probability of different
gap lengths. and yet it can be more efficiently handled than arbitrary gap weights. One
particular convex function that is appealing in this context is the log function, although it
is not clear which base of the logarithm might be most meaningful.
The argument for or against convex gap weights is still open, and the affine gap model
remains dominant in practice. Still, even if the convex gap model never becomes popular
in molecular biology it could well find application elsewhere. Furthermore, the algorithm
for alignment with convex gaps is of interest in itself, as a representative of a number of
related algorithms in the general area of "sparse dynamic programming".
To solve the convex gap weight case we use the same dynamic programming recurrences
developed for arbitrary gap weights (page 242), but reduce the time needed to evaluate
j"
j'
Figure 12.17: Graphical illustration of the key observation. Winning candidates are shown with a solid
curve and losers with a dashed curve. If the candidate from j loses to the candidate from k at cell j ' , then
the candidate from j will lose to the candidate from k at every cell j" to the right of j ' .
Key observation Let j be the current cell. If Cand(j, j') 5 E( j') for some j' > j,
then Cand(j, j") 5 E(j") for every j " > jf.That is, "one strike and you're out",
Hence the current cell j need not send forward any candidate values to the right of
the first cell j' > j where Cand(j, j ' ) is less than or equal to F ( j f ) . This suggests the
obvious practical speedup of stopping the loop labeled {Loop 1) in the Forward dynamic
programming algorithm as soon as j ' s candidate loses. But this improvement does not
lead directly to a better (worst-case) time bound. For that, we will have to use one more
trick. But first, we prove the key observation with the following more precise lemma.
Lemma 12.6.1. Let k < j < jr < j" be anyfour cells in the same raw. IfCand(j, j') 5
Cand(k, j') the11 Cand(j, j") 5 Cand(k, j"). See Figure 12.17 for reference.
Cand(k, j') 2 Cand(j, j ' ) implies that V(k) - w(j' - k) 2 V(j) - w ( j l - j),
w(j - j ) .
so V(k) - V ( j ) >
- w(j' - k)
Trivially, ( j ' - k) = ( j ' - J ) ( j - k). Similarly, ( j " - k) = (J" - j ) ( j - k). For
future use, note that ( j ' - k) < ( j " - k).
Now let q denote (J' - j), let q' denote ( j " - j ) , and let d denote ( j - k). Since j' < j",
then q < q ' . By convexity, w(q + d ) - w(q) 2 w(ql d ) - w(ql) (see Figure 12.16).
Translating back, we have w ( j l - k) - w ( j l - j ) > w(j" - k) - w(j" - J). Combining
this with the result in the first paragraph gives V(k) - V ( j ) 2 w(j" - k) - w(j" - j ) , and
rewriting gives V(k) - w ( j " - k) >_ V ( j ) - w(j" - J), i.e., Cand(k, j") >_ Cand(J, j"),
as claimed. a
PROOF
I
+
296
In the forward implementation, we first initialize a variable E(j') to Cand(0, j') for
each cell j' > 0 in the row. The E values are set left to right in the row, as in backward
dynamic programming. However, to set the value of E ( j ) (for any j > 0) the algorithm
merely sets E ( j ) to the current value of E ( j ) , since every cell to the left of j will have
contributed a candidate value to cell j. Then, before setting the value of E ( j
l ) , the
algorithm traverses forwards in the row to set E(j')(for each j' > j ) to be the maximum
of the current E ( j l ) and Cand(j, j'). To summarize, the forward implementation for a
fixed row is:
0 1 2
Figure t2.19: The three possible ways that the block partition changes after E(1) is set. The curves with
arrows represent the common pointer for the block and leave from the last entry in the block.
Cells 2 through m might get divided into two blocks, where the common pointer for the
first block is b = I , and the common pointer for the second is b = 0 . This happens (again
by Lemma 12.6.1) if and only if for some k < m Cand(l, j') ;z E(j') for j' from 2 to k
and Cand(1, j') 5 E ( j f )for j' from k 1 tom.
Cells 2 through m might remain in a single block, but now the common pointer b is set to
I. This happens if and only if Cand(1, j') > E( j ) for j' from 2 to m .
Figure 12.18: Partition of the cells j + 1 through m into maximal blocks of consecutive cells such that all
the cells in any block have the same b value. The common b value in any block is less than the common b
value in the preceding block.
k < j' that has contributed the best candidate yet seen for cell j'. Pointer b(j') is updated
every time the value of E( j') changes. The use of these pointers combined with the next
lemma leads ultimately to the desired speedup.
Lemma 12.6.2. Consider the point when j is the current cell, but before j sends fomard
any candidate values. At that point, b( j') 2 b( j' + I )for every cell j'from j 1 to m - 1.
For notational simplicity, let b(jl) = k and b(j + 1) = k'. Then, by the selection of k, Cand(k, j') 2 Cand(k', j'). Now suppose k < k'. Then, by Lemma 12.6.1,
Cand(k, j'
1) >_ Cand(kr, j' + I), in which case b(jl 1) should be set to k, not k'.
Hence k 3 k' and the lemma is proved.
l
PROOF
Corollary 12.6.1. At the point that j is the current cell but before j sends forward any
candidates, the values of the b pointers form a nonincreasing sequence from left to right.
Therefore, cells j, j + 1, j + 2, . . . , m are partitioned into maximal blocks of consecutive
cells such that all b pointers in the block have the same value, and the pointer values
decline in successive blocks.
Definition The partition of cells j through m referred to in Corollary 12.6.1 is called
the current block-partition. See Figure 12.18.
Given Corollary 12.6.1, the algorithm doesn't need to explicitly maintain a b pointer
for every cell but only record the common b pointer for each block. This fact will next be
exploited to achieve the desired speedup.
.?? values in these three cases are the values before any .??changes.
v(j ) := max[G( j ) , E ( j ) , F ( j ) l ;
{As before we assume that the needed F and G values have been computed.)
{Now see how j's candidates change the block-partition.}
Set j' equal to the first entry on the end-of-block list.
{look for the first index s in the end-of-block list where j loses)
If Cand(b(j'), j 1) < Cand(j, j 1) then {j's candidate wins one)
begin
While
The end-of-block list is not empty and Cand(b(j'), j') < Cand(j, j') do
begin
remove the first entry on the end-of-block list,
and remove the corresponding b-pointer
If the end-of-block list is not empty then
set j' to the new first entry on the end-of-block list.
end;
end {while};
If the end-of-block list is empty then
place m at the head of that list;
Else {when the end-of-block list is not empty)
begin
Let p, denote the first end-of-block entry.
Using binary search over the cells in block s , find the
right-most point p in that block such that Cand(j, p ) > Cand(b,, p).
Add p to the head of the end-of-block list;
end;
Time analysis
An E value is computed for the current cell, or when the algorithm does a comparison
involved in maintaining the current block-partition. Hence the total time for the algorithm
is proportional to the number of those comparisons. In iteration j, when j is the current
cell, the 'comparisons are divided into those used to find block s and those used in the
binary search to split block s. If the algorithm does 1 > 2 comparisons to find s in iteration
j, then at least 1 - 1 full blocks coalesce into a single block. The binary search then splits
at most one block into two. Hence if, in iteration j , the algorithm does 1 > 2 comparisons
to find s, then the total number of blocks decreases by at least 1 - 2. If it does one or
two comparisons, then the total number of blocks at most increases by one. Since the
algorithm begins with a single block and there are m iterations, it follows that over the
entire algorithm there can be at most O(m) comparisons done to find every s , excluding
the comparisons done during the binary searches. Clearly, the total number of comparisons
used in the rn binary searches is O(m logm). Hence we have
Theorem 12.6.1. Fur anyfied row; all rhe E ( j ) values can be computed in O(m log m)
total rime.
P1
p2
I
I
j+ l
p3
I
p4
p5
p6
I
I
coalesced block
Figure 12.20: To update the block-partition the algorithm successively examines cell pi to find the first
index s where g(ps)zCand(j, p,). In this figure, s is 4. Blocks 1 through s - 1 = 3 coalesce into a single
block with some initial part of block s = 4. Blocks to the right of s remain unchanged.
for i from 1 to r, until either the end-of-block list is exhausted, or until it finds the first
index s with E(p,) >_ Cand(j, p,). In the first case, the cells j 1, . . . , m fall into a single
block with common pointer to cell j . In the second case, the blocks s f 1 through r remain
unchanged, but all the blocks 1 through J. - 1 coalesce with some initial part (possibly all)
of blocks, forming one block with common pointer to cell j (see Figure 12.20). Note that
every comparison but the last one results in two neighboring blocks coalescing into one.
Having found block s, the algorithm finds the proper place to split block s by doing
binary search over the cells in the block. This is exactly as in the case already discussed
for j = 1.
S2
j+4
I
0
0
- D
i+4
-n
l
n
ri
Sl
Figure 12,21: A single block with t = 4 drawn inside the full dynamic programming table. The distance
values in the part of the block labeled F are determined by the values in the parts labeled A. 8 , and C
together with the substrings of S1 and S2 in D and E. Note that A is the intersection of the first row and
column of the block.
Consider the standard dynamic programming approach to computing the edit distance of
two strings S1and S2.The value D(i, j)given to any cell (i, j),when i and j are both greater
than 0, is determined by the values in its three neighboring cells, (i - I , j - l ) , ( i - 1, j),
and (i, j - I), and by the characters in positions i and j of the two strings. By extension,
the values given to the cells in an entire t-block, with upper left-hand comer at position
(i, j ) say, are determined by the values in the first row and column of the t-block together
with the substrings Sl[i..i + t - I] and S2[j..
j + t - 11 (see Figure 12.21). Another way
to state this observation is the following:
Lemma 12.7.1. The distance values in a t-block starting in position (i, j) are afunction of
the values in itsjrst row and column and the substrings S1[i. .i t - 13 and S2[ j.. j t - 11.
Definition Given Lemma 12.7.1, and using the notation shown in Figure 12.21, we
define the block filnction as the function from the five inputs (A, B , C,D ,E) to the
output F .
It follows that the values in the last row and column of a t-block are also a function of
the inputs (A. B, C , D , E). We call the function from those inputs to the values in the last
row and column of a t-block, the restricted block function.
Notice that the total size of the input and the size of the output of the restricted block
function is O(t).
~=ruullvu
LUKE3 IK L l Y b bV1L b ANU ALIGNMENTS
+
+
Theorem 12.6.2. When the gap weight w is a convexfunction of thegap length, a n optimal
alignment can be computed in O(nm log m) time, where m > n are the lengths of the hvo
strings.
The rough idea of the Four-Russians method is to partition the dynamic programming
table into t-blocks and compute the essential values in the table one t-block at a time,
rather than one cell at a time. The goal is to spend only O(r) time per block (rather than
Q(t2) time), achieving a factor of t speedup over the standard dynamic programming
solution. In the exposition given below, the partition will not be exactly achieved, since
neighboring t-blocks will overlap somewhat. Still, the rough idea given here does capture
the basic flavor and advantage of the method presented below. That method will compute
the edit distance in 0 ( n 2 / log n) time, for two strings of length n (again assuming a fixed
alphabet).
'
This reflects our general level of ignorance about ethnicities in the then Soviet Union.
1 ~ I .. I
nk PUUK-RUSSIANS SPEEDUP
305
In the case of edit distance, the precornputation suggested by the Four-Russians idea
is to enumerate all possible inputs to the restricted block function (the proper size of the
block will be determined later), compute the resulting output values (a t-length row and a
t-length column) for each input, and store the outputs indexed by the inputs. Every time
a specific restricted block function must be computed in step 3 of; the block edit distance
algorithm, the value of the function is then retrieved from the precomputed values and
need not be computed. This clearly works to compute the edit distance D(n, n), but is it
any faster than the original 0 ( n 2 ) method? Astute readers should be skeptical, so please
suspend disbelief for now.
Accounting detail
Assume first that all the precomputation has been done. What time is needed to execute
the block edit distance algorithm? Recall that the sizes of the input and the output of the
restricted block function are both O(t). It is not difficult to organize the input-output values
of the (precomputed) restricted block function so that the correct output for any specific
input can be retrieved in O ( t j time, Details are left to the reader. There are 0 ( n 2 / t 2 )blocks,
hence the total time used by the block edit distance algorithm is 0 ( n 2 / t ) . Setting t to
@(log n), the time is 0 ( n 2 / log n). However, in the unit-cost RAM model of computation,
each output value can be retrieved in constant time since t = O(1og n). In that case, the
time for the method is reduced to 0(n 2 /(lo g u ) ~ ) .
But what about the precomputation time? The key issue involves the number of input
choices to the restricted block function. By definition, every cell has an integer from zero
to n, so there are (n + 1)' possible values for any t-length row or column. If the alphabet
has size a , then there are a' possible substrings of length t. Hence the number of distinct
input combinations to the restricted block function is (n 1)2'a". For each input, it takes
@ ( r 2 ) time to evaluate the last row and column of the resulting t-block (by running the
standard dynamic program). Thus the overall time used in this way to precompute the
function outputs to all possible input choices is O((n + 1 j2'02't2). But t must be at least
one, so $2(n2)time is used in this way. No progress yet! The idea is right, but we need
another trick to make it work.
Lemma 12.7.2. in any row, column, o r diagonal of the dynamic programtning table for
edit distance, m o adjacent cells call have a val~rethat difiers by a t most one.
PROOF
Figure 12.22: An edit distance table for n = 9. With t = 4, the table is covered by nine overlapping blocks.
The center block is outlined with darker lines for clarity. Ln general, if n = k(t - 1) then the ( n +1) by (n+ 1)
table will be covered by k 2 overlapping t-blocks.
1. Cover the (n + 1) by (n 1) dynamic programming table with t-blocks, where the last
column of every t-block is shared with the first column of the t-block to its right (if any),
and the last row of every t-block is shared with the first row of the r-block below it (if
any). (See Figure 12.22). In this way, anci since n = k(r - I ) , the table will consist of k
rows and k columns of partially overlapping t-blocks.
2. Initialize the values in the first row and column of the full table according to the base
conditions of the recurrence.
3. In arowwise manner, use the restricted block function to successively determine the values
in the last row and last column of each block. By the overlapping nature of the blocks, the
values in the last column (or row) of a block are the values in the first column (or row) of
the block to its right (or below it).
4. The value in ceIl (n, n ) is the edit distance of SLand Sz.
end.
Of course, the heart of the algorithm is step 3, where specific instances of the restricted
block function must be computed. Any instance of the restricted block function can be
computed 0 ( t 2 ) time, but that gains us nothing. So how is the restricted block function
computed?
Time analysis
As in the analysis of the block edit distance algorithm, the execution of the four-Russians
n ) ~in] the unit-cost
edit distance algorithm takes 0 ( n 2 / logn) time (or ~ [ n ~ / ( l o ~time
RAM model) by setting t to O(1ogn). S o again, the key issue is the time needed to
.
that the first entry of an offset vector must be
precompute the block offset f u ~ c t i o nRecall
zero, so there are 32(r-11possible offset vectors. There are o rways to specify a substring
ways to specify the input to
over an alphabet with 0 characters, and so there are 32(1-1)a2r
the offset function. For any specific input choice, the output is computed in 0 ( t 2 ) time (via
dynamic programming), hence the entire precomputation takes 0(32'a2b2)time. Setting
t equal to (log,, n)/2, the precomputation time is just O(n(1og nj2). In summary, we have
Theorem 12.7.2. The edit distance of two strings of length n can be computed in O (&)
time or O
time in the unit-cost RAM model.
(6)
(6)
306
then D(i - 1 , j- 1) 5 D(i, j ) 1. If the optimal alignment doesn't align i against j, then at
least one of the characters, Sl(i)or S2(j), must align against a space, and D(i - 1, j - 1) _<
i
j .
Given Lemma 12.7.2, we can encode the values in a row of a t-block by a t-length
vector specifying the value of the first entry in the row, and then specifying the difference
(offset) of each successive cell value to its left neighbor: A zero indicates equality, a one
indicates an increase by one, and a minus one indicates a decrease by one. For example,
the row of distances 5, 4, 4, 5 would be encoded by the row of offsets 5, -1, 0, + l .
Similarly, we can encode the values in any column by such offset encoding. Since there
are only (n 1)3'-' distinct vectors of this type, a change to offset encoding is surely a
move in the right direction. We can, however, reduce the number of possible vectors even
further.
Definition The ofset vector is a t-length vector of values from (-1,0, 1 ), where the
first entry must be zero.
The key to making the Four-Russians method efficient is to compute edit distance using
only offset vectors rather than actual distance values. Because the number of possible offset
vectors is much less than the number of possible vectors of distance values, much less
precomputation will be needed. We next show that edit distance can be computed using
offset vectors.
Theorem 12.7.1. Consider a t-block with upper left corner in position (i, j ) . The two
ofset vectorsfor the last row and last column of the block can be determinedfrom the two
offset vectors for the first row and column ofthe block and from substrings Sl [l..i] and
S2[1.. j ] . That is, no D value is needed in the input in order to determine the oflser vectors
in the lust row and column of the block.
The proof is essentially a close examination of the dynamic programming recurrences for edit distance. Denote the unknown value of D(i, j ) by C . Then for column q in
the block, D(i, q ) equals C plus the total of the offset values in row i from column j 1 to
column y. Hence even if the algorithm doesn't know the value of C , it can express D(i, q )
as C plus an integer that it can determine. Each D(q, j) can be similarly expressed. Let
D(i, j 1) be C + J and let D(i + 1, j ) be C I , where the algorithm can know I and
J. Now consider cell (i + 1 , j + 1). D(i + 1, j
1) is equal to D(i, j ) = C if character
S l ( i ) matches S z ( j ) .Otherwise D(i 1, j + 1 ) equals the minimum of D ( i , j 1) 1,
D(i 1, j) t , and D(i, j ) 1, i.e., the minimum of C I 1, C J 1, and C 1.
The algorithm can make this comparison by comparing I and J (which it knows) to the
1, j
1) as C , C
I
I,
number zero. So the algorithm can correctly express D(i
C J 1, or C 1. Continuing in this way, the algorithm can correctly express each
D value in the block as an unknown C plus some integer that it can determine. Since
every term involves the same unknown constant C, the offset vectors can be correctly
determined by the algorithm. o
PROOF
+ +
+
+
+ +
+ + + +
+
+ +
+ +
Definition The function that determines the two offset vectors for the last row and last
column from the two offset vectors for the first row and column of a block together with
substrings Sl[ l . . i ] and S2[l..j ] is called the offsetfunction.
We now have all the pieces of the Four-Russians-type algorithm to compute edit distance. We again assume, for simplicity, that each string has length n = k(t - 1) for
some k.
12.8. EXERCISES
309
Prove the lemma and then show how to exploit it in the solution to the threshold P-againstall problem. Try to estimate how effective the lemma is in practice. Be sure to consider how
the output is efficiently collected when the dynamic programming ends high in the tree,
before a leaf is reached.
11. Give a complete proof of the correctness of the all-against-all suffix t ~ e ealgorithm.
12. Another, faster, alternative to the P-against-all problem is to change the problem slightly as
follows: For each position i in T such that there is a substring starting at i with edit distance
less than d from P, report only the smallestsuch substring starting at position i. This is the
(P-against-all) starting location problem, and it can be solved by modifying the approach
discussed for the threshold P-against-all problem. The starting location problem (actually
the equivalent ending location problem) is the subject of a paper by Ukkonen [437].
In that
paper, Ukkonen develops three hybrid dynamic programming methods in the same spirit
as those presented in this chapter, but with additional technical observations. The main
result of that paper was later improved by Cobbs f1051.
Detail a solution to the starting location problem, using a hybrid dynamic programming
approach.
13. Show that the suffix tree methods and time bounds for the P-against-all.andthe all-againstall problems extend to the problem of computing similarity instead of edit distance.
14. Let R be a regular expression. Show how to modify the P-against-allmethod to solve the Ragainst-all problem. That is, show how to use a suffix tree to efficiently search for a substring
in a large text T that matches the regular expression R. (This problem is from [63].)
Now extend the method to allow for a bounded number of errors in the match.
15. Finish the proof of Theorem 12.5.2.
16. Show that in any permutation of n integers from 1 to n, there is either an increasing subsequence of length at least f i or a decreasing subsequence of length at least &.Show
that, averaged over all the n! permutations, the average length of the longest increasing
subsequence is at least &/2. Show that the lower bound of f i / 2 cannot be tight.
17. What do the results from the previous problem imply for the Ics problem?
18. If S is a subsequence of another string S', then S is said to be a supersequence of S. If
two strings S1and E& are subsequences of S', then Sf is a common supersequence of S,
and &. That leads to the following natural question: Given two strings SIand $, what is
the shortestsupersequence common to both S1 and S2.This problem is clearly related to
the longest common subsequence problem. Develop an explicit relationship between the
two problems, and the lengths of their solutions. Then develop efficient methods to find a
shortest common supersequence of two strings. For additional results on subsequences
and supersequences see 12401 and [241].
f
19. Can the results in the previous problem be generalized to the case of more than two strings?
For instance, is there a natural relationship between the longest common subsequence and
the shortest common supersequence of three strings?
20. Let T be a string whose characters come from an alphabet C with a characters. A subsequence S of T is nondecreasing if each successive charhder in S is lexically greater
than or equal to the preceding character. For example, using the English alphabet let T =
characterstring; then S = aacrst is a nondecreasing subsequence of T. Give an aigorithm
that finds the longest nondecreasing subsequence of a string T in time O(na), where n is
the length of T. How does this bound compare to the O(nlog n) bound given for the longest
increasing subsequence problem over integers.
21. Recall the definition of r given for two strings in Section 12.5.2 on page 290. Extend the
12.8. Exercises
1. Show how to compute the value V(n,m)of the optimal alignment using only min(n,m) + 1
space in addition to the space needed to represent the two input strings.
2. Modify Hirschberg's method to work for alignment with a gap penalty (affine and general)
in the objective function. It may be helpful to use both the affine gap recurrences developed
in the text, and the alternative recurrences that pay for a gap when terminated. The latter
recurrences were developed in the exercise 27 of Chapter 11.
3. Hirschberg's method computes one optimal alignment. Try to find ways to modify the
method to produce more (all?) optimal alignments while still achieving substantial space
reduction and maintaining a good time bound compared to the O(nm)-time and space
method? I believe this is an open area.
4. Show how to reduce the size of the strip needed in the method of Section 12.2.3,when
Im- nf < k.
5. Fill in the details of how to find the actual alignments of P in T that occur with at most k
differences. The method uses the O(km)values stored during the k differences algorithm.
The solution is somewhat simpler if the k differences algorithm also stores a sparse set of
pointers recording how each farthest-reaching d-path extends a farthest-reaching ( d - I )path. These pointers only take O(km) space and are a sparse version of the standard
dynamic programming pointers. Fill in the details for this approach as well.
6. The k differences problem is an unweighted (or unit weighted) alignment problem defined
in terms of the number of mismatches and spaces. Can the O(km) result be extended
to operator- or alphabet-weighted versions of alignment? The answer is: not completely.
Explain why not. Then find special cases of weighted alignment, and plausible uses for
these cases, where the result does extend.
7 . Prove Lemma 12.3.2 from page 274.
8. Prove Lemma 12.3.4 from page 277.
9. Prove Theorem 12.4.2 that concerns space use in the P-against-all problem.
The P-against-all problem was introduced first because it most directly illustrates one
general approach to using suffix trees to speed up dynamic programming computations.
And, it has been proposed that such a massive study of how Prelates to substrings of T
can be important in certain problems [183].Nonetheless, for most applications the output
of the Pagainst-all problem is excessive and a more focused computation is desirable.
The threshold P-against-allproblem is of this type: Given strings Pand T and a threshold
d , find every substring T' of T such that the edit distance between P and T' is less than
d . Of course, it would be cheating to first solve the P-against-all problem and then filter
out the substrings of T whose edit distance to P i s d or greater. We want a method whose
speed is related to d . The computation should increase in speed as d falls.
The idea is to follow the solution to the P-against-all problem, doing a depth-first traversal
of suffix tree 7,but recognize subtrees that need not be traversed. The following lemma
is the key.
Lemma 12.8.1. In the P-against-allproblem, suppose that the currentpath in the suffix tree
specifies a substring S of T and that the current dynamic programming column (including
the zero row) contains no values below d . Then the column representing an extension
of S will also contain no values below d. Hence no columns need be computed for any
extensions of S.
12.8. EXERCISES
311
method seems more justified. In fact, why not pick a "reasonable5aalue tor t , do the precomputation of the offset function once for that t, and then embed the offset function in
an edit distance algorithm to be used for all future edit distance computations. Discuss the
merits and demerits of this proposal.
32. The Four-Russians method presented in the text only computes the edit distance. How can
it be modified to compute the edit transcript as well?
33. Show how to apply the Four-Russians method to strings of unequal length.
34. What problems arise in trying to extend the Four-Russians method and the improved time
bound to the weightededit distance problem? Are there restrictions on weights (other than
equality) that make the extension easier?
35. Following the lines of the previous question, show in detail how the Four-Russians approach
can be used to solve the longest common subsequence problem between two strings of
length n, in O($/log n) time.
310
definition for r to the longest common subsequence problem for more than two strings, and
use r to express the time for finding an Ics in this case.
22. Show how to model and solve the lis problem as a shortest path problem in a directed,
acyclic graph. Are there any advantages to viewing the problem in this way?
23. Suppose we only want to learn the length of the Ics of two strings S1and $. That can be
done, as before, in O(r log n) time, but now only using linear space. The key is to keep only
the last element in each list of the cover (when computing the lis), and not to generate all
of n(St,$) at once, but to generate (in linear space) parts of n(S,,&) on the fly. Fill in
the details of these ideas and show that the length of the Ics can be computed as quickly
as before in only linear space.
Open problem: Extend the above combinatorial ideas, to show how to compute the actual
Ics of two strings using only linear space, without increasing the needed time. Then extend
to more than two strings.
24. (This problem requires a knowledge of systolic arrays.) Show how to implement the longest
increasing subsequence algorithm to run in O(n) time on an O(n)-element systolic array
(remember that each array element has only constant memory). To make the problem
simpler, first consider how to compute the length of the /is, and then work out how to
compute the actual increasing subsequence.
25. Work out how to compute the Ics in O(n) time on an O(n)-element systolic array.
26. We have reduced the Ics problem to the /is problem. Show how to do the reduction in the
opposite direction,
27. Suppose each character in S,and S is given an individual weight. Give an algorithm to
find an increasing subsequence of maximum total weight.
28. Derive an O(nmlog m)-time method to compute edit distance for the convex gap weight
model.
29. The idea of forward dynamic programming can be used to speed up (in practice) the (global)
alignment of two strings, even when gaps are not included in the objective function. We
will explain this in terms of computing unweighted edit distance between strings S and S2
(of lengths nand m respectively), but the basic idea works for computing similarity as well.
Suppose a cell (i, j ) is reached during the (forward) dynamic programming computation
of edit distance and the value there is D(i, j). Suppose also that there is a fast way to
compute a lower bound, L(i, j),on the distance between substrings Sl[i I , . . . ,n] and
&[j
1,. . ,ml. If o ( i , j ) L ( i , j ) is greater than or equal to a known distance between
S, and & obtained from some particular alignment, then there is no need to propogate
j). The question now is to find efficient methods to
candidate values forward from cell (i,
compute "effective" values of L(i,j ) . One simple one is ( n- m + j - il. Explain this. Try it
out in practice to see how effective it is. Come up with other simple lower bounds that are
much more effective.
Hint: Use the count of the number of times each character appears in each string.
30. As detailed in the text, the Four-Russians method precomputes the offset function for
321t-11~2r
specifications of input values. However, the problem statement and time bound
allow the precomputation of the offset function to be done after strings S1 and & are
known. Can that observation be used to reduce the running time?
An alternative encoding of strings allows the a2' term to be changed to (t + 2)t even
in problem settings where S, and & are not known when the precomputation is done.
Discover and explain the encoding and how edit distance is computed when using it.
31. Consider the situation when the edit distance must be computed for each pair of strings from
a large set of strings. In that situation, the precomputation needed by the Four-Russians