Aproximate String Matching
Aproximate String Matching
Gonzalo Navarro
Dept. of Computer Science, University of Chile
Blanco Encalada 2120 - Santiago - Chile
[email protected], http://www.dcc.uchile.cl/gnavarro
Abstract
We survey the current techniques to cope with the problem of string matching allowing
errors. This is becoming a more and more relevant issue for many fast growing areas such as
information retrieval and computational biology. We focus on online searching, explaining the
problem and its relevance, its statistical behavior, its history and current developments, and
the central ideas of the algorithms and their complexities. We present a number of experiments
to compare the performance of the dierent algorithms and show which are the best choices
according to each case. We conclude with some future work directions and open problems.
1 Introduction
This work focuses on the problem of string matching allowing errors, also called approximate string
matching. The general goal is to perform string matching of a pattern in a text where one or
both of them have suered some kind of (undesirable) corruption. Some examples are recovering
the original signals after their transmission over noisy channels, nding DNA subsequences after
possible mutations, and text searching under the presence of typing or spelling errors.
The problem, in its most general form, is to nd the positions of a text where a given pattern
occurs, allowing a limited number of \errors" in the matches. Each application uses a dierent error
model, which denes how dierent two strings are. The idea for this \distance" between strings is
to make it small when one of the strings is likely to be an erroneous variant of the other under the
error model in use.
The goal of this survey is to present an overview of the state of the art in approximate string
matching. We focus on online searching, explaining the problem and its relevance, its statistical
behavior, its history and current developments, and the central ideas of the algorithms and their
complexities. We also consider some variants of the problem which are of interest. We present a
number of experiments to compare the performance of the dierent algorithms and show which are
the best choices according to each case. We conclude with some future work directions and open
problems.
Unfortunately, the algorithmic nature of the problem strongly depends on the type of \errors"
considered, and the solutions range from linear time to NP-complete. The scope of our subject is
so broad that we are forced to specialize our focus on a subset of the possible error models. We
consider only those dened in terms of replacing some substrings by others at varying costs. Under
Partially supported by Fondecyt grant 1-990627.
this light, the problem becomes minimizing the total cost to transform the pattern and its text
occurrence to make them equal, and reporting the text positions where this cost is low enough.
One of the best studied particular cases of this error model is the so-called edit distance, which
allows to delete, insert and replace simple characters (by a dierent one) in both strings. If the
dierent operations have dierent costs or the costs depend on the characters involved, we speak
of general edit distance. Otherwise, if all the operations cost 1, we speak of simple edit distance
or just edit distance (ed). In this last case we simply seek for the minimum number of insertions,
deletions and replacement to make both strings equal. For instance ed("survey","surgery") = 2.
The edit distance has received a lot of attention because its generalized version is powerful enough
for a wide range of applications. Despite that most existing algorithms concentrate on the simple
edit distance, many of them can be easily adapted to the generalized edit distance, and we pay
attention to this issue throughout this work. Moreover, the few algorithms that exist for the general
error model we consider are generalizations of edit distance algorithms.
On the other hand, most of the algorithms designed for the edit distance are easily specialized
to other cases of interest. For instance, by allowing only insertions and deletions at cost 1 we can
compute the longest common subsequence (LCS) between two strings. Another simplication that
has received a lot of attention is the variant that allows only replacements (Hamming distance).
An extension of the edit distance enrichs it with transpositions (i.e. a replacement of the form
ab ! ba at cost 1). Transpositions are very important in text searching applications because they
are typical typing errors, but few algorithms exist to handle them. However, many algorithms for
edit distance can be easily extended to include transpositions, and we keep track of this fact in this
work.
Since the edit distance is by far the best studied case, this survey focuses basically on the simple
edit distance. However, we also pay attention to extensions such as generalized edit distance,
transpositions and general substring replacement, as well as to simplications such as LCS and
Hamming distance. In addition, we also pay attention to some extensions of the type of pattern to
search: when the algorithms allow it, we mention the possibility to search some extended patterns
and regular expression allowing errors. We point out now what are we not covering in this work.
First, we do not cover other distance functions that do not t in the model of substring
replacement. This is because they are too dierent from our focus and the paper would
loose cohesion. Some of these are: Hamming distance (short survey in [Nav98a]), reversals
[KS95] (which allows reversing substrings), block distance [Tic84, EH88, Ukk92, LT97] (which
allows rearranging and permuting the substrings), q -gram distance [Ukk92] (based in nding
common substrings of xed length q ), allowing swaps [AAL+ 97, LKPC97], etc. Hamming
distance, despite being a simplication of the edit distance, is not covered because specialized
algorithms exist for it that go beyond the simplication of an existing algorithm for edit
distance.
Second, we consider pattern matching over sequences of symbols, and at most generalize
the pattern to a regular expression. Extensions such as approximate searching in multidimensional texts (short survey in [NBY99a]), in graphs [ALL97, Nav98b] or multipattern
approximate searching [MM96, BYN97b, Nav97a, BYN98] are not considered. None of these
areas are very developed and the algorithms should be easy to grasp once approximate pattern
2
matching under the simple model is well understood. Many existing algorithms for these
problems borrow from those we present here.
Third, we leave aside non-standard algorithms, such as approximate or parallel algorithms
[TU88, LL85, LV89].
Finally, an important area that we leave aside in this survey is indexed searching, i.e. the
process of building a persistent data structure (an index) on the text to speed up the search
later. Typical reasons that prevent keeping indices on the text are: extra space requirements
(as the indices for approximate searching tend to take many times the text size), volatility of
the text (as building the indices is quite costly and needs be amortized over many searches)
and simply inadequacy (as the eld of indexed approximate string matching is quite immature
and the speedup that the indices provide is not always satisfactory).
Indexed approximate searching is a dicult problem, and the area is quite new and active [JU91, Gon92, Ukk93, Mye94a, HS94, MW94, Cob95, ST96, BYN97a, ANZ97, NBY98b,
NBY99b, MNZBY99]. The problem is very important because the texts to handle are so large
in some applications that no online algorithm can provide adequate performance. However,
virtually all the indexed algorithms are strongly based on online algorithms, and therefore understanding and improving the current online solutions is of interest for indexed approximate
searching as well.
These issues have been left aside to keep a reasonable scope in the present work. They certainly
deserve separate surveys. Our goal in this survey is to explain the basic tools of approximate string
matching, as many of the extensions we are leaving aside are built on the basic algorithms designed
for online approximate string matching.
This work is organized as follows. In Section 2 we present in detail some of the most important
application areas for approximate string matching. In Section 3 we formally introduce the problem
and the basic concepts necessary to follow the rest of the paper. In Section 4 we show some
analytical and empirical results about the statistical behavior of the problem.
Sections 5 to 8 cover all the work of interest we could trace on approximate string matching
under the edit distance. We divided it in four sections that correspond to dierent approaches
to the problem: dynamic programming, automata, bit-parallelism and ltering algorithms. Each
section is presented as a historical tour, so that we do not only explain the work done but also show
how it was developed.
Section 9 presents experimental results comparing the most ecient algorithms presented. Finally, we give our conclusions and discuss open questions and future work directions in Section
10.
There exist other surveys on approximate string matching, which are however too old for this
fast moving area [HD80, SK83, AG85, GG88, JTU96] (the last one was in its denitive form in
1991). So all previous surveys lack coverage of the latest developments. Our aim is to provide a
long awaited update. This work is partially based in [Nav98a], but the coverage of previous work
is much more detailed here. The subject is also covered, albeit with less depth, in some textbooks
on algorithms [CR94, BYR99].
3
DNA and protein sequences can be seen as long texts over specic alphabets (e.g. fA,C,G,Tg in
DNA). Those sequences represent the genetic code of living beings. Searching specic sequences
over those texts appeared as a fundamental operation for problems such as assembling the DNA
chain from the pieces obtained by the experiments, looking for given features in DNA chains, or
determining how dierent two genetic sequences were. This was modeled as searching for given
\patterns" in a \text". However, exact searching was of little use for this application, since the
patterns rarely matched the text exactly: the experimental measures have errors of dierent kinds
and even the correct chains may have small dierences, some of them signicant due to mutations
and evolutionary alterations and others unimportant. Finding DNA chains very similar to those
sought represent signicant results as well. Moreover, establishing how dierent two sequences
are is important to reconstruct the tree of the evolution (phylogenetic trees). All these problems
required a concept of \similarity", as well as an algorithm to compute it.
This gave a motivation to \search allowing errors". The errors were those operations that
biologists knew were common to occur in genetic sequences. The \distance" between two sequences
was dened as the minimum (i.e. more likely) sequence of operations to transform one into the
other. With regard to likelihood, the operations were assigned a \cost", such that the more likely
operations were cheaper. The goal was then to minimize the total cost.
Computational biology has since then evolved and developed a lot, with a special push in
recent years due to the \genome" projects that aim at the complete decoding of the DNA and
its potential applications. There are other, more exotic problems, such as structure matching or
searching for unknown patterns. Even the simple problem where the pattern is known is believed
to be NP-complete under some distance functions (e.g. reversals).
Some good references for the applications of approximate pattern matching to computational
biology are [Sel74, NW70, SK83, Mye94b, Wat95, YFM96].
Another early motivation came from signal processing. One of the largest areas deals with speech
recognition, where the general problem is to determine, given an audio signal, a textual message
which is being transmitted. Even simplied problems such as discerning a word from a small set
of alternatives is complex, since parts of the the signal may be compressed in time, parts of the
speech may not be pronounced, etc. A perfect match is practically impossible.
Another problem of this eld is error correction. The physical transmission of signals is errorprone. To ensure correct transmission over a physical channel, it is necessary to be able to recover
4
the correct message after a possible modication (error) introduced during the transmission. The
probability of such errors is obtained from the signal processing theory and used to assign a cost to
them. In this case we may even not know what we are searching for, we just want a text which is
correct (according to the error correcting code used) and closest to the received message. Although
this area has not developed too much with respect to approximate searching, it has generated the
most important measure of similarity, known as the Levenshtein distance [Lev65, Lev66] (also called
\edit distance").
Signal processing is a very active area today. The rapidly evolving eld of multimedia databases
demands the ability to search by content in image, audio and video data, which are potential
applications for approximate string matching. We expect in the next years a lot of pressure on nonwritten human-machine communication, which involves speech recognition. Strong error correcting
codes are also sought given the current interest in wireless networks, as the air is a low quality
transmission medium.
Good references for the relations of approximate pattern matching with signal processing are
[Lev65, Vin68, DM79].
The problem of correcting misspelled words in written text is rather old, perhaps the oldest potential
application for approximate string matching. We could nd references from the twenties [Mas27],
and perhaps there are older ones. Since the sixties, approximate string matching is one of the most
popular tools to deal with this problem. For instance, 80% of these errors are corrected allowing
just one insertion, deletion, replacement or transposition [Dam64].
There are many areas where this problem appears, and Information Retrieval (IR) is one of the
most demanding. IR is about nding the relevant information in a large text collection, and string
matching is one of its basic tools.
However, classical string matching is normally not enough, because the text collections are
becoming larger (e.g. the Web has largely surpassed the terabyte), more heterogeneous (dierent
languages, for instance) and more error prone. Many are so large and grow so fast that it is
impossible to control their quality (e.g. in the Web). A word which is entered incorrectly in the
database cannot be retrieved anymore. Moreover, the pattern itself may have errors, for instance in
cross-lingual scenarios where a foreign name sought is incorrectly spelled, or in ancient texts that
use outdated versions of the language.
For instance, text collections digitalized via optical character recognition (OCR) contain a nonnegligible percentage of errors (7% to 16%). The same happens with typing (1% to 3.2%) and
spelling (1.5% to 2.5%) errors. Experiments for typing Dutch surname (by Dutchs) reached 38%
of spelling errors. All these percentages were obtained from [Kuk92]. Our own experiments with
the name \Levenshtein" in Altavista gave more than 30% of errors allowing just one deletion or
transposition.
Nowadays, there is virtually no text retrieval product that does not allow some extended search
facility to recover from errors in the text of pattern. Other text processing applications are spelling
checkers, natural language interfaces, command language interfaces, computer aided tutoring and
language learning, to name a few.
5
A very recent extension which became possible thanks to word-oriented text compression methods is the possibility to perform approximate string matching at the word level [MNZBY99]. That
is, the user supplies a phrase to search and the system searches the text positions where the phrase
appears with a limited number of word insertions, deletions and replacements. It is also possible
to disregard the order of the words in the phrases. This allows the query to survive from dierent
wordings of the same idea, which extends the applications of approximate pattern matching well
beyond the recovery of syntactic mistakes.
Good references about the relation of approximate string matching and information retrieval
are [WF74, LW75, Nes86, OM88, Kuk92, ZD96, FPS97, BYR99].
The number of applications for approximate string matching grows every day. We have found
solutions to the most diverse problems based on approximate string matching, for instance handwriting recognition [LT94], virus and intrusion detection [KS94], image compression [LS97], data
mining [DFG+97], pattern recognition [GT78], optical character recognition [EL90], le comparison
[Hec78] and screen updating [Gos91], to name a few. Many more applications are mentioned in
[SK83, Kuk92].
3 Basic Concepts
We present in this section the important concepts needed to understand all the development that
follows. Basic knowledge is assumed on design and analysis of algorithms and data structures,
basic text algorithms, and formal languages. If this is not the case we refer the reader to good
books in these subjects, such as [AHU74, CLR91, Knu73] (for algorithms), [GBY91, CR94] (for
text algorithms) and [HU79] (for formal languages).
We start with some formal denitions related to the problem. Then we cover some data structures not widely known which are relevant for this survey (they are also explained in [GBY91,
CR94]). Finally, we make some comments about the tour itself.
In the discussion that follows, we use s; x; y; z; v; w to represent arbitrary strings, and a; b; c::: to
represent letters. Writing a sequence of strings and/or letters represents their concatenation. We
assume that concepts such as prex, sux and substring are known. For any string s 2 we
denote its length as jsj. We also denote si the i-th character of s, for an integer i 2 f1::jsjg. We
denote si::j = si si+1 :::sj (which is the empty string if i > j ). The empty string is denoted as ".
In the Introduction we have dened the problem of approximate string matching as that of
nding the text positions that match a pattern with up to k errors. We give now a more formal
denition.
Let be a nite1 alphabet of size jj = .
However, many algorithms can be adapted to innite alphabets with an extra O(log m) factor in their cost. This
is because the pattern can have at most m dierent letters and all the rest can be considered equal for our purposes.
1
In the simplied denition, all the operations cost 1. This can be rephrased as \the
minimal number of insertions, deletions and replacements to make two strings equal".
In the literature the search problem is in many cases called \string matching with k
dierences". The distance is symmetric, and it holds 0 d(x; y ) max(jxj; jy j).
A table of size could be replaced by a search structure over at most m + 1 dierent letters.
? Hamming distance [SK83]: allows only replacements, which cost 1 in the simplied
denition. In the literature the search problem is in many cases called \string matching
with k mismatches". The distance is symmetric, and it is nite whenever jxj = jy j. In
this case it holds 0 d(x; y ) jxj.
? Episode distance [DFG+97]: allows only insertions, which cost 1. In the literature
the search problem is in many cases called \episode matching", since it models the case
where a sequence of events is sought, where all them must occur within a short period.
This distance is not symmetric, and it may not be possible to convert x into y in this
case. Hence, d(x; y ) is either jy j ? jxj or 1.
? Longest Common Subsequence distance [NW70, AG87]: allows only insertions and
deletions, all costing 1. The name of this distance refers to the fact that it measures
the length of the longest pairing of characters that can be made between both strings,
so that the pairings respect the order of the letters. The distance is the number of
unpaired characters. The distance is symmetric, and it holds 0 d(x; y ) jxj + jy j.
In all cases, except the episode distance, one can think that the changes can be made over x or
y. Insertions on x are the same as deletions in y and vice versa, and replacements can be made in
any of the two strings to match the other.
This paper is most concerned with the simple edit distance, which we denote ed(). Although
transpositions are of interest (especially in case of typing errors), there are few algorithms to deal
with them. However, we will consider them at some points of this work (note that a transposition
can be simulated with an insertion plus a deletion, but the cost is dierent). We also will point out
when the algorithms can be extended to have dierent costs of the operations (which is of special
interest in computational biology), including the extreme case of not allowing some operations.
This includes the other distances mentioned.
Note that if the Hamming or edit distance are used, then the problem makes sense for 0 < k < m,
since if we can perform m operations we can make the pattern match at any text position by means
of m replacements. The case k = 0 corresponds to exact string matching and is therefore excluded
from this work. Under these distances, we call = k=m the error level, which given the above
conditions satises 0 < < 1. This value gives an idea of the \error ratio" allowed in the match
(i.e. the fraction of the pattern that can be wrong).
We nish this section with some notes about the algorithms we are going to consider. Like string
matching, this area is suitable for very theoretical and for very practical contributions. There exist
a number of algorithms with important improvements in their theoretical complexity but very slow
in practice. Of course, for carefully built scenarios (say, m = 100; 000 and k = 2) these algorithms
could be a practical alternative, but these cases do not appear in applications. Therefore, we
point out now which are the parameters of the problem that we consider \practical", i.e. likely to
be of use in some application, and when we say later \in practice" we mean under the following
assumptions.
? The pattern length can be as short as 5 letters (e.g. text retrieval) and as long as a
few hundred letters (e.g. computational biology).
? The number of errors allowed k satises that k=m is a moderately low value. Reasonable values range from 1=m to 1=2.
8
? The text length can be as short as a few thousand letters (e.g. computational biology)
Sux trees [Wei73, Knu73, AG85] are widely used data structures for text processing [Apo85]. Any
position i in a string S denes automatically a sux of S , namely Si:::. In essence, a sux tree is
a trie data structure built over all the suxes of S . At the leaf nodes the pointers to the suxes
are stored. Each leaf represents a sux and each internal node represents a unique substring of S .
Every substring of S can be found by traversing a path from the root. Each node representing the
substring ax has a sux link that leads to the node representing the substring x.
To improve space utilization, this trie is compacted into a Patricia tree [Mor68]. This involves
compressing unary paths. At the nodes which root a compressed path, an indication of how many
characters to skip is stored. Once unary paths are not present the tree has O(jS j) nodes instead
of the worst-case O(jS j2) of the trie (see Figure 1). The structure can be built in time O(jS j)
[McC76, Ukk95].
1
10 11
String
a b r a c a d a b r a
"c"
"a"
"c"
"$"
10
Suffix Trie
"$"
"r"
10
"r"
"d"
"c"
"d"
"c"
5
"$"
"b"
"r"
"a"
"a"
"c"
"c"
"c"
"a"
"c"
"r"
"a"
"$"
"d"
"$"
"$"
Suffix Tree
"b"
"d"
"b"
"c"
4
"c"
"b"
5
"$"
11
11
"$"
1
8
Figure 1: The sux trie and sux tree for a sample string. The \$" is a special marker to denote
the end of the text. Two sux links are exemplied in the trie: from "abra" to "bra" and then to
"ra". The internal nodes of the sux tree show the character position to inspect in the string.
A DAWG (deterministic acyclic word graph) [Cro86, BBH+ 85] built on a string S is a determin9
istic automaton able to recognize all the substrings of S . As each node in the sux tree corresponds
to a substring, the DAWG is no more than the sux tree augmented with failure links for the letters
not present in the tree. Since nal nodes are not distinguished, the DAWG is smaller. DAWGs
have similar applications to those of sux trees, and also need O(jS j) space and construction time.
Figure 2 illustrates.
"d"
"c"
"r"
"b"
"a"
"r"
"b"
"a"
"a"
"c"
"d"
"b"
"a"
"r"
"a"
"c"
"d"
Figure 2: The DAWG or the sux automaton for the sample string. If all the states are nal, it is
a DAWG. If only the rightmost state is nal then it is a sux automaton.
A sux automaton on S is an automaton that recognizes all the suxes of S . The nondeterministic version of this automaton has a very regular structure and is shown in Figure 3 (the
deterministic version can be seen in Figure 2).
Sections 5 to 8 present a historical tour across the four main approaches to online approximate
string matching (see Figure 4). In those historical discussions, keep in mind that there may be
a long gap between the time when a result is discovered and when it gets nally published in its
denitive form. Some apparent inconsistencies can be explained in this way (e.g. algorithms which
are \nally" analyzed before they appear). We did our best in the bibliography to trace the earliest
version of the works, although the full reference corresponds generally to the nal version.
At the beginning of each of these sections we give a taxonomy to help guide the tour. The
taxonomy is an acyclic graph where the nodes are the algorithms and the edges mean that the
lower work can be seen as an evolution of the upper work (although sometimes the developments
are in fact independent).
Finally, we specify some notation regarding time and space complexity. When we say that an
algorithm is O(x) time we refer to its worst case (although sometimes we say that explicitly). If
the cost is on average, we say so explicitly. We also say sometimes that the algorithm is O(x) cost,
10
Worst case
Based on DP matrix
Average case
Automaton
For moderatepatterns
Filters
Bit-parallelism
Based on DP matrix
the average edit distance. Let f (m; k) be the probability of a random pattern of length m matching
a given text position with k errors or less under the edit distance (i.e. that the text position is
reported as the end of a match). In [BYN99, Nav98a, NBY99b] upper and lower bounds on the
maximum error level for which f (m; k) is exponentially decreasing on m are found. This is
important because many algorithms search for potential matches that have to be veried later, and
the cost of such verications is polynomial in m, typically O(m2). Therefore, if that event occurs
with probability O(
m) for some
< 1 then the total cost of verications is O(m2
m ) = o(1),
which makes the verications cost negligible.
We rst show the analytical bounds for f (m; k), then give a new result on average edit distance,
and nally present an experimental verication.
The upper bound for comes from the proof that the matching probability is f (m; k) = O(
m)
for
!1?
1?
2
1
e
=
(1 ? )2
(1)
? (1 ? )2
where we note that
is 1= for = 0 and grows to 1 as grows. This matching probability is
exponentially decreasing on m as long as
< 1, which is equivalent to
(2)
< 1 ? pe ? O(1=) 1 ? pe
p
Therefore, < 1 ? e= is a conservative condition on the
error level that ensures \few"
p
matches. Therefore, the maximum level satises > 1 ? e= .
The proof is obtained using a combinatorial model. Based on the observation that m ? k
common characters must appear in the same order in two strings that match with k errors, all
the possible alternatives to select the matching characters from both strings are enumerated. This
model, however, does not take full advantage of the properties of the edit distance: even if m ? k
characters match, the distance can be larger than k. For example, in ed(abc; bcd) = 2, i.e. although
two characters match, the distance is not 1.
2
1
On the other hand, the only optimistic bound we know of is based on considering that only replacements are allowed (i.e. Hamming distance). This distance is simpler to analyze but its matching
probability is much lower. Using again a combinatorial model it is shown that the matching probability is f (m; k) m m?1=2 , where
1?
= (1 ?1)
Therefore an upper bound for the maximum value is 1 ? 1= , since otherwise it can be
proved that f (m; k) is not exponentially decreasing on m (i.e. it is
(m?1=2)).
12
We can now prove that the average edit distance is larger than m(1 ? e= ) for any . We dene
p(m; k) as the probability that the edit distance between two strings of length m is at most k. Note
that p(m; k) f (m; k) because in the latter case we can match with any text sux of length from
m ? k to m + k. Then the average edit distance is
m
X
k=0
k Pr(ed = k) =
m
X
k=0
Pr(ed > k) =
m
X
k=0
1 ? p(m; k) = m ?
m
X
k=0
p(m; k)
m ? (Kp(m; K ) + (m ? K )) = K (1 ? p(m; K ))
p
for any K of our choice. In particular, for K=m < p1 ? e= we have that p(m; K ) f (m; K ) =
O(
m) forp
< 1. Therefore choosing K = m(1 ? e= ) ? 1 yields that the edit distance is at least
m(1 ? e= )+ O(1), for any . As we see later, this proofs converts a conjecture about the average
running time of an algorithm [CL92] into a fact.
We verify the analysis experimentally in this section (this is also taken from [BYN99, Nav98a]).
The experiment consists of generating a large random text (n = 10 Mb) and running the search
of a random pattern on that text, allowing k = m errors. At each text character, we record the
minimum allowed error k for which that text position would match the pattern. We repeat the
experiment with 1,000 random patterns.
Finally, we build the cumulative histogram, nding how many text positions have matched with
up to k errors, for each k value. We consider that k is \low enough" up to where the histogram
values become signicant, that is, as long as few text positions have matched. The threshold is set
to n=m2 , since m2 is the normal cost of verifying a match. However, the selection of this threshold
is not very important, since the histogram is extremely concentrated. For example, for m in the
hundreds, it moves from almost zero to almost n in just ve or six increments of k.
Figure 5 shows the results for = 32. On the left we show the histogram we have built, where
the matching probability undergoes a sharp increase at . On the right we show the value as
m grows. It is clear that is essentially independent on m, although it is a bit lower for short
patterns. The increase in the left plot at is so sharp that the right plot would be the same if we
plotted the value of the average edit distance divided by m.
p
Figure 6 uses a stable m = 300 to show the value as a function of . The curve = 1 ? 1=
is included to pshow its closeness to the experimental data. Least squares give the approximation
= 1 ? 1:09= , with a relative error smaller than 1%. This shows that the upper bound analysis
(Eq. (2)) matches reality better, provided we replace e by 1:09 in the formulas.
Therefore, we have shown that the matching probability has a sharp behavior: for low it is
very low, not as low as 1= m like exact string matching, but still exponentially decreasing in m,
with an exponent base larger than 1= . At some value (that we
called ) it sharply increases
p
and quickly becomes almost 1. This point is close to = 1 ? 1= in practice.
13
f (m;m)
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.2 0.4 0.6 0.8 1.0
m
200 400 600 800 1000
Figure 5: On the left, probability of an approximate match as a function of the error level (m = 300).
On the right, the observed error level as a function of the pattern length. Both cases correspond
to random text with = 32.
1.0
0.6
0.8
0.4
+
0.2 +
The curve 1 ? 1=
Experimental data
+ Exact lower bound with
= 1, Eq. (1)
Conservative lower bound, Eq. (2)
2 10 20 30 40 50 60
Figure 6: Theoretical and practical values for , for m = 300 and dierent values.
14
This is why the problem has interest only up to a given error level, since for higher errors almost
all text positions match. This is also the reason that makes some algorithms
to have good average
p
behavior only for low enough error levels. The point = 1 ? 1= matches the conjecture of
[SM83].
We present now the rst algorithm to solve the problem. It has been rediscovered many times in
the past, in dierent areas, e.g. [Vin68, NW70, San72, Sel74, WF74, LW75] (there are more in
[Ull77, SK83, Kuk92]). However, this algorithm computed the edit distance, and it was converted
into a search algorithm only in 1980 by Sellers [Sel80]. Although the algorithm is not very ecient,
it is among the most
exible ones to adapt to dierent distance functions.
We rst show how to compute the edit distance between two strings x and y . Later, we extend
that algorithm to search a pattern in a text allowing errors. Finally, we show how to handle more
general distance functions.
The algorithm is based on dynamic programming. Imagine that we need to compute ed(x; y ).
A matrix C0::jxj;0::jyj is lled, where Ci;j represents the minimum number of operations needed to
match x1::i to y1::j . This is computed as follows
Ci;0 = i
C0;j = j
Ci;j = if (xi = yj ) then Ci?1;j?1
else 1 + min(Ci?1;j ; Ci;j ?1; Ci?1;j ?1)
where at the end Cjxj;jyj = ed(x; y ). The rationale of the above formula is as follows. First, Ci;0 and
C0;j represent the edit distance between a string of length i or j and the empty string. Clearly i
(respectively j ) deletions are needed on the nonempty string. For two non-empty strings of length
i and j , we assume inductively that all the edit distances between shorter strings have already been
computed, and try to convert x1::i into y1::j .
15
Practical
Theoretical
[Sel80]
First O(mn) search algorithm
O(m) space
[Ukk85b]
Cut-o heuristic
O(kn) expectedtime
O(m) space
Analysis [CL92, BYN99]
[Mye86a]
All matches
O(kn) time
O(n) space
[Mye86a, GP90]
Brute force
O(kn) expected
O(k) space
[GP90]
Prex matrix
O(kn) time
O(m2 ) space
[CL92]
Columnppartitioning
O(kn= ) expected
O(m) space
[MP80]
Four russians
O(nm= log2 n) time
O(n) extra space
[Ukk85a, Mye86b]
O(k2 ) edit distance
O(k2 ) space
[LV89]
[LV88]
Diagonaltransition
O(k2 n) time O(kn) time
O(m) space
O(n) space
[UW93]
Suf. aut. of patt
O(kn) time
O(m) space
[GP90, CL94]
Su. tree of patt.
O(km) time
O(m) space
[GG88]
Partial s. tree
O(kn) time
O(m) space
[SV97]
[CH98]
Patt. period.
O(n(1+kc =m))
time (c=3 or 4)
Text Searching
Edit Distance
16
Consider the last characters xi and yj . If they are equal, then we do not need to consider them
and we proceed in the best possible way to convert x1::i?1 into y1::j ?1 . On the other hand, if they
are not equal, we must deal with them in some way. Following the three allowed operations, we
can delete xi and convert in the best way x1::i?1 into y1::j , insert yj at the end of x1::i and convert
in the best way x1::i into y1::j ?1 , or replace xi by yj and convert in the best way x1::i?1 into y1::j ?1 .
In all cases, the cost is 1 plus the cost for the rest of the process (already computed). Notice that
the insertions in one string are equivalent to deletions in the other.
The dynamic programming algorithm must ll the matrix in such a way that the upper, left, and
upper-left neighbors of a cell are computed prior to computing that cell. This is easily achieved
by either a row-wise left-to-right traversal or a column-wise top-to-bottom traversal. Figure 8
illustrates this algorithm to compute ed("survey", "surgery").
s
s
u
r
v
e
y
0 1 2 3
1 0 1 2
2 1 0 1
3 2 1 0
4
3
2
1
5
4
3
2
4 3 2 1 1 2
5 4 3 2 2 1
6 5 4 3 3 2
6
5
4
3
3
7
6
5
4
4
2 3
2 2
Figure 8: The dynamic programming algorithm to compute the edit distance between "survey"
and "surgery". The bold entries show the path to the nal result.
Therefore, the algorithm is O(jxjjy j) time in the worst and average case. However, the space
required is only O(min(jxj; jy j)). This is because, in the case of a column-wise processing, only
the previous column must be stored in order to compute the new one, and therefore we just keep
one column and update it. We can process the matrix row-wise or column-wise so that the space
requirement is minimized.
On the other hand, the sequences of operations performed to transform x into y can be easily
recovered from the matrix, simply by proceeding from the cell Cjxj;jyj to the cell C0;0 following the
path (i.e. sequence of operations) that matches the update formula (multiple paths may exist).
In this case, however, we need to store the complete matrix or at least an area around the main
diagonal.
This matrix has some properties which can be easily proved by induction (see, e.g. [Ukk85a])
and which make it possible to design better algorithms. Some of the most used are that the
values of neighboring cells dier in at most one, and that upper-left to lower-right diagonals are
nondecreasing.
We show now how to adapt this algorithm to search a short pattern P in a long text T . The
algorithm is basically the same, with x = P and y = T (proceeding column-wise so that O(m)
17
space is required). The only dierence is that we must allow that any text position is the potential
start of a match. This is achieved by setting C0;j = 0 for all j 2 0::n. That is, the empty pattern
matches with zero errors at any text position (because it matches with a text substring of length
zero).
The algorithm then initializes its column C0::m with the values Ci = i, and processes the text
character by character. At each new text character Tj , its column vector is updated to C00 ::m. The
update formula is
Ci0 = if (Pi = Tj ) then Ci?1
else 1 + min(Ci0?1; Ci; Ci?1)
and the text positions where Cm k are reported.
The search time of this algorithm is O(mn) and its space requirement is O(m). This is a sort
of worst case in the analysis of all the algorithms that we consider later. Figure 9 exemplies this
algorithm applied to search the pattern "survey" in the text "surgery" (a very short text indeed)
with at most k = 2 errors. In this case there are 3 occurrences.
s
u
r
v
e
y
0
1
2
3
4
5
6
0
0
1
2
3
4
5
0
1
0
1
2
3
4
0
1
1
0
1
2
3
0
1
2
1
1
2
3
0
1
2
2
2
1
0
1
2
2
3
2
0
1
2
3
3
3
2 2 2
Figure 9: The dynamic programming algorithm to search "survey" in the text "surgery" with
two errors. Bold entries indicate matching text positions.
Finally, we point out that although we have presented a column-wise algorithm to ll the matrix,
many other works are based in alternative lling styles. There are applications for row-wise lling
[Nav98b] or even by (upper-left to lower-right) diagonals or \secondary" (upper-right to lower-left)
diagonals. Some lling styles, as diagonals, need to set up a dierent recurrence to compute the
cells. We cover them later.
It is easy to adapt this algorithm for the other distance functions mentioned. If the operations have
dierent costs, we add the cost instead of adding 1 when computing Ci;j , i.e.
C0;0 = 0
Ci;j = min(Ci?1;j?1 + (xi ; yj ); Ci?1;j + (xi; "); Ci;j?1 + ("; yj ))
where we assume (a; a) = 0 for any a 2 and that C?1;j = Ci;?1 = 1 for all i; j .
18
For distances that do not allow some operations, we just take them out of the minimization
formula, or which is the same, we assign 1 to their cost. For transpositions, we allow a fourth
rule that says that Ci;j can be Ci?2;j ?2 + 1 if xi?1 xi = yj yj ?1 [LW75].
The most complex case is to allow general substring replacements, in the form of a nite set R
of rules. The formula is given in [Ukk85a].
C0;0 = 0
Ci;j = min(Ci?1;j?1 if xi = yj ;
Ci?js j;j?js j + (s1 ; s2) for each (s1; s2) 2 R; x1::i = x0s1 ; y1::j = y 0s2)
1
An interesting problem is how to compute this recurrence eciently. A naive approach takes
O(jRjmn), where jRj is the sum of all the lengths of the strings in R. A better solution is to build
two Aho-Corasick automata [AC75] with the left and right hand sides of the rules, respectively.
The automata are run as we advance in both strings (left hand sides in x and right hand sides
in y ). For each pair of states (i1; i2) of the automata we precompute the set of replacements that
can be tried (i.e. those 's whose left and right hand match the suxes of x and y , respectively,
represented by the automata states). Hence, we know in constant time (per cell) the set of possible
replacements. The complexity is now much lower, in the worst case it is O(cmn) where c is the
maximum number of rules applicable to a single text position.
As said, the dynamic programming approach is unbeaten in
exibility, but its time requirements
are indeed high. A number of improved solutions have been proposed along the years. Some of
them work only for the edit distance, while others can still be adapted to other distance functions.
Before considering the improvements, we mention that there exists a way to see the problem as a
shortest path problem on a graph built on the pattern and the text [Ukk85a]. This reformulation
has been conceptually useful for more complex variants of the problem.
this area is as old as the Sellers [Sel80] algorithm itself. In 1980, Masek and Paterson [MP80] found
an algorithm whose worst case cost is O(mn= log2 n) and requires O(n) extra space. This is an
improvement over the O(mn) classical complexity.
The algorithm is based on the Four-Russians technique [ADKF75]. Basically, it replaces the
alphabet by r-tuples (i.e. r ) for a small r. Considered algorithmically, it rst builds a table of
solutions of all the possible problems (i.e. portions of the matrix) of size r r, and then uses the
table to solve the original problem in blocks of size r. Figure 10 illustrates.
The values inside the r r size cells depend on the corresponding letters in the pattern and
the text, which gives 2r possibilities. They also depend on the values in the last column and row
of the upper and left cells, as well as the bottom-right state of the upper left cell (see Figure 10).
Since neighboring cells dier in at most one, there are only three choices for adjacent cells once the
current cell is known. Therefore, this adds only m(32r ) possibilities. In total, there are m(3 )2r
dierent cells to precompute. Using O(n) memory we have enough space for r = log3 n, and since
we nally compute mn=r2 cells, the nal complexity follows.
19
text
s
,...
s
u
r
pattern
v
e
y
Figure 10: The Masek and Paterson algorithm partitions the dynamic programming matrix in cells
(r = 2 in this example). On the right, we shaded the entries of adjacent cells that in
uence the
current one.
The algorithm is only of theoretical interest, since as the same authors estimate, it will not beat
the classical algorithm for texts below 40 Gb size (and it would need that extra space!). Adapting
it to other distance functions seems not dicult, but the dependencies among dierent cells may
become more complex.
Ukkonen 1983 In 1983, Ukkonen [Ukk85a] presented an algorithm able to compute the edit
distance between two strings x and y in O(ed(x; y )2) time, or to check in time O(k2) whether that
distance was k or not. This is the rst member of what has been called \diagonal transition
algorithms", since it is based in the fact that the diagonals of the dynamic programming matrix
(running from the upper-left to the lower-right cells) are monotonically increasing (more than that,
Ci+1;j+1 2 fCi;j ; Ci;j + 1g). The algorithm is based on computing in constant time the positions
where the values along the diagonals are incremented. Only O(k2 ) such positions are computed to
reach the lower-right decisive cell.
Figure 11 illustrates the idea. Each diagonal stroke represents a number of errors, and is a
sequence where both strings match. When a stroke of e errors starts, it continues until the adjacent
e ? 1 strokes continue or until it keeps matching the text. To compute each stroke in constant time
we need to know until where it matches the text. The way to do this in constant time is explained
shortly.
1
0
2
1
e-1
e-1
3
3
2
3
e-1
3
3
Figure 11: On the left, the O(k2 ) algorithm to compute the edit distance. On the right, the way
to compute the strokes in diagonal transition algorithms. The solid bold line is guaranteed to be
part of the new stroke of e errors, while the dashed part continues as long as both strings match.
20
Landau and Vishkin 1985 In 1985 and 1986, Landau and Vishkin found the rst worst-case
time improvements for the search problem. All of them and the thread that followed were diagonal
transition algorithms. In 1985 [LV88] they show an algorithm which is O(k2n) time and O(m)
space, and in 1986 [LV89] they obtain O(kn) time and O(n) space.
The main idea of Landau and Vishkin was to adapt to text searching the Ukkonen's diagonal
transition algorithm for edit distance [Ukk85a]. Basically, the dynamic programming matrix was
computed diagonal-wise (i.e. stroke by stroke) instead of column-wise. They wanted to compute
in constant time the length of each stroke (i.e. the point where the values along a diagonal were to
be incremented). Since a text position was to be reported when matrix row m was reached before
incrementing more than k times the values along the diagonal, this gave immediately the O(kn)
algorithm. Another way to see it is that each diagonal is abandoned as soon as the k-th stroke
ends, there are n diagonals and hence nk strokes, each of them computed in constant time (recall
Figure 11).
A recurrence on diagonals (d) and number of errors (e), instead of rows (i) and columns (j ), is
set up in the following way
Ld;?1 =
Ld;jdj?2 =
Ld;jdj?1 =
Ld;e =
where the external loop updates e from 0 to k and the internal one updates d from ?e to n.
Negative numbered diagonals are those virtually starting before the rst text position. Figure 12
shows our search example using this recurrence.
-3 -2
0
0
1 1 1
2 2 5
-1 0 1 2
3 0 0 0
4 5 3 1
6 6 6 3
3
0
1
2
4
0
1
3
5
0
1
2
6
0
1
2
7
0
1
2
Figure 12: The diagonal transition matrix to search "survey" in the text "surgery" with two
errors. Bold entries indicate matching diagonals. The rows are e values and the columns are the d
values.
The dicult part is how to compute the strokes in constant time (i.e. the max` ( )). The
problem is equivalent to knowing which is the longest prex of Pi::: that matches Tj:::. This data
has ben called thereafter \matching statistics". The algorithms of this section dier basically in
how they manage to compute the matching statistics fast.
We defer for later the explanation of [LV88]. In [LV89], this is done by building the sux tree
(see Section 3.2) of T ; P (text concatenated with pattern), where the huge O(n) extra space comes
from. The longest prex common to both suxes Pi::: and Tj::: can be visualized in the sux tree
21
as follows: imagine the root to leaf paths that end in each of the two suxes. Both parts share
the beginning of the path (at least they share the root). The last sux tree node common to both
paths represents a substring which is precisely the longest common prex. In the literature, this
last common node is called lowest common ancestor (LCA) of two nodes.
Despite being conceptually clear, it is not easy to nd this node in constant time. In 1986,
the only existing LCA algorithm was [HT84], which had constant amortized time, i.e. it answered
n0 > n LCA queries in O(n0) time. In our case we have kn queries, so each one nally costed O(1).
The resulting algorithm, however, is quite slow in practice.
Myers 1986 In 1986, Myers found also an algorithm with O(kn) worst-case behavior [Mye86a].
It needed O(n) extra space, and shared the idea of computing the k new strokes using the previous
ones, as well as the use of a sux tree on the text for the LCA algorithm. Unlike other algorithms,
this one is able to report the O(kn) matching substrings of the text (not only the endpoints) in
O(kn) time. This makes the algorithm suitable for more complex applications, for instance in
computational biology. The original reference is a technical report and never went to press, but it
has been recently included in a larger work [LMS98].
Galil and Giancarlo 1988 In 1988, Galil and Giancarlo [GG88] obtained the same time complexity of Landau and Vishkin using O(m) space. Basically, the sux tree of the text is built by
overlapping pieces of size O(m). The algorithm scans the text four times, being even slower than
[LV89]. Therefore, the result was of theoretical interest.
Galil and Park 1989 One year later, in 1989, Galil and Park [GP90] obtained O(kn) worst-case
time and O(m2) space, worse in theory than [GG88] but much better in practice. This time the
idea was to build the matching statistics of the pattern against itself (longest match between Pi:::
and Pj::: , hence the O(m2 ) complexity), resembling in some sense the basic ideas of [KMP77]. This
algorithm is still slow in practice anyway.
The algorithm goes diagonal by diagonal, computing its k strokes, and using the k strokes of
the previous diagonal to save some character comparisons in the current one. Note that if there is
a stroke in the previous diagonal d spanning rows i1 to i2, this means that Td+i ::d+i = Pi ::i , and
therefore we can compare the pattern against itself instead of against the text.
As we have precomputed all the longest matches of the pattern against itself, we can use that
information to know the longest match between pattern and text. So we consider the longest match
of the pattern against itself, starting at the two positions corresponding to the current stroke and
the stroke below it (in the previous diagonal, which we call the \stroke-below"). Depending on
which is shorter or longer between this longest match and the stroke-below itself, there are three
alternatives (recall Figure 11). First, if the stroke-below ends before the longest match, then the
current stroke nishes at the same row of the stroke-below (since it diered from the text and the
current stroke still was equal to it). Second, if the stroke ends after the longest match, the current
stroke ends where the longest match does (since the stroke-below was still equal to the text when
it diered from the current stroke). Finally, if both end together, we cannot know the result and
we need to consider the next stroke of the previous diagonal.
1
22
As shown in [GP90], the total cost to compute the k strokes2 of the new diagonal is O(k), since
a linear pass is performed over the previous diagonal overall. Despite that they can work O(k) for
a single stroke of the new diagonal, they cannot work more than O(k) for the whole diagonal (there
are also up to k direct comparisons performed between text and pattern, when the stroke-below is
of length zero). A very similar algorithm had been presented before by Landau and Vishkin [LV88],
but they have O(k2) cost to compute the new diagonal.
Finally, Galil and Park show that the O(m2) extra space can be reduced to O(m) by using a
sux tree of the pattern (not the text as in previous work) and LCA algorithms, so we add dierent
entries in Figure 7. They also show how to add transpositions to the edit operations at the same
complexity. This technique can be extended to all these diagonal transition algorithms. We believe
that allowing dierent integral costs for the operations or forbidding some of them can be achieved
with simple modications of the algorithms.
Ukkonen and Wood 1990 An idea similar to that of using the sux tree of the pattern (and
similarly slow in practice) was independently discovered by Ukkonen and Wood in 1990 [UW93].
They use a sux automaton (described in Section 3.2) on the pattern to nd the matching statistics,
instead of the table. As the algorithm progresses over the text, the sux automaton keeps count of
the pattern substrings that match the text at any moment. Despite that they report O(m2 ) space
for the sux automaton, it cant take O(m) space.
Chang and Lawler 1990 Also in 1990, Chang and Lawler [CL94] repeated the idea that was
brie
y mentioned in [GP90]: the matching statistics can be computed using the sux tree of the
pattern and LCA algorithms. However, they used a newer and faster LCA algorithm [SV88], truly
O(1), and reported the best time among algorithms with guaranteed O(kn) performance. However,
the algorithm is still not competitive in practice.
Cole and Hariharan 1998 In 1998, Cole and Hariharan [CH98] presented an algorithm with
worst case O(n(1 + kc =m)), where c = 3 if the pattern is \mostly aperiodic" and c = 4 otherwise3.
The idea is that, unless a pattern has a lot of self-repetition, only a few diagonals of a diagonal
transition algorithm need to be computed.
This algorithm can be thought of as a lter (see next sections) with worst case guarantee useful
for very small k. It resembles some ideas of the lters developed in [CL94]. Probably other lters
can be proved to have good worst cases under some periodicity assumptions on the pattern, but
this thread has not been explored up to now. This algorithm is an evolution over a previous one
[SV97], which is more complex and has a worse complexity, namely O(nk8( log n)1= log 3). In any
case, the interest of this work is theoretical too.
algorithm, a short note at the end of [Ukk85b], improved the dynamic programming algorithm
23
to O(kn) average time and and O(m) space. This algorithm has been called later the \cut-o
heuristic". The main idea is that, since a pattern does not normally match in the text, the values
at each column (from top to bottom) quickly reach k + 1 (i.e. mismatch), and that if a cell has a
value larger than k + 1, the result of the search does not depend on its exact value. A cell is called
active if its value is at most k. The algorithm simply keeps count of which is the last active cell
and avoids working on the rest of the cells.
To keep the last active cell, we must be able to recompute it for each new column. At each new
column, the last active cell can be incremented in at most one, so we check if we have activated
the next cell at O(1) cost. However, it is also possible that which was the last active cell becomes
inactive now. In this case we have to search upwards which is the new last active cell. Despite
that we can work O(m) in a given column, we cannot work more than O(n) overall, because there
are at most n increments of this value in the whole process, and hence there are no more than n
decrements. Hence, the last active cell is maintained at O(1) amortized cost per column.
Ukkonen conjectured that this algorithm was O(kn) on average, but this was proven only in
1992 by Chang and Lampe [CL92]. The proof was rened in 1996 by Baeza-Yates and Navarro
[BYN99]. The result can probably be extended to more complex distance functions, although with
substrings the last active cell must exceed k by enough to ensure that it can never return to a value
smaller than k. In particular, it must have the value k + 2 if transpositions are allowed.
Myers 1986 An algorithm in [Mye86a] is based on diagonal transitions as those of the previous
sections, but the strokes are simply computed by brute force. Myers showed that the resulting
algorithm was O(kn) on average. This is clear because the length of the strokes is =( ? 1) = O(1)
on average. The same algorithm was proposed again in 1989 by Galil and Park [GP90]. Since only
the k strokes need to be stored, the space is O(k).
Chang and Lampe 1992 In 1992, Chang and Lampe [CL92] gave a new algorithm called \col-
umn partitioning", based on exploiting a dierent property of the dynamic programming matrix.
They consider again the fact that, along each column, the numbers are normally increasing. They
work on \runs" of consecutive increasing cells (a run ends when Ci+1 6= Ci + 1). They manage to
work O(1) per run in the column actualization process.
To update each run in constant time, they precompute loc(j; x) = minj 0 j Pj 0 = x for all pattern
positions j and all characters x (hence it needs O(m ) space). At each column of the matrix, they
consider the current text character x and the current row j , and know in constant time where the
run is going to end (i.e. next character match). The run can end before, namely where the parallel
run of the previous column ends.
Based on empirical observations, they conjecture that the average length ofp the runs is O(p).
Notice that this matches our result that the average edit distance ispm(1 ? e= ), since this is the
number of increments along columns, and therefore there arepO(m= ) non-increments (i.e. runs).
From there it is clear that each run has average length O( ). Therefore, we have just proved
Chang and Lampe's conjecture.
p
Since the paper uses the cut-o heuristic of Ukkonen, their average search time is O(kn= ).
This is, in practice, the fastest algorithm of this class.
Unlike the other algorithms of this section, it seems dicult to adapt [CL92] to other distance
24
functions, since the idea strongly relies on the unitary costs. It is mentioned that the algorithm
could run in average time O(kn log log(m)= ) but it would be unpractical.
[Mel96]
Improvedanalysis
(k+2)m?k (k+1)! states
replaces by min(;m)
[MP80]
Four Russians technique
[Kur96, Nav97b]
Lazy automaton
[WMM96]
O(mn= log s) time
O(s) space
An alternative and very useful way to consider the problem is to model the search with a nondeterministic automaton (NFA). This automaton (in its deterministic form) was rstly proposed in
[Ukk85b], and rstly used in non-deterministic form (although implicitly) in [WM92a]. It is shown
explicitly in [BY91, BY96, BYN99].
Consider the NFA for k = 2 errors under edit distance shown in Figure 14. Every row denotes
the number of errors seen (the rst row zero, the second row one, etc.). Every column represents
matching a pattern prex. Horizontal arrows represent matching a character (i.e. if the pattern
and text characters match, we advance in the pattern and in the text). All the others increment
the number of errors (move to the next row): vertical arrows insert a character in the pattern (we
advance in the text but not in the pattern), solid diagonal arrows replace a character (we advance
in the text and pattern), and dashed diagonal arrows delete a character of the pattern (they are
"-transitions, since we advance in the pattern without advancing in the text). The initial self-loop
allows a match to start anywhere in the text. The automaton signals (the end of) a match whenever
a rightmost state is active. If we do not care about the number of errors of the occurrences, we can
consider nal states those of the last full diagonal.
25
It is not hard to see that once a state in the automaton is active, all the states of the same
column and higher rows are active too. Moreover, at a given text character, if we collect the
smallest active rows at each column, we obtain the vertical vector of the dynamic programming
algorithm (in this case [0; 1; 2; 3; 3; 3; 2], compare to Figure 9).
11
00
00
11
00
11
11
00
00
11
00
11
11
00
00
11
00
11
y
no errors
11
00
00
11
00
11
11
00
00
11
00
11
1 error
11
00
00
11
00
11
11
00
00
11
00
11
2 errors
Figure 14: An NFA for approximate string matching of the pattern "survey" with two errors. The
shaded states are those active after reading the text "surgery".
Other types of distances (Hamming, LCS and Episode) are obtaining by deleting some arrows
of the automaton. Dierent integer costs for the operations can also be modeled by changing the
arrows. For instance, if insertions cost 2 instead of 1, we make the vertical arrows to move from
rows i to rows i + 2. Transpositions are modeled by adding an extra state Si;j between each pair
of states at position (i; j ) and (i + 1; j + 2), and arrows labeled Pi+2 from state (i; j ) to Sij and
Pi+1 between Si;j and (i + 1; j + 2) [Mel96]. Adapting to general substring replacement needs more
complex setups but it is always possible.
This automaton can be simply made deterministic to obtain O(n) worst case search time.
However, as we see next, the main problem becomes the construction of the DFA (deterministic
nite automaton). An alternative solution is based on simulating the NFA instead of making it
deterministic.
elements in the columns that were larger than k + 1 could be replaced by k + 1 without aecting
the output of the search (the lemma was used in the same paper to design the cut-o heuristic
described in Section 5.3). This reduced the potential number of dierent columns. He also showed
that adjacent cells in a column diered in at most one. Hence, the column states could be dened
as a vector of m incremental values in the set f?1; 0; 1g.
All this made possible to obtain in [Ukk85b] a nontrivial bound to the number of states of
the automaton, namely O(min(3m ; m(2m )k)). This size, although much better than the obvious
O((k + 1)m), is still very large except for short patterns or very low error levels. The resulting
space complexity of the algorithm is m times the above value. This exponential space complexity
has to be added to the O(n) time complexity, as the preprocessing time to build the automaton.
As a nal comment, Ukkonen suggested that the columns could be only partially computed up
to, say, 3k=2 entries. Since he conjectured (and later was proved in [CL92]) that the columns of
interest were O(k) on average, this would normally not aect the algorithm, though it will reduce
the number of possible states. If at some point the states not computed were really needed, the
algorithm would compute them by dynamic programming.
Notice that to incorporate transpositions and substring replacements into this conception we
need to consider that each state is the set of the j last columns of the dynamic programming matrix,
where j is the longest left hand side of a rule. In this case it is better to build the automaton of
Figure 14 explicitly and make it deterministic.
Wu, Manber and Myers 1992 It was not until 1992 that Wu, Manber and Myers looked again
into this problem [WMM96]. The idea was to trade time for space using a Four Russians technique
[ADKF75]. Since the cells could be expressed using only values in f?1; 0; 1g, the columns were
partitioned into blocks of r cells (called \regions") which took 2r bits each. Instead of precomputing
the transitions from a whole column to the next, the transitions from a region to the next region
in the column were precomputed, although the current region could now depend on three previous
regions (see Figure 15). Since the regions were smaller than the columns, much less space was
necessary. The total amount of work was O(m=r) per column in the worst case, and O(k=r) on
average. The space requirement was exponential in r. By using O(n) extra space, the algorithm
was O(kn= log n) on average and O(mn= log n) in the worst case. Notice that this shares with
[MP80] the Four Russians approach, but there is an important dierence: the states in this case
do not depend on the letters of the pattern and text. The states of the \automaton" of [MP80], on
the other hand, depend on the text and pattern.
This Four Russians approach is so
exible that this work was extended to handle regular expressions allowing errors [WMM95]. The technique for exact regular expression searching is to pack
portions of the deterministic automaton in bits and compute transition tables for each portion.
The few transitions among portions are left nondeterministic and simulated one by one. To allow
errors, each state is not anymore active or inactive, but they keep count of the minimum number
of errors that makes it active, in O(log k) bits.
Melichar 1995 In 1995, Melichar [Mel96] studied again the size of the deterministic automaton. By considering the properties of the NFA of Figure 14, he rened the bound of [Ukk85b]
to O(min(3m; m(2mt)k ; (k + 2)m?k (k + 1)!), where t = min(m + 1; ). The space complexity and
27
Figure 15: On the left, the automaton of [Ukk85b] where each column is a state. On the right, the
automaton of [WMM96] where each region is a state. Both compute the columns of the dynamic
programming matrix.
preprocessing time of the automaton is t times the number of states. Melichar also conjectured
that this automaton is bigger when there are periodicities in the pattern, which matches with the
results of [CH98] (Section 5.2), in the sense that periodic patterns are more problematic. This is
in fact a property shared with other problems in string matching.
Kurtz 1996 In 1996, Kurtz [Kur96] proposed another way to reduce the space requirements to
at most O(mn). It is an adaptation of [BYG94], which rst proposed it for the Hamming distance.
The idea was to build the automaton in lazy form, i.e. build only the states and transitions actually
reached in the processing of the text. The automaton starts as just one initial state and the states
and transitions are built as needed. By doing this, all those transitions that Ukkonen [Ukk85b]
considered that were not necessary were not built in fact, without need to guess. The price was the
extra overhead of a lazy construction versus a direct construction, but the idea pays o. Kurtz also
proposed to have built only the initial part of the automaton (which should be the most commonly
traversed states) to save space.
Navarro studied in [Nav97b, Nav98a] the growth of the complete and lazy automata as a function
of m, k and n (this last value for the lazy automaton only). The empirical results show that the
lazy automaton grows with the text at a rate of O(n ), for 0 < < 1 dependent on , m and k.
Some replacement policies designed to work with bounded memory are proposed in [Nav98a].
7 Bit-Parallelism
These algorithms are based on exploiting the parallelism of the computer when it works on bits.
This is also a new (after 1990) and very active area. The basic idea is to \parallelize" another
algorithm using bits. The results are interesting from the practical point of view, and are especially
signicative when short patterns are involved (typical in text retrieval). They may work eectively
for any error level.
In this section we nd elements which strictly could belong to other sections, since we parallelize other algorithms. There are two main trends: parallelize the work of the non-deterministic
28
automaton that solves the problem (Figure 14), and parallelize the work of the dynamic programming matrix.
We rst explain the technique and then the results achieved using it. Figure 16 shows the
historical development of this area.
Parallelize automaton
Parallelize matrix
[BY89]
Birth of bit-parallelism
[WM92a]
Bit-parallelNFA
O(kdm=wen) time
[Wri94]
ParallelizedDP matrix
O(mn log()=w) time
[BYN99]
NFA parallelizedby diagonals
O(dkm=wen) worst-case
O(k2 n=w) average
[Mye98]
Optimal parall. DP matrix
O(mn=w) worst-case
O(kn=w) on average
This technique, of common use in string matching [BY91, BY92], was born in the PhD. Thesis of
Baeza-Yates [BY89]. It consists in taking advantage of the intrinsic parallelism of the bit operations
inside a computer word. By using cleverly this fact, the number of operations that an algorithm
performs can be cut down by a factor of at most w, where w is the number of bits in the computer
word. Since in current architectures w is 32 or 64, the speedup is very signicative in practice
and improves with technological progress. In order to relate the behavior of bit-parallel algorithms
to other works, it is normally assumed that w = (log n), as dictated by the RAM model of
computation. We prefer, however, to keep w as an independent value. We introduce now some
notation we use for bit-parallel algorithms.
? We use C-like syntax for operations on the bits of computer words: \j" is the bitwiseor, \&" is the bitwise-and, \ b " is the bitwise-xor and \" complements all the bits.
The shift-left operation, \<<", moves the bits to the left and enters zeros from the
right, i.e. bm bm?1 :::b2b1 << r = bm?r :::b2b10r . The shift-right, \>>" moves the bits
in the other direction. Finally, we can perform arithmetic operations on the bits, such
as addition and subtraction, which operates the bits as if they formed a number. For
instance, b` :::bx10000 ? 1 = b` :::bx01111.
We explain now the rst bit-parallel algorithm, Shift-Or [BYG92], since it is the basis of much
of which follows. The algorithm searches a pattern in a text (without errors) by parallelizing the
operation of a non-deterministic nite automaton that looks for the pattern. Figure 17 illustrates
this automaton.
0
D0
The formula is correct because the i-th bit is set if and only if the (i ? 1)-th bit was set for
the previous text character and the new text character matches the pattern at position i. In other
words, Tj ?i+1::j = P1::i if and only if Tj ?i+1::j ?1 = P1::i?1 and Tj = Pi . It is possible to relate
this formula to the movement that occurs in the non-deterministic automaton for each new text
character: each state gets the value of the previous state, but this happens only if the text character
matches the corresponding arrow.
For patterns longer than the computer word (i.e. m > w), the algorithm uses dm=we computer
words for the simulation (not all them are active all the time). The algorithm is O(n) on average.
The real algorithm uses the bits with the inverse meaning and therefore the operation \j 10m?1 " is not necessary.
It also shifts in the other direction to ensure that a fresh zero lls the hole left by the shift, which is more machine
dependent for the right shift. We have preferred to explain this more didactic version.
4
30
It is easy to extend Shift-Or to handle classes of characters. In this extension, each position
in the pattern matches with a set of characters rather than with a single character. The classical
string matching algorithms are not so easily extended. In Shift-Or, it is enough to set the i-th bit
of B [c] for every c 2 Pi (Pi is a set now). For instance, to search for "survey" in case-insensitive
form, we just set to 1 the rst bit of B ["s"] and of B ["S"], and the same with the rest. Shift-Or
can also search for multiple patterns (where the complexity is O(mn=w) if we consider that m is
the total length of all the patterns), and it was later enhanced [WM92a] to support a larger set of
extended patterns and even regular expressions.
Many online text algorithms can be seen as implementations of an automaton (classically, in its
deterministic form). Bit-parallelism has since its invention became a general way to simulate simple
non-deterministic automata instead of converting them to deterministic. It has the advantage of
being much simpler, in many cases faster (since it makes better usage of the registers of the computer
word), and easier to extend to handle complex patterns than its classical counterparts. Its main
disadvantage is the limitation it imposes with regard to the size of the computer word. In many
cases its adaptations to cope with longer patterns are not so ecient.
had a great impact in the future of practical text searching. They rst extended the Shift-Or
algorithm to handle wild cards (i.e. allow an arbitrary number of characters between two given
positions in the pattern), and regular expressions (the most
exible pattern that can be eciently
searched). What is of more interest to us is that they presented a simple scheme to combine any
of the preceding extensions with approximate string matching.
The idea is to simulate the NFA of Figure 14 using bit-parallelism, so that each row i of the
automaton ts in a computer word Ri (each state is represented by a bit). For each new text
character, all the transitions of the automaton are simulated using bit operations among the k + 1
computer words. Notice that all the k + 1 computer words have the same structure (i.e. the same
bit is aligned to the same text position). The update formula to obtain the new R0i values at text
position j from the current Ri values is
the arrows that connect the words) and one has an algorithm to nd the same pattern allowing
errors. Hence, they are able to perform approximate string matching with sets of characters, wild
cards, and regular expressions. They also allow some extensions unique of approximate searching:
a part of the pattern can be searched with errors and another may be forced to match exactly, and
dierent integer costs of the edit operations can be accommodated (including not allowing some of
them). Finally, they are able to search a set of patterns at the same time, but this capability is
very limited (since all the patterns must t in a computer word).
The great
exibility obtained encouraged the authors to build a software called Agrep [WM92b]5,
where all these capabilities are implemented (although some particular cases are solved in a dierent
manner). This software has been taken as a reference in all the subsequent research.
Baeza-Yates and Navarro 1996 In 1996, Baeza-Yates and Navarro presented a new bit-parallel
algorithm able to parallelize the computation of the automaton even more [BYN99]. The classical dynamic programming algorithm can be thought of as a column-wise \parallelization" of the
automaton [BY96], and Wu and Manber [WM92a] proposed a row-wise parallelization. Neither
algorithm was able to increase the parallelism (even if all the NFA states t in a computer word)
because of the "-transitions of the automaton, which caused what we call zero-time dependencies.
That is, the current values of two rows or two columns depend on each other, and hence cannot be
computed in parallel.
In [BYN99] the bit-parallel formula for a diagonal parallelization was found. They packed the
states of the automaton along diagonals instead of rows or columns, which run in the same direction
of the diagonal arrows (notice that this is totally dierent from the diagonals of the dynamic
programming matrix). This idea had been mentioned much earlier by Baeza-Yates [BY91] but no
bit-parallel formula was found. There are m ? k + 1 complete diagonals (the others are not really
necessary) which are numbered from 0 to m ? k. The number Di is the row of the rst active state
in diagonal i (all the subsequent states in the diagonal are active because of the "-transitions). The
new Di0 values after reading text position j are computed as
Available at ftp.cs.arizona.edu.
32
k)=wen) worst case time, and O(dk2=wen) on average since the Ukkonen cut-o heuristic is used
(see Section 5.3). The scheme can handle classes of characters, wild cards and dierent integral
costs in the edit operations.
namic programming matrix. The idea was to consider secondary diagonals (i.e. those that run
from the upper-right to the bottom-left) of the matrix. The main observation is that the elements
of the matrix follow the recurrence6
Myers 1998 In 1998, Myers [Mye98] found a better way to parallelize the computation of
the dynamic programming matrix. He represented the dierences along columns instead of the
columns themselves, so that two bits per cell were enough (in fact this algorithm can be seen as
the bit-parallel implementation of the automaton which is made deterministic in [WMM96], see
Section 6.2). A new recurrence is found where the cells of the dynamic programming matrix are expressed using horizontal and vertical dierences, i.e. vi;j = Ci;j ? Ci?1;j and hi;j = Ci;j ? Ci;j ?1 :
vi;j = min(?Eqi;j ; vi;j ?1; hi?1;j ) + (1 ? hi?1;j )
hi;j = min(?Eqi;j ; vi;j ?1; hi?1;j ) + (1 ? vi;j ?1 )
where Eqi;j is 1 if Pi = Tj and zero otherwise. The idea is to keep packed binary vectors representing
the current (i.e. j -th) values of the dierences, and nding the way to update the vectors in a single
operation. Each cell Ci;j is seen as a small processor that receives inputs vi;j ?1 , hi?1;j and Eqi;j
and produces outputs vi;j and hi;j . There are 3 3 2 = 18 possible inputs and a simple formula
is found to express the cell logic (unlike [Wri94], the approach is logical rather than arithmetical).
The hard part is to parallelize the work along the column, because of the zero-time dependency
problem. The author nds a solution which, despite that a very dierent model is used, is very
similar to that of [BYN99].
6
33
The result is an algorithm that uses better the bits of the computer word, with a worst case
of O(dm=wen) and an average case of O(dk=wen) since it uses the Ukkonen cut-o (Section 5.3).
The update formula is a little more complex than that of [BYN99] and hence the algorithm is a bit
slower, but it adapts better to longer patterns because less computer words are needed.
As it is dicult to improve over O(kn) algorithms, this algorithm may be the last word with
respect to asymptotic eciency of parallelization, except for the possibility to parallelize an O(kn)
worst case algorithm. As it is now common to expect in bit-parallel algorithms, this scheme is
able to search some extended patterns as well, but it seems dicult to adapt it to other distance
functions.
8 Filtering Algorithms
Our last category is quite young, starting in 1990 and still very active. It is formed by algorithms
that lter the text, quickly discarding text areas which cannot match. Filtering algorithms address
only the average case, and their major interest is the potential for algorithms that do not inspect
all text characters. The major theoretical achievement is an algorithm with average cost O(n(k +
log m)=m), which has been proven optimal. In practice, ltering algorithms are the fastest too.
All of them, however, are limited in their applicability by the error level . Moreover, they need a
non-lter algorithm to check the potential matches.
We rst explain the general concept and then consider the developments that have occurred in
this area. See Figure 18.
Filtering is based on the fact that it may be much easier to tell that a text position does not match
than to tell that it matches. For instance, if neither "sur" nor "vey" appear in a text area, then
"survey" cannot be found there with one error under the edit distance. This is because a single
edit operation cannot alter both halves of the pattern.
Most ltering algorithms take advantage of this fact by searching pieces of the pattern without
errors. Since the exact searching algorithms can be much faster than approximate searching ones,
ltering algorithms can be very competitive (in fact, they dominate on a large range of parameters).
It is important to notice that a ltering algorithm is normally unable to discover the matching
text positions by itself. Rather, it is used to discard (hopefully large) areas of the text which cannot
contain a match. For instance, in our example, it is necessary that either "sur" or "vey" appears
in an approximate occurrence, but it is not sucient. Any ltering algorithm must be coupled with
a process that veries all those text positions that could not be discarded by the lter.
Virtually any non-ltering algorithm can be used for this verication, and in many cases the
developers of a ltering algorithm do not care in looking for the best verication algorithm, but
they just use the dynamic programming algorithm. That selection is normally independent, but the
verication algorithm must behave well on short texts because it can be started at many dierent
text positions to work on small text areas. By careful programming it is almost always possible to
keep the worst-case behavior of the verifying algorithm (i.e. avoid verifying overlapping areas).
Finally, the performance of ltering algorithms is very sensitive to the error level . Most lters
work very well on low error levels and very bad otherwise. This is related to the amount of text
34
Moderate patterns
[TU93]
Horspool-likelter
e(2k+1)= kn time
for <e?(2k+1)=
[BYN99] (1996)
Part. into less errors
p
n mk=w time1
p
for <1?em w =p
[BYN99]
Superimposition
p
n mk=w time
p1
for <1?em w =p
[NBY98a]
Hierarchicalverif.
p
n mk=w time
for <1?e=p
Long patterns
[CL94, Ukk92]
[CL94]
LET
SET
n time
n log (m) time
for <1= log m+O(1)
[JTU96, Nav97a]
Counting lter
n time for <e?m=
[Ukk92]
Generaliz. to q-grams
n for <O(1=(log m))
[Shi96]
k+s pieces
n log (m) time
for <O(1= log m)
[Tak94]
Text h-samples
n log (m) time
for <O(1= log m)
[NBY98c]
Hierarchical verif.
kn log (m)= time
for <1= log m
[NR98b]
Sux automata
(m)=m
n( +(1 ?log
) ? )
p
1?e=
for < 2?e=p
[ST95]
Many h-samples
n log (m) time
for <O(1= log m)
[CM94]
Opt. algor. & lower bound
n(k+log mp)=m) time
for <1?e=
[GKHO97]
Dynamic ltering
35
an algorithm that used Boyer-Moore-Horspool techniques [BM77, Hor80] to lter the text. The
idea is to align the pattern with a text window and scan the text backwards. The scanning ends
where more than k \bad" text characters are found. A \bad" character is one that not only does
not match the pattern position it is aligned with, but it also does not match any pattern character
at a distance of k characters or less. More formally, assume that the window starts at text position
j + 1, and therefore Tj+i is aligned with Pi . Then Tj+i is bad when Bad(i; Tj+i), where Bad(i; c)
has been precomputed as c 62 fPi?k ; Pi?k+1 ; :::; Pi; :::Pi+k g.
The idea of the bad characters is that we know for sure that we have to make an error to match
them, i.e. they will not match as a byproduct of inserting or deleting other characters. When more
than k characters that are errors for sure are found, the current text window can be abandoned and
shifted forward. If, on the other hand, the beginning of the window is reached, the area Tj +1?k::j +m
must be checked with a classical algorithm.
To know how much can we shift the window, the authors show that there is no point in shifting P
to a new position j 0 where none of the k +1 text characters that are at the end of the current window
(Tj +m?k ; :::Tj +m) match the corresponding character of P , i.e. where Tj +m?r 6= Pm?r?(j 0 ?j ) . If
those dierences are xed with replacements we make k + 1 errors, and if they can be xed with
less than k + 1 operations, then it is because we aligned some of the involved pattern and text
characters using insertions and deletions. In this case, we would have obtained the same eect
aligning the matching characters from start.
So for each pattern position i 2 fm ? k::mg and each text character a that could be aligned to
position i (i.e. for all a 2 ) the shift to align a in the pattern is precomputed, i.e. Shift(i; a) =
mins>0 fPi?s = ag (or m if no such s exists). Later, the shift for the window us computed as
mini2m?k::m Shift(i; Tj +i). This last minimum is computed together with the backward window
traversal.
The analysis in [TU93] shows that the search time is O(kn(k= + 1=(m ? k)), without considering verications. In the Appendix we show that the amount of verication is negligible for
< e?(2k+1)= . The analysis is valid for m >> > k, so we can simplify the search time to
O(k2 n=). The algorithm is competitive in practice for low error levels. Interestingly, the version
k = 0 corresponds exactly to Horspool algorithm [Hor80]. Like Horspool, it does not take proper
advantage of very long patterns. The algorithm can probably be adapted to other simple distance
functions if we dene k as the minimum number of errors needed to reject a string.
Jokinen, Tarhio and Ukkonen 1991 In 1991, Jokinen, Tarhio and Ukkonen [JTU96] adapted
a previous lter for the k-mismatches problem [GL89]. The lter is based on the simple fact that
inside any match with at most k errors there must be at least m ? k letters belonging to the pattern.
The lter does not care about the order of those letters. This is a simple version of [CL94] (see
Section 8.3), with less ltering eciency but simpler implementation.
The search algorithm slides a window of length m over the text8 and keeps count of the number
of window characters that belong to the pattern. This is easily done with a table that for each
7
8
36
character a stores a counter of a's in the pattern that have not yet been seen in the text window. The
counter is incremented when an a enters the window and decremented when it leaves the window.
Each time a positive counter is decremented, the window character is considered as belonging to
the pattern. When there are m ? k such characters, the area is veried with a classical algorithm.
The algorithm was analyzed by Navarro in 1997 [Nav97a] using a model of urns and balls. He
shows that the algorithm is O(n) time for < e?m= . Some possible extensions are studied in
[Nav98a].
The resulting algorithm is competitive in practice for short patterns, but it worsens for long
ones. It is simple to adapt to other distance functions, just by determining how many characters
must match in an approximate occurrence.
Wu and Manber 1992 In 1992, a very simple lter was proposed by Wu and Manber [WM92a]
(among many other ideas of that work). The basic idea is in fact very old [Riv76]: if a pattern is cut
in k +1 pieces, then at least one of the pieces must appear unchanged in an approximate occurrence.
This is evident, since k errors cannot alter the k + 1 pieces. The proposal was then to split the
pattern in k + 1 approximately equal length pieces, search the pieces in the text, and check the
neighborhood of their matches (of length m + 2k). They used an extension of Shift-Or [BYG92] to
search all the pieces simultaneously in O(mn=w) time. In the same 1992, Baeza-Yates and Perleberg
[BYP96] suggested better algorithms for the multipattern search: an Aho-Corasick machine [AC75]
to guarantee O(n) search time (excluding verications), or Commentz-Walter [CW79].
Only in 1996 the improvement was really implemented [BYN99], by adapting the Boyer-MooreSunday algorithm [Sun90] to multipattern search (using a trie of patterns and a pessimistic shift
table). The resulting algorithm is surprisingly fast in practice for low error levels.
There is no closed expression for the average case cost of this algorithm [BYR90], but we
show in the Appendix that a gross approximation is O(kn log (m)= ). Two independent proofs in
[BYN99, BYP96] show that the cost of the search dominates for < 1=(3 log m). A simple way to
see it is to consider that checking a text area costs O(m2 ) and is done when any of the k + 1 pieces
of length m=(k + 1) matches, which happens with probability near k= 1=. The result follows from
requiring the average verication cost to be O(1).
This lter can be adapted, with some care, to other distance functions. The main issue is to
determine how many pieces can an edit operation destroy and how many edit operations can be
made before surpassing the error threshold. For example, a transposition can destroy two pieces in
one operation, so we would need to split the pattern in 2k +1 pieces to ensure that one is unaltered.
A more clever solution for this case is to leave a hole of one character between each pair of pieces,
so the transposition cannot alter both.
Baeza-Yates and Navarro 1996 The bit-parallel algorithms presented in Section 7 [BYN99]
were also the basis for novel ltering techniques. As the basic algorithm is limited to short patterns,
they split longer patterns in j parts, making them short enough to be searchable with the basic
bit-parallel automaton (using one computer word).
The method is based on a more general version of the partition into k + 1 pieces [Mye94a,
BYN99]. For any j , if we cut the pattern in j pieces, then at least one of them appears with bk=j c
errors in any occurrence of the pattern. This is clear because if each piece needs more than k=j
37
errors to match, then the complete match needs more than k errors.
Hence, the pattern was split in j pieces (of length m=j ) which were searched with k=j errors
using the basic algorithm. Each time a piece was found, the neighborhood was veried to check for
the complete pattern. Notice thatpthe error level for the pieces is kept unchanged.
p p
The resulting algorithm
is O(n mk=w) on average. Its maximum value is 1 ? emO(1= w) = ,
p
smaller than 1 ? e= and worsening as m grows. This may be surprising since the error level
is the same for the subproblems. The reason is that the verication cost keeps O(m2) but the
matching probability is O(
m=j ), larger than O(
m) (see Section 4).
In 1997, the technique was enriched with \superimposition" [BYN99]. The idea is to avoid
performing one separate search for each piece of the pattern. A multipattern approximate searching
is designed using the ability of bit-parallelism to search for classes of characters. Assume that we
want to search "survey" and "secret". We search the pattern "s[ue][rc][vr]e[yt]", where
[ab] means fa; bg. In the NFA of Figure 14, the horizontal arrows are traversable by more than
one letter. Clearly any match of each of the two patterns is also a match of the superimposed
pattern, but not vice-versa (e.g. "servet" matches with zero errors). So the lter is weakened
but the search is made
pw) pfaster.p Superimposition
p allowed to lower the average search time to O(n)
O
(1
=
m= w and to O(n mk=(w)) for the maximum of the 1996 version.
for < 1 ? em
By using a j value smaller than the necessary to put the automata in single machine words, an
intermediate scheme was obtained
that softly adapted to higher error levels. The algorithm was
p
O(kn log(m)=w) for < 1 ? e= .
Navarro and Baeza-Yates 1998 The nal twist to the previous scheme was the introduction of
\hierarchical verication" in 1998 [NBY98a]. For simplicity assume that the pattern is partitioned
in j = 2r pieces, although the technique is general. The pattern is split in two halves, each one to
be searched with bk=2c errors. Each half is recursively split in two and so on, until the pattern is
short enough to make its NFA t in a computer word (see Figure 19). The leaves of this tree are
the pieces actually searched. When a leaf nds a match, instead of checking the whole pattern as
in the previous technique, its parent is checked (in a small area around the piece that matched).
If the parent is not found, the verication stops, otherwise it continues with the grandparent until
the root (i.e. the whole pattern) is found. This is correct because the partitioning scheme applies
to each level of the tree: the grandparent cannot appear if none of its children appear, even if a
grandchild appeared.
Figure 19 shows an example. If one searches the pattern "aaabbbcccddd" with four errors in
the text "xxxbbxxxxxxx", and split the pattern in four pieces to be searched with one error, the
piece "bbb" will be found in the text. In the original approach, one would verify the complete
pattern in the text area, while with the new approach one veries only its parent "aaabbb" and
immediately determine that there cannot be a complete match.
An orthogonal hierarchical verication technique is also presented in [NBY98a] to include superimposition in this scheme. If the superimposition of 4 patterns matches, the set is split in two
sets of two patterns each, and it is checked whether some of them match instead of verifying all
the 4 patterns one by one.
The analysis in [Nav98a, NBY98a] shows that the average verication cost drops to O((m=j )2).
Only now the problem scales well (i.e. O(
m=j ) verication probability and O((m=j )2) verication
38
aaabbbcccddd
aaabbb
aaa
cccddd
bbb
ccc
ddd
Figure 19: The hierarchical verication method for a pattern split in 4 parts. The boxes (leaves)
are the elements which are really searched, and the root represents the whole pattern. At least
one pattern at each level must match in any occurrence of the complete pattern. If the bold box is
found, all the bold lines may be veried.
cost). With hierarchical verication, the verication cost keeps negligible for < 1 ? e= . All
the simple extensions of bit-parallel algorithms apply, although the partition into j pieces may
need some p
redesign for other distances. Notice that it is very dicult to break the barrier of
= 1 ? e= for any lter because, as shown in Section 4, there are too many real matches, and
even the best lters must check real matches.
In the same 1998, the same authors [NBY98c, Nav98a] added hierarchical verication to the
lter that splits the pattern in k + 1 pieces and searches them with zero errors. The analysis shows
that with this technique the verication cost does not dominate the search time for < 1= log m.
The resulting lter is the fastest for most cases of interest.
Navarro and Ranot 1998 In 1998 Navarro and Ranot [NR98b, Nav98a] presented a novel
approach based on sux automata (see Section 3.2). They adapted an exact string matching
algorithm, BDM, to allow errors.
The idea of the original BDM algorithm is as follows [CCG+94, CR94]. The deterministic
sux automaton of the reverse pattern is built, so it recognizes the reverse prexes of the pattern.
Then the pattern is aligned with a text window, and the window is scanned backwards with the
automaton (this is why the pattern is reversed). The automaton is active as long as what it has
read is a substring of the pattern. Each time the automaton reaches a nal state, it has seen a
pattern prex, so we remember the last time it happened. If the automaton arrives with active
states to the beginning of the window then the pattern has been found, otherwise what is there is
not a substring of the pattern and hence the pattern cannot be in the window. In any case the last
window position that matched a pattern prex gives the next initial window position. The algorithm
BNDM [NR98a] is a bit-parallel implementation (using the nondeterministic sux automaton, see
Figure 3) which is much faster in practice and allows searching for classes of characters, etc.
The modication of [NR98b, Nav98a] is to build an NFA to search the reversed pattern allowing
errors, modify it to match any pattern sux, and apply essentially the same algorithm BNDM using
this automaton. Figure 20 shows the resulting automaton.
This automaton recognizes any reverse prex of P allowing k errors. The window will be
abandoned when no pattern substring matches with k errors what was read. The window is shifted
to the next pattern prex found with k errors. The matches must start exactly at the initial window
position. The window length is m ? k, not m, to ensure that if there is an occurrence starting at
39
11
00
00
11
00
11
11
00
00
11
00
11
11
00
00
11
00
11
s
no errors
11
00
00
11
00
11
11
00
00
11
00
11
11
00
00
11
00
11
11
00
00
11
00
11
s
1 error
11
00
00
11
00
11
11
00
00
11
00
11
1
0
0
1
2 errors
Figure 20: The construction to search any reverse prex of "survey" allowing 2 errors.
the window position then a substring of the pattern occurs in any sux of the window (so we do
not abandon the window before reaching the occurrence). Reaching the beginning of the window
does not guarantee a match, however, so we have to check the area by computing edit distance
from the beginning of the window (at most m + k text characters).
9 is O(n( + log (m)=m)=((1 ?
In the Appendix it is shown that the average complexity
p
) ? ) and the lter works well for < (1 ? e= )=(2 ? e=p), which for large alphabets
tends to 1=2. The result is competitive for low error levels, but the pattern cannot be very long
because of the bit-parallel implementation. Notice that trying to do this with the deterministic
BDM would have generated a very complex construction, while the algorithm is simple with the
nondeterministic automaton. Moreover, a deterministic automaton would have too many states,
just as in Section 6.2. All the simple extensions of bit-parallelism apply, provided the window
length m ? k is carefully reconsidered.
analyzed in [GKHO97]). A rst one, called LET (for \linear expected time"), works as follows: the
text is traversed linearly, and at each time the longest pattern substring that matches the text is
maintained. When the substring cannot be further extended, it starts again from the current text
position. Figure 21 illustrates.
The crucial observation is that, if less than m ? k text characters have been covered by concatenating k longest substrings, then the text area does not match the pattern. This is evident since a
match is formed by k + 1 correct strokes (recall Section 5.2) separated by k errors. Moreover, the
strokes need to be ordered, which is not required by the lter.
9
40
LET
SET
Figure 21: Algorithms LET and SET. LET covers all the text with pattern substrings, while SET
works only at block beginnings and stop when it nds k dierences.
The algorithm uses a sux tree on the pattern to determine in a linear pass the longest pattern
substring matching the text seen up to now. Notice that the article is from 1990, the same year when
Ukkonen and Wood did the same with a sux automaton [UW93] (see Section 5.2). Therefore, the
ltering is O(n) time. The authors use [LV89] as the verifying algorithm and therefore the worst
case is O(kn). The authors show that the ltering time dominates for < 1= log m + O(1). The
constants are involved, but practical gures are 0:35 for = 64 or 0:15 for = 4.
The second algorithm presented was called SET (for \sublinear expected time"). The idea is
similar to LET, except because the text is split in xed blocks of size (m ? k)=2, and the check for
k contiguous strokes starts only at block boundaries. Since the shortest match is of length m ? k,
some of these blocks is always contained completely in a match. If one is able to discard the block,
no occurrence can contain it. This is also illustrated in Figure 21.
The sublinearity is clear once it is proven that in O(k log m) comparisons a block is discarded
on average. Since 2n=(m ? k) blocks are considered, the average time is O(n log (m)=(1 ? )). The
maximum level keeps the same as in LET, so the complexity can be simplied to O(n log m).
Although the proof that limits the comparisons per block is quite involved, it is not hard to see
intuitively why it is true: the probability of nding in the pattern a stroke of length ` is limited by
m=`, and the detailed proof shows that ` = log m is on average the longest stroke found. This
contrast with the result of [Mye86a] (Section 5.3), that shows that k strokes add up O(k) length.
The dierence is that here we can take the strokes from anywhere in the pattern.
Both LET and SET are eective for very long patterns only, since their overhead does not pay
o on short patterns. Dierent distance functions can be accommodated after re-reasoning the
adequate k values.
Ukkonen 1992 In 1992, Ukkonen [Ukk92] independently rediscovered some of the ideas of Chang
and Lampe. He presented two ltering algorithms, one of which (based on what he called \maximal
matches") is similar to LET of [CL94] (in fact Ukkonen presents it as a new \block distance"
computable in linear time and shows that it serves as a lter for the edit distance). The other
lter is the rst reference to \q -grams" for online searching (there are much older ones in indexed
searching [Ull77]).
A q -gram is a substring of length q . A lter was proposed based on counting the number of
q-grams shared between the pattern and a text window (this is presented in terms of a new \q-gram
distance" which may be of interest by its own). A pattern of length m has (m ? q + 1) overlapping
41
q-grams. Each error can alter q q-grams of the pattern, and therefore (m ? q + 1 ? kq) pattern
q-grams must appear in any occurrence. Figure 22 illustrates.
text window
text samples
[Ukk92]
[ST95]
q-grams
q-grams
Figure 22: Q-gram algorithm. The left one [Ukk92] counts the number of pattern q -grams in a text
window. The right one [ST95] nds sequences of pattern q -grams in approximately the same text
positions (we have put in bold a text sample and the possible q -grams to match it).
Notice that this is a generalization of the counting lter of [JTU96] (Section 8.2), which would
correspond to q = 1. The search algorithm is similar as well, although of course keeping a table
with a counter for each of the q q -grams is impractical (especially because only m ? q + 1 of them
are present). Ukkonen uses a sux tree to keep count in linear time of the last q -gram seen (the
relevant information can be attached to the m ? q + 1 important nodes at depth q in the sux
tree).
The lter takes therefore linear time. There is no analysis to show which is the maximum error
level tolerated by the lter, so we attempt a gross analysis in the Appendix, valid for large m.
The result is that the lter works well for < O(1= log m), and that the optimal q to obtain it is
q = log m. The search algorithm is more complicated than that of [JTU96]. Therefore, using larger
q values only pays o for larger patterns. Dierent distance functions are easily accommodated by
recomputing the number of q grams that must be preserved in any occurrence.
Takaoka 1994 In 1994, Takaoka [Tak94] presented a simplication of [CL94]. He considered
h-samples of the text (which are non-overlapping q-grams of the text taken each h characters, for
h q). The idea was that if one h-sample was found in the pattern, then a neighborhood of the
area was veried.
By using h = b(m ? k ? q + 1)=(k + 1)c one cannot miss a match. The easiest way to see this is
to start with k = 0. Clearly we need h = m ? q + 1 to not loose any matches. For larger k, recall
that if the pattern is split in k + 1 pieces some of them must appear with no errors. The lter
divides h by k + 1 to ensure that any occurrence of those pieces will be found (we are assuming
q < m=(k + 1)).
Using a sux tree of the pattern the h-sample can be found in O(q ) time, and therefore the
ltering time is O(qn=h), which is O(n log (m)=(1 ? )) if the optimal q = log m is used. The
error level is again < O(1= log m), which makes the time O(n log m).
42
Chang and Marr 1994 It looks like O(n log m) is the best complexity achievable by using
lters, and that it will work only for = O(1= log m), but in 1994 Chang and Marr obtained an
algorithm which was
k
+
log
m
O
n
m
for < , where depends only on and it tends to 1 ? e= for very large . At the same
time, they proved that this was a lower bound for the average complexity of the problem (and
therefore their algorithm was optimal on average). This is a major theoretical breakthrough.
The lower bound is obtained by taking the maximum (or sum) of two simple facts: the rst
one is the O(n log (m)=m) bound of [Yao79] for exact string matching, and the second one is the
obvious fact that in order to discard a block of m text characters, at least k characters should
be examined to nd the k errors (and hence O(kn=m) is a lower bound). Also, the maximum
error level is optimal according to Section 4. What is impressive is that an algorithm with such
complexity was found.
The algorithm is a variation of SET [CL94]. It is of polynomial space in m, i.e. O(mt ) space
for some constant t which depends on . It is based on splitting the text in contiguous substrings
of length ` = t log m. Instead of nding in the pattern the longest exact matches starting at the
beginning of blocks of size (m ? k)=2, it searches the text substrings of length ` in the pattern
allowing errors.
The algorithm proceeds as follows. The best matches allowing errors inside P are precomputed
for every `-tuple (hence the O(mt) space). Starting at the beginning of the block, it searches
consecutive `-tuples in the pattern (each in O(`) time), until the total number of errors made
exceeds k. If by that time it has not yet covered m ? k text characters, the block can be safely
skipped.
The reason why this works is a simple extension of that for SET. We have found an area
contained in the possible occurrence which cannot be covered with k errors (even allowing the use
of unordered portions of the pattern for the match). The algorithm is only practical for very long
patterns, and can be extended for other distances with the same ideas of the other ltration and
q-gram methods.
p
It is interesting to notice that 1 ? e= is the limit we have discussed in Section 4, which
is a rm barrier for any ltering mechanism. Chang and Lawler proved an asymptotic result, while
a general bound is proved in [BYN99]. The lters of [CM94, NBY98a] reduce the problem to less
errors instead of to zero errors. An interesting observation is that it seems that all the lters that
partition the problem intop exact search can be applied for = O(1= log m), and that in order
to improve this to 1 ? e= we must partition the problem into (smaller) approximate searching
subproblems.
Sutinen and Tarhio 1995 Sutinen and Tarhio generalized the Takaoka lter in 1995 [ST95],
improving its ltering eciency. This is the rst lter that takes into account the relative positions
of the pattern pieces that match in the text (all the previous matched pieces of the pattern in any
order). The generalization is to force that s q -grams of the pattern match (not just one). The
pieces must conserve their relative ordering in the pattern and must not be more than k characters
43
away from their correct position (otherwise we need to make more than k errors to use them). This
method is also illustrated in Figure 22.
In this case, the sampling step is reduced to h = b(m ? k ? q + 1)=(k + s)c. The reason for
this reduction is that, to ensure that s pieces of the pattern match, we need to cut the pattern
in k + s pieces. To search the pieces forcing that they are not too far away from their correct
positions, the pattern is divided in k + s pieces and a hashed set is created for each piece. The
set contains the q -grams of the piece and some neighboring ones too (because the sample can be
slightly misaligned). At search time, instead of a single h-sample, they consider text windows of
contiguous sequences of k + s h-samples. Each of these h-samples are searched in the corresponding
set, and if at least s are found the area is veried. This is a sort of Hamming distance, and the
authors resort to an ecient algorithm for that distance [BYG92] to process the text.
The resulting algorithm is O(n log m) on average using optimal q = log m, and works well
for < 1= log m. The algorithm is better suited for long patterns, although with s = 2 it can be
reasonably applied to short ones as well. In fact the analysis is done for s = 2 only in [ST95].
Shi 1996 In 1996 Shi [Shi96] proposed to extend the idea of the k + 1 pieces (explained in
Section 8.2) to k + s pieces, so that at least s pieces must match. This idea is implicit in the lter of
Sutinen and Tarhio but had not been explicitly written down. Shi compared his lter against the
simple one, nding that the ltering eciency was improved. However, this improvement will be
noticeable only for long patterns. Moreover, the online searching eciency is degraded because the
pieces are shorter (which aects any Boyer-Moore-like search), and because the verication logic
is more complex. No analysis is presented in the paper, but we conjecture that the optimum s is
O(1) and therefore the same complexity and tolerance to errors is maintained.
Giegerich, Kurtz, Hischke and Ohlebusch 1996 Also in 1996, a general method to improve
lters was developed [GKHO97]. The idea is to mix the phases of ltering and checking, so that the
verication of a text area is abandoned as soon as the combined information from the lter (number
of guaranteed dierences left) and the verication in progress (number of actual dierences seen)
shows that a match is not possible. As they show, however, the improvement occurs in a very
narrow area of . This is a consequence of the statistics of this problem that we have discussed in
Section 4.
9 Experiments
We perform in this section an empirical comparison among the algorithms described along this
work. Our goal is to show which are the best options at hand depending on the case. Nearly 40
algorithms have been surveyed, some of them without existing implementation and many of them
already known to be impractical. To avoid an excessively lengthly comparison among algorithms
known to be not competitive, we have left aside many of them.
44
A large group of excluded algorithms is from those theoretical ones based on the dynamic programming matrix. We remark that all these algorithm, despite not being competitive in practice,
represent (or represented at their time) a valuable contribution to the development of the algorithmic aspect of the problem. The dynamic programming algorithm [Sel80] is excluded because the
cut-o heuristic of Ukkonen [Ukk85b] is well known to be faster (e.g. in [CL92] and in our internal
tests); the algorithm of [MP80] is argued in the same paper to be worse than dynamic programming
(which is quite bad) for n < 40 Gb; [LV88] has bad complexity and was improved later by many
others in theory and practice; [LV89] is implemented with a better LCA algorithm in [CL92] and
found too slow; [Mye86a] is considered slow in practice by the same author in [WMM96]; [GG88]
is clearly slower than [LV89]; [GP90], one of the fastest among the O(kn) worst case algorithms,
is shown to be extremely slow in [UW93, CL92, Wri94] and in internal tests done by ourselves;
[UW93] is shown to be slow in [JTU96]; the O(kn) algorithm implemented in [CL94] is in the same
paper argued to be the fast of the group and shown to be not competitive in practice; [SV97, CH98]
are clearly theoretical, their complexities show that the patterns have to be very long and the error
level too low to be of practical application. To give an idea of how slow is \slow", we found [GP90]
10 times slower than Ukkonen's cut-o heuristic (a similar result is reported by Chang and Lampe
[CL92]). Finally, other O(kn) average time algorithms are proposed in [Mye86a, GP90], and they
are shown to be very similar to Ukkonen's cut-o [Ukk85b] in [CL92]. Since the cut-o heuristic
is already not very competitive we leave aside the other similar algorithms. Therefore, from the
group based on dynamic programming we consider only cut-o heuristic (mainly as a reference)
and [CL92], which is the only competitive in practice.
From the algorithms based on automata we consider the DFA algorithm [Ukk85b], but prefer its
lazy version implemented in [Nav97b], which is equally fast for small automata and much faster for
large automata. We also consider the four Russians algorithm of [WMM96]. From the bit-parallel
algorithms we consider [WM92a, BYN99, Mye98], leaving aside [Wri94]. As shown in the 1996
version of [BYN99], the algorithm of [Wri94] was competitive only on binary text, and this was
shown to not hold anymore in [Mye98].
From the ltering algorithms, we have included [TU93]; the counting lter proposed in [JTU96]
(as simplied in [Nav97a]); the algorithm of [NR98b, Nav98a]; and those of [ST95] and [Tak94]
(this last one seen as the case s = 1 of [ST95], since this implementation worked better). We have
also included the lters proposed in [BYN99, NBY98a, Nav98a], preferring to present only the last
version which incorporates all the twists of superimposition, hierarchical verication and mixed
partitioning. Many previous versions are outperformed by this one. We have also included the
best version of the lters that partition the pattern in k + 1 pieces, namely the one incorporating
hierarchical verication [NBY98c, Nav98a]. In those publications it is shown that this version clearly
outperforms the previous ones proposed in [WM92a, BYP96, BYN99]. Finally, we are discarding
some lters [CL94, Ukk92, CM94, Shi96] which are applicable only to very long patterns, since this
case is excluded from our experiments as explained shortly. Some comparisons among them were
carried out by Chang and Lampe [CL92], showing that LET is equivalent to the cut-o algorithm
with k = 20, and that the time for SET is 2 times that of LET. LET was shown to be the fastest
with patterns of hundred letters long and a few errors in [JTU96], but we recall that many modern
lters were not included in that comparison.
45
We list now the algorithms included and the relevant comments on them. All the algorithms
implemented by ourselves represent our best coding eort and have been found similar or faster
than other implementations found elsewhere. The implementations coming from other authors were
checked with the same standards and in some cases their code was improved with better register
usage and I/O management.
We made our best eort to uniformize the algorithms. The I/O is the same in all cases: the
text is read in chunks of 64 Kb to improve locality (this is the optimum in our machine) and care is
taken to not lose or repeat matches in the borders; open is used instead of fopen because this last
one is slower. We also uniformize internal conventions: only a nal special character (zero) is used
at the end of the buer to help algorithms recognize it; and only the number of matches found is
reported.
We separate in the experiments the ltering and non-ltering algorithms. This is because the
lters can in general use any non-lter to check for potential matches, so the best algorithm is
formed by a combination of both. All the ltering algorithms in the experiments use the cut-o
algorithm [Ukk85b] as their verication engine, except for BPP (whose very essence is to switch
smoothly to BPD) and BND (that uses a reverse BPR to search in the window and a forward BPR
for the verications).
Apart from the algorithms to be included and their details, we describe our experimental setup.
We measure CPU times and show the results in tenths of seconds per megabyte. Our machine is a
Sun UltraSparc-1 of 167 MHz and 64 Mb of main memory, we run Solaris 2.5.1 and the texts are
in a local disk of 2 Gb. Our experiments were run on texts of 10 Mb of size and repeated 20 times
(with dierent search patterns). The same patterns are used for all the algorithms.
Considering the applications, we have selected three types of texts.
DNA This le is formed by concatenating the 1.34 Mb DNA chain of h.influenzae with itself
until obtaining 10 Mb. Lines are cut at 60 characters. The patterns are selected randomly
from the text avoiding line breaks if possible. The alphabet size is 4 except for a few exceptions
along the le, and the results are similar to a random four-letter text.
Natural language This le is formed by 1.29 Mb of writings of B. Franklin ltered to lower case
and separators converted to a space (except line breaks which are respected). This mimics
common Information Retrieval scenarios. The text is replicated to obtain 10 Mb and search
patterns are randomly selected from the same text at word beginnings. The results on this
text are roughly equivalent to a random text over 15 characters.
Speech We obtained speech les from discussions on the U.S. Law from Indiana University, in
PCM format with 8 bits per sample. Of course the standard edit distance is of no use here,
since it has to take into account the absolute values of the dierences between two characters.
We simplied the problem to use edit distance: we reduced the range of values to 64 by
quantization, therefore considering equal two samples that lie in the same range. We used
the rst 10 Mb of the resulting le. The results are similar to those on a random text of 50
letters, although the le shows smooth changes from a letter to the next.
We present results using dierent pattern lengths and error levels, in two
avors: we x m and
show the eect of increasing k, or we x and show the eect of increasing m. A given algorithm
may not appear at all in a plot when its times are above the y range of interest or its restrictions
on m and k do not intersect with the x range of interest. In particular, lters are shown only for
1=2. We remind that in most applications the error levels of interest are low.
47
9.3 Results
Figure 23 shows the results for short patterns (m = 10) and varying k. In non-ltering algorithms
BPD is normally the fastest, up to 30% faster than the next one, BPM. The DFA is also quite
close in most cases. For k = 1, a specialized version of BPR is slightly faster than BPD (recall that
for k > 3 BPR starts to use a nonspecialized algorithm, hence the jump). An exception to this
situation occurs in DNA text, where for k = 4 and k = 5 BPD shows a nonmonotonic behavior and
BPM becomes the fastest. This behavior comes from its O(k(m ? k)n=w) complexity10, which in
texts with larger alphabets is not noticeable because the cut-o heuristic keeps the cost unchanged.
Indeed, the behavior of BPD would have been totally stable if we chose m = 9 instead of m = 10,
because the problem would t in a computer word all the time. BPM, on the other hand, handles
much longer patterns keeping such stability, although it takes up to 50% mor time than BPD.
With respect to lters, we see that EXP is the fastest for low error levels. The value of \low"
increases for larger alphabets. At some point, BPP starts to dominate. BPP adapts smoothly to
higher error levels by slowly switching to BPD, so BPP is a good alternative for intermediate error
levels, from where EXP ceases to work until it should switch to BPD. However, this range is void
on DNA and English text for m = 10. Other lters competitive with EXP are BND and BMH. In
fact, BND is the fastest for k = 1 on DNA, although really no lter works very well in that case.
Finally, QG2 does not appear because it only worked for k = 1 and it was worse than QG1.
The best choice for short patterns seems to be using EXP while it works and switching to the
best bit-parallel algorithm for higher errors. Moreover, the verication algorithm for EXP should
be BPR or BPD (which are the fastest where EXP dominates).
Figure 24 shows the case of longer patterns (m = 30). Many of the observations are still valid
in this case. However, the algorithm BPM shows in this case its advantage over BPD, since still all
the problem ts in a computer word for BPM and it does not for BPD. Hence in the left plots the
best algorithm is BPM except for low k, where BPR or BPD are better. With respect to lters,
EXP or BND are the fastest depending on the alphabet, until a certain error level is reached. At
that point BPP becomes the fastest, in some cases still faster than BPM. Notice that for DNA a
specialized version of BND for k = 4 and even 5 could be the fastest choice.
In Figure 25 we consider the case of xed = 0:1 and growing m. The results repeat somewhat
with regard to non-ltering algorithms: BPR is the best for k = 1 (i.e. m = 10), then BPD is the
best until certain pattern length (which varies from 30 on DNA to 80 on speech) and nally BPM
becomes the fastest. Note that for so low error level the number of active columns is quite small,
which permits algorithms like BPD and BPM keeping their good behavior for patterns much longer
than what they can handle in a single machine word. The DFA is also quite competitive until its
memory requirements become unreasonable.
The real change, however, is in the lters. In this case PEX becomes the star lter in English
and speech texts, by far unbeaten. The situation on DNA, on the other hand, is quite complex. For
m 30 BND is the fastest, and indeed an extended implementation of it allowing longer patterns
could keep being the fastest for a few more points. However, that case would have to handle four
errors, and only a specialized implementation for xed k = 4, 5, etc. could keep a competitive
performance. We have determined that such specialized code was worthwhile up to k = 3 only.
When BND ceases to be applicable, PEX becomes the fastest algorithm and nally QG2 beats it
10
Another reason for this behavior is that there are integer round-o eects that produce nonmonotonic results.
48
t 5
4
3 +
+
2
1
0
1 2
6
t 5
4
3
2
+ +
1
0
1 2
6
t 5
4
3
2
+ +
1
0
1 2
CTF
CLP
+
3 4 5 6 7 8 9
+
3 4 5 6 7 8 9
+
+
3 4 5 6 7 8 9
+
DFA
RUS
t 5
4
+
3
2
1
k 0
1
6
t 5
4
3
2
+
1
k 0
1
6
t 5
4
3
2
1+
k 0
1
BPR
BPD
BPM
k
2
+
+
+
2
QG2
QG1
+
BPP
CNT
BMH
EXP
BND
Figure 23: Results for m = 10 and varying k. The left plots show non-ltering and the right plots
show ltering algorithms. Rows 1 to 3 show DNA, English and speech les, respectively.
49
10
t 8
6
2
0
10
+ + +
+ +
1 5 9 13 17 21 25 29
+
+
+
k
0
1 3 5 7 9 11 13 15
10
t 8
+ +
t 8
+
t 8
6
10
+ +
2 +
2 +
+
k
k 0
0
1 3 5 7 9 11 13 15
1 5 9 13 17 21 25 29
10
10
t 8
6
4
2
0
t 8
+
+
+
+
+
DFA
RUS
4
2
1 5 9 13 17 21 25 29
CTF
CLP
+ +
+
+
+
+
BPR
BPD
BPM
+
1 3 5 7 9 11 13 15
QG2
QG1
+
BPP
CNT
BMH
EXP
BND
Figure 24: Results for m = 30 and varying k. The left plots show non-ltering and the right plots
show ltering algorithms. Rows 1 to 3 show DNA, English and speech les, respectively.
50
(for m 60). However, notice that for m > 30 all the lters are beaten by BPM and therefore
make little sense (on DNA).
There is a nal phenomenon that deserves mention with respect to lters. The algorithms QG1
and QG2 improve as m grows. These algorithms are the most practical and the only ones we
tested among the family of algorithms suitable for very long patterns. This shows that, despite
that all these algorithms would not be competitive in our tests (where m 100), they should be
considered in scenarios where the patterns are much longer and the error level keeps very low. In
such a scenario, those algorithms would nally beat all the algorithms we are considering here.
The situation becomes worse for the lters when we consider = 0:3 and varying m (Figure 26).
On DNA, no lter can beat the non-ltering algorithms, and among these the tricks to keep few
active columns do not work well. This favors the algorithms that pack more information per bit,
which makes BPM the best in all cases except for m = 10 (where BPD is better). The situation
is almost the same on English text, except because BPP works reasonably well and becomes quite
similar to BPM (the periods where each one dominates are interleaved). On speech, on the other
hand, the scenario is similar for non-ltering algorithms, but the PEX lter still beats all them, as
30% of errors is low enough on the speech les. Note in passing that the error level is too high for
QG1 and QG2, which can only be applied in a short range and yield bad results.
To give an idea of which are the areas where each algorithm dominates, Figure 27 shows the case
of English text. There is more information in Figure 27 that what can be inferred from previous
plots, such as the area where RUS is better than BPM. We have shown the non-ltering algorithms
and superimposed in gray the area where the lters dominate. Therefore, in the grayed area the
best choice is to use the corresponding lter using the dominating non-lter as its verication
engine. In the non grayed area it is better to use the dominating non-ltering algorithm directly,
with no lter.
A code implementing such a heuristic (including only EXP, BPD and BPP) is publicly available
from the Web page of the author11. This combined code is faster than each isolated algorithm,
although of course it is not really a single algorithm but the combination of the best choices.
10 Conclusions
We reach the end of this tour on approximate string matching. Our goal has been to present and
explain the main ideas that exist behind the existing algorithms, to classify them according to the
type of approach proposed, and to show how they perform in practice in a subset of the possible
practical scenarios. We have shown that the oldest approaches, based on the dynamic programming
matrix, yielded the most important theoretical developments, but in general the algorithms have
been improved by modern developments based on ltering and bit-parallelism. In particular, the
fastest algorithms combine a fast lter to discard most of the text with a fastest non-lter algorithm
to check the potential matches.
We show some plots summarizing the contents of the survey. Figure 28 shows the historical order
in which the algorithms appeared in the dierent areas. Figure 29 shows a worst case time/space
complexity plot for the non-ltering algorithms. Figure 30 considers ltration algorithms, showing
11
http://www.dcc.uchile.cl/gnavarro/pubcode.
51
8
7
t 6
5
4
3 +
2 +
1
0
10 20
6
8
+
7
t 6
+ +
+
5 + +
+
+
+
4
+
+
3
+
2
1
m
m 0
10 20 30 40 50 60 70 80 90 100
30 40 50 60 70 80 90 100
6
t
t 5
+
4
+
+ +
+
3
+
+
+
2
+
+
1
m
0
10 20 30 40 50 60 70 80 90 100
4
t 3
+
+
2
1
0
2
1
0
4
1
0
10 20 30 40 50 60 70 80 90 100
DFA
RUS
+
+
+
+ + + + +
+
+
10 20 30 40 50 60 70 80 90 100
m
+
t 3
+
+
+
+
CTF
CLP
BPR
BPD
BPM
+ + +
10 20 30 40 50 60 70 80 90 100
QG2
QG1
+
BPP
CNT
BMH
EXP
BND
Figure 25: Results for = 0:1 and varying m. The left plots show non-ltering and the right plots
show ltering algorithms. Rows 1 to 3 show DNA, English and speech les, respectively.
52
16
14
t 12
10
8
6
4+ +
2
0
10 20
12
16 +
14
+
t
+
12
+
+
10 +
+
8
+
+
6
+
4
2
m
m 0
10 20 30 40 50 60 70 80 90 100
30 40 50 60 70 80 90 100
12
t
t 10
+
8
+
+
+
6
+
+
+
4
+
+ +
2
m
0
10 20 30 40 50 60 70 80 90 100
8
t 6
4
2
0
+
+
DFA
RUS
4
+
2
0
8
BPR
BPD
BPM
+
+ +
+
+
+
10 20 30 40 50 60 70 80 90 100
QG2
QG1
2
+ + +
10 20 30 40 50 60 70 80 90 100
10 20 30 40 50 60 70 80 90 100
CTF
CLP
t 6
+
+
+
+
+
10
+
BPP
CNT
BMH
EXP
BND
Figure 26: Results for = 0:3 and varying m. The left plots show non-ltering and the right plots
show ltering algorithms. Rows 1 to 3 show DNA, English and speech les, respectively.
53
RUS
0.7
BPD
BPM
0.5
PEX
BPP
0.3
0.1
BPD
BPR
0
10
30
50
70
100
Figure 27: The areas where each algorithm is the best, graying that of ltering algorithms.
their average case complexity and the maximum error level for which they work. Some practical
assumptions have been made to order the dierent functions of k, m, , w and n.
Approximate string matching is a very active research area and it should continue in that
status in the foreseeable future: strong genome projects in computational biology, the pressure for
oral human-machine communication and the heterogeneity and spelling errors present in textual
databases are just a sample of the reasons that are driving researchers to look for faster and more
exible algorithms for approximate pattern matching.
It is interesting to point out which theoretical and practical questions are still open in the area.
A dicult open question is about the exact matching probability and average edit distance
between two random strings. We found a new bound in this survey, but the problem is still
open.
A worst case lower bound of the problem is clearly O(n), but the only algorithms achieving
it have space and preprocessing cost exponential in m or k. The only improvements to the
worst case with polynomial space complexity are the O(kn) algorithms and, for very small k,
O(n(1 + k4 =m)). Is it possible to improve the algorithms or to nd a better lower bound for
this case?
The previous question has also a practical side: is it possible to nd an algorithm which
is O(kn) in the worst case and ecient in practice? Using bit-parallelism there are good
practical algorithms that achieve O(kn=w) on average and O(mn=w) in the worst case.
The lower bound of the problem for the average case is known to be O(n(k + log m)=m) and
there exists an algorithm achieving it, so from the theoretical point of view that problem is
closed. However, from the practical side we have that the algorithms approaching those limits
work well only for very long patterns, while a much simpler algorithm (EXP) is the best for
moderate and short patterns. Is is possible to nd a unied approach, good in practice and
with that theoretical complexity?
54
rst algorithm
80
[Sel80] [MP80]
85
86
87
88
89
90
91
92
93
94
95
96
97
98
[LV88] [Ukk85b]
[LV89] [Mye86a]
rst lter
rst bit-parallel
avg. lower bound
fastest practical
[Ukk85b]
[GG88]
[GP90]
[CL94] [UW93]
[CL92]
[WMM96]
[WM92a]
[Wri94]
[Mel96]
[Kur96]
[SV97]
[CH98]
[BYN99]
[Mye98]
[TU93] [CL94]
[JTU96]
[WM92a] [BYP96] [Ukk92]
[CM94] [Tak94]
[ST95]
[BYN99] [Shi96] [GKHO97]
[Nav97a]
[NBY98a] [NBY98c] [NR98b]
Another practical question on ltering algorithms is: is it possible in practice to improve over
Acknowledgements
The author wishes to thank the many top researchers in this area for their willingness to exchange
ideas and/or share their implementations: Amihood Amir, Ricardo Baeza-Yates, William Chang,
Udi Manber, Gene Myers, Erkki Sutinen, Tadao Takaoka, Jorma Tarhio, Esko Ukkonen and Alden
Wright.
References
[AAL+ 97]
[AC75]
A. Amir, Y. Aumann, G. Landau, M. Lewenstein, and N. Lewenstein. Pattern matching with swaps. In Proc. FOCS'97, pages 144{153, 1997.
A. Aho and M. Corasick. Ecient string matching: an aid to bibliographic search.
Comm. of the ACM, 18(6):333{340, 1975.
55
time
mn
[Sel80]
k2 n
[LV88]
kndm= log ne
kn
n(1+k4 =m)
ndkm= log ne
mn= log n
mn= log2 n
n
[GG88]
[UW93]
[GP90]
[CL94]
[CH98]
[WM92a]
[Mye86a]
[LV89]
[GP90]
[BYN99]
[Mye98]
[Ukk85a]
m2
exp (m)
[WMM96]
[MP80]
n
space
Figure 29: Worst case time and space complexity of non-ltering algorithms. We replaced w by
(log n).
[ADKF75] V. Arlazarov, E. Dinic, M. Konrod, and I. Faradzev. On economic construction of
the transitive closure of a directed graph. Soviet Mathematics Doklady, 11:1209{1210,
1975. Original in Russian in Doklady Akademi Nauk SSSR, v. 194, 1970.
[AG85]
A. Apostolico and Z. Galil. Combinatorial Algorithms on Words. NATO ISI Series.
Springer-Verlag, 1985.
[AG87]
A. Apostolico and C. Guerra. The Longest Common Subsequence problem revisited.
Algorithmica, 2:315{336, 1987.
[AHU74] A. Aho, J. Hopcroft, and J. Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley, 1974.
[ALL97]
A. Amir, M. Lewenstein, and N. Lewenstein. Pattern matching in hypertext. In Proc.
WADS'97, LNCS 1272, pages 160{173. Springer-Verlag, 1997.
[ANZ97] M. Araujo, G. Navarro, and N. Ziviani. Large text searching allowing errors. In Proc.
WSP'97, pages 2{20. Carleton University Press, 1997.
[Apo85]
A. Apostolico. The myriad virtues of subword trees. In Combinatorial Algorithms on
Words, pages 85{96. Springer-Verlag, 1985.
56
time
[BYN99]
kn log(m)=w
p
n km=w
[BYN99] (96)
[BYN99]
n km=(w)
n
[NBY98a]
[Ukk92]
[CL94]
[WM92a]
[JTU96]
[TU93]
k2 n=
[BYN99]
[NBY99a]
[Tak94]
[Shi96]
[ST95]
[CL94]
kn log (m)=
n log m
(m)=m
n( +(1 ?log
) ? )
[NR98b]
[CM94]
n(k+log m)=m
e?m=
?e=p
?e=p
1
2
p
1? emOp(1= w)
1?e=p
max
Figure 30: Average time and maximum tolerated error level for the ltration algorithms.
[BBH+ 85] A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M. Chen, and J. Seiferas. The
samllest automaton recognizing the subwords of a text. Theoretical Computer Science,
40:31{55, 1985.
[BM77]
R. Boyer and J. Moore. A fast string searching algorithm. Comm. of the ACM,
20(10):762{772, 1977.
[BY89]
R. Baeza-Yates. Ecient Text Searching. PhD thesis, Dept. of Computer Science,
Univ. of Waterloo, May 1989. Also as Research Report CS-89-17.
[BY91]
R. Baeza-Yates. Some new results on approximate string matching. In Workshop on
Data Structures, Dagstuhl, Germany, 1991. Abstract.
[BY92]
R. Baeza-Yates. Text retrieval: Theory and practice. In 12th IFIP World Computer
Congress, volume I, pages 465{476. Elsevier Science, 1992.
[BY96]
R. Baeza-Yates. A unied view of string matching algorithms. In SOFSEM'96: Theory
and Practice of Informatics, LNCS 1175, pages 1{15. Springer-Verlag, 1996.
[BYG92] R. Baeza-Yates and G. Gonnet. A new approach to text searching. Comm. of the
ACM, 35(10):74{82, 1992. Preliminary version in ACM SIGIR'89, 1989.
57
[BYG94]
[CM94]
W. Chang and T. Marr. Approximate string matching and local similarity. In Proc.
CPM'94, LNCS 807, pages 259{273. Springer-Verlag, 1994.
[Cob95]
A. Cobbs. Fast approximate matching using sux trees. In Proc. CPM'95, pages
41{54, 1995.
[CR94]
M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, Oxford,
UK, 1994.
[Cro86]
M. Crochemore. Transducers and repetitions. Theoretical Computer Science, 45:63{
86, 1986.
[CS75]
V. Chvatal and D. Sanko. Longest common subsequences of two random sequences.
Journal of Applied Probability, 12:306{315, 1975.
[CW79]
B. Commentz-Walter. A string matching algorithm fast on the average. In Proc.
ICALP'79, LNCS 6, pages 118{132. Springer-Verlag, 1979.
[Dam64] F. Damerau. A technique for computer detection and correction of spelling errors.
Comm. of the ACM, 7(3):171{176, 1964.
[Dek79]
J. Deken. Some limit results for longest common subsequences. Discrete Mathematics,
26:17{31, 1979.
[DFG+97] G. Das, R. Fleisher, L. Gasieniek, D. Gunopulos, and J. Karkainen. Episode matching.
In Proc. CPM'97, LNCS 1264, pages 12{27. Springer-Verlag, 1997.
[DM79]
R. Dixon and T. Martin, editors. Automatic speech and speaker recognition. IEEE
Press, 1979.
[EH88]
A. Ehrenfeucht and D. Haussler. A new distance metric on strings computable in
linear time. Discrete Applied Mathematics, 20:191{203, 1988.
[EL90]
D. Elliman and I. Lancaster. A review of segmentation and contextual analysis techniques for text recognition. Pattern Recognition, 23(3/4):337{346, 1990.
[FPS97]
J. French, A. Powell, and E. Schulman. Applications of approximate word matching
in information retrieval. In Proc. ACM CIKM'97, pages 9{15, 1997.
[GBY91] G. Gonnet and R. Baeza-Yates. Handbook of Algorithms and Data Structures.
Addison-Wesley, 2nd edition, 1991.
[GG88]
Z. Galil and R. Giancarlo. Data structures and algorithms for approximate string
matching. Journal of Complexity, 4:33{72, 1988.
[GKHO97] R. Giegerich, S. Kurtz, F. Hischke, and E. Ohlebusch. A general technique to improve
lter algorithms for approximate string matching. In Proc. WSP'97, pages 38{52.
Carleton University Press, 1997. Preliminary version as Technical Report 96-01, Universitat Bielefeld, Germany, 1996.
59
[GL89]
[Gon92]
[Gos91]
[GP90]
[GT78]
[HD80]
[Hec78]
[Hor80]
[HS94]
[HT84]
[HU79]
[JTU96]
[JU91]
[KM97]
[KMP77]
R. Grossi and F. Luccio. Simple and ecient string matching with k mismatches.
Information Processing Letters, 33(3):113{120, 1989.
G. Gonnet. A tutorial introduction to Computational Biochemistry using Darwin.
Technical report, Informatik E.T.H., Zuerich, Switzerland, 1992.
J. Gosling. A redisplay algorithm. In Proc. ACM SIGPLAN/SIGOA Symp. on Text
Manipulation, pages 123{129, 1991.
Z. Galil and K. Park. An improved algorithm for approximate string matching. SIAM
Journal on Computing, 19(6):989{999, 1990. Preliminary version in ICALP'89, LNCS
372, 1989.
R. Gonzalez and M. Thomason. Syntactic pattern recognition. Addison-Wesley, 1978.
P. Hall and G. Dowling. Approximate string matching. ACM Computing Surveys,
12(4):381{402, 1980.
P. Heckel. A technique for isolating dierences between les. Comm. of the ACM,
21(4):264{268, 1978.
R. Horspool. Practical fast searching in strings. Software Practice and Experience,
10:501{506, 1980.
N. Holsti and E. Sutinen. Approximate string matching using q -gram places. In Proc.
7th Finnish Symposium on Computer Science, pages 23{32. University of Joensuu,
1994.
D. Harel and E. Tarjan. Fast algorithms for nding nearest common ancestors. SIAM
Journal on Computing, 13(2):338{355, 1984.
J. Hopcroft and J. Ullman. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, 1979.
P. Jokinen, J. Tarhio, and E. Ukkonen. A comparison of approximate string matching
algorithms. Software Practice and Experience, 26(12):1439{1458, 1996. Preliminary
version in Technical Report A-1991-7, Dept. of Computer Science, Univ. of Helsinki,
1991.
P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static
texts. In Proc. MFCS'91, volume 16, pages 240{248. Springer-Verlag, 1991.
S. Kurtz and G. Myers. Estimating the probability of approximate matches. In Proc.
CPM'97, LNCS 1264, pages 52{64. Springer-Verlag, 1997.
D. Knuth, J. Morris, Jr, and V. Pratt. Fast pattern matching in strings. SIAM
Journal on Computing, 6(1):323{350, 1977.
60
[Knu73]
[LW75]
[Nav97b]
[Nav98a]
[Nav98b]
[NBY98a]
[NBY98b]
[NBY98c]
[NBY99a]
[NBY99b]
[Nes86]
[NR98a]
[NR98b]
[NW70]
[OM88]
[Riv76]
[RS97]
[San72]
[Sel74]
[Sel80]
[Shi96]
[SK83]
[SM83]
[ST95]
[ST96]
[Sun90]
[SV88]
[SV97]
[Tak94]
[Tic84]
[TU88]
[TU93]
[WMM96] S. Wu, U. Manber, and E. Myers. A sub-quadratic algorithm for approximate limited expression matching. Algorithmica, 15(1):50{67, 1996. Preliminary version as
Technical Report TR29-36, Computer Science Dept., Univ. of Arizona, 1992.
[Wri94]
A. Wright. Approximate string matching using within-word parallelism. Software
Practice and Experience, 24(4):337{362, 1994.
[Yao79]
A. Yao. The complexity of pattern matching for a random string. SIAM Journal on
Computing, 8:368{387, 1979.
[YFM96] T. Yap, O. Frieder, and R. Martino. High performance computational methods for
biological sequence analysis. Kluwer Academic Publishers, 1996.
[ZD96]
J. Zobel and P. Dart. Phonetic string matching: lessons from information retrieval.
In Proc. SIGIR'96, pages 166{172, 1996.
Tarhio and Ukkonen 1990 First, the probability of a text character being \bad" is that of not
matching 2k + 1 pattern positions, i.e. Pbad = (1 ? 1= )2k+1 e?(2k+1)= , so we try on average
1=Pbad characters until nding a bad one. Since k + 1 bad characters have to be found, we work
O(k=Pbad ) to abandon the window. On the other hand, the probability of verifying a text window
is that of reaching its beginning. We approximate that probability by equating m to the average
portion of the traversed window (k=Pbad), to obtain < e?(2k+1)= .
Wu and Manber 1992 The Sunday algorithm can be analyzed as follows. To see how far can
we verify in the current window, consider that the (k + 1) patterns have to fail. Each one fails on
average in log (m=(k + 1)) character comparisons, but the time for all them to fail is longer. By
Yao's bound [Yao79], this cannot be less than log m. Otherwise we could split the test of a single
pattern in (k + 1) tests of subpatterns and all them would fail in less than log m time, breaking
the lower bound. To compute the average shift, consider that k characters must be dierent from
the last window character, and therefore the average shift is =k. The nal complexity is therefore
O(kn log (m)=). This is optimistic but we conjecture that it is the correct complexity. An upper
bound is obtained by replacing k by k2 (i.e. adding the times for all the pieces to fail).
Navarro and Ranot 1998 The automaton matches with the text window with k errors almost
surely until k= characters have been inspected (so that the error level becomes lower than ).
From there on, it becomes exponentially decreasing on
, which can be made 1= in O(k) total
66
steps. From that point on we are in a case of exact string matching and then log m characters
are inspected, for a total of O(k= + log m). When the window is shifted to the last prex that
matched with k errors, this is also at k= distance from the end of the window, on average. The
window length is m ? k, and therefore we shift the window in m ? k ? k= on average. Therefore,
the total amount of work is O(n( + log (m)=m)=((1 ? ) ? ). The lter works well unless
the probability of nding a pattern prex with errors at the beginning
of thep window is high. This
p
is the same to say that k= = m ? k, which gives < (1 ? e= )=(2 ? e= ).
Ukkonenq1992 The probability of nding a given q-gram in the text window is 1 ? (1 ? 1=q )m
67