Approximate String
Approximate String
Problem
String searching
Ricardo Baeza-Yates
From automata to algorithms
Center for Web Research
Filtering
www.cwr.cl
Depto. de Ciencias de la Computación Indices
Universidad de Chile
Santiago, CHILE ASM with Indices
rbaeza@dcc.uchile.cl
Concluding remarks
2
User’s point of view Theory vs. Practice
3 4
Problem Search Models
2. is any sequence starting in an index-point
Text
of length
Pattern of length ( )
Some data structures assume the first model
is considered bounded
Answer Models
Problem: find all occurrences of in
Space complexity : the extra space used for the search (index)
exact match
– Worst-case
Computation Models
– Average-case (uniform text and pattern)
Text-Pattern comparisons
Arithmetical/Bitwise operations
5 6
Algorithmic point of view String-Matching Space-Time Trade-Offs
Input data:
Patricia
trees
Hybrid solutions
– Inverted index
Signature files
– Suffix trees (tries, Patricia trees, ....)
Sequential search
– Based on -grams
Boyer-Moore like algorithms
– Automata: DAWGs, suffix based KMP
Shift-or
Brute force
Hybrid solutions: RAC
Time
– Filtering or Filtration Complexity
– Two Level TR
7 8
String Matching: Definition String Matching Complexity
Variations
: size of the pattern
– Allow mismatches (Hamming distance)
– Language dependent measure: phonetic, morphems, etc.
– Average case: lower and upper bound
Examples:
– ASM: worst case, average case
text
Preprocessed text:
text
text: This is a text example ...
ex
– Average case: comparisons
Software examples: grep command in Unix (sequential) or Google – ASM: several results, still open
in the Web (index based).
9 10
Classical Algorithms String Searching: Historical View
Knuth-Morris-Pratt
x
1992 Cole-Hariharan Baeza-Yates/Perleberg
Text y
Hume-Sunday
Colussi-Galil-Giancarlo
1990 Cole Choffrut Sunday
Boyer-Moore Crochemore-Perrin
Match heuristic Wu-Manber Baeza-Yates
x
Regnier Baeza-Yates/Gonnet
1988 Baeza-Yates Baeza-Yates/Gonnet
Text y
y Abrahamson
Occurrence heuristic 1986
Apostolico-Giancarlo
x
Text y
Horspool Sunday
1980 Karp-Rabin Horspool
Rytter
Match heuristic defines BM automata Galil
Boyer-Moore
Fischer-Patterson
1970 Knuth-Morris-Pratt
Theory Practice
11 12
Knuth-Morris-Pratt Algorithm Algorithm
search( text, n, pat, m ) // Search pat[1..m] in text[1..n]
char text[], pat[];
int n, m;
Fascinating story.... from theory and practice {
int next[MAX_PATTERN_SIZE];
pat[m+1] = CHARACTER_NOT_IN_THE_TEXT;
Preprocessing: kmp( pat, m+1, pat, m+1, next ); // Preprocess pattern
kmp( text, n, pat, m, next ); // Search text
pat[m+1] = END_OF_STRING;
}
kmp( text, n, pat, m, next )
char text[], pat[];
int n, m, next[];
{
static dosearch = 0;
for .
int i, j;
i = 1;
if( !dosearch ) // Preprocessing
Example: j = next[1] = 0;
else j = 1;
do {
a b r a c a d a b r a if( j == 0 || text[i] == pat[j] )
{
i++; j++;
next[j] 0 1 1 0 2 0 2 0 1 1 0 5 if( !dosearch ) { // Preprocessing
if( text[i] != pat[j] ) next[i] = j;
else next[i] = next[j];
}
}
else j = next[j];
13 14
Boyer-Moore-Horspool-Sunday Algorithm Counting: Baeza-Yates/Perleberg, 1992
A simple example of filtering:
Match heuristic can be extended: BM automata, suffix automata
Idea: Count the number of matches for all
In practice the occurrence heuristic is the key issue:
possible positions of the pattern
Straight implementation: Brute force algorithm
search( text, n, pat, m ) // Search pat[1..m] in text[1..n]
char text[], pat[];
int n, m;
{
int d[MAX_ALPHABET_SIZE], i, j, k, lim;
// Preprocessing Pattern =
for( k=0; k<MAX_ALPHABET_SIZE; k++ )
d[k] = m+1; Text =
for( k=1; k<=m; k++ )
d[pat[k]] = m+1-k; Count = 2000 0020 0 0 0 10 0 0 0 0003001
// Search
lim = n-m+1;
for( k=1; k <= lim; k += d[text[k+m]] )
{
i=k; // Could optimal order Improvements
for( j=1; j<=m && text[i] == pat[j]; j++ )
i++;
if( j == m+1 ) Preprocess the pattern computing which
Report_match_at_position( k );
} characters of the alphabet should update a counter
}
extra space
15 16
Example Running time
Pattern = t h a n
total cost
Text = t hi s i s a n ex a m p l e t hat
Each step is:
On average
For all j such that pattern[j] = text[i]
increment count[i-j+1] Cost is independent of number of mismatches
Code
17 18
Bit Parallelism: Baeza-Yates/Gonnet, 1989 [2] Example:
1
Parallel algorithm using processors
1 t
0 t e
1 output
0 t e x
text
1 t e x t
current character
Output
Processor : 1 if
t h i s i s a t e x t
0 otherwise
19 20
Bit sequence simulation Complexity
Preprocessing time:
Search time:
For finite alphabets, all possible comparisons can be precomputed
Space needed: words
before the search
Code:
In the example: // Preprocessing
for( i=0; i<MAXSYM; i++ ) T[i] = ˜0;
T[t] T[e] T[x] T[*]
for( lim=0, j=1; *pattern != EOS; lim |= j, j <<= B, pattern++ )
t 1 0 0 0
T[*pattern] &= ˜j;
e 0 1 0 0
lim = ˜(lim >> B);
x 0 0 1 0
// Search
t 1 0 0 0
matches = 0; state = ˜0; // Initial state
for( ; *text != EOS; text++ )
{
0 1 shift-and/or algorithm
21 22
Extensions Approximate String Matching: Dynamic Programming
Every pattern element is a class of symbols Minimum number of errors to match to a suffix of
Just change !
Don’t care symbols on the text:
Multiple patterns: just one longer sequence
Mismatches: count the number of mismatches
Example:
1 2 ... m 0 0 0 0 0 0 0 0
s 1 0 1 1 1 1 1 1
u 2 1 0 1 2 2 2 2
Overflow bit
bits r 3 2 1 0 1 2 2 3
e 5 4 3 2 2 1 2 3
Agrep: Was the fastest approximate search tool for Unix, now y 6 5 4 3 3 2 2 2
nrgrep Bit-wise approach to DP is the fastest for long strings (¿ 8)
23 24
From Automata to Bit-parallelism Approximate string searching
Exploit automata structure Consider the NFA for searching text with at most errors
Consider the NFA to search for text
t e x t
no errors
t e x t
t e x t
1 error
Processors in 1 Active states of standard simulation
t e x t
2 errors
Be careful with -closure
25 26
Horizontal bit parallelism: Wu & Manber Vertical bit parallelism:
t e x t
t e x t no errors
no errors
t e x t
t e x t 1 error
1 error
t e x t
t e x t 2 errors
2 errors
Initially ( ones)
Initially and
Drawback: Dependency on
space
27 28
Diagonal bit parallelism: Baeza-Yates & Navarro [1996] ASM: Sequential Algorithms
p a t t
no errors
p a t t
1 error
p a t t
2 errors
where
29 30
Dynamic Programming Automata and Bit-Parallelism
31 32
Filtering or Filtration A First Lemma for Filtering
Lemma 1: Let and be two strings such that . Let
Search time is , for strings and and for any
. Then, at least strings appear in . Moreover, their
relative distances inside cannot differ from those in by more
Filtration can be done by a sequential scan or by an index than .
Consider the sequence of at most edit operations that convert
There is always a maximum error ratio up to where into .
Each edit operation can affect at most one of the ’s, at least
verify cover almost all the text
of them must remain unaltered.
Verification can be done in a hierarchical fashion
Relative distances: the edit operations cannot produce mis-
alignments larger than .
33 34
Example Filtering Algorithms
A1 x1 A2 x2 A3 x3 A4 x4 A5
A
A1 A2’ A3 A4’ A5
B
They are actually 3 such segments because one of the errors ap-
peared in .
Another possible reason could have been more than one error
occurring in a single .
35 36
Worst Case Complexity and Space Average Case Complexity and Error Ratio
37 38
Best Algorithms Data Structures
Suffix arrays permit the same operations but are slightly slower.
-grams allow searching for any text substring not longer than
.
-samples permit the same but only for some text substrings.
39 40
Inverted Indices Inverted Files: Space
Idea: all words and their positions
Posting file: linear space (one occurrence = one pointer)
1 6 9 11 17 19 24 28 33 40 46 50 55 60
This is a text. A text has many words. Words are made from letters. Word distribution: Zipf’s law
Text
Vocabulary Occurrences
letters 60...
50...
made
where
many 28...
text 11, 19... Inverted Index
words 33, 40...
Stopwords: half the posting file
Vocabulary search: Hashing, sorted, etc.
Linear space
Granularity of the occurrences depends in what we want to an-
swer: file, word, byte Vocab. Vocab.
41 42
Complex Patterns Building Inverted Indices
d1 d2 d3 d4 d5 d6 Text
letters: 60
Used Cars Excellent Second Cars & Change
"l" made: 50
cars used offer of hand trucks car for "d"
and trucks trucks cards bargan a truc
"m" "a"
[d5,2-3] 7
Vocabulary Posting file Inversion
1 2 4 5
Level 1
I-1 I-2 I-3 I-4 I-5 I-6 I-7 I-8 (initial dumps)
43 44
Two-level Text Retrieval: Block addressed inverted files Inverted File Space in Practice
Each entry indicates only the blocks where the word appears Document
1 byte per block addressing 19% 26% 18% 32% 26% 47%
Block (64K)
First we search in the inverted file
addressing 27% 41% 18% 32% 5% 9%
next in the corresponding blocks using a fast sequential algo-
Block (256)
rithm
addressing 18% 25% 1.7% 2.4% 0.5% 0.7%
Complexity depends on the number of occurrences
For large texts, empirical results show that the index requires
less than 5% of the text size
45 46
Tries and Suffix Trees Suffix Arrays
"a" "c"
"$"
3
to be at the beach or to be at work, that is the real question $ Text
"r"
"d" 7 10
Suffix Trie 1 2 3 4 5 6 7 8 9 10 11
Text
"c"
a b r a c a d a b r a
5
1 4 7 10 14 20 23 26 29 32 38 43 46 50 55 Index points
9
"b" "$"
"r" "a"
"c" Suffix Array
2
"a" 6 11 8 1 4 6 2 9 5 7 10 3
"c" 4 1: to be at the beach or to be at work, that is the real question
"d" 1
"r" "a"
"c"
4: be at the beach or to be at work, that is the real question
"b" "$" Suffixes
"$"
8 7: at the beach or to be at work, that is the real question
11 10: the beach or to be at work, that is the real question
......
55: question
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
7 29 4 26 14 43 20 55 50 38 10 46 1 23 32 Suffix Array
Problem: space can be quadratic in a trie
Compact suffix trees (Patricia trees): cut unary paths to be at the beach or to be at work, that is the real question $ Text
1 4 7 10 14 20 23 26 29 32 38 43 46 50 55 Index points
To remember the depth, a count is added at every node or the
string associated to the path is stored
47 48
Suffix Array Search Suffix Array: Construction
The prefix relation can be used for lexicographic order But suffixes are suffixes of suffixes
Best solution: sequential scan with counting
Hence, two binary searches are enough to obtain the suffix array
range where all occurrences of appear
counters
be bf
c)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 small text long suffix array
7 29 4 26 14 43 20 55 50 38 10 46 1 23 32 Suffix Array
small suffix array
Index points
Building time is now linear (2003!)
1 4 7 10 14 20 23 26 29 32 38 43 46 50 55
49 50
Q-gram Indexes Q-Sample Indexes
In a -gram index, every different text -gram is stored, and all its
positions are stored in increasing text order. In a -sample index, only some text -grams are stored. In this
case the samples do not overlap.
1 2 3 4 5 6 7 8 9 10 11
a b r a c a d a b r a Text
"a" "c"
Useful to search long strings and using much less space (from .5
3
"r" rac 3 to )
"a" "b"
dab 7 q-Gram Index
7
"d" cad 5
"a" "a"
6
"d"
"c" "a"
cad 5
4 bra 9 q-Samples Index
abr 1
"b" "r"
1, 8
P error
q-grams
P q-grams
51 52
Algorithms for ASM using an Index Current Results on this Taxonomy
must appear unaltered in any approximate occurrence, uses the [8] Cobbs 95
Suffix Array [10] Gonnet 88 [21] Navarro &
index to search for those substrings, and checks the text areas Baeza-Yates 99
[13] Jokinen &
surrounding them.
Q-grams n/a Ukkonen 91 [20] Navarro & Myers 90 [17]
[12] Holsti & Baeza-Yates 97
Assuming that the errors occur in the pattern or in the text leads
Sutinen 94
to radically different approaches. Q-samples n/a [26] Sutinen & n/a [22] Navarro n/a
Tarhio 96 et al. 2000
Intermediate partitioning extracts substrings from the pattern
that are searched for allowing fewer errors using neighborhood
generation.
53 54
Neighborhood Generation Backtracking
Use a suffix tree or array to find in the text [4, 10, 29]
The Neighborhood of the Pattern: Just some branches will be followed, but they factorize many
matches
While searching we have three cases at node :
a) , which means that , and we report all
rences [17] the leaves of the current subtree as answers.
b) for every , which means that is not a prefix of
any string in and we can abandon this branch
The same idea can be used to compare a whole text against an-
other one or against itself [6]
55 56
Example Partitioning into Exact Search: Errors in the Pattern
We use Lemma 1 under the setting , . That is, the
The matrix can be seen now as a stack that grows to the right
pattern is split in pieces, and hence of the pieces must
appear inside any occurrence.
Then, the pieces are searched and the text areas where of
s
g
Search time in the index is or , but the checking
5
e a 4
time dominates.
r
2
2
y
2
The case , proposed in [20], shows an average time to
2cm In the example:
With the backtracking ends indeed after reading "surge"
With the search would have been pruned after considering
If grows, the pieces get shorter and hence there are more matches
"surger", and "surga", since in both cases no entry of the to check, but on the other hand, forcing pieces to match makes
matrix is
the filter stricter [24]. Recent results show that this is slower.
Note that, since we cannot know where the pattern pieces can be
found in the text, all the text positions must be searchable.
57 58
Partitioning into Exact Search: Errors in the Text Search Algorithm
Assume now that the errors occur in the text, i.e., is an occur- At search time, all the (overlapping) pattern -grams
rence of in . are extracted and searched for in the index of text -samples.
We extract substrings of length at fixed text intervals of length When pattern -grams match in the text at the proper distances,
Those -samples correspond to the ’s of Lemma 1, and the This idea is presented in [26], and earlier versions in [13, 12, 27].
What the lemma ensures is that, inside any occurrence of con-
– Should be small to avoid a very large set of different -
taining text -samples, at least of them appear in at
samples.
about the same positions ( ).
– Should be large to minimize the amount of verification.
Some analysis [25] show that is the optimal value.
59 60
Intermediate Partitioning Proof and Example
We filter the search by looking for pattern pieces, but those pieces The proof is easy: if every needs more than errors to match
are large and still may appear with errors in the occurrences. in , then the total distance cannot be less than .
However, they appear with fewer errors, and then we use neigh- Note that in particular we can choose for every .
borhood generation to search for them.
A1 x1 A2 x2 A3
Lemma 2: Let and be two strings such that . Let
A
A1’ A2’ A3’
B
Then, at least one string appears with at most errors in .
Let and . At least one of the ’s has at most one error
(in this case )
61 62
Intermediate Partitioning: Errors in the Pattern What value for ?
In [17], the pattern is partitioned because they use a -gram in-
Search approaches based on this method have been proposed in
dex, so they use the minimum that gives short enough pieces
[17, 21]. The algorithm is:
(they are of length ).
Split the pattern in pieces, for some .
In [21] the index can search for pieces of any length, and the
Use neighborhood generation to find the text positions where partitioning is done in order to optimize the search time.
For each such text position, check with an on-line algorithm the (neighborhood generation) to (partitioning into exact search).
surrounding text.
– We search for pieces of length with errors, so the
error level stays about the same for the subpatterns.
– As moves to 1, the cost to search for the neighborhood of
the pieces grows exponentially with their length.
when . So, to find the pieces, a larger is better.
63 64
Cost to verify the occurrences: consider a pattern that is split in Trade-off
pieces, for increasing . Start with .
– Lemma 2 states that every occurrence of the pattern involves
search
an occurrence of at least one of its two halves with er-
"!
verify
"!
rors, although there may be occurrences of the halves that
Neighborhood generation Intermediate partitioning Partitioning into exact search
yield no occurrences of the pattern.
In [21] we show that the optimal is , yielding a
of the halves involves an occurrence of at least one quarter
$
time complexity of , for .
%
$
that yield no occurrences of a pattern half. This is sublinear ( ) for , a pessimistic and
– Hence, the verification cost grows from zero at to its is replaced by 1 in practice).
maximum at .
The same results are obtained in [17] by setting .
The experiments in [21] show that this intermediate approach is
by far superior to both extremes.
65 66
Intermediate Partitioning: Errors in the Text We chose and assume that every text -sample
indeed matches with errors.
Consider an occurrence containing a sequence of -samples,
We search the pattern blocks permitting only errors. Every
which must be chosen at steps of .
-sample found with errors changes its estimation from
to , otherwise it stays at the optimistic bound .
By Lemma 2, one of the -samples must appear in the pattern
– For a small value, the search of the -neighborhoods is
This method [26, 22] searches every block in the index of
-samples using backtracking, so as to find the least number of – Using larger values gives more exact estimates of the ac-
errors to match each text -sample inside . tual number of errors of each text -sample, reducing use-
If a zone of consecutive samples is found whose errors add up less verifications in exchange for a higher cost to search the
To allow efficient neighborhood searching, we need to limit the Optimal ? In [22] it is mentioned that, as the cost of the search
maximum error level allowed. grows exponentially with , the minimal can be a
good choice. Experimentally this scheme tolerates higher error
Permitting errors may be too expensive, as every text -sample
[3] R. Baeza-Yates. Text retrieval: Theory and practice. In 12th IFIP World Computer Congress, volume I,
Problem reduction works for text searching
pages 465–476. Elsevier Science, 1992.
Example: Multiple string searching plus checking [4] R. Baeza-Yates and G.H. Gonnet. Fast text searching for regular expressions or automaton searching
on tries. Journal of the ACM, 43(6):915–936, Nov 1996.
– Two dimensional case [Baeza-Yates and Regnier, 1990]
[5] R. Baeza-Yates. A unified view of string matching algorithms. In Keith Jeffery, Jaroslav Král, and
Miroslav Bartosek, editors, SOFSEM’96: Theory and Practice of Informatics, volume 1175 of Lecture
– Approximate pattern matching [Wu and Manber, 1991]
Notes in Computer Science, pages 1–15, Milovy, Czech Republic, November 1996. Springer Verlag.
The final optimal algorithm depends on the input [6] R. Baeza-Yates and G. Gonnet. A fast algorithm on average for all-against-all sequence matching. In
Proc. 6th Symp. on String Processing and Information Retrieval (SPIRE’99). IEEE CS Press, 1999.
Further study of input adaptive algorithms? Previous version unpublished, Dept. of Computer Science, Univ. of Chile, 1990.
[7] E. Chávez and G. Navarro. A metric index for approximate string matching. In Proc. 5th Symp. on
New uses for old concepts. Example: -grams
Indexing for ASM on NL text can be done better [8] A. Cobbs. Fast approximate matching using suffix trees. In Proc. 6th Ann. Symp. on Combinatorial
Pattern Matching (CPM’95), LNCS 807, pages 41–54, 1995.
Approximation algorithms with worst-case performance guar- [9] R. Giegerich, S. Kurtz, and J. Stoye. Efficient implementation of lazy suffix trees. In Proc. 3rd
Workshop on Algorithm Engineering (WAE’99), LNCS 1668, pages 30–42, 1999.
antees [16].
[10] G. Gonnet. A tutorial introduction to Computational Biochemistry using Darwin. Technical report,
Use a metric space to search [7]. Informatik E.T.H., Zurich, Switzerland, 1992.
[11] G. Gonnet, R. Baeza-Yates, and T. Snider. Information Retrieval: Data Structures and Algorithms,
New text indexes tailored to special cases: ASM chapter 3: New indices for text: Pat trees and Pat arrays, pages 66–82. Prentice-Hall, 1992.
69 70
[12] N. Holsti and E. Sutinen. Approximate string matching using -gram places. In Proc. 7th Finnish [25] E. Sutinen and J. Tarhio. On using -gram locations in approximate string matching. In Proc. 3rd
Symp. on Computer Science, pages 23–32. Univ. of Joensuu, 1994. European Symp. on Algorithms (ESA’95), LNCS 979, pages 327–340, 1995.
[13] P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In Proc. [26] E. Sutinen and J. Tarhio. Filtration with -samples in approximate string matching. In Proc. 7th Ann.
2nd Ann. Symp. on Mathematical Foundations of Computer Science (MFCS’91), pages 240–248, 1991. Symp. on Combinatorial Pattern Matching (CPM’96), LNCS 1075, pages 50–61, 1996.
[14] U. Manber and E. Myers. Suffix arrays: a new method for on-line string searches. SIAM J. on [27] T. Takaoka. Approximate pattern matching with samples. In Proc. 5th Int’l. Symp. on Algorithms and
Computing, 22(5):935–948, 1993. Computation (ISAAC’94), LNCS 834, pages 234–242, 1994.
[15] E. McCreight. A space-economical suffix tree construction algorithm. J. of the ACM, 23(2):262–272, [28] E. Ukkonen. Finding approximate patterns in strings. J. of Algorithms, 6:132–137, 1985.
1976.
[29] E. Ukkonen. Approximate string matching over suffix trees. In Proc. 4th Ann. Symp. on Combinatorial
[16] S. Muthukrishnan and C. Sahinalp. Approximate nearest neighbors and sequence comparisons with Pattern Matching (CPM’93), LNCS 684, pages 228–242, 1993.
block operations. In Proc. ACM Symp. on the Theory of Computing, pages 416–424, 2000.
[30] E. Ukkonen. Constructing suffix trees on-line in linear time. Algorithmica, 14(3):249–260, 1995.
[17] E. Myers. A sublinear algorithm for approximate keyword searching. Algorithmica, 12(4/5):345–374,
1994. Earlier version in Tech. report TR-90-25, Dept. of CS, Univ. of Arizona, 1990.
[18] Gene Myers. A fast bit-vector algorithm for approximate string matching based on dynamic program-
ming, Journal of the ACM, 46 (3), 395–415, 1999.
[19] G. Navarro. A guided tour to approximate string matching. ACM Comp. Surv., 33(1):31–88, 2001.
[20] G. Navarro and R. Baeza-Yates. A practical -gram index for text retrieval allowing errors. CLEI
Electronic Journal, 1(2), 1998. http://www.clei.cl. Earlier version in Proc. CLEI’97.
[21] G. Navarro and R. Baeza-Yates. A hybrid indexing method for approximate string matching. J. of
Discrete Algorithms, 1(1):205–239, 2000. Hermes Science Publishing. Earlier version in CPM’99.
[22] G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. Indexing text with approximate -grams. In Proc.
11th Ann. Symp. on Combinatorial Pattern Matching (CPM’2000), LNCS 1848, pages 350–363, 2000.
[23] Gonzalo Navarro, Ricardo Baeza-Yates, Erkki Sutinen, Jorma Tarhio. Indexing Methods for Approxi-
mate String Matching. IEEE Data Engineering Bulletin, 2000.
[24] F. Shi. Fast approximate string matching with -blocks sequences. In Proc. 3rd South American
Workshop on String Processing (WSP’96), pages 257–271. Carleton University Press, 1996.
71 72