0% found this document useful (0 votes)

75 views

Approximate String

The document outlines approximate string matching techniques for text retrieval, discussing the differences between theoretical and practical approaches. It describes problems in string searching, indexing techniques, and approximate string matching with indices. The summary is based on surveys by Baeza-Yates and own work on applications to areas like web retrieval, XML processing, and bioinformatics.

Uploaded by

api-3784370

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views

Approximate String

Uploaded by

api-3784370

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Outline

Approximate String Matching

Text Retrieval

Theory vs. Practice

Problem

String searching
Ricardo Baeza-Yates
From automata to algorithms
Center for Web Research
Filtering
www.cwr.cl
Depto. de Ciencias de la Computación Indices
Universidad de Chile
Santiago, CHILE ASM with Indices
rbaeza@dcc.uchile.cl
Concluding remarks

Based in surveys by Baeza-Yates [3], Baeza-Yates [5], Navarro

[19] and Navarro et al. [23], and own work and views.

2
User’s point of view Theory vs. Practice

How we can measure the goodness of an algorithm?

Asymptotic worst case behavior

Text Query
Asymptotic average case behavior
User-defined text normalization

User-defined Practical behavior

index points Indexing
and structure Algorithm
D. Knuth [IFIP’89 Invited speech]
Index Searching
Algorithm
Balance between theory and practice
Text Answer
User Software is hard
Interface

? The best theory

Tools vs.
is
Intelligence
inspired by practice
—
Applications to other areas: The best practice
Web retrieval, XML processing, NL processing, text mining, is
multimedia search, bioinformatics, signal processing, .... inspired by theory

3 4
Problem Search Models

: finite alphabet of size 1. is a word (depends on the language)

2. is any sequence starting in an index-point

Text

of length

Pattern of length ( )

Some data structures assume the first model
is considered bounded

Answer Models
Problem: find all occurrences of in

Space complexity : the extra space used for the search (index)
exact match

RAM model: words of size

approximate match (distance function needed)

Time complexity : time needed to find the pattern

closest match or all matches at a certain distance
or equivalent measure (for example, comparisons)

– Worst-case
Computation Models
– Average-case (uniform text and pattern)

Text-Pattern comparisons

Arithmetical/Bitwise operations

5 6
Algorithmic point of view String-Matching Space-Time Trade-Offs

Input data:

Raw pattern and text

Space Complexity
- sequential, on-line, real-time algorithms
Tries
Preprocessing of the pattern Indexed search

- pattern is known in advance Suffix arrays

Inverted files

Preprocessing of the text index

Patricia
trees
Hybrid solutions
– Inverted index

Signature files
– Suffix trees (tries, Patricia trees, ....)

– Suffix arrays Two level TR

Sequential search
– Based on -grams

Boyer-Moore like algorithms
– Automata: DAWGs, suffix based KMP

Shift-or

Brute force
Hybrid solutions: RAC

Time

– Filtering or Filtration Complexity

– Two Level TR

7 8
String Matching: Definition String Matching Complexity

Basic problem: find exact occurrences of a pattern in a text

: size of the text

Variations
: size of the pattern

– Allow mismatches (Hamming distance)

– Allow insertions (Episode distance, not symmetric)

Raw text
– Allos insertions and deletions (LCS distance)
– Worst case:
– Allow mismatches, insertions and deletions

lower and upper bound of comparisons

– Language dependent measure: phonetic, morphems, etc.

– Average case: lower and upper bound

Examples:

– ASM: worst case, average case

text
Preprocessed text:
text

– Index construction: time and space (finite alphabet)

text: This is a text example ...

t ext – Worst case: comparisons

– Average case: comparisons

Software examples: grep command in Unix (sequential) or Google – ASM: several results, still open
in the Web (index based).
9 10
Classical Algorithms String Searching: Historical View

Knuth-Morris-Pratt

x
1992 Cole-Hariharan Baeza-Yates/Perleberg
Text y
Hume-Sunday
Colussi-Galil-Giancarlo
1990 Cole Choffrut Sunday
Boyer-Moore Crochemore-Perrin
Match heuristic Wu-Manber Baeza-Yates
x
Regnier Baeza-Yates/Gonnet
1988 Baeza-Yates Baeza-Yates/Gonnet
Text y

y Abrahamson
Occurrence heuristic 1986
Apostolico-Giancarlo
x
Text y

Horspool Sunday
1980 Karp-Rabin Horspool
Rytter
Match heuristic defines BM automata Galil

Boyer-Moore

Fischer-Patterson

1970 Knuth-Morris-Pratt
Theory Practice

11 12
Knuth-Morris-Pratt Algorithm Algorithm
search( text, n, pat, m ) // Search pat[1..m] in text[1..n]
char text[], pat[];
int n, m;
Fascinating story.... from theory and practice {
int next[MAX_PATTERN_SIZE];

pat[m+1] = CHARACTER_NOT_IN_THE_TEXT;
Preprocessing: kmp( pat, m+1, pat, m+1, next ); // Preprocess pattern
kmp( text, n, pat, m, next ); // Search text
pat[m+1] = END_OF_STRING;
}

kmp( text, n, pat, m, next )

char text[], pat[];

int n, m, next[];
{
static dosearch = 0;
for .

int i, j;

i = 1;
if( !dosearch ) // Preprocessing
Example: j = next[1] = 0;
else j = 1;
do {
a b r a c a d a b r a if( j == 0 || text[i] == pat[j] )
{
i++; j++;
next[j] 0 1 1 0 2 0 2 0 1 1 0 5 if( !dosearch ) { // Preprocessing
if( text[i] != pat[j] ) next[i] = j;
else next[i] = next[j];
}
}

Worst case complexity:

else j = next[j];

if( dosearch && j > m ) { // Search

Report_match_at_position( i-m );
Extension to multiple patterns: Aho-Corasick j = next[m+1];
}
}
while( i <= n ) ;
dosearch = 1;
}

13 14
Boyer-Moore-Horspool-Sunday Algorithm Counting: Baeza-Yates/Perleberg, 1992
A simple example of filtering:
Match heuristic can be extended: BM automata, suffix automata
Idea: Count the number of matches for all
In practice the occurrence heuristic is the key issue:
possible positions of the pattern

Straight implementation: Brute force algorithm

with worst and average case time

search( text, n, pat, m ) // Search pat[1..m] in text[1..n]
char text[], pat[];
int n, m;
{
int d[MAX_ALPHABET_SIZE], i, j, k, lim;

// Preprocessing Pattern =
for( k=0; k<MAX_ALPHABET_SIZE; k++ )
d[k] = m+1; Text =
for( k=1; k<=m; k++ )
d[pat[k]] = m+1-k; Count = 2000 0020 0 0 0 10 0 0 0 0003001
// Search
lim = n-m+1;
for( k=1; k <= lim; k += d[text[k+m]] )
{
i=k; // Could optimal order Improvements
for( j=1; j<=m && text[i] == pat[j]; j++ )
i++;
if( j == m+1 ) Preprocess the pattern computing which
Report_match_at_position( k );
} characters of the alphabet should update a counter
}

We need only the last counters

Complexity ranges between and

extra space

15 16
Example Running time

Pattern = t h a n

total cost

Text = t hi s i s a n ex a m p l e t hat

Count = 2000 0020 0 0 0 10 0 0 0 0003001

is the number of text-pattern symbol matches

Each step is:

On average

For all j such that pattern[j] = text[i]
increment count[i-j+1] Cost is independent of number of mismatches

Not suitable for (e.g. DNA)

Code

for (i=0; i<n; i++) {

if ((off1=(aptr=&alpha[c=*t++])->offset) >= 0) {
count[(i+off1)&MOD256]--;
for (aptr=aptr->next; aptr!=NULL; aptr=aptr->next)
count[(i+aptr->offset)&MOD256]--;
}
if (count[i&MOD256] <= k) printf("%d",count[i&MOD256]);
count[i&MOD256] = m;
}

17 18
Bit Parallelism: Baeza-Yates/Gonnet, 1989 [2] Example:

1
Parallel algorithm using processors
1 t

0 t e

1 output
0 t e x

text
1 t e x t

current character
Output

Processor : 1 if

t h i s i s a t e x t

0 otherwise

19 20
Bit sequence simulation Complexity

For the uniform cost RAM model, we have

One bit per processor simulation with a bit vector!

Preprocessing time:

Search time:

For finite alphabets, all possible comparisons can be precomputed

Space needed: words

before the search

Code:
In the example: // Preprocessing
for( i=0; i<MAXSYM; i++ ) T[i] = ˜0;
T[t] T[e] T[x] T[*]
for( lim=0, j=1; *pattern != EOS; lim |= j, j <<= B, pattern++ )
t 1 0 0 0
T[*pattern] &= ˜j;
e 0 1 0 0
lim = ˜(lim >> B);
x 0 0 1 0
// Search
t 1 0 0 0
matches = 0; state = ˜0; // Initial state
for( ; *text != EOS; text++ )
{
0 1 shift-and/or algorithm

state = (state << B) | T[*text]; // Next state

Handbook of Algorithms and Data Structures, 2 Ed, 1991

if( state < lim )

// Match at current position-len(pattern)+1
}

21 22
Extensions Approximate String Matching: Dynamic Programming

Every pattern element is a class of symbols Minimum number of errors to match to a suffix of

Just change !

Don’t care symbols on the text:

Multiple patterns: just one longer sequence

Mismatches: count the number of mismatches
Example:

instead of &, using bits per position

is maximal number of mismatches allowed s u r g e r y

1 2 ... m 0 0 0 0 0 0 0 0
s 1 0 1 1 1 1 1 1
u 2 1 0 1 2 2 2 2
Overflow bit

bits r 3 2 1 0 1 2 2 3

Insertions and deletions [Wu & Manber, 1991] v 4 3 2 1 1 2 3 3

bit sequences

e 5 4 3 2 2 1 2 3
Agrep: Was the fastest approximate search tool for Unix, now y 6 5 4 3 3 2 2 2
nrgrep Bit-wise approach to DP is the fastest for long strings (¿ 8)

23 24
From Automata to Bit-parallelism Approximate string searching

Exploit automata structure Consider the NFA for searching text with at most errors

Consider the NFA to search for text

t e x t
no errors

t e x t
t e x t
1 error

Processors in 1 Active states of standard simulation

t e x t
2 errors
Be careful with -closure

Related to hardware implementations Longest positional match wanted?

Based in Baeza-Yates [5]

25 26
Horizontal bit parallelism: Wu & Manber Vertical bit parallelism:

t e x t

t e x t no errors
no errors

t e x t
t e x t 1 error
1 error

t e x t
t e x t 2 errors
2 errors

Key information: highest (smallest error) active state per column

State of the search: numbers on the range .

Initially ( ones)

Initially and

Drawback: Dependency on

Related to dynamic programming:

Complexity: search time

Longest common subsequence, string editing

space

Related to Ukkonen’s automata approach

27 28
Diagonal bit parallelism: Baeza-Yates & Navarro [1996] ASM: Sequential Algorithms

p a t t
no errors

p a t t
1 error

p a t t
2 errors

Each diagonal represents an -closure (longest match):

where

Advantage: all s can be computed in parallel

Mixing this with filtration we get time

Related to simulation of DP over a suffix array (later) and bioin-

formatic applications

29 30
Dynamic Programming Automata and Bit-Parallelism

31 32
Filtering or Filtration A First Lemma for Filtering

Find potential matches and then apply a sequential algorithm to

check each candidate

Lemma 1: Let and be two strings such that . Let

Search time is , for strings and and for any

. Then, at least strings appear in . Moreover, their

relative distances inside cannot differ from those in by more
Filtration can be done by a sequential scan or by an index than .

There is trade-off between and

Consider the sequence of at most edit operations that convert

There is always a maximum error ratio up to where into .

Filtration is useful, as for larger error levels the text areas to

Each edit operation can affect at most one of the ’s, at least

verify cover almost all the text
of them must remain unaltered.
Verification can be done in a hierarchical fashion
Relative distances: the edit operations cannot produce mis-
alignments larger than .

33 34
Example Filtering Algorithms

A1 x1 A2 x2 A3 x3 A4 x4 A5
A

A1 A2’ A3 A4’ A5
B

An example of Lemma 1 with and :

At least 2 of the survive unaltered.

They are actually 3 such segments because one of the errors ap-
peared in .

Another possible reason could have been more than one error

occurring in a single .

35 36
Worst Case Complexity and Space Average Case Complexity and Error Ratio

37 38
Best Algorithms Data Structures

Inverted indices permit searching for any word in the text.

Suffix trees allow searching for any substring of the text.

Suffix arrays permit the same operations but are slightly slower.

-grams allow searching for any text substring not longer than
.

-samples permit the same but only for some text substrings.

39 40
Inverted Indices Inverted Files: Space

Vocabulary: Heaps’ law

Idea: all words and their positions

Posting file: linear space (one occurrence = one pointer)
1 6 9 11 17 19 24 28 33 40 46 50 55 60

This is a text. A text has many words. Words are made from letters. Word distribution: Zipf’s law

Text

Vocabulary Occurrences

letters 60...
50...
made
where
many 28...

text 11, 19... Inverted Index

words 33, 40...

Stopwords: half the posting file
Vocabulary search: Hashing, sorted, etc.
Linear space
Granularity of the occurrences depends in what we want to an-
swer: file, word, byte Vocab. Vocab.

Text length Posting lists

Heaps’ Law Zipf’s Law

41 42
Complex Patterns Building Inverted Indices

Search the vocabulary sequentially (e.g. ASM) and do set opera-

Process text pieces as large as possible
1 6 9 11 17 19 24 28 33 40 46 50 55 60
tions
This is a text. A text has many words. Words are made from letters.

d1 d2 d3 d4 d5 d6 Text
letters: 60
Used Cars Excellent Second Cars & Change
"l" made: 50
cars used offer of hand trucks car for "d"
and trucks trucks cards bargan a truc
"m" "a"

many: 28 Vocabulary trie

"t" "n"
text: 11, 19
sell d1 [1] "w"
car d1,d2,d4,d5,d6 [2][1][1][1][2] words: 33, 40
(1) truck d1,d3,d5 [3][3][2] [d1,3][d3,3],[d5,2],[d6,3]
use d2 [2] [d5,3]
excellent d3 [1]
offer d3 [2] Merge partial indexes
second d4 [2]
hand d4 [3]
(2) bargan d5 [3]
change d6 [1] Level 4
(1) truc d6 [3] I-1..8
(final index)

[d5,2-3] 7
Vocabulary Posting file Inversion

I-1..4 I-5..8 Level 3

"bargain for trucks" [approximate]
3 6

I-1..2 I-3..4 I-5..6 I-7..8 Level 2

1 2 4 5

Level 1
I-1 I-2 I-3 I-4 I-5 I-6 I-7 I-8 (initial dumps)

43 44
Two-level Text Retrieval: Block addressed inverted files Inverted File Space in Practice

Idea used in PIRS (Personal Information Retrieval System)

Index Small base Medium base Large base
[Wu and Manber 1993]
(1 MB) (200 MB) (2 GB)
The text is divided in 256 blocks of the same size
Full
An inverted file of all the different words of the text is built inverted 45% 73% 36% 64% 35% 63%

Each entry indicates only the blocks where the word appears Document

1 byte per block addressing 19% 26% 18% 32% 26% 47%
Block (64K)
First we search in the inverted file
addressing 27% 41% 18% 32% 5% 9%
next in the corresponding blocks using a fast sequential algo-
Block (256)
rithm
addressing 18% 25% 1.7% 2.4% 0.5% 0.7%
Complexity depends on the number of occurrences

locality of reference is important

For large texts, empirical results show that the index requires
less than 5% of the text size

This idea works reasonable well up to 200Mbs

45 46
Tries and Suffix Trees Suffix Arrays

"a" "c"

"$"
3
to be at the beach or to be at work, that is the real question $ Text
"r"
"d" 7 10
Suffix Trie 1 2 3 4 5 6 7 8 9 10 11
Text
"c"
a b r a c a d a b r a
5
1 4 7 10 14 20 23 26 29 32 38 43 46 50 55 Index points
9
"b" "$"
"r" "a"
"c" Suffix Array
2

"a" 6 11 8 1 4 6 2 9 5 7 10 3
"c" 4 1: to be at the beach or to be at work, that is the real question
"d" 1
"r" "a"
"c"
4: be at the beach or to be at work, that is the real question
"b" "$" Suffixes
"$"
8 7: at the beach or to be at work, that is the real question
11 10: the beach or to be at work, that is the real question
......
55: question

Search time is optimal:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
7 29 4 26 14 43 20 55 50 38 10 46 1 23 32 Suffix Array
Problem: space can be quadratic in a trie

Compact suffix trees (Patricia trees): cut unary paths to be at the beach or to be at work, that is the real question $ Text

1 4 7 10 14 20 23 26 29 32 38 43 46 50 55 Index points
To remember the depth, a count is added at every node or the
string associated to the path is stored

Space is now linear ( )

Useful for complex queries: regular expressions in sublinear av-

erage time [4]

47 48
Suffix Array Search Suffix Array: Construction

Every substring is a prefix of a suffix In principle is a lexicographical order

The prefix relation can be used for lexicographic order But suffixes are suffixes of suffixes

However, random access to the text is the bottleneck

Best solution: sequential scan with counting
Hence, two binary searches are enough to obtain the suffix array
range where all occurrences of appear

Number of occurrences: range size

a) small text
b) small text

Time is logarithmic in the size of the array

small suffix array long text
small suffix array

counters
be bf

c)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 small text long suffix array
7 29 4 26 14 43 20 55 50 38 10 46 1 23 32 Suffix Array
small suffix array

counters final suffix array

to be at the beach or to be at work, that is the real question $ Text

Index points
Building time is now linear (2003!)
1 4 7 10 14 20 23 26 29 32 38 43 46 50 55

49 50
Q-gram Indexes Q-Sample Indexes

In a -gram index, every different text -gram is stored, and all its

positions are stored in increasing text order. In a -sample index, only some text -grams are stored. In this

case the samples do not overlap.
1 2 3 4 5 6 7 8 9 10 11
a b r a c a d a b r a Text
"a" "c"
Useful to search long strings and using much less space (from .5
3
"r" rac 3 to )

"a" "b"
dab 7 q-Gram Index
7
"d" cad 5

"c" "a" "d"

bra 2, 9 Search for a string can be constant on average
ada 6
5
"b"
aca 4
"r" "a" abr 1,8 Building them takes linear time
2, 9

"a" "a"
6
"d"
"c" "a"
cad 5
4 bra 9 q-Samples Index
abr 1
"b" "r"
1, 8

P error

q-grams

P q-grams

51 52
Algorithms for ASM using an Index Current Results on this Taxonomy

Search approaches: Search Approach

Data Structure Neighborhood Partitioning into Intermediate
Neighborhood generation generates and searches for, using an Generation Exact Searching Partitioning
Errors in Text Errors in Pattern Errors in Text Errors in Pattern
index, all the strings that are at distance or less from the pattern
[13] Jokinen &
Suffix Tree Ukkonen 91 [24] Shi 96
Partitioning into exact searching selects pattern substrings that [29] Ukkonen 93

must appear unaltered in any approximate occurrence, uses the [8] Cobbs 95
Suffix Array [10] Gonnet 88 [21] Navarro &
index to search for those substrings, and checks the text areas Baeza-Yates 99
[13] Jokinen &
surrounding them.
Q-grams n/a Ukkonen 91 [20] Navarro & Myers 90 [17]
[12] Holsti & Baeza-Yates 97
Assuming that the errors occur in the pattern or in the text leads
Sutinen 94
to radically different approaches. Q-samples n/a [26] Sutinen & n/a [22] Navarro n/a
Tarhio 96 et al. 2000
Intermediate partitioning extracts substrings from the pattern
that are searched for allowing fewer errors using neighborhood
generation.

53 54
Neighborhood Generation Backtracking

Use a suffix tree or array to find in the text [4, 10, 29]

The Neighborhood of the Pattern: Just some branches will be followed, but they factorize many
matches

Let be the “ -neighborhood”

While searching we have three cases at node :

of (is s finite set).

a) , which means that , and we report all

Generate and use an index to search for their text occur-

rences [17] the leaves of the current subtree as answers.

b) for every , which means that is not a prefix of

Problem: is quite large.

any string in and we can abandon this branch

Good bounds [28, 17] show [28]

c) Otherwise, we continue descending by every branch of that

This approach works well for small and .

node. If we arrive at a leaf node, we have to use a sequential

algorithm in the rest

Some improvements [13, 30, 8] avoid processing some redun-

dant nodes at the cost of a more complex node processing

The same idea can be used to compare a whole text against an-
other one or against itself [6]

55 56
Example Partitioning into Exact Search: Errors in the Pattern

We use Lemma 1 under the setting , . That is, the

The matrix can be seen now as a stack that grows to the right
pattern is split in pieces, and hence of the pieces must

appear inside any occurrence.

Then, the pieces are searched and the text areas where of

u those pieces appear under the stated distance requirements are

r
verified for a complete match.

Search time in the index is or , but the checking

e a 4

time dominates.

y
2
The case , proposed in [20], shows an average time to

2cm In the example:

check the candidates of .

With the backtracking ends indeed after reading "surge"

The case is proposed in [24] without any analysis.

With the search would have been pruned after considering

If grows, the pieces get shorter and hence there are more matches

"surger", and "surga", since in both cases no entry of the to check, but on the other hand, forcing pieces to match makes

matrix is

the filter stricter [24]. Recent results show that this is slower.

Note that, since we cannot know where the pattern pieces can be
found in the text, all the text positions must be searchable.
57 58
Partitioning into Exact Search: Errors in the Text Search Algorithm

Assume now that the errors occur in the text, i.e., is an occur- At search time, all the (overlapping) pattern -grams

rence of in . are extracted and searched for in the index of text -samples.

We extract substrings of length at fixed text intervals of length When pattern -grams match in the text at the proper distances,

. the text area is verified.

Those -samples correspond to the ’s of Lemma 1, and the This idea is presented in [26], and earlier versions in [13, 12, 27].

space between -samples to the ’s.

The best value of :

What the lemma ensures is that, inside any occurrence of con-
– Should be small to avoid a very large set of different -

taining text -samples, at least of them appear in at

samples.
about the same positions ( ).
– Should be large to minimize the amount of verification.

Now we need to ensure that any occurrence of in contains

Some analysis [25] show that is the optimal value.

at least text -samples, i.e., .

The best value? A larger may trigger fewer verifications.

59 60
Intermediate Partitioning Proof and Example

We filter the search by looking for pattern pieces, but those pieces The proof is easy: if every needs more than errors to match

are large and still may appear with errors in the occurrences. in , then the total distance cannot be less than .

However, they appear with fewer errors, and then we use neigh- Note that in particular we can choose for every .

borhood generation to search for them.

A1 x1 A2 x2 A3

Lemma 2: Let and be two strings such that . Let

A1’ A2’ A3’

, for strings and and for any . Let

be any set of nonnegative numbers such that .

Then, at least one string appears with at most errors in .

Let and . At least one of the ’s has at most one error

(in this case )

61 62
Intermediate Partitioning: Errors in the Pattern What value for ?

In [17], the pattern is partitioned because they use a -gram in-

Search approaches based on this method have been proposed in
dex, so they use the minimum that gives short enough pieces

[17, 21]. The algorithm is:

(they are of length ).

Split the pattern in pieces, for some .

In [21] the index can search for pieces of any length, and the
Use neighborhood generation to find the text positions where partitioning is done in order to optimize the search time.

those pieces appear, allowing errors.

Consider the evolution of the search time as moves from 1

For each such text position, check with an on-line algorithm the (neighborhood generation) to (partitioning into exact search).

surrounding text.

– We search for pieces of length with errors, so the

error level stays about the same for the subpatterns.

– As moves to 1, the cost to search for the neighborhood of

the pieces grows exponentially with their length.

– As moves to this cost decreases, reaching even

when . So, to find the pieces, a larger is better.

63 64
Cost to verify the occurrences: consider a pattern that is split in Trade-off
pieces, for increasing . Start with .

– Lemma 2 states that every occurrence of the pattern involves

an occurrence of at least one of its two halves with er-

verify

rors, although there may be occurrences of the halves that

Neighborhood generation Intermediate partitioning Partitioning into exact search
yield no occurrences of the pattern.

– Consider now halving the halves ( ), so we have four

pieces now (call them “quarters”). Each occurrence of one

In [21] we show that the optimal is , yielding a

of the halves involves an occurrence of at least one quarter

$
time complexity of , for .

with errors, but there may be many quarter occurrences

%

$
that yield no occurrences of a pattern half. This is sublinear ( ) for , a pessimistic and

– Hence, the verification cost grows from zero at to its is replaced by 1 in practice).

maximum at .

The same results are obtained in [17] by setting .

The experiments in [21] show that this intermediate approach is
by far superior to both extremes.

65 66

Intermediate Partitioning: Errors in the Text We chose and assume that every text -sample

indeed matches with errors.

Consider an occurrence containing a sequence of -samples,

We search the pattern blocks permitting only errors. Every

which must be chosen at steps of .

-sample found with errors changes its estimation from

to , otherwise it stays at the optimistic bound .

By Lemma 2, one of the -samples must appear in the pattern

with errors at most. There is a trade-off here:

Moreover, if every -sample appears in the pattern block

– For a small value, the search of the -neighborhoods is

with errors, then it must hold that .

cheaper, but as we must assume that the text -samples not

found have errors, some useless verifications are done.

This method [26, 22] searches every block in the index of

-samples using backtracking, so as to find the least number of – Using larger values gives more exact estimates of the ac-

errors to match each text -sample inside . tual number of errors of each text -sample, reducing use-

If a zone of consecutive samples is found whose errors add up less verifications in exchange for a higher cost to search the

to at most , the area is verified. -environments.

To allow efficient neighborhood searching, we need to limit the Optimal ? In [22] it is mentioned that, as the cost of the search

maximum error level allowed. grows exponentially with , the minimal can be a

good choice. Experimentally this scheme tolerates higher error
Permitting errors may be too expensive, as every text -sample

levels than the corresponding partitioning into exact search.

will be considered.
67 68
Future References
[1] A. Apostolico and Z. Galil. Combinatorial Algorithms on Words. Springer-Verlag, 1985.
Further study on the power of non-comparison based algorithms:
[2] R. Baeza-Yates and G.H. Gonnet. A new approach to text searching. Communications of the ACM,
Many new bit-based algorithms 35:74–82, Oct 1992.

[3] R. Baeza-Yates. Text retrieval: Theory and practice. In 12th IFIP World Computer Congress, volume I,
Problem reduction works for text searching
pages 465–476. Elsevier Science, 1992.

Example: Multiple string searching plus checking [4] R. Baeza-Yates and G.H. Gonnet. Fast text searching for regular expressions or automaton searching
on tries. Journal of the ACM, 43(6):915–936, Nov 1996.
– Two dimensional case [Baeza-Yates and Regnier, 1990]
[5] R. Baeza-Yates. A unified view of string matching algorithms. In Keith Jeffery, Jaroslav Král, and
Miroslav Bartosek, editors, SOFSEM’96: Theory and Practice of Informatics, volume 1175 of Lecture
– Approximate pattern matching [Wu and Manber, 1991]
Notes in Computer Science, pages 1–15, Milovy, Czech Republic, November 1996. Springer Verlag.

The final optimal algorithm depends on the input [6] R. Baeza-Yates and G. Gonnet. A fast algorithm on average for all-against-all sequence matching. In
Proc. 6th Symp. on String Processing and Information Retrieval (SPIRE’99). IEEE CS Press, 1999.
Further study of input adaptive algorithms? Previous version unpublished, Dept. of Computer Science, Univ. of Chile, 1990.

[7] E. Chávez and G. Navarro. A metric index for approximate string matching. In Proc. 5th Symp. on
New uses for old concepts. Example: -grams

Latin American Theoretical Informatics (LATIN), 2002. Cancun, Mexico.

Indexing for ASM on NL text can be done better [8] A. Cobbs. Fast approximate matching using suffix trees. In Proc. 6th Ann. Symp. on Combinatorial
Pattern Matching (CPM’95), LNCS 807, pages 41–54, 1995.

Approximation algorithms with worst-case performance guar- [9] R. Giegerich, S. Kurtz, and J. Stoye. Efficient implementation of lazy suffix trees. In Proc. 3rd
Workshop on Algorithm Engineering (WAE’99), LNCS 1668, pages 30–42, 1999.
antees [16].
[10] G. Gonnet. A tutorial introduction to Computational Biochemistry using Darwin. Technical report,

Use a metric space to search [7]. Informatik E.T.H., Zurich, Switzerland, 1992.

[11] G. Gonnet, R. Baeza-Yates, and T. Snider. Information Retrieval: Data Structures and Algorithms,
New text indexes tailored to special cases: ASM chapter 3: New indices for text: Pat trees and Pat arrays, pages 66–82. Prentice-Hall, 1992.

69 70
[12] N. Holsti and E. Sutinen. Approximate string matching using -gram places. In Proc. 7th Finnish [25] E. Sutinen and J. Tarhio. On using -gram locations in approximate string matching. In Proc. 3rd
Symp. on Computer Science, pages 23–32. Univ. of Joensuu, 1994. European Symp. on Algorithms (ESA’95), LNCS 979, pages 327–340, 1995.

[13] P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In Proc. [26] E. Sutinen and J. Tarhio. Filtration with -samples in approximate string matching. In Proc. 7th Ann.
2nd Ann. Symp. on Mathematical Foundations of Computer Science (MFCS’91), pages 240–248, 1991. Symp. on Combinatorial Pattern Matching (CPM’96), LNCS 1075, pages 50–61, 1996.

[14] U. Manber and E. Myers. Suffix arrays: a new method for on-line string searches. SIAM J. on [27] T. Takaoka. Approximate pattern matching with samples. In Proc. 5th Int’l. Symp. on Algorithms and
Computing, 22(5):935–948, 1993. Computation (ISAAC’94), LNCS 834, pages 234–242, 1994.

[15] E. McCreight. A space-economical suffix tree construction algorithm. J. of the ACM, 23(2):262–272, [28] E. Ukkonen. Finding approximate patterns in strings. J. of Algorithms, 6:132–137, 1985.
1976.
[29] E. Ukkonen. Approximate string matching over suffix trees. In Proc. 4th Ann. Symp. on Combinatorial
[16] S. Muthukrishnan and C. Sahinalp. Approximate nearest neighbors and sequence comparisons with Pattern Matching (CPM’93), LNCS 684, pages 228–242, 1993.
block operations. In Proc. ACM Symp. on the Theory of Computing, pages 416–424, 2000.
[30] E. Ukkonen. Constructing suffix trees on-line in linear time. Algorithmica, 14(3):249–260, 1995.
[17] E. Myers. A sublinear algorithm for approximate keyword searching. Algorithmica, 12(4/5):345–374,
1994. Earlier version in Tech. report TR-90-25, Dept. of CS, Univ. of Arizona, 1990.

[18] Gene Myers. A fast bit-vector algorithm for approximate string matching based on dynamic program-
ming, Journal of the ACM, 46 (3), 395–415, 1999.

[19] G. Navarro. A guided tour to approximate string matching. ACM Comp. Surv., 33(1):31–88, 2001.

[20] G. Navarro and R. Baeza-Yates. A practical -gram index for text retrieval allowing errors. CLEI
Electronic Journal, 1(2), 1998. http://www.clei.cl. Earlier version in Proc. CLEI’97.

[21] G. Navarro and R. Baeza-Yates. A hybrid indexing method for approximate string matching. J. of
Discrete Algorithms, 1(1):205–239, 2000. Hermes Science Publishing. Earlier version in CPM’99.

[22] G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. Indexing text with approximate -grams. In Proc.
11th Ann. Symp. on Combinatorial Pattern Matching (CPM’2000), LNCS 1848, pages 350–363, 2000.

[23] Gonzalo Navarro, Ricardo Baeza-Yates, Erkki Sutinen, Jorma Tarhio. Indexing Methods for Approxi-
mate String Matching. IEEE Data Engineering Bulletin, 2000.

[24] F. Shi. Fast approximate string matching with -blocks sequences. In Proc. 3rd South American
Workshop on String Processing (WSP’96), pages 257–271. Carleton University Press, 1996.

71 72

Chat Arch
No ratings yet
Chat Arch
56 pages
AI Based Student's Assignments Plagiarism Detector
No ratings yet
AI Based Student's Assignments Plagiarism Detector
11 pages
B.Sc. CS 3rd Sem.
No ratings yet
B.Sc. CS 3rd Sem.
13 pages
Deep Learning Applications
No ratings yet
Deep Learning Applications
11 pages
Chapter 1
No ratings yet
Chapter 1
29 pages
Yadav 2020
No ratings yet
Yadav 2020
6 pages
MachinLearning
No ratings yet
MachinLearning
6 pages
Project Report
No ratings yet
Project Report
8 pages
SIC_AI_Chapter 3. Exploratory Data Analysis_v2.1
No ratings yet
SIC_AI_Chapter 3. Exploratory Data Analysis_v2.1
520 pages
SIC - AI - Chapter 3. Exploratory Data Analysis - Rev2.0
No ratings yet
SIC - AI - Chapter 3. Exploratory Data Analysis - Rev2.0
527 pages
AI Generated Text Detection Synopsis
No ratings yet
AI Generated Text Detection Synopsis
29 pages
Text Mining An Improvised Feature Based
No ratings yet
Text Mining An Improvised Feature Based
5 pages
Analyzing Sentiment Using IMDb Dataset
No ratings yet
Analyzing Sentiment Using IMDb Dataset
4 pages
Communications201608 DL PDF
No ratings yet
Communications201608 DL PDF
108 pages
Unit 5-1
No ratings yet
Unit 5-1
22 pages
2 Studyof Different Algorithmsfor Pattern Matching
No ratings yet
2 Studyof Different Algorithmsfor Pattern Matching
7 pages
DM Report
No ratings yet
DM Report
8 pages
Exact String Matching Algorithms Survey Issues and
No ratings yet
Exact String Matching Algorithms Survey Issues and
25 pages
ML1 Foundations
No ratings yet
ML1 Foundations
39 pages
New 8 Sem Syll ITgu
No ratings yet
New 8 Sem Syll ITgu
15 pages
(CS3244) Final Poster Group 19
No ratings yet
(CS3244) Final Poster Group 19
1 page
Deep Learning for Natural Language GDG Bloomington 1690248059
No ratings yet
Deep Learning for Natural Language GDG Bloomington 1690248059
41 pages
An Approximate Algorithm For Maximum Inner Product Search Over Streaming Sparse Vectors
No ratings yet
An Approximate Algorithm For Maximum Inner Product Search Over Streaming Sparse Vectors
44 pages
Comparative Study Between Traditional Machine Learning and Deep Learning Approaches For Text Classification
No ratings yet
Comparative Study Between Traditional Machine Learning and Deep Learning Approaches For Text Classification
11 pages
Research On Text Generation Model of Natural Language Processing Based On Computer Artificial Intelligence
No ratings yet
Research On Text Generation Model of Natural Language Processing Based On Computer Artificial Intelligence
6 pages
Abhaya Agarwal
No ratings yet
Abhaya Agarwal
5 pages
WWW Geeksforgeeks
No ratings yet
WWW Geeksforgeeks
2 pages
Zharmagambetov 2015
No ratings yet
Zharmagambetov 2015
4 pages
5298-Original PDF-10611-2-10-20200619
No ratings yet
5298-Original PDF-10611-2-10-20200619
13 pages
Deep Learning A Review
No ratings yet
Deep Learning A Review
11 pages
Text-Based Intent Analysis Using Deep Learning
No ratings yet
Text-Based Intent Analysis Using Deep Learning
8 pages
Algorithm For Devanagari Character
No ratings yet
Algorithm For Devanagari Character
6 pages
Overview On Movie Recommender System
100% (1)
Overview On Movie Recommender System
4 pages
Supervised Learning Based Approach To Aspect Based Sentiment Analysis
No ratings yet
Supervised Learning Based Approach To Aspect Based Sentiment Analysis
5 pages
Applications of CNN for Sentiement Analysis
No ratings yet
Applications of CNN for Sentiement Analysis
6 pages
SESSION_1_LLMs
No ratings yet
SESSION_1_LLMs
40 pages
Introduction To DSA in C
No ratings yet
Introduction To DSA in C
10 pages
Pdsa-Brochure 04092019
No ratings yet
Pdsa-Brochure 04092019
2 pages
Machine_Learning_approach_to_Document_Classificati
No ratings yet
Machine_Learning_approach_to_Document_Classificati
5 pages
Neural Network Models for Paraphrase Identification, Semantic Textual
No ratings yet
Neural Network Models for Paraphrase Identification, Semantic Textual
14 pages
Comparing Topic Modeling and Named Entity Recognition Techniques For The Semantic Indexing of A Landscape Architecture Textbook
No ratings yet
Comparing Topic Modeling and Named Entity Recognition Techniques For The Semantic Indexing of A Landscape Architecture Textbook
6 pages
Handwritten Text Recognition Using Machine Learning Techniques in Application of NLP
No ratings yet
Handwritten Text Recognition Using Machine Learning Techniques in Application of NLP
4 pages
Gradascent at Emoint-2017: Character-And Word-Level Recurrent Neural Network Models For Tweet Emotion Intensity Detection
No ratings yet
Gradascent at Emoint-2017: Character-And Word-Level Recurrent Neural Network Models For Tweet Emotion Intensity Detection
7 pages
BinaryInferno2023Chandler 4
No ratings yet
BinaryInferno2023Chandler 4
18 pages
Topic Modeling A Comprehensive Review
No ratings yet
Topic Modeling A Comprehensive Review
17 pages
Text Databases and Information Retrieval: Riloff, Hollaar@cs - Utah.edu&
No ratings yet
Text Databases and Information Retrieval: Riloff, Hollaar@cs - Utah.edu&
3 pages
Speech Recognition Using Neural Networks: A Review: Dhavale Dhanashri, S.B. Dhonde
No ratings yet
Speech Recognition Using Neural Networks: A Review: Dhavale Dhanashri, S.B. Dhonde
4 pages
Hybrid
No ratings yet
Hybrid
10 pages
(IJCST-V11I4P8) :nishita Raghavendra, Manushree Shah
No ratings yet
(IJCST-V11I4P8) :nishita Raghavendra, Manushree Shah
5 pages
(IJCST-V5I2P25) :S. Gopi, A. Berno Raj, M. Abinav, P.Gokul Sarathy, D.P. Bharath
No ratings yet
(IJCST-V5I2P25) :S. Gopi, A. Berno Raj, M. Abinav, P.Gokul Sarathy, D.P. Bharath
4 pages
NILES2021 Paper 43
No ratings yet
NILES2021 Paper 43
5 pages
Module1 Introduction
No ratings yet
Module1 Introduction
35 pages
Common DS Interview Questions and Answers - 1
No ratings yet
Common DS Interview Questions and Answers - 1
4 pages
Chapter 8 - Applications of NLP-3
No ratings yet
Chapter 8 - Applications of NLP-3
72 pages
09 Modelling
No ratings yet
09 Modelling
19 pages
Artificial Intelligence Techniques IJERTCONV2IS10061
No ratings yet
Artificial Intelligence Techniques IJERTCONV2IS10061
4 pages
query-classification-based-on-semantic-similarity-2uhb6qtj7v
No ratings yet
query-classification-based-on-semantic-similarity-2uhb6qtj7v
8 pages
PLN 65 06
No ratings yet
PLN 65 06
6 pages
Sentiment Analysis Based On Weighted Word2vec and Att-LSTM
No ratings yet
Sentiment Analysis Based On Weighted Word2vec and Att-LSTM
5 pages
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Apache Tomcat Tuning
100% (1)
Apache Tomcat Tuning
20 pages
NPTL Design
No ratings yet
NPTL Design
17 pages
Tomcat Performance
No ratings yet
Tomcat Performance
28 pages