Princeton Substring Search
Princeton Substring Search
Goal. Find pattern of length M in a text of length N. Goal. Find pattern of length M in a text of length N.
typically N >> M typically N >> M
pattern N E E D L E pattern N E E D L E
text I N A H A Y S T A C K N E E D L E I N A text I N A H A Y S T A C K N E E D L E I N A
match match
Substring search Substring search
3 4
Substring search applications Substring search applications
Goal. Find pattern of length M in a text of length N. Goal. Find pattern of length M in a text of length N.
typically N >> M typically N >> M
pattern N E E D L E pattern N E E D L E
text I N A H A Y S T A C K N E E D L E I N A text I N A H A Y S T A C K N E E D L E I N A
match match
Substring search Substring search
Computer forensics. Search memory or disk for signatures, Identify patterns indicative of spam.
e.g., all URLs or RSA keys that the user has entered. ・ PROFITS
・ L0SE WE1GHT
・ herbal Viagra
・ There is no catch.
・ This is a one-time mailing.
http://citp.princeton.edu/memory
・ This message is sent in compliance with spam regulations.
5 6
Electronic surveillance. Screen scraping. Extract relevant data from web page.
Need to monitor all
internet traffic.
(security) Ex. Find string delimited by <b> and </b> after first occurrence of
No way! pattern Last Trade:.
(privacy)
...
<tr>
Well, we’re mainly <td class= "yfnc_tablehead1"
interested in width= "48%">
“ATTACK AT DAWN” Last Trade:
</td>
OK. Build a
<td class= "yfnc_tabledata1">
machine that just
<big><b>452.92</b></big>
looks for that.
</td></tr>
<td class= "yfnc_tablehead1"
http://finance.yahoo.com/q?s=goog width= "48%">
Trade Time:
“ATTACK AT DAWN” </td>
substring search <td class= "yfnc_tabledata1">
...
machine
found
7 8
Screen scraping: Java implementation
Java library. The indexOf() method in Java's string library returns the index
of the first occurrence of a given string, starting at a given offset.
Check for pattern starting at each text position. Check for pattern starting at each text position.
i j i+j 0 1 2 3 4 5 6 7 8 9 10
A B A C A D A B R A C
i j i+j 0 1 2 3 4 5 6 7 8 9 10
4 3 7 A D A C R
txt A B A C A D A B R A C 5 0 5 A D A C R
0 2 2 A B R A pat
1 0 1 A B R A entries in red are
mismatches
2 1 3 A B R A public static int search(String pat, String txt)
Brute-force algorithm can be slow if text and pattern are repetitive. In many applications, we want to avoid backup in text stream.
・Treat input as stream of data. “ATTACK AT DAWN”
found
i ij j i+j 0 0 1 1 22 33 44
i+j 5
5 6
6 77 88 9910
txttxt A A A B AA AC AA A D A A BA RA AB C
0 2 2 A B R A pat
Brute-force algorithm needs backup for every mismatch.
0 4 4 A A A A B pat
1 0 1 A B R A entries in red are matched chars
1 4 5 A A A A B mismatches mismatch
2 1 3 A B R A
2 4 6 A A A A B entries in gray are
3 0 3 A B R A A A A A A A A A A A A A A A A A A A A A A B
for reference only
3 44 17 5 A AA AB R A A B
entries in black A A A A A B
4 54 08 5 match the text A A A B A RA AB backup
5 65 10
4 10 A AA BA RA AB
return i when j is M A A A A A A A A A A A A A A A A A A A A A B
Brute-force substring search (worst case)
match A A A A A B
Same sequence of char compares as previous implementation. Brute-force is not always good enough.
・ i points to end of sequence of already-matched chars in text.
・ j stores # of already-matched chars (end of sequence in pattern). Theoretical challenge. Linear-time guarantee. fundamental algorithmic problem
Practical challenge. Avoid backup in text stream. often no room or time to save text
i j 0 1 2 3 4 5 6 7 8 9 10
A B A C A D A B R A C
7 3 A D A C R
5 0 A D A C R Now is the time for all people to come to the aid of their party. Now is the time for all good people to
come to the aid of their party. Now is the time for many good people to come to the aid of their party.
Now is the time for all good people to come to the aid of their party. Now is the time for a lot of good
people to come to the aid of their party. Now is the time for all of the good people to come to the aid of
their party. Now is the time for all good people to come to the aid of their party. Now is the time for
public static int search(String pat, String txt) each good person to come to the aid of their party. Now is the time for all good people to come to the aid
{ of their party. Now is the time for all good Republicans to come to the aid of their party. Now is the
time for all good people to come to the aid of their party. Now is the time for many or all good people to
int i, N = txt.length(); come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now
int j, M = pat.length(); is the time for all good Democrats to come to the aid of their party. Now is the time for all people to
come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now
for (i = 0, j = 0; i < N && j < M; i++) is the time for many good people to come to the aid of their party. Now is the time for all good people to
{ come to the aid of their party. Now is the time for a lot of good people to come to the aid of their
party. Now is the time for all of the good people to come to the aid of their party. Now is the time for
if (txt.charAt(i) == pat.charAt(j)) j++; all good people to come to the aid of their attack at dawn party. Now is the time for each person to come
else { i -= j; j = 0; } explicit backup to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is
the time for all good Republicans to come to the aid of their party. Now is the time for all good people
} to come to the aid of their party. Now is the time for many or all good people to come to the aid of their
if (j == M) return i - M; party. Now is the time for all good people to come to the aid of their party. Now is the time for all good
Democrats to come to the aid of their party.
else return N;
}
15 16
Knuth-Morris-Pratt substring search
text
‣ introduction A B A A A A B A A A A A A A A A
after mismatch
‣ brute force on sixth char B A A A A A A A A A pattern
brute-force backs B A A A A A A A A A
‣ Knuth-Morris-Pratt
Algorithms up to try this
and this
B A
B
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A A
‣ Boyer-Moore and this B A A A A A A A A A
and this B A A A A A A A A A
R OBERT S EDGEWICK | K EVIN W AYNE
‣ Rabin-Karp and this
A
graphical representation
A B, C A A
B,C A
B,C A A B
A A B B
0 A
01 B 21 A 3 B 4 A 5 C
A6 0 1 2 3 4 5 6
A B 2 A 3 B 4 5 C 6 A B A B A C
C
B,C
C C B,C C
B,C C B,C
B, C
C
A A
B, C A A B, C A A
B B
0 A 1 B 2 A 3 B 4 A 5 C 6 0 A 1 B 2 A 3 B 4 A 5 C 6
C C
B, C B, C
substring found
C C
B, C 21
B, C 22
Knuth-Morris-Pratt substring search: Java implementation Knuth-Morris-Pratt substring search: Java implementation
Key differences from brute-force implementation. Key differences from brute-force implementation.
・Need to precompute dfa[][] from pattern. ・Need to precompute dfa[][] from pattern.
・Text pointer i never decrements. ・Text pointer i never decrements.
・Could use input stream.
public int search(String txt) public int search(In in)
{ {
int i, j, N = txt.length(); int i, j;
for (i = 0, j = 0; i < N && j < M; i++) for (i = 0, j = 0; !in.isEmpty() && j < M; i++)
j = dfa[txt.charAt(i)][j]; no backup j = dfa[in.readChar()][j]; no backup
if (j == M) return i - M; if (j == M) return i - M;
else return N; else return NOT_FOUND;
} }
X
j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6
Running time.
・Simulate DFA on text: at most N character accesses. B,C
0 A
C
A
1 B 2 A
A
3 B 4 A
B
A
5 C 6
Include one state for each character in pattern (plus accept state).
0 1 2 3 4 5 0 1 2 3 4 5
pat.charAt(j) A B A B A C pat.charAt(j) A B A B A C
A A 1 1 3 1 5 1
dfa[][j] B dfa[][j] B 0 2 0 4 0 4
C C 0 0 0 0 0 6
Constructing the DFA for KMP substring search for A B A B A C Constructing the DFA for KMP substring search for A B A B A C
B, C A A
B
0 1 2 3 4 5 6 0 A 1 B 2 A 3 B 4 A 5 C 6
C
B, C
C
25
B, C 26
How to build DFA from pattern? How to build DFA from pattern?
Include one state for each character in pattern (plus accept state). Match transition. If in state j and next char c == pat.charAt(j), go to j+1.
0 1 2 3 4 5 0 1 2 3 4 5
pat.charAt(j) A B A B A C pat.charAt(j) A B A B A C
A A 1 3 5
dfa[][j] B dfa[][j] B 2 4
C C 6
0 1 2 3 4 5 6 0 A 1 B 2 A 3 B 4 A 5 C 6
27 28
How to build DFA from pattern? How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j), Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c. then the last j-1 characters of input are pat[1..j-1], followed by c.
state X
To compute dfa[c][j]: Simulate pat[1..j-1] on DFA and take transition c. To compute dfa[c][j]: Simulate pat[1..j-1] on DFA and take transition c.
Running time. Seems to require j steps. still under construction (!) Running time. Takes only constant time if we maintain state X.
A A
simulation
A of BABA j A X j
B, C A B, C A
B B
0 A 1 B 2 A 3 B 4 A 5 C 6 0 A 1 B 2 A 3 B 4 A 5 C 6
C C
B, C B, C
C C
B, C 29
B, C 30
Knuth-Morris-Pratt demo: DFA construction in linear time Knuth-Morris-Pratt demo: DFA construction in linear time
Include one state for each character in pattern (plus accept state).
0 1 2 3 4 5 0 1 2 3 4 5
pat.charAt(j) A B A B A C pat.charAt(j) A B A B A C
A A 1 1 3 1 5 1
dfa[][j] B dfa[][j] B 0 2 0 4 0 4
C C 0 0 0 0 0 6
Constructing the DFA for KMP substring search for A B A B A C Constructing the DFA for KMP substring search for A B A B A C
B, C A A
B
0 1 2 3 4 5 6 0 A 1 B 2 A 3 B 4 A 5 C 6
C
B, C
C
31
B, C 32
Constructing the DFA for KMP substring search: Java implementation KMP substring search analysis
For each state j: Proposition. KMP substring search accesses no more than M + N chars
・Copy dfa[][X] to dfa[][j] for mismatch case. to search for a pattern of length M in a text of length N.
・Set dfa[pat.charAt(j)][j] to j+1 for match case.
・Update X. Pf. Each pattern char accessed once when constructing the DFA;
Running time. M character accesses (but space/time proportional to R M). KMP NFA for ABABAC
33 NFA corresponding to the string A B A B A C 34
‣ brute force
Vol. 6, No. 2, June 1977
‣ Knuth-Morris-Pratt
DONALD E. KNUTHf, JAMES H. MORRIS, JR.:l: AND VAUGHAN R. PRATT
Algorithms
Abstract. An algorithm is presented which finds all occurrences of one. given string within
another, in running time proportional to the sum of the lengths of the strings. The constant of
proportionality is low enough to make this algorithm of practical use, and the procedure can also be
‣ Boyer-Moore
extended to deal with some more general pattern-matching problems. A theoretical application of the
algorithm shows that the set of concatenations of even palindromes, i.e., the language {can}*, can be
recognized in linear time. Other algorithms which run even faster on the average are also considered. R OBERT S EDGEWICK | K EVIN W AYNE
‣ Rabin-Karp
Key words, pattern, string, text-editing, pattern-matching, trie memory, searching, period of a
string, palindrome, optimum algorithm, Fibonacci string, regular expression http://algs4.cs.princeton.edu
before
txt . . . . . . T L E . . . . . .
i j 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 pat N E E D L E
text F I N D I N A H A Y S T A C K N E E D L E I N A
0 5 N E E D L E pattern
5 5 N E E D L E
i
11 4 N E E D L E
15 0 N E E D L E after
return i = 15 txt . . . . . . T L E . . . . . .
Mismatched character heuristic for right-to-left (Boyer-Moore) substring search pat N E E D L E
mismatch character 'T' not in pattern: increment i one character beyond 'T'
37 38
Case 2a. Mismatch character in pattern. Case 2b. Mismatch character in pattern (but heuristic no help).
i i
before before
txt . . . . . . N L E . . . . . . txt . . . . . . E L E . . . . . .
pat N E E D L E pat N E E D L E
i i
txt . . . . . . N L E . . . . . . txt . . . . . . E L E . . . . . .
pat N E E D L E pat N E E D L E
mismatch character 'N' in pattern: align text 'N' with rightmost pattern 'N' mismatch character 'E' in pattern: align text 'E' with rightmost pattern 'E' ?
39 40
Boyer-Moore: mismatched character heuristic Boyer-Moore: mismatched character heuristic
before N E E D L E
c 0 1 2 3 4 5 right[c]
txt . . . . . . E L E . . . . . . A -1 -1 -1 -1 -1 -1 -1 -1
pat N E E D L E B -1 -1 -1 -1 -1 -1 -1 -1
right = new int[R];
for (int c = 0; c < R; c++) C -1 -1 -1 -1 -1 -1 -1 -1
right[c] = -1; D -1 -1 -1 -1 3 3 3 3
for (int j = 0; j < M; j++) E -1 -1 1 2 2 2 5 5
i
right[pat.charAt(j)] = j; ... -1
after L -1 -1 -1 -1 -1 4 4 4
. . . . . . E L E . . . . . . M -1 -1 -1 -1 -1 -1 -1 -1
txt
N -1 0 0 0 0 0 0 0
pat N E E D L E
... -1
Boyer-Moore skip table computation
mismatch character 'E' in pattern: increment i by 1
41 42
Math trick. To keep numbers small, take intermediate results modulo Q. Modular hash function. Using the notation ti for txt.charAt(i),
we wish to compute
Ex. (10000 + 535) * 1000 (mod 997) xi = ti RM–1 + ti+1 R M–2 + … + ti+M–1 R0 (mod Q)
Challenge. How to efficiently compute xi+1 given that we know xi. First R entries: Use Horner's rule.
xi = ti R M–1 + ti+1 R M–2 + … + ti+M–1 R0 Remaining entries: Use rolling hash (and % to avoid overflow).
public class RabinKarp Monte Carlo version. Return match if hash match.
{
private long patHash; // pattern hash value
private int M; // pattern length public int search(String txt)
private long Q; // modulus check for hash collision
{
private int R; // radix using rolling hash function
int N = txt.length();
private long RM1; // R^(M-1) % Q int txtHash = hash(txt, M);
if (patHash == txtHash) return 0;
public RabinKarp(String pat) { for (int i = M; i < N; i++)
M = pat.length(); {
R = 256; txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;
a large prime
Q = longRandomPrime(); (but avoid overflow) txtHash = (txtHash*R + txt.charAt(i)) % Q;
if (patHash == txtHash) return i - M + 1;
RM1 = 1; precompute RM – 1 (mod Q) }
for (int i = 1; i <= M-1; i++) return N;
RM1 = (R * RM1) % Q; }
patHash = hash(pat, M);
}
public int search(String txt) Las Vegas version. Check for substring match if hash match;
{ /* see next slide */ }
}
continue search if false collision.
51 52
Rabin-Karp analysis Rabin-Karp fingerprint search
Rabin-Karp substring
Summary search
The table at is known
the bottom as apage
of the fingerprint searchthe
summarizes because it usesthat
algorithms a small
we
amount
have of information
discussed for substringto represent
search. As a (potentially
is often the very
case large)
when pattern.
we have Then it algo-
several looks 53 54
for thisfor
rithms fingerprint
the same (the
task, hash
each value)
of theminhastheattractive
text. Thefeatures.
algorithm is efficient
Brute because
force search the
is easy
fingerprints
to implementcan andbeworks
efficiently
well incomputed and(Java’s
typical cases compared.
indexOf() method in String uses
brute-force search); Knuth-Morris-Pratt is guaranteed linear-time with no backup in
Substring
the search
Summary
input; cost
The
Boyer-Moore summary
table isatsublinear
the bottom (by of the page
a factor summarizes
of M) in typical the algorithms
situations; and that
Rabin-we
have is
Karp discussed for substring
linear. Each search. As is
also has drawbacks: often the might
brute-force case when
requirewe time
have proportional
several algo-
rithms
to MN; for the same task, eachand
Knuth-Morris-Pratt of them has attractive
Boyer-Moore features.
use extra Brute
space; and force search ishas
Rabin-Karp easy
a
Cost of searching
to implement
relatively for
long and anloopM-character
innerworks well in typical
(several pattern
cases
arithmetic in as
(Java’s indexOf()
operations, N-character
anopposedmethod text.
in String
to character uses
com-
brute-force
pares search);
in the other Knuth-Morris-Pratt
methods. is guaranteed
These characteristics linear-timeinwith
are summarized no backup
the table below. in
the input; Boyer-Moore is sublinear (by a factor of M) in typical situations; and Rabin-
Karp is linear. Each also has drawbacks: brute-force operation countmight require time proportional
backup extra
to MN; algorithm
Knuth-Morris-Pratt version
and Boyer-Moore use extra space; correct?
and Rabin-Karp has a
guarantee typical in input? space
relatively long inner loop (several arithmetic operations, as opposed to character com-
paresbrute
in theforce
other methods. These— characteristics
M N are1.1 summarized
N yes in the yes
table below.
1
full DFA
algorithm version5.6 ) 2operation
N count
1.1 N no
backup yes
correct?
extra
MR
(Algorithm in input? space
Knuth-Morris-Pratt guarantee typical
mismatch
3 NN 1.1 no yes
brute force — only
transitions M 1.1 N
N yes yes M
1
fullfull DFA
algorithm 32N N1.1
/M yes yes R
N N no yes MR
(Algorithm 5.6 )
Knuth-Morris-Pratt
Boyer-Moore mismatched char
mismatch
heuristic only 3N
M N1.1
/MN no
yes yes
yes R
M
transitions only
(Algorithm 5.7 )
full
Montealgorithm
Carlo 3N N/M yes yes R
7N 7N no yes † 1
(Algorithm 5.8 )
†
Boyer-Moore
Rabin-Karp mismatched char
heuristic only
Las Vegas 7MNN† N7 /NM yes†
no yes
yes 1R
(Algorithm 5.7 )
† probabilisitic guarantee, with uniform hash function
Monte Carlo
summary 5.8
7N 7N no yes † 1
(Algorithm
Cost for )substring-search implementations
Rabin-Karp† 55
Las Vegas 7N† 7N no † yes 1