Substring Search
Substring Search
pattern N E E D L E
text I N A H A Y S T A C K N E E D L E I N A
match
Substring search
3
Substring search applications
pattern N E E D L E
text I N A H A Y S T A C K N E E D L E I N A
match
Substring search
4
Substring search applications
pattern N E E D L E
text I N A H A Y S T A C K N E E D L E I N A
match
Substring search
http://citp.princeton.edu/memory
5
Substring search applications
pattern N E E D L E
text I N A H A Y S T A C K N E E D L E I N A
match
Substring search
Electronic surveillance.
Need to monitor all
internet traffic.
(security)
No way!
(privacy)
OK. Build a
machine that just
looks for that.
“ATTACK AT DAWN”
substring search
machine
found
7
Substring search applications
Ex. Find string delimited by <b> and </b> after first occurrence of
pattern Last Trade:.
...
<tr>
<td class= "yfnc_tablehead1"
width= "48%">
Last Trade:
</td>
<td class= "yfnc_tabledata1">
<big><b>452.92</b></big>
</td></tr>
<td class= "yfnc_tablehead1"
http://finance.yahoo.com/q?s=goog width= "48%">
Trade Time:
</td>
<td class= "yfnc_tabledata1">
...
8
Screen scraping: Java implementation
Java library. The indexOf() method in Java's string library returns the index
of the first occurrence of a given string, starting at a given offset.
i j i+j 0 1 2 3 4 5 6 7 8 9 10
txt A B A C A D A B R A C
0 2 2 A B R A pat
1 0 1 A B R A entries in red are
mismatches
2 1 3 A B R A
3 0 3 A B R A entries in gray are
for reference only
4 1 5 A B R A
entries in black
5 0 5 match the text A B R A
6 4 10 A B R A
return i when j is M
match
12
Brute-force substring search: Java implementation
i j i+j 0 1 2 3 4 5 6 7 8 9 10
A B A C A D A B R A C
4 3 7 A D A C R
5 0 5 A D A C R
i ij j i+j 0 0 1 1 22 33 44 55
i+j 6
6 77 88 99 10
txttxt A A A B AA AC AA A D A A BA RA AB C
0 2 2 A B R A pat
0 4 4 A A A A B pat
1 0 1 A B R A entries in red are
1 4 5 A A A A B mismatches
2 1 3 A B R A
2 4 6 A A A A B entries in gray are
3 0 3 A B R A
for reference only
3 44 17 5 A AA AB A R AB
entries in black
4 54 08 5 match the text A AA A B RA AB
5 65 10
4 10 A A A BA RA AB
return i when j is M
Brute-force substring search (worst case)
match
found
A A A A A A A A A A A A A A A A A A A A A B
A A A A A B
backup
A A A A A A A A A A A A A A A A A A A A A B
A A A A A B
16
Algorithmic challenges in substring search
Practical challenge. Avoid backup in text stream. often no room or time to save text
Now is the time for all people to come to the aid of their party. Now is the time for all good people to
come to the aid of their party. Now is the time for many good people to come to the aid of their party.
Now is the time for all good people to come to the aid of their party. Now is the time for a lot of good
people to come to the aid of their party. Now is the time for all of the good people to come to the aid of
their party. Now is the time for all good people to come to the aid of their party. Now is the time for
each good person to come to the aid of their party. Now is the time for all good people to come to the aid
of their party. Now is the time for all good Republicans to come to the aid of their party. Now is the
time for all good people to come to the aid of their party. Now is the time for many or all good people to
come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now
is the time for all good Democrats to come to the aid of their party. Now is the time for all people to
come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now
is the time for many good people to come to the aid of their party. Now is the time for all good people to
come to the aid of their party. Now is the time for a lot of good people to come to the aid of their
party. Now is the time for all of the good people to come to the aid of their party. Now is the time for
all good people to come to the aid of their attack at dawn party. Now is the time for each person to come
to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is
the time for all good Republicans to come to the aid of their party. Now is the time for all good people
to come to the aid of their party. Now is the time for many or all good people to come to the aid of their
party. Now is the time for all good people to come to the aid of their party. Now is the time for all good
Democrats to come to the aid of their party.
17
5.3 S UBSTRING S EARCH
‣ introduction
‣ brute force
‣ Knuth-Morris-Pratt
Algorithms
‣ Boyer-Moore
text
A B A A A A B A A A A A A A A A
after mismatch
on sixth char B A A A A A A A A A pattern
brute-force backs B A A A A A A A A A
up to try this B A A A A A A A A A
and this B A A A A A A A A A
and this B A A A A A A A A A
and this B A A A A A A A A A
and this
but no backup B A A A A A A A A A
is needed
graphical representation
B,C A A A A
B,C A A B B
0 A
01 B 21 A 3 2B 4 A 5 C
A6
A B A 3 B 4 5 C 6
C
B,C
C C B,C
B,C C B,C
A A B A C A A B A B A C A A
0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B, C
C
B, C 22
DFA simulation demo
A A B A C A A B A B A C A A
0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B, C
substring found
C
B, C 23
Interpretation of Knuth-Morris-Pratt DFA
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B, C
C
B, C 24
Knuth-Morris-Pratt substring search: Java implementation
if (j == M) return i - M;
else return N;
}
Running time.
・Simulate DFA on text: at most N character accesses.
・Build DFA: how to do efficiently? [warning: tricky algorithm ahead]
25
Knuth-Morris-Pratt substring search: Java implementation
if (j == M) return i - M;
else return NOT_FOUND;
}
X
j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6
B,C A A
A B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B,C C B,C
Include one state for each character in pattern (plus accept state).
0 1 2 3 4 5
pat.charAt(j) A B A B A C
A
dfa[][j] B
C
0 1 2 3 4 5 6
27
Knuth-Morris-Pratt construction demo
0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B, C
C
B, C 28
How to build DFA from pattern?
Include one state for each character in pattern (plus accept state).
0 1 2 3 4 5
pat.charAt(j) A B A B A C
A
dfa[][j] B
C
0 1 2 3 4 5 6
29
How to build DFA from pattern?
first j characters of pattern next char matches now first j+1 characters of
have already been matched pattern have been matched
0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 3 5
dfa[][j] B 2 4
C 6
0 A 1 B 2 A 3 B 4 A 5 C 6
30
How to build DFA from pattern?
simulation
B, C A A
of BABA j
B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B, C
C
B, C 31
How to build DFA from pattern?
B, C A A
X j
B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B, C
C
B, C 32
Knuth-Morris-Pratt construction demo (in linear time)
Include one state for each character in pattern (plus accept state).
0 1 2 3 4 5
pat.charAt(j) A B A B A C
A
dfa[][j] B
C
0 1 2 3 4 5 6
33
Knuth-Morris-Pratt construction demo (in linear time)
0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B, C
C
B, C 34
Constructing the DFA for KMP substring search: Java implementation
Pf. Each pattern char accessed once when constructing the DFA;
each text char accessed once (in the worst case) when simulating the DFA.
internal representation
j 0 1 2 3 4 5
Proposition. KMP constructs
pat.charAt(j) A B
dfa[][]Ain time
B A and
C space proportional to R M.
next[j] 0 0 0 0 0 3
0 A 1 B 2 A 3 B 4 A 5 C 6
SIAM J. COMPUT.
Vol. 6, No. 2, June 1977
Abstract. An algorithm is presented which finds all occurrences of one. given string within
another, in running time proportional to the sum of the lengths of the strings. The constant of
proportionality is low enough to make this algorithm of practical use, and the procedure can also be
extended to deal with some more general pattern-matching problems. A theoretical application of the
algorithm shows that the set of concatenations of even palindromes, i.e., the language {can}*, can be
recognized in linear time. Other algorithms which run even faster on the average are also considered.
Intuition.
・Scan characters in pattern from right to left.
・Can skip as many as M text chars when finding one not in the pattern.
i j 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
text F I N D I N A H A Y S T A C K N E E D L E I N A
0 5 N E E D L E pattern
5 5 N E E D L E
11 4 N E E D L E
15 0 N E E D L E
return i = 15
Mismatched character heuristic for right-to-left (Boyer-Moore) substring search
40
Boyer-Moore: mismatched character heuristic
before
txt . . . . . . T L E . . . . . .
pat N E E D L E
after
txt . . . . . . T L E . . . . . .
pat N E E D L E
mismatch character 'T' not in pattern: increment i one character beyond 'T'
41
Boyer-Moore: mismatched character heuristic
before
txt . . . . . . N L E . . . . . .
pat N E E D L E
after
txt . . . . . . N L E . . . . . .
pat N E E D L E
mismatch character 'N' in pattern: align text 'N' with rightmost pattern 'N'
42
Boyer-Moore: mismatched character heuristic
before
txt . . . . . . E L E . . . . . .
pat N E E D L E
txt . . . . . . E L E . . . . . .
pat N E E D L E
mismatch character 'E' in pattern: align text 'E' with rightmost pattern 'E' ?
43
Boyer-Moore: mismatched character heuristic
before
txt . . . . . . E L E . . . . . .
pat N E E D L E
after
txt . . . . . . E L E . . . . . .
pat N E E D L E
44
Boyer-Moore: mismatched character heuristic
N E E D L E
c 0 1 2 3 4 5 right[c]
A -1 -1 -1 -1 -1 -1 -1 -1
B -1 -1 -1 -1 -1 -1 -1 -1
right = new int[R];
for (int c = 0; c < R; c++) C -1 -1 -1 -1 -1 -1 -1 -1
right[c] = -1; D -1 -1 -1 -1 3 3 3 3
for (int j = 0; j < M; j++) E -1 -1 1 2 2 2 5 5
right[pat.charAt(j)] = j; ... -1
L -1 -1 -1 -1 -1 4 4 4
M -1 -1 -1 -1 -1 -1 -1 -1
N -1 0 0 0 0 0 0 0
... -1
Boyer-Moore skip table computation
45
Boyer-Moore: Java implementation
}
if (skip == 0) return i; match
}
return N;
}
46
Boyer-Moore: analysis
i skip 0 1 2 3 4 5 6 7 8 9
txt B B B B B B B B B B
0 0 A B B B B pat
1 1 A B B B B
2 1 A B B B B
3 1 A B B B B
4 1 A B B B B
5 1 A B B B B
Boyer-Moore-Horspool substring search (worst case)
txt.charAt(i)
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3
0 3 1 4 1 5 % 997 = 508
1 1 4 1 5 9 % 997 = 201
2 4 1 5 9 2 % 997 = 715
3 1 5 9 2 6 % 997 = 971
4 5 9 2 6 5 % 997 = 442
match
5 9 2 6 5 3 % 997 = 929
6 return i = 6 2 6 5 3 5 % 997 = 613
51
Efficiently computing the hash function
i ... 2 3 4 5 6 7 ...
current value 1 4 1 5 9 2 6 5
text
new value 4 1 5 9 2 6 5
4 1 5 9 2 current value
- 4 0 0 0 0
1 5 9 2 subtract leading digit
* 1 0 multiply by radix
1 5 9 2 0
+ 6 add new trailing digit
1 5 9 2 6 new value
52
Key computation in Rabin-Karp substring search
(move right one position in the text)
Rabin-Karp substring search example
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3
0 3 % 997 = 3 Q
53
Rabin-Karp: Java implementation
RM = 1; precompute RM – 1 (mod Q)
for (int i = 1; i <= M-1; i++)
RM = (R * RM) % Q;
patHash = hash(pat, M);
}
56
Rabin-Karp fingerprint search
Advantages.
・Extends to 2d patterns.
・Extends to finding multiple patterns.
Disadvantages.
・Arithmetic ops slower than char compares.
・Las Vegas version requires backup.
・Poor worst-case guarantee.
57
for thisfor
rithms fingerprint
the same (the
task,hash
each value)
of theminhas
theattractive
text. Thefeatures.
algorithm is efficient
Brute because
force search the
is easy
fingerprints
to implementcan andbeworks
efficiently
well in computed and(Java’s
typical cases compared.
indexOf() method in String uses
brute-force search); Knuth-Morris-Pratt is guaranteed linear-time with no backup in
Substring
the search
Summary
input; cost
The
Boyer-Moore summary
table isatsublinear
the bottom(byof the page
a factor summarizes
of M) in typical the algorithms
situations; and that we
Rabin-
have is
Karp discussed for substring
linear. Each search. As is
also has drawbacks: often the might
brute-force case when
requirewe time
have proportional
several algo-
rithms
to MN; for the same task, eachand
Knuth-Morris-Pratt of them has attractive
Boyer-Moore features.
use extra Brute
space; and force search ishas
Rabin-Karp easy
a
Cost of searching
to implement
relatively for
long and
inner anloopM-character
works well in typical
(several pattern
cases
arithmetic in as
(Java’s indexOf()
operations, N-character
anopposed
method text.
in String
to character uses
com-
brute-force
pares search);
in the other Knuth-Morris-Pratt
methods. is guaranteed
These characteristics linear-timeinwith
are summarized no backup
the table below. in
the input; Boyer-Moore is sublinear (by a factor of M) in typical situations; and Rabin-
Karp is linear. Each also has drawbacks: brute-force might require time proportional
operation count backup extra
to MN; algorithm
Knuth-Morris-Pratt version
and Boyer-Moore use extra space; correct?
and Rabin-Karp has a
guarantee typical in input? space
relatively long inner loop (several arithmetic operations, as opposed to character com-
paresbrute
in theforce
other methods. These— characteristics
M N are1.1 summarized
N yes in the yes
table below.
1
full DFA
algorithm version5.6 ) 2operation
N count
1.1 N no
backup yes
correct?
extra
MR
(Algorithm in input? space
Knuth-Morris-Pratt guarantee typical
mismatch
3N 1.1 N no yes M
brute force — only
transitions MN 1.1 N yes yes 1
fullfull DFA
algorithm 3N N/M yes yes R
2N 1.1 N no yes MR
(Algorithm 5.6 )
Knuth-Morris-Pratt
Boyer-Moore mismatched char
mismatch
heuristic only M
3N N1.1
/MN yes
no yes
yes R
M
transitions only
(Algorithm 5.7 )
full
Montealgorithm
Carlo 3N N/M yes yes R
7N 7N no yes † 1
(Algorithm 5.8 )
Rabin-Karp†
Boyer-Moore mismatched char
heuristic only
Las Vegas 7MNN† N7 /NM yes†
no yes
yes 1R
(Algorithm 5.7 )
† probabilisitic guarantee, with uniform hash function
Monte Carlo
summary 5.8
7N 7N no yes † 1
(Algorithm
Cost for )substring-search implementations
Rabin-Karp† 58
Las Vegas 7N † 7N no † yes 1
5.3 S UBSTRING S EARCH
‣ introduction
‣ brute force
‣ Knuth-Morris-Pratt
Algorithms
‣ Boyer-Moore