5 1 Stringsearch
5 1 Stringsearch
1
String Search
A common word processor facility is to search
for a given word in a document. Generally, the
problem is to search for occurrences of a short
string in a long string.
2
History of String Search
The brute force algorithm:
invented in the dawn of computer history
re-invented many times, still common
Knuth & Pratt invented a better one in 1970
invented independently by Morris
published 1976 as “Knuth-Morris-Pratt”
3
The obvious algorithm is to try the word at each possible
place, and compare all the characters:
characters
for i := 0 to n-m do (doc length n)
for
m) j := 0 to m-1 do (word length
5
Improved string search, continued
In every case where the document
character is not one of the characters in
the word, we can move along m places.
Sometimes, it is less.
6
Problem Definition, terminology
Let p be the pattern string
Let t be the target string (document)
Let k be the index of the character in the target
string that “lies over” the first character of the
pattern
Given two strings, p and t, over the alphabet ,
determine whether p occurs as the substring of t
That is, determine whether there exists k such
that p=Substring(t,k,|p|).
7
Straightforward string searching
function SimpleStringSearch(string p,t): integer
{Find p in t; return its location or -1 if p is not a substring of t}
8
SimpleStringSearch
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]
A B C E F G A B C D E
p[0] p[1] p[2] p[3]
A B C D
Y Y Y N
9
SimpleStringSearch
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]
A B C E F G A B C D E
p[0] p[1] p[2] p[3]
A B C D
10
SimpleStringSearch
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]
A B C E F G A B C D E
p[0] p[1] p[2] p[3]
A B C D
11
SimpleStringSearch
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]
A B C E F G A B C D E
p[0] p[1] p[2] p[3]
A B C D
12
SimpleStringSearch
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]
A B C E F G A B C D E
p[0] p[1] p[2] p[3]
A B C D
13
SimpleStringSearch
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]
A B C E F G A B C D E
p[0] p[1] p[2] p[3]
A B C D
14
SimpleStringSearch
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]
A B C E F G A B C D E
p[0] p[1] p[2] p[3]
A B C D
15
SimpleStringSearch
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]
A B C E F G A B C D E
p[0] p[1] p[2] p[3]
A B C D
Y Y Y Y
16
Straightforward string searching
Worst case:
Pattern string always matches completely except for last
character
Example: search for XXXXXXY in target string of
XXXXXXXXXXXXXXXXXXXX
Outer loop executed once for every character in target
string
Inner loop executed once for every character in pattern
(|p| * |t|)
17
Knuth-Morris-Pratt
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]
X Y X Y X Y c
X Y X Y Z
Y Y Y Y N
X Y X Y Z
Y Y Y Y ?
18
Knuth-Morris-Pratt
(|p| * |t|)
Key idea:
if pattern fails to match, slide pattern to right by
as many boxes as possible without permitting a
match to go unnoticed
19
Knuth-Morris Pratt
Correct motion of pattern depends on both
location of mismatch and the mismatching
character
If c == X : move 2 boxes to right
If c == E : move 5 boxes to right
If c == Z : target found; alg terminates
20
Knuth-Morris-Pratt
Goal: determine d, number of boxes to
right pattern should move; smallest d such
that:
p[0] = t[k+d]
p[1] = t[k+d+1]
p[2] = t[k+d+2]
…
p[i-d] = t[k+i]
21
Knuth-Morris-Pratt
Note: can be stated largely in terms of
pattern alone.
Value of d depends only on:
The pattern
The value of i
The mismatching character c (at t[k+i])
22
Knuth-Morris-Pratt
algorithm kmp_search:
Input:
an array of characters, t (the text to be searched)
an array of characters, p (the word sought)
output:
an integer (the zero-based position in t at which p is found)
define variables:
an integer, m ← 0 (the beginning of the current match in t)
an integer, i ← 0 (the position of the current character in p)
an array of integers, T (the table, computed elsewhere)
while (m + i) is less than the length of t, do:
if p[i] = p[m + i],
let i ← i + 1
if i equals the length of p,
return m
otherwise,
let m ← m + i - T[i],
if i is greater than 0,
let i ← T[i]
(if we reach here, we have searched all of t unsuccessfully)
return the length of t
23
Knuth-Morris-Pratt
m = 8, i = 0: Fails
m = 15, i = 0
24
Knuth-Morris-Pratt
For pattern ABCD:
And the
X 0 1 0 3 2
mis-
matching
1 0 3 0 5 Then skip this many
character
Y spaces
in the
target is Z 1 2 3 4 0
this -
1 2 3 4 5
other
26