0% found this document useful (0 votes)
11 views13 pages

Stringsearch

Uploaded by

Aathmika Vijay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views13 pages

Stringsearch

Uploaded by

Aathmika Vijay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Strings

String Searching String.


■ Sequence of characters over some alphabet.
– binary { 0, 1 }
– ASCII, UNICODE

Some applications.
■ Word processors.
■ Virus scanning.
■ Text information retrieval systems. (Lexis, Nexis)
■ Digital libraries.
■ Natural language processing.
■ Specialized databases.
■ Computational molecular biology.
Reference: Chapter 19, Algorithms in C by R. Sedgewick. ■ Web search engines.
Addison Wesley, 1990.

Princeton University • COS 423 • Theory of Algorithms • Spring 2002 • Kevin Wayne 2

String Searching Brute Force


Brute force.
Search Text ■ Check for pattern starting at every text position.
n n e e n l e d e n e e n e e d l e n l d

Brute Force String Search


Search Pattern
n e e d l e
int brutesearch(char p[], char t[]) {
int i, j;
Successful Search int M = strlen(p); // pattern length
n n e e n l e d e n e e n e e d l e n l d int N = strlen(t); // text length

for (i = 0; i < N; i++) {


Parameters.
for (j = 0; j < M; j++) {
■ N = # characters in text. if (t[i+j] != p[j]) break;
■ M = # characters in pattern. }
if (j == M) return i; // found at offset i
■ Typically, N >> M.
}
– e.g., N = 1 million, M = 1 hundred
return -1; // not found
}

3 4
Analysis of Brute Force How To Save Comparisons
Analysis of brute force. How to avoid recomputation?
■ Running time depends on pattern and text. ■ Pre-analyze search pattern.
– can be slow when strings repeat themselves ■ Ex: suppose that first 5 characters of pattern are all a’s.
■ Worst case: MN comparisons. – If t[0..4] matches p[0..4] then t[1..4] matches p[0..3].
– too slow when M and N are large – no need to check i = 1, j = 0, 1, 2, 3
– saves 4 comparisons
■ Need better ideas in general.
Search Pattern
a a a a a b
Search Pattern
Search Text a a a a a b
a a a a a a a a a a a a a a a a a a a a b
a a a a a b Search Text
a a a a a b a a a a a a a a a a a a a a a a a a a a b
a a a a a b a a a a a b
a a a a a b a a a a a b
a a a a a b
5 6

Knuth-Morris-Pratt Knuth-Morris-Pratt
KMP algorithm. KMP algorithm.
■ Use knowledge of how search pattern repeats itself. ■ Use knowledge of how search pattern repeats itself.
■ Build FSA from pattern. ■ Build FSA from pattern.
■ Run FSA on text. ■ Run FSA on text.
■ O(M + N) worst-case running time. ■ O(M + N) worst-case running time.

Search Pattern Search Pattern Search Text


a a b a a a a a b a a a a a a b a a b a a a b

b b
b a b a
a a b a a a a a b a a a
0 1 2 3 4 5 6 0 1 2 3 4 5 6
b accept state b accept state
b b b b

7 8
Knuth-Morris-Pratt Knuth-Morris-Pratt
KMP algorithm. KMP algorithm.
■ Use knowledge of how search pattern repeats itself. ■ Use knowledge of how search pattern repeats itself.
■ Build FSA from pattern. ■ Build FSA from pattern.
■ Run FSA on text. ■ Run FSA on text.
■ O(M + N) worst-case running time. ■ O(M + N) worst-case running time.

Search Pattern Search Text Search Pattern Search Text


a a b a a a a a a b a a b a a a b a a b a a a a a a b a a b a a a b

b b

b a b a
a a b a a a a a b a a a
0 1 2 3 4 5 6 0 1 2 3 4 5 6
b accept state b accept state
b b b b

9 10

Knuth-Morris-Pratt Knuth-Morris-Pratt
KMP algorithm. KMP algorithm.
■ Use knowledge of how search pattern repeats itself. ■ Use knowledge of how search pattern repeats itself.
■ Build FSA from pattern. ■ Build FSA from pattern.
■ Run FSA on text. ■ Run FSA on text.
■ O(M + N) worst-case running time. ■ O(M + N) worst-case running time.

Search Pattern Search Text Search Pattern Search Text


a a b a a a a a a b a a b a a a b a a b a a a a a a b a a b a a a b

b b
b a b a
a a b a a a a a b a a a
0 1 2 3 4 5 6 0 1 2 3 4 5 6
b accept state b accept state
b b b b

11 12
Knuth-Morris-Pratt Knuth-Morris-Pratt
KMP algorithm. KMP algorithm.
■ Use knowledge of how search pattern repeats itself. ■ Use knowledge of how search pattern repeats itself.
■ Build FSA from pattern. ■ Build FSA from pattern.
■ Run FSA on text. ■ Run FSA on text.
■ O(M + N) worst-case running time. ■ O(M + N) worst-case running time.

Search Pattern Search Text Search Pattern Search Text


a a b a a a a a a b a a b a a a b a a b a a a a a a b a a b a a a b

b b

b a b a
a a b a a a a a b a a a
0 1 2 3 4 5 6 0 1 2 3 4 5 6
b accept state b accept state
b b b b

13 14

Knuth-Morris-Pratt Knuth-Morris-Pratt
KMP algorithm. KMP algorithm.
■ Use knowledge of how search pattern repeats itself. ■ Use knowledge of how search pattern repeats itself.
■ Build FSA from pattern. ■ Build FSA from pattern.
■ Run FSA on text. ■ Run FSA on text.
■ O(M + N) worst-case running time. ■ O(M + N) worst-case running time.

Search Pattern Search Text Search Pattern Search Text


a a b a a a a a a b a a b a a a b a a b a a a a a a b a a b a a a b

b b
b a b a
a a b a a a a a b a a a
0 1 2 3 4 5 6 0 1 2 3 4 5 6
b accept state b accept state
b b b b

15 16
Knuth-Morris-Pratt Knuth-Morris-Pratt
KMP algorithm. KMP algorithm.
■ Use knowledge of how search pattern repeats itself. ■ Use knowledge of how search pattern repeats itself.
■ Build FSA from pattern. ■ Build FSA from pattern.
■ Run FSA on text. ■ Run FSA on text.
■ O(M + N) worst-case running time. ■ O(M + N) worst-case running time.

Search Pattern Search Text Search Pattern Search Text


a a b a a a a a a b a a b a a a b a a b a a a a a b a a a b a a a b

b b

b a b a
a a b a a a a a b a a a
0 1 2 3 4 5 6 0 1 2 3 4 5 6
b accept state b accept state
b b b b

17 18

Knuth-Morris-Pratt FSA Representation


KMP algorithm. FSA used in KMP has special property.
■ Use knowledge of how search pattern repeats itself. ■ Upon character match, go forward one state.
■ Build FSA from pattern. ■ Only need to keep track of where to go upon character mismatch.
■ Run FSA on text. – go to state next[j] if character mismatches in state j

■ O(M + N) worst-case running time.


– FSA simulation takes O(N) time
0 1 2 3 4 5
– can build FSA in O(M) time with cleverness
a 1 2 2 4 5 6
Search Pattern b 0 0 3 0 0 3
a a b a a a next 0 0 2 0 0 3

b
b a
a a b a a a
0 1 2 3 4 5 6
b accept state
b b

19 20
KMP Algorithm FSA Construction for KMP
Given the FSA, string search is easy. FSA construction for KMP.
■ The array next[] contains next FSA state if character mismatches. ■ FSA builds itself!

Example. Building FSA for aabaaabb.


KMP String Search
■ State 6. p[0..5] = aabaaa
int kmpsearch(char p[], char t[], int next[]) { – assume you know state for p[1..5] = abaaa X=2
int i, j = 0; – if next char is b (match): go forward 6+1=7
int M = strlen(p); // pattern length – if next char is a (mismatch): go to state for abaaaa X + ’a’ = 2
int N = strlen(t); // text length – update X to state for p[1..6] = abaaab X + ’b’ = 3
for (i = 0; i < N; i++) {
if (t[i] == p[j]) j++; // char match b
else j = next[j]; // char mismatch
if (j == M) return i – M + 1; // found b a
} a a b a a a
0 1 2 3 4 5 6
return -1; // not found b
} b b

21 22

FSA Construction for KMP FSA Construction for KMP


FSA construction for KMP. FSA construction for KMP.
■ FSA builds itself! ■ FSA builds itself!

Example. Building FSA for aabaaabb. Example. Building FSA for aabaaabb.
■ State 7. p[0..6] = aabaaab
– assume you know state for p[1..6] = abaaab X=3
– if next char is b (match): go forward 7+1=8
– next char is a (mismatch): go to state for abaaaba X + ’a’ = 4
– update X to state for p[1..7] = abaaabb X + ’b’ = 0

b b
b a b a
a a b a a a b a a b a a a b
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
b b
b b b b
a a
23 24
FSA Construction for KMP FSA Construction for KMP
FSA construction for KMP. FSA construction for KMP.
■ FSA builds itself! ■ FSA builds itself!

Example. Building FSA for aabaaabb. Crucial insight.


■ To compute transitions for state n of FSA, suffices to have:
– FSA for states 0 to n-1
– state X that FSA ends up in with input p[1..n-1]

■ To compute state X’ that FSA ends up in with input p[1..n], it


suffices to have:
– FSA for states 0 to n-1
b
a – state X that FSA ends up in with input p[1..n-1]
b a
a a b a a a b b
0 1 2 3 4 5 6 7 8
b
b b
a
25 26

FSA Construction for KMP FSA Construction for KMP


Search Pattern j pattern[1..j] X Search Pattern j pattern[1..j] X next
a a b a a a b b a a b a a a b b 0 0 0

0
a a 1
b b 0

b b
a a
0 1 0 1

27 28
FSA Construction for KMP FSA Construction for KMP
Search Pattern j pattern[1..j] X next Search Pattern j pattern[1..j] X next
a a b a a a b b 0 0 0 a a b a a a b b 0 0 0
1 a 1 0 1 a 1 0
2 a b 0 2
0 1 0 1 2
a 1 2 a 1 2 2
b 0 0 b 0 0 3

b b a
a a a a b
0 1 2 0 1 2 3
b b

29 30

FSA Construction for KMP FSA Construction for KMP


Search Pattern j pattern[1..j] X next Search Pattern j pattern[1..j] X next
a a b a a a b b 0 0 0 a a b a a a b b 0 0 0
1 a 1 0 1 a 1 0
2 a b 0 2 2 a b 0 2
0 1 2 3 0 1 2 3 4
3 a b a 1 0 3 a b a 1 0
a 1 2 2 4 a 1 2 2 4 5
4 a b a a 2 0
b 0 0 3 0 b 0 0 3 0 0

b
b a b a
a a b a a a b a a
0 1 2 3 4 0 1 2 3 4 5
b b
b b

31 32
FSA Construction for KMP FSA Construction for KMP
Search Pattern j pattern[1..j] X next Search Pattern j pattern[1..j] X next
a a b a a a b b 0 0 0 a a b a a a b b 0 0 0
1 a 1 0 1 a 1 0
2 a b 0 2 2 a b 0 2
0 1 2 3 4 5 0 1 2 3 4 5 6
3 a b a 1 0 3 a b a 1 0
a 1 2 2 4 5 6 a 1 2 2 4 5 6 2
4 a b a a 2 0 4 a b a a 2 0
b 0 0 3 0 0 3 b 0 0 3 0 0 3 7
5 a b a a a 2 3 5 a b a a a 2 3
6 a b a a a b 3 2

b b

b a b a
a a b a a a a a b a a a b
0 1 2 3 4 5 6 0 1 2 3 4 5 6 7
b b
b b b b
a
33 34

FSA Construction for KMP FSA Construction for KMP


Search Pattern j pattern[1..j] X next Code for FSA construction in KMP algorithm.
a a b a a a b b 0 0 0
1 a 1 0 FSA Construction for KMP
2 a b 0 2
0 1 2 3 4 5 6 7
3 a b a 1 0 void kmpinit(char p[], int next[]) {
a 1 2 2 4 5 6 2 4
4 a b a a 2 0 int j, X = 0, M = strlen(p);
b 0 0 3 0 0 3 7 8 next[0] = 0;
5 a b a a a 2 3
6 a b a a a b 3 2
for (j = 1; j < M; j++) {
7 a b a a a b b 0 4
if (p[X] == p[j]) {
next[j] = next[X];
X = X + 1;
b }
a else {
b a next[j] = X + 1;
a a b a a a b b X = next[X];
0 1 2 3 4 5 6 7 8 }
b }
b b }
a
35 36
Specialized KMP Implementation Summary of KMP
Specialized C program for aabaaabb pattern. KMP summary.
■ Build FSA from pattern.
Hardwired FSA for aabaaabb
■ Run FSA on text.
int kmpsearch(char t[]) { ■ O(M + N) worst case string search.
int i = 0; ■ Good efficiency for patterns and texts with much repetition.
s0: if (t[i++] != ’a’) goto s0; – binary files
s1: if (t[i++] != ’a’) goto s0; – graphics formats
s2: if (t[i++] != ’b’) goto s2;
s3: if (t[i++] != ’a’) goto s0; ■ Less useful for text strings.
s4: if (t[i++] != ’a’) goto s0; next[] ■ On-line algorithm.
s5: if (t[i++] != ’a’) goto s3; – virus scanning
s6: if (t[i++] != ’b’) goto s2; – Internet spying
s7: if (t[i++] != ’b’) goto s4;
return i - 8;
}

Ultimate search program for aabaaabb pattern.


■ Machine language version of above.
37 38

History of KMP Boyer-Moore


History of KMP. Boyer-Moore algorithm (1974).
■ Inspired by theorem of Cook that says O(M + N) algorithm should ■ Right-to-left scanning.
be possible. – find offset i in text by moving left to right.
■ Discovered in 1976 independently by two groups. – compare pattern to text by moving right to left.
■ Knuth-Pratt.
■ Morris was hacker trying to build an editor.
– annoying problem that you needed a buffer when performing
text search

Resolved theoretical and practical problems.


■ Surprise when it was discovered.
■ In hindsight, seems like right algorithm.
Text
a s t r i n g s s e a r c h c o n s i s t i n g o f
s t i n g
s t i n g
s t i n g Mismatch
Pattern
s t i n g Match
s t i n g
No comparison
39 40
Boyer-Moore Boyer-Moore
Boyer-Moore algorithm (1974). Boyer-Moore algorithm (1974).
■ Right-to-left scanning. ■ Right-to-left scanning.
■ Heuristic 1: advance offset i using "bad character rule." Index ■ Heuristic 1: advance offset i using "bad character rule."
– upon mismatch of text character c, look up g 4 – extremely effective for English text
j = index[c] i 2 ■ Heuristic 2: use KMP-like suffix rule.
– increase offset i so that jth character of pattern lines up n 3 – effective with small alphabets
with text character c s 0 – different rules lead to different worst-case behavior
t 1
* -1

Text Text
a s t r i n g s s e a r c h c o n s i s t i n g o f x x x x x x x b a b x x x x x x x x x x x x x x
s t i n g x c a b d a b d a b
s t i n g x c a b d a b d a b
Pattern
s t i n g
s t i n g
s t i n g bad character heuristic
s t i n g
Mismatch
s t i n g
Match
s t i n g
No comparison 41 42

Boyer-Moore Boyer-Moore
Boyer-Moore algorithm (1974). Boyer-Moore analysis.
■ Right-to-left scanning. ■ O(N / M) average case if given letter usually doesn’t occur in string.
■ Heuristic 1: advance offset i using "bad character rule." – English text: 10 character search string, 26 char alphabet

– extremely effective for English text – time decreases as pattern length increases

■ Heuristic 2: use KMP-like suffix rule. – sublinear in input size!

– effective with small alphabets ■ O(M + N) worst-case with Galil variant.


– different rules lead to different worst-case behavior – proof is quite difficult

Text
x x x x x x x b a b x x x x x x x x x x x x x x
x c a b d a b d a b
x c a b d a b d a b

strong good suffix

43 44
Karp-Rabin Karp-Rabin
Idea: use hashing. Idea: use hashing.
■ Compute hash function for each text position. ■ Compute hash function for each text position.
■ No explicit hash table!
– just compare with pattern hash Problems.
■ Need full compare on hash match to guard against collisions.
Example. – 59265 % 97 = 95
■ Hash "table" size = 97. Search Pattern – 59362 % 97 = 95
5 9 2 6 5 59265 % 97 = 95
■ Hash function depends on M characters.
– running time on search miss = MN
Search Text
3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6
3 1 4 1 5 31415 % 97 = 84
1 4 1 5 9 14159 % 97 = 94
4 1 5 9 2 41592 % 97 = 76
1 5 9 2 6 15926 % 97 = 18
5 9 2 6 5 59265 % 97 = 95
45 46

Karp-Rabin Karp-Rabin (Sedgewick, p. 290)

Key idea: fast to compute hash function of adjacent substrings. #define q 3355439 // table size
■ Use previous hash to compute next hash. #define d 256 // radix

■ O(1) time per hash, except first one. int rksearch(char p[], char t[]) {
int i, j, dM = 1, h1 = 0, h2 = 0;
int M = strlen(p), N = strlen(t);
Example.
■ Pre-compute: 10000 % 97 = 9 for (j = 1; j < M; j++) // precompute d^M % q
■ Previous hash: 41592 % 97 = 76 dM = (d * dM) % q;

■ Next hash: 15926 % 97 for (j = 0; j < M; j++) {


h1 = (h1*d + p[j]) % q; // hash of pattern
h2 = (h2*d + t[j]) % q; // hash of text
Observation. }
■15926 ≡ (41592 – (4 * 10000)) * 10 + 6
for (i = M; i < N; i++) {
■ 15926 % 97 ≡ (41592 – (4 * 10000)) * 10 + 6
if (h1 == h2) return i – M; // match found
≡ (76 – 4 * 9) * 10 + 6 h2 = (h2 – a[i-M]*dM) % q; // remove high order digit
≡ 406 h2 = (h2*d + a[i]) % q; // insert low order digit
≡ 18 }
return -1; // not found
}

47 48
Karp-Rabin
Karp-Rabin algorithm.
■ Choose table size at RANDOM to be huge prime.
■ Expected running time is O(M + N).
■ O(MN) worst-case, but this is (unbelievably) unlikely.

Randomized algorithms.
■ Monte Carlo: don’t check for collisions.
– algorithm can be wrong but running time guaranteed linear
■ Las Vegas: if collision, start over with new random table size.
– algorithm always correct, but running time is expected linear

Advantages.
■ Extends to 2d patterns and other generalizations.

49

You might also like