Notes 04 String Matching
Notes 04 String Matching
Sebastian Wild
24 February 2020
4 String Matching
4.1 Introduction
4.2 Brute Force
4.3 String Matching with Finite Automata
4.4 The Knuth-Morris-Pratt algorithm
4.5 Beyond Optimal? The Boyer-Moore Algorithm
4.6 The Rabin-Karp Algorithm
4.1 Introduction
Ubiquitous strings
string = sequence of characters
� universal data type for . . . everything!
� natural language texts
� programs (source code)
� websites
� XML documents
� DNA sequences
� bitstrings
� . . . a computer’s memory � ultimately any data is a string
1
Ubiquitous strings
string = sequence of characters
� universal data type for . . . everything!
� natural language texts
� programs (source code)
� websites
� XML documents
� DNA sequences
� bitstrings
� . . . a computer’s memory � ultimately any data is a string
2
Notations
� alphabet Σ: finite set of allowed characters; 𝜎 = |Σ| “a string over alphabet Σ”
� letters (Latin, Greek, Arabic, Cyrillic, Asian scripts, . . . )
� “what you can type on a keyboard”, Unicode characters
� {0, 1}; nucleotides {𝐴, 𝐶, 𝐺, 𝑇}; . . . comprehensive standard character set
including emoji and all known symbols
2
Notations
� alphabet Σ: finite set of allowed characters; 𝜎 = |Σ| “a string over alphabet Σ”
� letters (Latin, Greek, Arabic, Cyrillic, Asian scripts, . . . )
� “what you can type on a keyboard”, Unicode characters
� {0, 1}; nucleotides {𝐴, 𝐶, 𝐺, 𝑇}; . . . comprehensive standard character set
including emoji and all known symbols
A True B False
pingo.upb.de/622222
3
Clicker Question
A True � B False
pingo.upb.de/622222
3
String matching – Definition
Search for a string (pattern) in a large body of text
� Input:
� 𝑇 ∈ Σ𝑛 : The text (haystack) being searched within
� 𝑃 ∈ Σ𝑚 : The pattern (needle) being searched for; typically 𝑛 � 𝑚
� Output:
� �
� the first occurrence (match) of 𝑃 in 𝑇: min 𝑖 ∈ [0..𝑛 − 𝑚) : 𝑇[𝑖..𝑖 + 𝑚) = 𝑃
� or NO_MATCH if there is no such 𝑖 (“𝑃 does not occur in 𝑇”)
� Example:
� 𝑇 = “Where is he?”
� 𝑃1 = “he” � 𝑖 = 1
� 𝑃2 = “who” � NO_MATCH
5
Brute-force method
a b b b a b a b b a b
� Example:
𝑇 = abbbababbab
𝑃 = abba
� 15 char cmps
(vs 𝑛 · 𝑚 = 44)
not too bad!
6
Brute-force method
a b b b a b a b b a b
a b b a
� Example:
a
𝑇 = abbbababbab
a
𝑃 = abba
a
� 15 char cmps a b b
(vs 𝑛 · 𝑚 = 44) a
not too bad! a b b a
6
Brute-force method – Discussion
Brute-force method can be good enough
� typically works well for natural language text
� also for random strings
7
Brute-force method – Discussion
Brute-force method can be good enough
� typically works well for natural language text
� also for random strings
� brute force does ‘obviously’ stupid repetitive comparisons � can we avoid that?
7
Roadmap
� Approach 1 (this week): Use preprocessing on the pattern 𝑃 to eliminate guesses
(avoid ‘obvious’ redundant work)
� Deterministic finite automata (DFA)
� Knuth-Morris-Pratt algorithm
� Boyer-Moore algorithm
� Rabin-Karp algorithm
8
4.3 String Matching with Finite Automata
Clicker Question
pingo.upb.de/622222
9
Theoretical Computer Science to the rescue!
� string matching = deciding whether 𝑇 ∈ Σ★ · 𝑃 · Σ★
10
Theoretical Computer Science to the rescue!
� string matching = deciding whether 𝑇 ∈ Σ★ · 𝑃 · Σ★
Job done!
10
Theoretical Computer Science to the rescue!
� string matching = deciding whether 𝑇 ∈ Σ★ · 𝑃 · Σ★
10
Theoretical Computer Science to the rescue!
� string matching = deciding whether 𝑇 ∈ Σ★ · 𝑃 · Σ★
Example:
𝑇 = aabacaababacaa
a
𝑃 = ababaca
b,c a a Σ
b
a b a b a c a
0 1 2 3 4 5 6 7
c b,c
c
b,c
b,c
text: a a b a c a a b a b a c a a
state: 0
11
String matching with DFA
� Assume first, we already have a deterministic automaton
� How does string matching work?
Example:
𝑇 = aabacaababacaa
a
𝑃 = ababaca
b,c a a Σ
b
a b a b a c a
0 1 2 3 4 5 6 7
c b,c
c
b,c
b,c
text: a a b a c a a b a b a c a a
state: 0 1 1 2 3 0 1 1 2 3 4 5 6 7 7
11
String matching DFA – Intuition
Why does this work?
𝑇 = aabacaababacaa
a
𝑃 = ababaca
b,c a a Σ
� Main insight: a b a b
b
a c a
0 1 2 3 4 5 6 7
State 𝑞 means: c b,c
c
“we have seen 𝑃[0..𝑞) until here b,c
b,c
(but not any longer prefix of 𝑃)”
text: a a b a c a a b a b a c a a
state: 0 1 1 2 3 0 1 1 2 3 4 5 6 7 7
a b a b a c a
� trivial part: 0 1 2 3 4 5 6 7
13
NFA instead of DFA?
It remains to construct the DFA.
Σ Σ
a b a b a c a
� trivial part: 0 1 2 3 4 5 6 7
Example:
Previous versions of this example were missing states; this is the correct version:
text: a a b a c a a b a b a c a a
state: 0 0, 1 0, 1 0, 2 0, 1, 3 0 0, 1 0, 1 0, 2 0, 1, 3 0, 2, 4 0, 1, 3, 5 0, 6 0, 1, 7 0, 1, 7
13
Computing DFA directly
You have an NFA and want a DFA?
Simply apply the power-set construction
(and maybe DFA minimization)!
14
Computing DFA directly
You have an NFA and want a DFA?
Simply apply the power-set construction
(and maybe DFA minimization)!
14
Computing DFA directly
You have an NFA and want a DFA?
Simply apply the power-set construction
(and maybe DFA minimization)!
https://cuvids.io/app/video/194/watch
15
String matching with DFA – Discussion
� Time:
� Matching: 𝑛 table lookups for DFA transitions
� building DFA: Θ(𝑚𝜎) time (constant time per transition edge).
� Θ(𝑚𝜎 + 𝑛) time for string matching.
� Space:
� Θ(𝑚𝜎) space for transition matrix.
16
4.4 The Knuth-Morris-Pratt algorithm
Failure Links
� Recall: String matching with is DFA fast,
but needs table of 𝑚 × 𝜎 transitions.
� in fast DFA construction, we used that all simulations differ only by last symbol
� KMP’s third insight: do this last step of simulation from state 𝑥 during matching!
. . . but how?
17
Failure Links
� Recall: String matching with is DFA fast,
but needs table of 𝑚 × 𝜎 transitions.
� in fast DFA construction, we used that all simulations differ only by last symbol
� KMP’s third insight: do this last step of simulation from state 𝑥 during matching!
. . . but how?
a b a b a c a
0 1 2 3 4 5 6 7
× × ×
×
Σ−𝑎 × × Σ
a b a b a c a
0 1 2 3 4 5 6 7
× × ×
×
𝑇: a b a b a b a a b a b
18
Failure link automaton – Example
Example: 𝑇 = abababaaaca, 𝑃 = ababaca
Σ−𝑎 × × Σ
a b a b a c a
0 1 2 3 4 5 6 7
× × ×
×
𝑇: a b a b a b a a b a b
𝑃: a b a b a × to state 3
(a) (b) (a) b a × to state 1
a b a b
𝑞: 1 2 3 4 5 3, 4 5 3, 1, 0, 1 2 3 4
(after reading this character)
18
Clicker Question
A Θ(1) C Θ(𝑚)
B Θ(log 𝑚) D Θ(𝑚 2 )
pingo.upb.de/622222
19
Clicker Question
A Θ(1) C Θ(𝑚) �
B Θ(log 𝑚) D Θ(𝑚 2 )
pingo.upb.de/622222
19
The Knuth-Morris-Pratt Algorithm
1 procedure KMP(𝑇[0..𝑛 − 1], 𝑃[0..𝑚 − 1]) � only need single array fail for
fail[0..𝑚] := failureLinks(𝑃)
failure links
2
3 𝑖 := 0 // current position in 𝑇
4 𝑞 := 0 // current state of KMP automaton � (procedure failureLinks later)
5 while 𝑖 < 𝑛 do
6 if 𝑇[𝑖] == 𝑃[𝑞] then
7 𝑖 := 𝑖 + 1; 𝑞 := 𝑞 + 1
8 if 𝑞 == 𝑚 then
9 return 𝑖 − 𝑞 // occurrence found
10 else // i.e. 𝑇[𝑖] ≠ 𝑃[𝑞]
11 if 𝑞 ≥ 1 then
12 𝑞 := fail[𝑞] // follow one ×
13 else
14 𝑖 := 𝑖 + 1
15 end while
16 return NO_MATCH
20
The Knuth-Morris-Pratt Algorithm
1 procedure KMP(𝑇[0..𝑛 − 1], 𝑃[0..𝑚 − 1]) � only need single array fail for
fail[0..𝑚] := failureLinks(𝑃)
failure links
2
3 𝑖 := 0 // current position in 𝑇
4 𝑞 := 0 // current state of KMP automaton � (procedure failureLinks later)
5 while 𝑖 < 𝑛 do
6 if 𝑇[𝑖] == 𝑃[𝑞] then Analysis: (matching part)
7 𝑖 := 𝑖 + 1; 𝑞 := 𝑞 + 1
� always have fail[𝑗] < 𝑗 for 𝑗 ≥ 1
8 if 𝑞 == 𝑚 then
9 return 𝑖 − 𝑞 // occurrence found � in each iteration
10 else // i.e. 𝑇[𝑖] ≠ 𝑃[𝑞] � either advance position in text
11 if 𝑞 ≥ 1 then (𝑖 := 𝑖 + 1)
12 𝑞 := fail[𝑞] // follow one × � or shift pattern forward
13 else (guess 𝑖 − 𝑗)
14 𝑖 := 𝑖 + 1
� each can happen at most 𝑛 times
15 end while
16 return NO_MATCH � ≤ 2𝑛 symbol comparisons!
20
Computing failure links
� failure links point to error state 𝑥 (from DFA construction)
� run same algorithm, but store fail[𝑗] := 𝑥 instead of copying all transitions
� run same algorithm, but store fail[𝑗] := 𝑥 instead of copying all transitions
� Space:
� Θ(𝑚) space for failure links
22
Clicker Question
What are the main advantages of the KMP string matching (using
the failure-link automaton) over string matching with DFAs?
Check all that apply.
A faster preprocessing on pattern
pingo.upb.de/622222
23
Clicker Question
What are the main advantages of the KMP string matching (using
the failure-link automaton) over string matching with DFAs?
Check all that apply.
A faster preprocessing on pattern �
B faster matching in text
pingo.upb.de/622222
23
The KMP prefix function
� It turns out that the failure links are useful beyond KMP
memorize this!
fail[𝑗] = length of the
longest prefix of 𝑃[0..𝑗)
that is a suffix of 𝑃[1..𝑗).
24
4.5 Beyond Optimal? The Boyer-Moore Algorithm
Motivation
� KMP is an optimal algorithm, isn’t it? What else could we hope for?
25
Motivation
� KMP is an optimal algorithm, isn’t it? What else could we hope for?
𝑇 a a a a a a a a a a a a a a a a
x
x
x
x
25
Boyer-Moore Algorithm
� Let’s check guesses from right to left!
26
Boyer-Moore Algorithm
� Let’s check guesses from right to left!
� New rules:
� Bad character jumps: Upon mismatch at 𝑇[𝑖] = 𝑐:
� If 𝑃 does not contain 𝑐, shift 𝑃 entirely past 𝑖!
� Otherwise, shift 𝑃 to align the last occurrence of 𝑐 in 𝑃 with 𝑇[𝑖].
� Good suffix jumps:
Upon a mismatch, shift so that the already matched suffix of 𝑃 aligns with a
previous occurrence of that suffix (or part of it) in 𝑃.
(Details follow; ideas similar to KMP failure links)
26
Boyer-Moore Algorithm – Code
27
Bad character examples
𝑃 = a l d o
𝑇 = w h e r e i s w a l d o
o
𝑃 = m o o r e
𝑇 = b o y e r m o o r e
28
Bad character examples
𝑃 = a l d o
𝑇 = w h e r e i s w a l d o
o
o
𝑃 = m o o r e
𝑇 = b o y e r m o o r e
28
Bad character examples
𝑃 = a l d o
𝑇 = w h e r e i s w a l d o
o
o
o
𝑃 = m o o r e
𝑇 = b o y e r m o o r e
28
Bad character examples
𝑃 = a l d o
𝑇 = w h e r e i s w a l d o
o
o
d o
𝑃 = m o o r e
𝑇 = b o y e r m o o r e
28
Bad character examples
𝑃 = a l d o
𝑇 = w h e r e i s w a l d o
o
o
l d o
𝑃 = m o o r e
𝑇 = b o y e r m o o r e
28
Bad character examples
𝑃 = a l d o
𝑇 = w h e r e i s w a l d o
o
o
a l d o
𝑃 = m o o r e
𝑇 = b o y e r m o o r e
28
Bad character examples
𝑃 = a l d o
𝑇 = w h e r e i s w a l d o
o
o
a l d o
𝑃 = m o o r e
𝑇 = b o y e r m o o r e
28
Bad character examples
𝑃 = a l d o
𝑇 = w h e r e i s w a l d o
o
o
a l d o
𝑃 = m o o r e
𝑇 = b o y e r m o o r e
e
28
Bad character examples
𝑃 = a l d o
𝑇 = w h e r e i s w a l d o
o
o
a l d o
𝑃 = m o o r e
𝑇 = b o y e r m o o r e
e
(r) e
28
Bad character examples
𝑃 = a l d o
𝑇 = w h e r e i s w a l d o
o
o
a l d o
𝑃 = m o o r e
𝑇 = b o y e r m o o r e
e
(r) e
(m) e
28
Bad character examples
𝑃 = a l d o
𝑇 = w h e r e i s w a l d o
o
o
a l d o
𝑃 = m o o r e
𝑇 = b o y e r m o o r e
e
(r) e
(m) r e
28
Bad character examples
𝑃 = a l d o
𝑇 = w h e r e i s w a l d o
o
o
a l d o
𝑃 = m o o r e
𝑇 = b o y e r m o o r e
e
(r) e
(m) o o r e
28
Last-Occurrence Function
� Preprocess pattern 𝑃 and alphabet Σ
� last-occurrence function 𝜆[𝑐] defined as
� the largest index 𝑖 such that 𝑃[𝑖] = 𝑐 or
� −1 if no such index exists
29
Last-Occurrence Function
� Preprocess pattern 𝑃 and alphabet Σ
� last-occurrence function 𝜆[𝑐] defined as
� the largest index 𝑖 such that 𝑃[𝑖] = 𝑐 or
� −1 if no such index exists
� Example: 𝑃 = moore
𝑃 = m o o r e
𝑐 m o r e all others 𝑇 = b o y e r m o o r e
𝜆[𝑐] 0 2 3 4 −1 e
(r) e
𝑖 = 0, 𝑗 = 4, 𝑇[𝑖 + 𝑗] = 𝑟, 𝜆[𝑟] = 3
� shift by 𝑗 − 𝜆[𝑇[𝑖 + 𝑗]] = 1
� 𝜆 easily computed in 𝑂(𝑚 + |Σ|) time.
29
Good suffix examples
1. 𝑃 = sells␣shells
s h e i l a ␣ s e l l s ␣ s h e l l s
30
Good suffix examples
1. 𝑃 = sells␣shells
s h e i l a ␣ s e l l s ␣ s h e l l s
h e l l s
30
Good suffix examples
1. 𝑃 = sells␣shells
s h e i l a ␣ s e l l s ␣ s h e l l s
h e l l s
(e) (l) (l) (s)
30
Good suffix examples
1. 𝑃 = sells␣shells
s h e i l a ␣ s e l l s ␣ s h e l l s
h e l l s
(e) (l) (l) (s)
2. 𝑃 = odetofood
i l i k e f o o d f r o m m e x i c o
o f o o d
30
Good suffix examples
1. 𝑃 = sells␣shells
s h e i l a ␣ s e l l s ␣ s h e l l s
h e l l s
(e) (l) (l) (s)
2. 𝑃 = odetofood
i l i k e f o o d f r o m m e x i c o
o f o o d
(o) (d)
30
Good suffix examples
1. 𝑃 = sells␣shells
s h e i l a ␣ s e l l s ␣ s h e l l s
h e l l s
(e) (l) (l) (s)
2. 𝑃 = odetofood
i l i k e f o o d f r o m m e x i c o
o f o o d
(o) (d)
matched suffix
31
Good suffix jumps
� Precompute good suffix jumps 𝛾[0..𝑚 − 1]:
� For 0 ≤ 𝑗 < 𝑚, 𝛾[𝑗] stores shift if search failed at 𝑃[𝑗]
� At this point, had 𝑇[𝑖+𝑗+1..𝑖+𝑚−1] = 𝑃[𝑗+1..𝑚−1], but 𝑇[𝑖] ≠ 𝑃[𝑗]
–OR–
� 𝑃[0 . . . ℓ ] is a suffix of 𝑃[𝑗+1, . . . , 𝑚−1]
o f o o d
(o) (d)
31
Good suffix jumps
� Precompute good suffix jumps 𝛾[0..𝑚 − 1]:
� For 0 ≤ 𝑗 < 𝑚, 𝛾[𝑗] stores shift if search failed at 𝑃[𝑗]
� At this point, had 𝑇[𝑖+𝑗+1..𝑖+𝑚−1] = 𝑃[𝑗+1..𝑚−1], but 𝑇[𝑖] ≠ 𝑃[𝑗]
–OR–
� 𝑃[0 . . . ℓ ] is a suffix of 𝑃[𝑗+1, . . . , 𝑚−1]
o f o o d
(o) (d)
� Note: You do not need to know how to find the values 𝛾[𝑗] for the exam,
but you should be able to find the next guess on examples.
31
Boyer-Moore algorithm – Discussion
Worst-case running time ∈ 𝑂(𝑛 + 𝑚 + |Σ|) if 𝑃 does not occur in 𝑇.
(follows from not at all obvious analysis!)
On typical English text, Boyer Moore probes only approx. 25% of the characters in 𝑇!
� Faster than KMP on English text.
32
Clicker Question
pingo.upb.de/622222
33
Clicker Question
pingo.upb.de/622222
33
4.6 The Rabin-Karp Algorithm
Space – The final frontier
� Knuth-Morris-Pratt has great worst case and real-time guarantees
34
Space – The final frontier
� Knuth-Morris-Pratt has great worst case and real-time guarantees
34
Rabin-Karp Fingerprint Algorithm – Idea
Idea: use hashing (but without explicit hash tables)
35
Rabin-Karp Fingerprint Algorithm – Idea
Idea: use hashing (but without explicit hash tables)
� Precompute & store only hash of pattern Example: (treat (sub)strings as decimal numbers)
𝑃 = 59265
� Compute hash for each guess 𝑇 = 3141592653589793238
� If hashes agree, check characterwise Hash function: ℎ(𝑥) = 𝑥 mod 97
� ℎ(𝑃) = 95.
35
Rabin-Karp Fingerprint Algorithm – Idea
Idea: use hashing (but without explicit hash tables)
� Precompute & store only hash of pattern Example: (treat (sub)strings as decimal numbers)
𝑃 = 59265
� Compute hash for each guess 𝑇 = 3141592653589793238
� If hashes agree, check characterwise Hash function: ℎ(𝑥) = 𝑥 mod 97
� ℎ(𝑃) = 95.
3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8
ℎ(31415) = 84
ℎ(14159) = 94
ℎ(41592) = 76
ℎ(15926) = 18
𝒉(59262) = 95
35
Rabin-Karp Fingerprint Algorithm – First Attempt
1 procedure rabinKarpSimplistic(𝑇[0..𝑛 − 1], 𝑃[0..𝑚 − 1])
2 𝑀 := suitable prime number
3 ℎ 𝑃 := computeHash(𝑃[0..𝑚 − 1)], 𝑀)
4 for 𝑖 := 0, . . . , 𝑛 − 𝑚 do
5 ℎ𝑇 := computeHash(𝑇[𝑖..𝑖 + 𝑚 − 1], 𝑀)
6 if ℎ𝑇 == ℎ 𝑃 then
7 if 𝑇[𝑖..𝑖 + 𝑚 − 1] == 𝑃 // 𝑚 comparisons
8 then return 𝑖
9 return NO_MATCH
36
Rabin-Karp Fingerprint Algorithm – Fast Rehash
� Crucial insight: We can update hashes in constant time.
� Use previous hash to compute next hash
for above hash function!
� 𝑂(1) time per hash, except first one
37
Rabin-Karp Fingerprint Algorithm – Fast Rehash
� Crucial insight: We can update hashes in constant time.
� Use previous hash to compute next hash
for above hash function!
� 𝑂(1) time per hash, except first one
Example:
� Pre-compute: 10000 mod 97 = 9
37
Rabin-Karp Fingerprint Algorithm – Fast Rehash
� Crucial insight: We can update hashes in constant time.
� Use previous hash to compute next hash
for above hash function!
� 𝑂(1) time per hash, except first one
Example:
� Pre-compute: 10000 mod 97 = 9
Observation:
15926 mod 97 = (41592 − (4·10000)) · 10 + 6 mod 97
= (76 − (4·9 )) · 10 + 6 mod 97
= 406 mod 97 = 18
37
Rabin-Karp Fingerprint Algorithm – Code
� use a convenient radix 𝑅 ≥ 𝜎 (𝑅 = 10 in our examples; 𝑅 = 2 𝑘 is faster)
38
Rabin-Karp – Discussion
Expected running time is 𝑂(𝑚 + 𝑛)
Θ(𝑚𝑛) worst-case;
but this is very unlikely
39
Clicker Question
pingo.upb.de/622222
40
Clicker Question
pingo.upb.de/622222
40
String Matching Conclusion
Brute- Suffix
DFA KMP BM RK
Force trees*
Preproc.
— 𝑂(𝑚|Σ|) 𝑂(𝑚) 𝑂(𝑚 + 𝜎) 𝑂(𝑚) 𝑂(𝑛)
time
Search 𝑂(𝑛) 𝑂(𝑛 + 𝑚)
𝑂(𝑛𝑚) 𝑂(𝑛) 𝑂(𝑛) 𝑂(𝑚)
time (often better) (expected)
Extra
— 𝑂(𝑚|Σ|) 𝑂(𝑚) 𝑂(𝑚 + 𝜎) 𝑂(1) 𝑂(𝑛)
space
* (see Unit 6)
41