String Algorithms
Jaehyun Park
CS 97SI
Stanford University
June 30, 2015
Outline
String Matching Problem
Hash Table
Knuth-Morris-Pratt (KMP) Algorithm
Suffix Trie
Suffix Array
String Matching Problem
String Matching Problem
Given a text T and a pattern P , find all occurrences of P
within T
Notations:
n and m: lengths of P and T
: set of alphabets (of constant size)
Pi : ith letter of P (1-indexed)
a, b, c: single letters in
x, y, z: strings
String Matching Problem
Example
T = AGCATGCTGCAGTCATGCTTAGGCTA
P = GCT
P appears three times in T
A naive method takes O(mn) time
Initiate string comparison at every starting point
Each comparison takes O(m) time
We can do much better!
String Matching Problem
Outline
String Matching Problem
Hash Table
Knuth-Morris-Pratt (KMP) Algorithm
Suffix Trie
Suffix Array
Hash Table
Hash Function
A function that takes a string and outputs a number
A good hash function has few collisions
i.e., If x 6= y, H(x) 6= H(y) with high probability
An easy and powerful hash function is a polynomial mod some
prime p
Consider each letter as a number (ASCII value is fine)
H(x1 . . . xk ) = x1 ak1 + x2 ak2 + + xk1 a + xk (mod p)
How do we find H(x2 . . . xk+1 ) from H(x1 . . . xk )?
Hash Table
Hash Table
Main idea: preprocess T to speedup queries
Hash every substring of length k
k is a small constant
For each query P , hash the first k letters of P to retrieve all
the occurrences of it within T
Dont forget to check collisions!
Hash Table
Hash Table
Pros:
Easy to implement
Significant speedup in practice
Cons:
Doesnt help the asymptotic efficiency
Can still take (nm) time if hashing is terrible or data is
difficult
A lot of memory consumption
Hash Table
Outline
String Matching Problem
Hash Table
Knuth-Morris-Pratt (KMP) Algorithm
Suffix Trie
Suffix Array
Knuth-Morris-Pratt (KMP) Algorithm
Knuth-Morris-Pratt (KMP) Matcher
A linear time (!) algorithm that solves the string matching
problem by preprocessing P in (m) time
Main idea is to skip some comparisons by using the previous
comparison result
Uses an auxiliary array that is defined as the following:
[i] is the largest integer smaller than i such that P1 . . . P[i] is
a suffix of P1 . . . Pi
... Its better to see an example than the definition
Knuth-Morris-Pratt (KMP) Algorithm
10
Table Example (from CLRS)
[i] is the largest integer smaller than i such that P1 . . . P[i]
is a suffix of P1 . . . Pi
e.g., [6] = 4 since abab is a suffix of ababab
e.g., [9] = 0 since no prefix of length 8 ends with c
Lets see why this is useful
Knuth-Morris-Pratt (KMP) Algorithm
11
Using the Table
T = ABC ABCDAB ABCDABCDABDE
P = ABCDABD
= (0, 0, 0, 0, 1, 2, 0)
Start matching at the first position of T :
Mismatch at the 4th letter of P !
Knuth-Morris-Pratt (KMP) Algorithm
12
Using the Table
We matched k = 3 letters so far, and [k] = 0
Thus, there is no point in starting the comparison at T2 , T3
(crucial observation)
Shift P by k [k] = 3 letters
Mismatch at T4 again!
Knuth-Morris-Pratt (KMP) Algorithm
13
Using the Table
We matched k = 0 letters so far
Shift P by k [k] = 1 letter (we define [0] = 1)
Mismatch at T11 !
Knuth-Morris-Pratt (KMP) Algorithm
14
Using the Table
[6] = 2 means P1 P2 is a suffix of P1 . . . P6
Shift P by 6 [6] = 4 letters
Again, no point in shifting P by 1, 2, or 3 letters
Knuth-Morris-Pratt (KMP) Algorithm
15
Using the Table
Mismatch at T11 again!
Currently 2 letters are matched
Shift P by 2 [2] = 2 letters
Knuth-Morris-Pratt (KMP) Algorithm
16
Using the Table
Mismatch at T11 yet again!
Currently no letters are matched
Shift P by 0 [0] = 1 letter
Knuth-Morris-Pratt (KMP) Algorithm
17
Using the Table
Mismatch at T18
Currently 6 letters are matched
Shift P by 6 [6] = 4 letters
Knuth-Morris-Pratt (KMP) Algorithm
18
Using the Table
Finally, there it is!
Currently all 7 letters are matched
After recording this match (at T16 . . . T22 , we shift P again in
order to find other matches
Shift by 7 [7] = 7 letters
Knuth-Morris-Pratt (KMP) Algorithm
19
Computing
Observation 1: if P1 . . . P[i] is a suffix of P1 . . . Pi , then
P1 . . . P[i]1 is a suffix of P1 . . . Pi1
Observation 2: all the prefixes of P that are a suffix of
P1 . . . Pi can be obtained by recursively applying to i
Well, obviously...
e.g., P1 . . . P[i] , P1 . . . , P[[i]] , P1 . . . , P[[[i]]] are all
suffixes of P1 . . . Pi
Knuth-Morris-Pratt (KMP) Algorithm
20
Computing
A non-obvious conclusion:
First, lets write (k) [i] as [] applied k times to i
e.g., (2) [i] = [[i]]
[i] is equal to (k) [i 1] + 1, where k is the smallest integer
that satisfies P(k) [i1]+1 = Pi
If there is no such k, [i] = 0
Intuition: we look at all the prefixes of P that are suffixes of
P1 . . . Pi1 , and find the longest one whose next letter
matches Pi
Knuth-Morris-Pratt (KMP) Algorithm
21
Implementation
pi[0] = -1;
int k = -1;
for(int i = 1; i <= m; i++) {
while(k >= 0 && P[k+1] != P[i])
k = pi[k];
pi[i] = ++k;
}
Knuth-Morris-Pratt (KMP) Algorithm
22
Pattern Matching Implementation
int k = 0;
for(int i = 1; i <= n; i++) {
while(k >= 0 && P[k+1] != T[i])
k = pi[k];
k++;
if(k == m) {
// P matches T[i-m+1..i]
k = pi[k];
}
}
Knuth-Morris-Pratt (KMP) Algorithm
23
Outline
String Matching Problem
Hash Table
Knuth-Morris-Pratt (KMP) Algorithm
Suffix Trie
Suffix Array
Suffix Trie
24
Suffix Trie
Suffix trie of a string T is a rooted tree that stores all the
suffixes (thus all the substrings)
Each node corresponds to some substring of T
Each edge is associated with an alphabet
For each node that corresponds to ax, there is a special
pointer called suffix link that leads to the node corresponding
to x
Surprisingly easy to implement!
Suffix Trie
25
Example
(Figure modified from Ukkonens original paper)
Suffix Trie
26
Incremental Construction
Given the suffix tree for T1 . . . Tn
Then we append Tn+1 = a to T , creating necessary nodes
Start at node u corresponding to T1 . . . Tn
Create an a-transition to a new node v
Take the suffix link at u to go to u , corresponding to
T2 . . . Tn
Create an a-transition to a new node v
Create a suffix link from v to v
Suffix Trie
27
Incremental Construction
Repeat the previous process:
Take the suffix link at the current node
Make a new a-transition there
Create the suffix link from the previous node
Stop if the node already has an a-transition
Because from this point, all nodes that are reachable via suffix
links already have an a-transition
Suffix Trie
28
Construction Example
Given the suffix trie for aba
We want to add a new letter c
Suffix Trie
29
Construction Example
Suffix Trie
30
Construction Example
Suffix Trie
31
Construction Example
Suffix Trie
32
Construction Example
Suffix Trie
33
Construction Example
Suffix Trie
34
Construction Example
Construction time is linear in the tree size
But the tree size can be quadratic in n
e.g., T = aa . . . abb . . . b
Suffix Trie
35
Construction Example
To find P , start at the root and keep following edges labeled
with P1 , P2 , etc.
Got stuck? Then P doesnt exist in T
Suffix Trie
36
Outline
String Matching Problem
Hash Table
Knuth-Morris-Pratt (KMP) Algorithm
Suffix Trie
Suffix Array
Suffix Array
37
Suffix Array
Suffix Array
38
Suffix Array
Memory usage is O(n)
Has the same computational power as suffix trie
Can be constructed in O(n) time (!)
There is an approachable O(n log2 n) algorithm
But its hard to implement
If you want to see how it works, read the paper on the course
website
http://cs97si.stanford.edu/suffix-array.pdf
Suffix Array
39
Notes on String Problems
Always be aware of the null-terminators
Simple hash works so well in many problems
If a problem involves rotations of some string, consider
concatenating it with itself and see if it helps
Stanford team notebook has implementations of suffix arrays
and the KMP matcher
Suffix Array
40