CS 97SI: INTRODUCTION TO
PROGRAMMING CONTESTS
Jaehyun Park
Last Lecture: String Algorithms
String Matching Problem
Hash Table
Knuth-Morris-Pratt (KMP) Algorithm
Suffix Trie
Suffix Array
Note on String Problems
String Matching Problem
Given a text and a pattern , find all the
occurrences of within
Notations:
and : lengths of and
: set of alphabets
Constant
size
th letter of (1-indexed)
, , : single letters in
, , : strings
String Matching Example
= AGCATGCTGCAGTCATGCTTAGGCTA
= GCT
A nave method takes () time
We
initiate string comparison at every starting point
Each comparison takes time
We can certainly do better!
Hash Function
A function that takes a string and outputs a number
A good hash function has few collisions
i.e.
If , () with high probability
An easy and powerful hash function is a polynomial
mod some prime
Consider
each letter as a number (ASCII value is fine)
1 + 2 + +
1 = 1
2
1 +
How do we find (2 +1 ) from (1 )?
Hash Table
Main idea: preprocess to speedup queries
Hash
every substring of length
is a small constant
For each query , hash the first letters of to
retrieve all the occurrences of it within
Dont forget to check collisions!
Hash Table
Pros:
Easy
to implement
Significant speedup in practice
Cons:
Doesnt
Can
help the asymptotic efficiency
take () time if hashing is terrible
lot of memory consumption
Knuth-Morris-Pratt (KMP) Matcher
A linear time (!) algorithm that solves the string
matching problem by preprocessing in time
Main
idea is to skip some comparisons by using the
previous comparison result
Uses an auxiliary array that is defined as the
following:
is the largest integer smaller than such that
1 [] is a suffix of 1
[]
Its better to see an example than the definition
Table Example (from CLRS)
10
[]
[]: the largest integer smaller than such that
1 [] is a suffix of 1
e.g.
[6] = 4 since abab is a suffix of ababab
e.g. [9] = 0 since no prefix of length 8 ends with c
Lets see why this is useful
Using the Table
= ABC ABCDAB ABCDABCDABDE
= ABCDABD
= (0, 0, 0, 0, 1, 2, 0)
Start matching at the first position of :
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567
Mismatch at the 4th letter of !
Using the Table
There is no point in starting the comparison at 2 , 3
We
matched = 3 letters so far
Shift by = 3 letters
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567
Mismatch at 4 again!
Using the Table
We define 0 = 1
We
matched = 0 letters so far
Shift by = 1 letter
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567
Mismatch at 11 !
Using the Table
[6] = 2 says 1 2 is a suffix of 1 6
Shift by 6 [6] = 4 letters
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
||
ABCDABD
1234567
Again, no point in shifting by 1, 2, or 3 letters
Using the Table
Mismatch at 11 again!
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567
Currently 2 letters are matched
We shift by 2 = 2 2 letters
Using the Table
Mismatch at 11 yet again!
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567
Currently no letters are matched
We shift by 1 = 0 0 letters
Using the Table
Mismatch at 18
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567
Currently 6 letters are matched
We shift by 4 = 6 6 letters
Using the Table
Finally, there it is!
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567
Currently all 7 letters are matched
After recording this match (match at 16 22 ), we shift
again in order to find other matches
Shift by 7 = 7 7 letters
Computing
Observation 1: if 1 [] is a suffix of 1 ,
then 1 1 is a suffix of 1 1
Well,
obviously
Observation 2: all the prefixes of P that are a
suffix of 1 can be obtained by recursively
applying to
e.g.
1 , 1 , 1
suffixes of 1
are all
Computing
A non-obvious conclusion:
First,
e.g.
lets write () as [] applied times to
is equal to 1 + 1, where is the smallest
integer that satisfies 1 +1 =
If
there is no such , [] = 0
Intuition: we look at all the prefixes of that are
suffixes of 1 1 and find the longest one
whose next letter matches too
Implementation
pi[0] = -1;
int k = -1;
for(int i = 1; i <= m; i++) {
while(k >= 0 && P[k+1] != P[i])
k = pi[k];
pi[i] = ++k;
}
Pattern Matching Implementation
int k = 0;
for(int i = 1; i <= n; i++) {
while(k >= 0 && P[k+1] != T[i])
k = pi[k];
k++;
if(k == m) {
// P matches T[i-m+1..i]
k = pi[k];
}
}
Suffix Trie
Suffix trie of a string is a rooted tree that stores
all the suffixes (thus all the substrings)
Each node corresponds to some substring of
Each edge is associated with an alphabet
For each node that corresponds to , there is a
special pointer called suffix link that leads to the
node corresponding to
Surprisingly easy to implement!
Suffix Trie Example
(Figure modified from Ukkonens original paper)
Incremental Construction
Given the suffix tree for 1
Then
we append +1 = to , creating necessary
nodes
Start at node corresponding to 1
Create
an -transition to a new node
Take the suffix link at to go to , corresponding
to 2
Create
an -transition to a new node
Create a suffix link from to
Incremental Construction
We repeat the previous process:
Take
the suffix link at the current node
Make a new -transition there
Create the suffix link from the previous node
We stop if the node already has an -transition
Because
from this point, all nodes that are reachable
via suffix links already have an -transition
Construction Example
a
b
b
a
Given the suffix trie for aba
We want to add a new letter c
Construction Example
a
b
1. Start at the green node
and make a c-transition
b
a
2. Then follow the suffix link
Construction Example
a
b
b
a
3. Make a c-transition at
4. Make a suffix link from
Construction Example
a
b
Construction Example
c
Construction Example
c
Suffix Trie Analysis
Construction time is linear in the tree size
But
the tree size can be quadratic in
e.g.
= aaabbb
Pattern Matching
To find , start at the root and keep following
edges labeled with 1 , 2 , etc.
Got stuck? Then doesnt exist in
Suffix Array
Input string
BANANA
Get all suffixes
1
2
3
4
5
6
BANANA
ANANA
NANA
ANA
NA
A
Sort the suffixes
6
4
2
1
5
3
A
ANA
ANANA
BANANA
NA
NANA
Take the indices
6,4,2,1,5,3
Suffix Array
Memory usage is
Has the same computational power as suffix trie
Can be constructed in time (!)
But
its hard to implement
There is an approachable log 2 algorithm
If
you want to see how it works, read the paper on the
course website
http://cs97si.stanford.edu/suffix-array.pdf
Note on String Problems
Always be aware of the null-terminators
Simple hash works so well in many problems
Even
for problems that arent supposed to be solved by
hashing
If a problem involves rotations of a string, consider
concatenating it with itself and see if it helps
Stanford team notebook has implementations of
suffix arrays and the KMP matcher