0% found this document useful (0 votes)
31 views

10 String Algorithms

09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computat ional Geometry09 Computational Geometry09 Computational Geometry09 Computational Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computat ional Geometry09 Computational Geometry09 Computational Geometry09 Computational Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computat ional Geometry09 Computational Geometry09 Computational Geometry09 Computational Computational Geometry09 Computational Geometry

Uploaded by

p4patelkeyur
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

10 String Algorithms

09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computat ional Geometry09 Computational Geometry09 Computational Geometry09 Computational Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computat ional Geometry09 Computational Geometry09 Computational Geometry09 Computational Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computat ional Geometry09 Computational Geometry09 Computational Geometry09 Computational Computational Geometry09 Computational Geometry

Uploaded by

p4patelkeyur
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

CS 97SI: INTRODUCTION TO

PROGRAMMING CONTESTS
Jaehyun Park

Last Lecture: String Algorithms

String Matching Problem


Hash Table
Knuth-Morris-Pratt (KMP) Algorithm
Suffix Trie
Suffix Array
Note on String Problems

String Matching Problem

Given a text and a pattern , find all the


occurrences of within
Notations:
and : lengths of and
: set of alphabets

Constant

size

th letter of (1-indexed)
, , : single letters in
, , : strings

String Matching Example

= AGCATGCTGCAGTCATGCTTAGGCTA
= GCT

A nave method takes () time

We

initiate string comparison at every starting point


Each comparison takes time

We can certainly do better!

Hash Function

A function that takes a string and outputs a number


A good hash function has few collisions
i.e.

If , () with high probability

An easy and powerful hash function is a polynomial


mod some prime
Consider

each letter as a number (ASCII value is fine)


1 + 2 + +
1 = 1
2
1 +
How do we find (2 +1 ) from (1 )?

Hash Table

Main idea: preprocess to speedup queries


Hash

every substring of length


is a small constant

For each query , hash the first letters of to


retrieve all the occurrences of it within
Dont forget to check collisions!

Hash Table

Pros:
Easy

to implement
Significant speedup in practice

Cons:
Doesnt
Can

help the asymptotic efficiency

take () time if hashing is terrible

lot of memory consumption

Knuth-Morris-Pratt (KMP) Matcher

A linear time (!) algorithm that solves the string


matching problem by preprocessing in time
Main

idea is to skip some comparisons by using the


previous comparison result

Uses an auxiliary array that is defined as the


following:
is the largest integer smaller than such that
1 [] is a suffix of 1

[]

Its better to see an example than the definition

Table Example (from CLRS)

10

[]

[]: the largest integer smaller than such that


1 [] is a suffix of 1
e.g.

[6] = 4 since abab is a suffix of ababab


e.g. [9] = 0 since no prefix of length 8 ends with c

Lets see why this is useful

Using the Table

= ABC ABCDAB ABCDABCDABDE


= ABCDABD
= (0, 0, 0, 0, 1, 2, 0)
Start matching at the first position of :
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567

Mismatch at the 4th letter of !

Using the Table

There is no point in starting the comparison at 2 , 3


We

matched = 3 letters so far


Shift by = 3 letters

12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567

Mismatch at 4 again!

Using the Table

We define 0 = 1
We

matched = 0 letters so far


Shift by = 1 letter

12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567

Mismatch at 11 !

Using the Table

[6] = 2 says 1 2 is a suffix of 1 6


Shift by 6 [6] = 4 letters
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
||
ABCDABD
1234567
Again, no point in shifting by 1, 2, or 3 letters

Using the Table

Mismatch at 11 again!
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567

Currently 2 letters are matched


We shift by 2 = 2 2 letters

Using the Table

Mismatch at 11 yet again!


12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567

Currently no letters are matched


We shift by 1 = 0 0 letters

Using the Table

Mismatch at 18
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567

Currently 6 letters are matched


We shift by 4 = 6 6 letters

Using the Table

Finally, there it is!


12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567

Currently all 7 letters are matched


After recording this match (match at 16 22 ), we shift
again in order to find other matches
Shift by 7 = 7 7 letters

Computing

Observation 1: if 1 [] is a suffix of 1 ,
then 1 1 is a suffix of 1 1
Well,

obviously

Observation 2: all the prefixes of P that are a


suffix of 1 can be obtained by recursively
applying to
e.g.

1 , 1 , 1

suffixes of 1

are all

Computing

A non-obvious conclusion:
First,
e.g.

lets write () as [] applied times to

is equal to 1 + 1, where is the smallest


integer that satisfies 1 +1 =

If

there is no such , [] = 0

Intuition: we look at all the prefixes of that are


suffixes of 1 1 and find the longest one
whose next letter matches too

Implementation
pi[0] = -1;
int k = -1;
for(int i = 1; i <= m; i++) {
while(k >= 0 && P[k+1] != P[i])
k = pi[k];
pi[i] = ++k;
}

Pattern Matching Implementation


int k = 0;
for(int i = 1; i <= n; i++) {
while(k >= 0 && P[k+1] != T[i])
k = pi[k];
k++;
if(k == m) {
// P matches T[i-m+1..i]
k = pi[k];
}
}

Suffix Trie

Suffix trie of a string is a rooted tree that stores


all the suffixes (thus all the substrings)
Each node corresponds to some substring of
Each edge is associated with an alphabet
For each node that corresponds to , there is a
special pointer called suffix link that leads to the
node corresponding to
Surprisingly easy to implement!

Suffix Trie Example

(Figure modified from Ukkonens original paper)

Incremental Construction

Given the suffix tree for 1


Then

we append +1 = to , creating necessary


nodes

Start at node corresponding to 1


Create

an -transition to a new node

Take the suffix link at to go to , corresponding


to 2
Create

an -transition to a new node


Create a suffix link from to

Incremental Construction

We repeat the previous process:


Take

the suffix link at the current node


Make a new -transition there
Create the suffix link from the previous node

We stop if the node already has an -transition


Because

from this point, all nodes that are reachable


via suffix links already have an -transition

Construction Example
a
b

b
a

Given the suffix trie for aba


We want to add a new letter c

Construction Example
a
b

1. Start at the green node


and make a c-transition

b
a

2. Then follow the suffix link

Construction Example
a
b

b
a

3. Make a c-transition at

4. Make a suffix link from

Construction Example
a
b

Construction Example
c

Construction Example
c

Suffix Trie Analysis

Construction time is linear in the tree size


But

the tree size can be quadratic in

e.g.

= aaabbb

Pattern Matching

To find , start at the root and keep following


edges labeled with 1 , 2 , etc.
Got stuck? Then doesnt exist in

Suffix Array
Input string

BANANA

Get all suffixes

1
2
3
4
5
6

BANANA
ANANA
NANA
ANA
NA
A

Sort the suffixes

6
4
2
1
5
3

A
ANA
ANANA
BANANA
NA
NANA

Take the indices

6,4,2,1,5,3

Suffix Array

Memory usage is
Has the same computational power as suffix trie
Can be constructed in time (!)
But

its hard to implement

There is an approachable log 2 algorithm


If

you want to see how it works, read the paper on the


course website
http://cs97si.stanford.edu/suffix-array.pdf

Note on String Problems

Always be aware of the null-terminators


Simple hash works so well in many problems
Even

for problems that arent supposed to be solved by


hashing

If a problem involves rotations of a string, consider


concatenating it with itself and see if it helps
Stanford team notebook has implementations of
suffix arrays and the KMP matcher

You might also like