0% found this document useful (0 votes)

72 views36 pages

10 String Algorithms

09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computat ional Geometry09 Computational Geometry09 Computational Geometry09 Computational Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computat ional Geometry09 Computational Geometry09 Computational Geometry09 Computational Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computational Geometry09 Computat ional Geometry09 Computational Geometry09 Computational Geometry09 Computational Computational Geometry09 Computational Geometry

Uploaded by

p4patelkeyur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views36 pages

10 String Algorithms

Uploaded by

p4patelkeyur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

CS 97SI: INTRODUCTION TO

PROGRAMMING CONTESTS
Jaehyun Park

Last Lecture: String Algorithms

String Matching Problem

Hash Table
Knuth-Morris-Pratt (KMP) Algorithm
Suffix Trie
Suffix Array
Note on String Problems

String Matching Problem

Given a text and a pattern , find all the

occurrences of within
Notations:
and : lengths of and
: set of alphabets

Constant

size

th letter of (1-indexed)
, , : single letters in
, , : strings

String Matching Example

= AGCATGCTGCAGTCATGCTTAGGCTA
= GCT

A nave method takes () time

initiate string comparison at every starting point

Each comparison takes time

We can certainly do better!

Hash Function

A function that takes a string and outputs a number

A good hash function has few collisions
i.e.

If , () with high probability

An easy and powerful hash function is a polynomial

mod some prime
Consider

each letter as a number (ASCII value is fine)

1 + 2 + +
1 = 1
2
1 +
How do we find (2 +1 ) from (1 )?

Hash Table

Main idea: preprocess to speedup queries

Hash

every substring of length

is a small constant

For each query , hash the first letters of to

retrieve all the occurrences of it within
Dont forget to check collisions!

Hash Table

Pros:
Easy

to implement
Significant speedup in practice

Cons:
Doesnt
Can

help the asymptotic efficiency

take () time if hashing is terrible

lot of memory consumption

Knuth-Morris-Pratt (KMP) Matcher

A linear time (!) algorithm that solves the string

matching problem by preprocessing in time
Main

idea is to skip some comparisons by using the

previous comparison result

Uses an auxiliary array that is defined as the

following:
is the largest integer smaller than such that
1 [] is a suffix of 1

[]

Its better to see an example than the definition

Table Example (from CLRS)

[]

[]: the largest integer smaller than such that

1 [] is a suffix of 1
e.g.

[6] = 4 since abab is a suffix of ababab

e.g. [9] = 0 since no prefix of length 8 ends with c

Lets see why this is useful

Using the Table

= ABC ABCDAB ABCDABCDABDE

= ABCDABD
= (0, 0, 0, 0, 1, 2, 0)
Start matching at the first position of :
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567

Mismatch at the 4th letter of !

Using the Table

There is no point in starting the comparison at 2 , 3

matched = 3 letters so far

Shift by = 3 letters

12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567

Mismatch at 4 again!

Using the Table

We define 0 = 1
We

matched = 0 letters so far

Shift by = 1 letter

12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567

Mismatch at 11 !

Using the Table

[6] = 2 says 1 2 is a suffix of 1 6

Shift by 6 [6] = 4 letters
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
||
ABCDABD
1234567
Again, no point in shifting by 1, 2, or 3 letters

Using the Table

Mismatch at 11 again!
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567

Currently 2 letters are matched

We shift by 2 = 2 2 letters

Using the Table

Mismatch at 11 yet again!

12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567

Currently no letters are matched

We shift by 1 = 0 0 letters

Using the Table

Mismatch at 18
12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567

Currently 6 letters are matched

We shift by 4 = 6 6 letters

Using the Table

Finally, there it is!

12345678901234567890123
ABC ABCDAB ABCDABCDABDE
ABCDABD
1234567

Currently all 7 letters are matched

After recording this match (match at 16 22 ), we shift
again in order to find other matches
Shift by 7 = 7 7 letters

Computing

Observation 1: if 1 [] is a suffix of 1 ,
then 1 1 is a suffix of 1 1
Well,

obviously

Observation 2: all the prefixes of P that are a

suffix of 1 can be obtained by recursively
applying to
e.g.

1 , 1 , 1

suffixes of 1

are all

Computing

A non-obvious conclusion:
First,
e.g.

lets write () as [] applied times to

is equal to 1 + 1, where is the smallest

integer that satisfies 1 +1 =

there is no such , [] = 0

Intuition: we look at all the prefixes of that are

suffixes of 1 1 and find the longest one
whose next letter matches too

Implementation
pi[0] = -1;
int k = -1;
for(int i = 1; i <= m; i++) {
while(k >= 0 && P[k+1] != P[i])
k = pi[k];
pi[i] = ++k;
}

Pattern Matching Implementation

int k = 0;
for(int i = 1; i <= n; i++) {
while(k >= 0 && P[k+1] != T[i])
k = pi[k];
k++;
if(k == m) {
// P matches T[i-m+1..i]
k = pi[k];
}
}

Suffix Trie

Suffix trie of a string is a rooted tree that stores

all the suffixes (thus all the substrings)
Each node corresponds to some substring of
Each edge is associated with an alphabet
For each node that corresponds to , there is a
special pointer called suffix link that leads to the
node corresponding to
Surprisingly easy to implement!

Suffix Trie Example

(Figure modified from Ukkonens original paper)

Incremental Construction

Given the suffix tree for 1

Then

we append +1 = to , creating necessary

nodes

Start at node corresponding to 1

Create

an -transition to a new node

Take the suffix link at to go to , corresponding

to 2
Create

an -transition to a new node

Create a suffix link from to

Incremental Construction

We repeat the previous process:

Take

the suffix link at the current node

Make a new -transition there
Create the suffix link from the previous node

We stop if the node already has an -transition

Because

from this point, all nodes that are reachable

via suffix links already have an -transition

Construction Example
a
b

b
a

Given the suffix trie for aba

We want to add a new letter c

Construction Example
a
b

1. Start at the green node

and make a c-transition

b
a

2. Then follow the suffix link

Construction Example
a
b

b
a

3. Make a c-transition at

4. Make a suffix link from

Construction Example
a
b

Construction Example
c

Suffix Trie Analysis

Construction time is linear in the tree size

But

the tree size can be quadratic in

e.g.

= aaabbb

Pattern Matching

To find , start at the root and keep following

edges labeled with 1 , 2 , etc.
Got stuck? Then doesnt exist in

Suffix Array
Input string

BANANA

Get all suffixes

1
2
3
4
5
6

BANANA
ANANA
NANA
ANA
NA
A

Sort the suffixes

6
4
2
1
5
3

A
ANA
ANANA
BANANA
NA
NANA

Take the indices

6,4,2,1,5,3

Suffix Array

Memory usage is
Has the same computational power as suffix trie
Can be constructed in time (!)
But

its hard to implement

There is an approachable log 2 algorithm

you want to see how it works, read the paper on the

course website
http://cs97si.stanford.edu/suffix-array.pdf

Note on String Problems

Always be aware of the null-terminators

Simple hash works so well in many problems
Even

for problems that arent supposed to be solved by

hashing

If a problem involves rotations of a string, consider

concatenating it with itself and see if it helps
Stanford team notebook has implementations of
suffix arrays and the KMP matcher

11 Data Structures and Algorithms - Narasimha Karumanchi
100% (1)
11 Data Structures and Algorithms - Narasimha Karumanchi
12 pages
String Algorithms: Jaehyun Park Cs 97si Stanford University
No ratings yet
String Algorithms: Jaehyun Park Cs 97si Stanford University
40 pages
Boyer Moore Algorithm: Idan Szpektor
100% (1)
Boyer Moore Algorithm: Idan Szpektor
48 pages
String Matching
No ratings yet
String Matching
89 pages
String Matching Algorithms
No ratings yet
String Matching Algorithms
25 pages
DSA - Strings - Notes
No ratings yet
DSA - Strings - Notes
8 pages
Foundations of Sequence Analysis
No ratings yet
Foundations of Sequence Analysis
161 pages
Cse2012 Design and Analysis of Algorithms Lab Digital Assignment 2
No ratings yet
Cse2012 Design and Analysis of Algorithms Lab Digital Assignment 2
18 pages
Suffix Arrays for String Search
No ratings yet
Suffix Arrays for String Search
71 pages
Z Function and Its Calculation:: Int Int Int Int For Int If While If
No ratings yet
Z Function and Its Calculation:: Int Int Int Int For Int If While If
32 pages
String Algorithms for CS Students
No ratings yet
String Algorithms for CS Students
48 pages
Suffix Arrays
No ratings yet
Suffix Arrays
20 pages
String Matching Introduction To NP-Completeness
No ratings yet
String Matching Introduction To NP-Completeness
37 pages
Efficient String Search Techniques
No ratings yet
Efficient String Search Techniques
2 pages
BSc Text Searching Exam 2010
No ratings yet
BSc Text Searching Exam 2010
8 pages
Cse2012 Design and Analysis of Algorithms Lab Digital Assignment 2
No ratings yet
Cse2012 Design and Analysis of Algorithms Lab Digital Assignment 2
18 pages
Advanced String Lecture
No ratings yet
Advanced String Lecture
50 pages
W9 Presentation
No ratings yet
W9 Presentation
20 pages
Lecture 04 Inaryseachtree
No ratings yet
Lecture 04 Inaryseachtree
20 pages
Pattern Matching: Suffix Tree Applications
No ratings yet
Pattern Matching: Suffix Tree Applications
39 pages
String Matching - RYS - Lect - 1 - 2 - 3 - Update
No ratings yet
String Matching - RYS - Lect - 1 - 2 - 3 - Update
61 pages
54.string Inotes
No ratings yet
54.string Inotes
20 pages
String Processing Algorithms
No ratings yet
String Processing Algorithms
111 pages
Solution Notes
No ratings yet
Solution Notes
3 pages
Application of A Modified Convolution Method To Exact String Matching
No ratings yet
Application of A Modified Convolution Method To Exact String Matching
6 pages
UNIT-V String Matching
No ratings yet
UNIT-V String Matching
24 pages
W 9 Presentation
No ratings yet
W 9 Presentation
20 pages
Week 4
No ratings yet
Week 4
18 pages
Programming-Assignment-3
No ratings yet
Programming-Assignment-3
17 pages
HW 2
No ratings yet
HW 2
5 pages
Week 2+3 TRIE (Student Copy)
No ratings yet
Week 2+3 TRIE (Student Copy)
24 pages
Suffix Arrays: Justin Zhang 24 May 2017
No ratings yet
Suffix Arrays: Justin Zhang 24 May 2017
5 pages
KMP 2
No ratings yet
KMP 2
7 pages
Unit 3
No ratings yet
Unit 3
34 pages
12 Strings.v3
No ratings yet
12 Strings.v3
111 pages
Suffix
No ratings yet
Suffix
29 pages
Draft 1
No ratings yet
Draft 1
6 pages
String Matching and Hashing
No ratings yet
String Matching and Hashing
10 pages
4 Module Algorithms
No ratings yet
4 Module Algorithms
28 pages
Algorithms in Bioinformatics
No ratings yet
Algorithms in Bioinformatics
7 pages
AAD Lec11
No ratings yet
AAD Lec11
5 pages
String Matching
No ratings yet
String Matching
116 pages
String Matching Algorithms Guide
No ratings yet
String Matching Algorithms Guide
46 pages
Daa 9
No ratings yet
Daa 9
4 pages
Daa 9
No ratings yet
Daa 9
4 pages
Daa Exp-9
No ratings yet
Daa Exp-9
4 pages
Daa Da
No ratings yet
Daa Da
9 pages
KMP Algorithm for Pattern Matching
No ratings yet
KMP Algorithm for Pattern Matching
4 pages
20BCS5977 - DAA LAB WORKSHEET 3.3pdf
No ratings yet
20BCS5977 - DAA LAB WORKSHEET 3.3pdf
5 pages
Finite Fields and String Matching
No ratings yet
Finite Fields and String Matching
11 pages
Experiment 9 DAA
No ratings yet
Experiment 9 DAA
5 pages
Sandeep Singh (Iii B.Tech I.T)
No ratings yet
Sandeep Singh (Iii B.Tech I.T)
179 pages
Design and Analysis of Algorithms: String Matching Knuth-Morris-Pratt (KMP) Algorithm
No ratings yet
Design and Analysis of Algorithms: String Matching Knuth-Morris-Pratt (KMP) Algorithm
46 pages
Top MBA Colleges in India - Top B Schools in India PDF
No ratings yet
Top MBA Colleges in India - Top B Schools in India PDF
10 pages
Imagine Cup: Sofia, October 2011
No ratings yet
Imagine Cup: Sofia, October 2011
26 pages
Silver Oak College of Engineering and Technology
No ratings yet
Silver Oak College of Engineering and Technology
12 pages
Longest Common Subsequences
No ratings yet
Longest Common Subsequences
8 pages
Suffix Array Tutorial
No ratings yet
Suffix Array Tutorial
17 pages