0% found this document useful (0 votes)

170 views40 pages

String Algorithms: Jaehyun Park Cs 97si Stanford University

This document provides an overview of string algorithms, including the string matching problem, hash tables, the Knuth-Morris-Pratt (KMP) algorithm, suffix tries, and suffix arrays. It describes the string matching problem of finding all occurrences of a pattern string within a text. It then summarizes hash tables, the KMP algorithm which solves string matching in linear time, suffix tries which store all substrings of a string in a tree structure, and suffix arrays which use less space than suffix tries while having the same computational power.

Uploaded by

Utkarsh Jha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

170 views40 pages

String Algorithms: Jaehyun Park Cs 97si Stanford University

Uploaded by

Utkarsh Jha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

String Algorithms

Jaehyun Park
CS 97SI
Stanford University

June 30, 2015

Outline

String Matching Problem

Hash Table
Knuth-Morris-Pratt (KMP) Algorithm
Suffix Trie
Suffix Array

String Matching Problem

Given a text T and a pattern P , find all occurrences of P

within T
Notations:

n and m: lengths of P and T

: set of alphabets (of constant size)
Pi : ith letter of P (1-indexed)
a, b, c: single letters in
x, y, z: strings

String Matching Problem

Example

T = AGCATGCTGCAGTCATGCTTAGGCTA

P = GCT

P appears three times in T

A naive method takes O(mn) time

Initiate string comparison at every starting point
Each comparison takes O(m) time

We can do much better!

String Matching Problem

Outline

String Matching Problem

Hash Table
Knuth-Morris-Pratt (KMP) Algorithm
Suffix Trie
Suffix Array

Hash Table

Hash Function

A function that takes a string and outputs a number

A good hash function has few collisions
i.e., If x 6= y, H(x) 6= H(y) with high probability

An easy and powerful hash function is a polynomial mod some

prime p
Consider each letter as a number (ASCII value is fine)
H(x1 . . . xk ) = x1 ak1 + x2 ak2 + + xk1 a + xk (mod p)
How do we find H(x2 . . . xk+1 ) from H(x1 . . . xk )?

Hash Table

Main idea: preprocess T to speedup queries

Hash every substring of length k
k is a small constant

For each query P , hash the first k letters of P to retrieve all

the occurrences of it within T

Dont forget to check collisions!

Hash Table

Pros:
Easy to implement
Significant speedup in practice

Cons:
Doesnt help the asymptotic efficiency

Can still take (nm) time if hashing is terrible or data is

difficult

A lot of memory consumption

Hash Table

Outline

String Matching Problem

Hash Table
Knuth-Morris-Pratt (KMP) Algorithm
Suffix Trie
Suffix Array

Knuth-Morris-Pratt (KMP) Algorithm

Knuth-Morris-Pratt (KMP) Matcher

A linear time (!) algorithm that solves the string matching

problem by preprocessing P in (m) time
Main idea is to skip some comparisons by using the previous
comparison result

Uses an auxiliary array that is defined as the following:

[i] is the largest integer smaller than i such that P1 . . . P[i] is
a suffix of P1 . . . Pi

... Its better to see an example than the definition

Knuth-Morris-Pratt (KMP) Algorithm

Table Example (from CLRS)

[i] is the largest integer smaller than i such that P1 . . . P[i]

is a suffix of P1 . . . Pi
e.g., [6] = 4 since abab is a suffix of ababab
e.g., [9] = 0 since no prefix of length 8 ends with c

Lets see why this is useful

Knuth-Morris-Pratt (KMP) Algorithm

Using the Table

T = ABC ABCDAB ABCDABCDABDE

P = ABCDABD

= (0, 0, 0, 0, 1, 2, 0)

Start matching at the first position of T :

Mismatch at the 4th letter of P !

Knuth-Morris-Pratt (KMP) Algorithm

Using the Table

We matched k = 3 letters so far, and [k] = 0

Thus, there is no point in starting the comparison at T2 , T3
(crucial observation)

Shift P by k [k] = 3 letters

Mismatch at T4 again!

Knuth-Morris-Pratt (KMP) Algorithm

Using the Table

We matched k = 0 letters so far

Shift P by k [k] = 1 letter (we define [0] = 1)

Mismatch at T11 !

Knuth-Morris-Pratt (KMP) Algorithm

Using the Table

[6] = 2 means P1 P2 is a suffix of P1 . . . P6

Shift P by 6 [6] = 4 letters

Again, no point in shifting P by 1, 2, or 3 letters

Knuth-Morris-Pratt (KMP) Algorithm

Using the Table

Mismatch at T11 again!

Currently 2 letters are matched

Shift P by 2 [2] = 2 letters

Knuth-Morris-Pratt (KMP) Algorithm

Using the Table

Mismatch at T11 yet again!

Currently no letters are matched

Shift P by 0 [0] = 1 letter

Knuth-Morris-Pratt (KMP) Algorithm

Using the Table

Mismatch at T18

Currently 6 letters are matched

Shift P by 6 [6] = 4 letters

Knuth-Morris-Pratt (KMP) Algorithm

Using the Table

Finally, there it is!

Currently all 7 letters are matched

After recording this match (at T16 . . . T22 , we shift P again in
order to find other matches

Shift by 7 [7] = 7 letters

Knuth-Morris-Pratt (KMP) Algorithm

Computing

Observation 1: if P1 . . . P[i] is a suffix of P1 . . . Pi , then

P1 . . . P[i]1 is a suffix of P1 . . . Pi1

Observation 2: all the prefixes of P that are a suffix of

P1 . . . Pi can be obtained by recursively applying to i

Well, obviously...

e.g., P1 . . . P[i] , P1 . . . , P[[i]] , P1 . . . , P[[[i]]] are all

suffixes of P1 . . . Pi

Knuth-Morris-Pratt (KMP) Algorithm

Computing

A non-obvious conclusion:
First, lets write (k) [i] as [] applied k times to i
e.g., (2) [i] = [[i]]
[i] is equal to (k) [i 1] + 1, where k is the smallest integer
that satisfies P(k) [i1]+1 = Pi

If there is no such k, [i] = 0

Intuition: we look at all the prefixes of P that are suffixes of

P1 . . . Pi1 , and find the longest one whose next letter
matches Pi

Knuth-Morris-Pratt (KMP) Algorithm

Implementation

pi[0] = -1;
int k = -1;
for(int i = 1; i <= m; i++) {
while(k >= 0 && P[k+1] != P[i])
k = pi[k];
pi[i] = ++k;
}

Knuth-Morris-Pratt (KMP) Algorithm

Pattern Matching Implementation

int k = 0;
for(int i = 1; i <= n; i++) {
while(k >= 0 && P[k+1] != T[i])
k = pi[k];
k++;
if(k == m) {
// P matches T[i-m+1..i]
k = pi[k];
}
}

Knuth-Morris-Pratt (KMP) Algorithm

Outline

String Matching Problem

Hash Table
Knuth-Morris-Pratt (KMP) Algorithm
Suffix Trie
Suffix Array

Suffix Trie

Suffix trie of a string T is a rooted tree that stores all the

suffixes (thus all the substrings)

Each node corresponds to some substring of T

Each edge is associated with an alphabet

For each node that corresponds to ax, there is a special

pointer called suffix link that leads to the node corresponding
to x

Surprisingly easy to implement!

Suffix Trie

Example

(Figure modified from Ukkonens original paper)

Suffix Trie

Incremental Construction

Given the suffix tree for T1 . . . Tn

Then we append Tn+1 = a to T , creating necessary nodes

Start at node u corresponding to T1 . . . Tn

Create an a-transition to a new node v

Take the suffix link at u to go to u , corresponding to

T2 . . . Tn
Create an a-transition to a new node v
Create a suffix link from v to v

Suffix Trie

Incremental Construction

Repeat the previous process:

Take the suffix link at the current node
Make a new a-transition there
Create the suffix link from the previous node

Stop if the node already has an a-transition

Because from this point, all nodes that are reachable via suffix
links already have an a-transition

Suffix Trie

Construction Example

Given the suffix trie for aba

We want to add a new letter c

Suffix Trie

Construction Example

Suffix Trie

Construction Example

Suffix Trie

Construction Example

Suffix Trie

Construction Example

Suffix Trie

Construction Example

Suffix Trie

Construction Example

Construction time is linear in the tree size

But the tree size can be quadratic in n
e.g., T = aa . . . abb . . . b

Suffix Trie

Construction Example

To find P , start at the root and keep following edges labeled

with P1 , P2 , etc.

Got stuck? Then P doesnt exist in T

Suffix Trie

Outline

String Matching Problem

Hash Table
Knuth-Morris-Pratt (KMP) Algorithm
Suffix Trie
Suffix Array

Suffix Array

Memory usage is O(n)

Has the same computational power as suffix trie

Can be constructed in O(n) time (!)

There is an approachable O(n log2 n) algorithm

But its hard to implement

If you want to see how it works, read the paper on the course
website
http://cs97si.stanford.edu/suffix-array.pdf

Suffix Array

Notes on String Problems

Always be aware of the null-terminators

Simple hash works so well in many problems

If a problem involves rotations of some string, consider

concatenating it with itself and see if it helps

Stanford team notebook has implementations of suffix arrays

and the KMP matcher

Suffix Array

Business and Economic Forecasting EMET3007/EMET8012 Problem Set 1
No ratings yet
Business and Economic Forecasting EMET3007/EMET8012 Problem Set 1
2 pages
Handwritten Digit Recognition
67% (12)
Handwritten Digit Recognition
23 pages
10 String Algorithms
No ratings yet
10 String Algorithms
36 pages
String Matching Algorithms
No ratings yet
String Matching Algorithms
25 pages
Lecture 34, 35 36 - String Matching Algorithms
No ratings yet
Lecture 34, 35 36 - String Matching Algorithms
42 pages
Abstract
No ratings yet
Abstract
12 pages
Lecture 39 Knutt Morris Pratt
No ratings yet
Lecture 39 Knutt Morris Pratt
15 pages
11 Data Structures and Algorithms - Narasimha Karumanchi
100% (1)
11 Data Structures and Algorithms - Narasimha Karumanchi
12 pages
Unit 5 String Matching 2010
No ratings yet
Unit 5 String Matching 2010
5 pages
Unit 3
No ratings yet
Unit 3
34 pages
Lecture#8 - String Matching Algorithm
No ratings yet
Lecture#8 - String Matching Algorithm
38 pages
04 03-PatternMatchingAndTries
No ratings yet
04 03-PatternMatchingAndTries
28 pages
UNIT 5.3 (String Mactching)
No ratings yet
UNIT 5.3 (String Mactching)
23 pages
Algorithms in Bioinformatics
No ratings yet
Algorithms in Bioinformatics
7 pages
KMP 2
No ratings yet
KMP 2
7 pages
Pattern Matching
No ratings yet
Pattern Matching
33 pages
CH 8
No ratings yet
CH 8
26 pages
String Matching Chapter 12 Goodrich Nep
No ratings yet
String Matching Chapter 12 Goodrich Nep
43 pages
Ch-5 Numerical Daa
No ratings yet
Ch-5 Numerical Daa
11 pages
CS 240 Tutorial 11 Notes: C A A B A
No ratings yet
CS 240 Tutorial 11 Notes: C A A B A
2 pages
Patternmatching
No ratings yet
Patternmatching
29 pages
DAA Unit 5
No ratings yet
DAA Unit 5
22 pages
M269 - Lec8 Fall 1819
No ratings yet
M269 - Lec8 Fall 1819
24 pages
String Matching Introduction To NP-Completeness
No ratings yet
String Matching Introduction To NP-Completeness
37 pages
54.string Inotes
No ratings yet
54.string Inotes
20 pages
Trings and Attern Atching: - Brute Force, Rabin-Karp, Knuth-Morris-Pratt - Regular Expressions
No ratings yet
Trings and Attern Atching: - Brute Force, Rabin-Karp, Knuth-Morris-Pratt - Regular Expressions
21 pages
Strings
No ratings yet
Strings
23 pages
AAD-String Matching
No ratings yet
AAD-String Matching
15 pages
Trings and Attern Atching: - Brute Force, Rabin-Karp, Knuth-Morris-Pratt
No ratings yet
Trings and Attern Atching: - Brute Force, Rabin-Karp, Knuth-Morris-Pratt
49 pages
Ada Notes Unit 4
No ratings yet
Ada Notes Unit 4
28 pages
W 9 Presentation
No ratings yet
W 9 Presentation
20 pages
W9 Presentation
No ratings yet
W9 Presentation
20 pages
Lab10 HQTCSDL
No ratings yet
Lab10 HQTCSDL
2 pages
UNIT-5 DAA Complete Notes
No ratings yet
UNIT-5 DAA Complete Notes
52 pages
UNIT-V String Matching
No ratings yet
UNIT-V String Matching
24 pages
Lecture 04 Inaryseachtree
No ratings yet
Lecture 04 Inaryseachtree
20 pages
Module 06. String Algorithms Lecture 3-6
No ratings yet
Module 06. String Algorithms Lecture 3-6
48 pages
Unit 5
No ratings yet
Unit 5
14 pages
Pattern Matching and Tries: Text Book
No ratings yet
Pattern Matching and Tries: Text Book
8 pages
DS Unit-5 Topic
No ratings yet
DS Unit-5 Topic
26 pages
Lecture 18 - String Matching-KMP
No ratings yet
Lecture 18 - String Matching-KMP
40 pages
Kumboji Pattern Matching Alg
No ratings yet
Kumboji Pattern Matching Alg
4 pages
BNP Unit-5 Lecture 20 KMP 5.2
No ratings yet
BNP Unit-5 Lecture 20 KMP 5.2
14 pages
String Matching Algorithms: Antonio Carzaniga
No ratings yet
String Matching Algorithms: Antonio Carzaniga
11 pages
String Matching: COMP171 Fall 2005
No ratings yet
String Matching: COMP171 Fall 2005
15 pages
Lec 7
No ratings yet
Lec 7
24 pages
AAD Lec11
No ratings yet
AAD Lec11
5 pages
資料工程 Data Engineering: Pattern Matching 張賢宗
No ratings yet
資料工程 Data Engineering: Pattern Matching 張賢宗
38 pages
String Matching Algorithms
No ratings yet
String Matching Algorithms
46 pages
AoA Exp10
No ratings yet
AoA Exp10
8 pages
A357460420 - 22393 - 2 - 2018 - String Matching
No ratings yet
A357460420 - 22393 - 2 - 2018 - String Matching
27 pages
Finite Fields and String Matching
No ratings yet
Finite Fields and String Matching
11 pages
KMP Algorithm: Engineerpro - K01
No ratings yet
KMP Algorithm: Engineerpro - K01
16 pages
Pattern Matching 2
No ratings yet
Pattern Matching 2
46 pages
Unit II
No ratings yet
Unit II
94 pages
Week 9 String Algorithms, Approximation
No ratings yet
Week 9 String Algorithms, Approximation
22 pages
Statistics Paper 1
No ratings yet
Statistics Paper 1
2 pages
A New Deep Learning Method For Automatic Ovarian Cancer Prediction & Subtype Classification
No ratings yet
A New Deep Learning Method For Automatic Ovarian Cancer Prediction & Subtype Classification
10 pages
Dynamic Programming - Set 1 (Overlapping Subproblems Property) - GeeksforGeeks PDF
No ratings yet
Dynamic Programming - Set 1 (Overlapping Subproblems Property) - GeeksforGeeks PDF
5 pages
Math f471 Non Linear Optimization1
No ratings yet
Math f471 Non Linear Optimization1
2 pages
Instant Download Bayesian Methods For Data Analysis Third Edition Carlin B.P. PDF All Chapter
100% (10)
Instant Download Bayesian Methods For Data Analysis Third Edition Carlin B.P. PDF All Chapter
85 pages
Reinforcement Learning 2
No ratings yet
Reinforcement Learning 2
41 pages
Krithika Heheee
No ratings yet
Krithika Heheee
17 pages
Syllabus DS&E 22 23 4Y
No ratings yet
Syllabus DS&E 22 23 4Y
16 pages
Emotion Classification of Facial Images Using Machine Learning Models
No ratings yet
Emotion Classification of Facial Images Using Machine Learning Models
6 pages
Method of Differentiation DPP - 6
No ratings yet
Method of Differentiation DPP - 6
3 pages
Tidsdiskret Pid Reg
No ratings yet
Tidsdiskret Pid Reg
5 pages
Structural Seismic Design Optimization and Earthquake Engineering Formulations and Applications 1st Edition Vagelis Plevris Instant Download
No ratings yet
Structural Seismic Design Optimization and Earthquake Engineering Formulations and Applications 1st Edition Vagelis Plevris Instant Download
82 pages
Successive Over Relaxation Method
No ratings yet
Successive Over Relaxation Method
5 pages
Nmce Unit 01
No ratings yet
Nmce Unit 01
142 pages
Eecs 281 Heaps
No ratings yet
Eecs 281 Heaps
25 pages
AKTU Syllabus CS 3rd Yr
No ratings yet
AKTU Syllabus CS 3rd Yr
1 page
Initial Value Problems For ODEs
No ratings yet
Initial Value Problems For ODEs
38 pages
Video Lecture PPT Format
No ratings yet
Video Lecture PPT Format
11 pages
POM
No ratings yet
POM
15 pages
Thermal and IR Drop Analysis Using Convolutional Encoder-Decoder Networks
No ratings yet
Thermal and IR Drop Analysis Using Convolutional Encoder-Decoder Networks
7 pages
22UMA101 M-C Assignemnt 1
No ratings yet
22UMA101 M-C Assignemnt 1
1 page
Toc Problem Solving
No ratings yet
Toc Problem Solving
5 pages
Data Structure Algorithums Assignment
No ratings yet
Data Structure Algorithums Assignment
14 pages
Hill Cipher::: Program
No ratings yet
Hill Cipher::: Program
6 pages
794 Lec Intro Handout
No ratings yet
794 Lec Intro Handout
44 pages
Engineering Mathematics Test 5: Numerical Methods
No ratings yet
Engineering Mathematics Test 5: Numerical Methods
6 pages
Clash Detection
No ratings yet
Clash Detection
13 pages
Transportation and Assignment Problems KA
No ratings yet
Transportation and Assignment Problems KA
24 pages