0% found this document useful (0 votes)

104 views25 pages

String Matching Algorithms

The document summarizes three common string matching algorithms: Naive, Rabin-Karp, and Knuth-Morris-Pratt. The Naive algorithm has O(mn) runtime by comparing characters at each index. Rabin-Karp improves this to O(m+n) by comparing hash values instead of characters. Knuth-Morris-Pratt also has O(m+n) runtime by constructing a state machine from the pattern to avoid re-checking characters.

Uploaded by

Aditya Pratap Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

104 views25 pages

String Matching Algorithms

Uploaded by

Aditya Pratap Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

STRING MATCHING

Aditya Pratap Singh

215/CO/15
Netaji Subhas Institute Of Technology
CONTENTS

● Introduction
● String Matching
● Basic Classification
● Naive Algorithm
● Rabin-Karp Algorithm
○ String Hashing
○ Hash value for substrings
● Knuth-Morris-Pratt Algorithm
○ Prefix Function
○ KMP Matcher
● Summary
INTRODUCTION

● String matching algorithms are an important class of string

algorithms that tries to find one or many indices where one
or several strings(or patterns) are found in the larger string(or
text)

● Why do we need string matching?

String matching is used in various applications like spell
checkers, spam filters, search engines, plagiarism detectors,
bioinformatics and DNA sequencing etc.
STRING MATCHING

● To find all occurrences of a pattern in a given text

● Formally, given a pattern P[1..m] and a text T[1..n], find all
occurrences of P in T. Both P and T belongs to Σ*
● P occurs with shift s(beginning at s+1): P[1] = T[s+1], P[2] =
T[s+2],…, P[m] = T[s+m]
● If so, s is called a valid shift, otherwise an invalid shift
● Note: one occurrence can start within another one ie.
overlapping is allowed. eg P=abab T=abcabababbc, P occurs
at s=3 and s=5.

*text is the string that we are searching

*pattern is the string that we are searching for
*Shift is an offset into a string
BASIC CLASSIFICATION

1. Naive Algorithm - The naive approach is accomplished by

performing a brute-force comparison of each character in the
pattern at each possible placement of the pattern in the
string. It is O(mn) in the worst case scenario

2. Rabin-Karp Algorithm - It compares the string’s hash values,

rather than string themselves. Performs well in practice and
generalized to other algorithm for related problems such as
2D-string matching

3. Knuth-Morris-Pratt Algorithm - It is improved on brute-force

algorithm and is capable of running O(m+n) in the worst
case. It improves the running time by taking advantage of
prefix function
NAIVE ALGORITHM

One of the most obvious approach towards the string matching

problem would be to compare the first element of the pattern to
be searched ‘p’, with the first element of the string ‘s’ in which to
locate ‘p’.

If the first element of ‘p’ matches the first element of ‘s’ ,

compare the second element and so on. If match found proceed
likewise until entire ‘p’ is found. If a mismatch is found at any
position , shift index to one position to the right and continue
comparison

This approach is easy to understand and implement but it can be

too slow in some cases.
In worst case it may take (m*n) iterations to complete the task.
PSEUDOCODE

function naive(text[], pattern[]){

for(i = 0; i < n; i++) {
for(j = 0; j < m && i + j < n; j++) {
if(text[i + j] != pattern[j]) break; // mismatch found
if(j == m) // match found
}
}
}
ILLUSTRATION

String S = a b c a b a a b c a b a c
Pattern P = a b a a

Step 1: Compare P[1] with S[1]

abcabaabcabac

abaa

Step 2: Compare P[2] with S[2]

abcabaabcabac

abaa
ILLUSTRATION

Step 3: Compare P[3] with S[3]

abcabaabcabac

abaa

Since mismatch is detected, shift ‘p’ one position to the left and
perform steps analogous to those from step 1 to step 3. At
position where mismatch is detected, shift ‘p’ one position to
right and repeat matching procedure.
ILLUSTRATION

Finally, a match is found after shifting ‘p’ three times to the right
side.

abcabaabcabac

abaa

Drawbacks : If ‘m’ is the length of pattern P and ‘n’ is the length

of text T, then the matching time is O(n*m), which is certainly a
very slow running time
RABIN-KARP ALGORITHM

This is actually the naive approach augmented with a powerful

programming technique - hash function

Algorithm :
1. Calculate the hash for the pattern P
2. Calculate the hash values for all the prefixes of the text T.
3. Now, we can compare a substring of length |s| in constant
time using the calculated hashes.

This algorithm was authored by Michael Rabin and Richard Karp

in 1987.
STRING HASHING

Problem - Given a string S of length n = |S| . Calculate the hash

value of S

Solution -

where p and m are suitably chosen prime numbers.

CHOICE OF PARAMETERS

‘p’ should be taken roughly equal to the number of characters in

the input alphabet. If input is composed of only lowercase
characters of English alphabet, p=31 is a good choice. If the
input may contain both uppercase and lowercase letters, then
p=53 is a good choice.

‘m’ should be a large prime. A popular choice is m = 10^9+7

This is a large number but still small enough so that we can
perform multiplication of two values using 64 bit integers.
HASH CALCULATION OF SUBSTRINGS OF GIVEN STRING

Problem : Given string S and indices i and j . Find the hash value
of S[i..j]

Solution :
By definition we have,

Multiplying by pi gives,

So by knowing the hash value of each prefix of string S, we can

compute the hash of any substring in constant O(1) time.
PSEUDOCODE
vector<int> rabin_karp(string const& pat, string const& text) {
const int p = 31, m = 1e9 + 9;
int S = pat.size(), T = text.size();

vector<long long> p_pow(max(S, T));

p_pow[0] = 1;
for (int i = 1; i < (int)p_pow.size(); i++)
p_pow[i] = (p_pow[i-1] * p) % m;

vector<long long> h(T + 1, 0);

for (int i = 0; i < T; i++)
h[i+1] = (h[i] + (text[i] - 'a' + 1) * p_pow[i]) % m;
long long h_s = 0;
for (int i = 0; i < S; i++)
h_s = (h_s + (pat[i] - 'a' + 1) * p_pow[i]) % m;

vector<int> occurrences;
for (int i = 0; i + S - 1 < T; i++) {
long long cur_h = (h[i+S] + m - h[i]) % m;
if (cur_h == h_s * p_pow[i] % m)
occurrences.push_back(i);
}
return occurrences;
}
KNUTH-MORRIS-PRATT ALGORITHM

Knuth, Morris and Pratt proposed a linear time algorithm for the
string matching problem.

A matching time of O(n) is achieved by avoiding comparisons with

elements of ‘S’ that have previously been involved in comparison
with some element of the pattern ‘p’ to be matched ie.
backtracking on the string ‘S’ never occurs.

KMP makes use of ‘prefix function’

PREFIX FUNCTION

The prefix function of a string is defined as an array Ⲡ of length n,

where Ⲡ[i] is the length of the longest proper prefix of the
substring s[0..i] which is also a suffix of this substring.
A proper prefix of a string is a prefix that is not equal to the string
itself. So by definition Ⲡ[0] = 0

Mathematically,
EXAMPLE
S = “aabaaab”
PREFIX Ⲡ[i]

a a 0

aa aa 1

aab aab 0

aaba aaba 1

aabaa aabaa 2

aabaaa aabaaa 2

aabaaab aabaaab 3
ALGORITHM TO COMPUTE PREFIX FUNCTION

● We compute the prefix values Ⲡ[i] in a loop iterating from i=1

to i=n-1 (Ⲡ[0] just gets assigned with 0)

● To calculate the current value Ⲡ[i] we set the variable j

denoting the length of the best suffix for ‘i-1’ . Initially j = Ⲡ[i-1]

● Test if the suffix of length ‘j+1’ is also a prefix by comparing s[j]

and s[i]. If they are equal then we assign Ⲡ[i] = j+1 . Otherwise,
we reduce j to Ⲡ[j-1] and repeat this step.

● If we have reached the length j=0 and still don’t have the
match, then we assign Ⲡ[i] = 0 and go to the next index ‘i+1’
PSEUDOCODE

vector<int> prefix_function(string s){

int n = (int)s.length();
vector<int> pi(n);
for(int i=1;i<n;i++){
int j = pi[i-1];
while(j>0 and s[i]!=s[j]) j = pi[j-1];
if(s[i] == s[j]) ++j;
pi[i] = j;
}
return pi;
}

Runtime - O(n)
KMP MATCHER

● This is a classical application of prefix function, which we just

learned
● Given text T and string S, we need to find all occurrences of S
in T
● Denote with n the length of the string S and with m the length
of the string T ie. n = |S| and m = |T|
● Generate a string S + # + T , where # is a separator that neither
appears in S nor T . Now calculate the prefix function of this
string
● By definition, Ⲡ[i] in this string corresponds to the largest block
that coincides with S and ends at position ‘i’ .
● Note: Ⲡ[i] can not be larger than ‘n’ because of the separator #
that we used
● If Ⲡ[i] == n, then we can say that string S appears completely at
this position.
EXAMPLE

S = “aba”
T = “aababac”
Generated string(G) = “aba#aababac”
Index (i) PREFIX Ⲡ[i]

4 a 1

5 aa 1

6 aab 2

7 aaba 3

8 aabab 2

9 aababa 3

10 aababac 0

Ⲡ[i] = n(=3) at positions i = 7 and 9 of G , which means at indices

i = 1 and i=3 in the Text , there is occurrence of the pattern(S)
PSEUDOCODE

vector<int> kmp(string pattern,string text){

string str = pattern + "#" + text;
int n = pattern.length(), m = str.length();
vector<int> pi = prefix_function(str);
vector<int> ret;
for(int i=n+1;i<m;i++) {
if(pi[i] == n) ret.pb(i-2*n);
}
return ret;
}

Runtime: O(n+m)
SUMMARY

Algorithm Time Complexity Key Ideas Approach

Brute Force (Naive) O(m*n) Searching with all Linear Searching

alphabets

Rabin-Karp Θ(m+n) Compare the text Hashing Based

and patterns using
their hash functions

Knuth-Morris-Pratt O(m+n) Constructs an Heuristic Based

automaton from the
pattern

n = |pattern| , length of pattern

m = |text| , length of text
THANK YOU

New Holland B90B, B95B, B95BLR, B95BTC, B110B, B115B TIER 3 Loader Backhoe Service Repair Manual
0% (1)
New Holland B90B, B95B, B95BLR, B95BTC, B110B, B115B TIER 3 Loader Backhoe Service Repair Manual
31 pages
Comp 272 Notes
0% (1)
Comp 272 Notes
26 pages
String Matching
No ratings yet
String Matching
30 pages
String Matching
100% (1)
String Matching
27 pages
String Matching
100% (1)
String Matching
12 pages
Daa
No ratings yet
Daa
113 pages
Data Structure Complete Notes
No ratings yet
Data Structure Complete Notes
115 pages
TCP Connection Management
100% (1)
TCP Connection Management
5 pages
Dsa Basic Data Structure
No ratings yet
Dsa Basic Data Structure
72 pages
Functional Requirements Non Functional Requirements: 4. I. Ii. Iii. Iv. V
No ratings yet
Functional Requirements Non Functional Requirements: 4. I. Ii. Iii. Iv. V
7 pages
Lecture 37 String Matching
100% (1)
Lecture 37 String Matching
12 pages
Back Patching CC Presentation
No ratings yet
Back Patching CC Presentation
15 pages
Sppu CN Insem Solved Paper Aug 2018
No ratings yet
Sppu CN Insem Solved Paper Aug 2018
14 pages
Data Str-Time &space Complexity
No ratings yet
Data Str-Time &space Complexity
48 pages
Data Structures
No ratings yet
Data Structures
43 pages
Hashing in Data Structures
No ratings yet
Hashing in Data Structures
27 pages
Module 1 Operating System Overview
No ratings yet
Module 1 Operating System Overview
20 pages
Distributed System
No ratings yet
Distributed System
162 pages
Lecture 12 Structures
No ratings yet
Lecture 12 Structures
37 pages
Chapter Three Searching and Sorting Algorithm
100% (1)
Chapter Three Searching and Sorting Algorithm
47 pages
Simple Sorting and Searching Algorithms 2.1searching: Pseudocode
No ratings yet
Simple Sorting and Searching Algorithms 2.1searching: Pseudocode
7 pages
CSD 205 - Design and Analysis of Algorithms: Instructor: Dr. M. Hasan Jamal Lecture# 01: Introduction
100% (1)
CSD 205 - Design and Analysis of Algorithms: Instructor: Dr. M. Hasan Jamal Lecture# 01: Introduction
101 pages
Module 1 - Ch1 - Introduction To Computer Programming - 0511 - 2017
No ratings yet
Module 1 - Ch1 - Introduction To Computer Programming - 0511 - 2017
18 pages
Software Engineering
No ratings yet
Software Engineering
29 pages
15cs204j-Algorithm Design and Analysis
No ratings yet
15cs204j-Algorithm Design and Analysis
3 pages
Simple Sorting and Searching Algorithms Lecture Note
No ratings yet
Simple Sorting and Searching Algorithms Lecture Note
11 pages
CSC 431 - Computer System Performance Evaluation (2 Units)
No ratings yet
CSC 431 - Computer System Performance Evaluation (2 Units)
56 pages
Algorithm Analysis
No ratings yet
Algorithm Analysis
61 pages
Data Structure & Algorithms - Tower of Hanoi
100% (2)
Data Structure & Algorithms - Tower of Hanoi
3 pages
Graphs Assignment
No ratings yet
Graphs Assignment
5 pages
Design & Analysis of Algorithms - 88 MCQs With Answers - Part 1 - Department of Computer Engineers PDF
No ratings yet
Design & Analysis of Algorithms - 88 MCQs With Answers - Part 1 - Department of Computer Engineers PDF
29 pages
DAA Unit 1
No ratings yet
DAA Unit 1
84 pages
Unit-I - Introduction
100% (1)
Unit-I - Introduction
75 pages
Unit - 1 Block Chain
No ratings yet
Unit - 1 Block Chain
81 pages
Data Structures - Module 1
No ratings yet
Data Structures - Module 1
65 pages
Sequences in Data Structure
No ratings yet
Sequences in Data Structure
45 pages
Deadlock Assignment
No ratings yet
Deadlock Assignment
6 pages
FirstTwounitsNotes OOSD (16oct23)
No ratings yet
FirstTwounitsNotes OOSD (16oct23)
97 pages
DS Lecture 01 - Introduction PDF
No ratings yet
DS Lecture 01 - Introduction PDF
23 pages
DSA-Module 1 - Notes On Search Trees and Their Operations
No ratings yet
DSA-Module 1 - Notes On Search Trees and Their Operations
29 pages
L02 OverviewOfProgrammingParadigms
No ratings yet
L02 OverviewOfProgrammingParadigms
13 pages
Data Structures Questions
No ratings yet
Data Structures Questions
6 pages
Assignment 1 Answer
0% (1)
Assignment 1 Answer
11 pages
FDS Unit 5
No ratings yet
FDS Unit 5
22 pages
Computer Networks Lecture Notes
0% (1)
Computer Networks Lecture Notes
50 pages
Advanced Networking and Communication Systems CSIS 430 CG
No ratings yet
Advanced Networking and Communication Systems CSIS 430 CG
6 pages
Brute Force: Design and Analysis of Algorithms - Chapter 3 1
No ratings yet
Brute Force: Design and Analysis of Algorithms - Chapter 3 1
18 pages
2012 IN4392 Lecture-5 CloudProgrammingModels
100% (1)
2012 IN4392 Lecture-5 CloudProgrammingModels
95 pages
Chapter 3: Recursion, Recurrence Relations, and Analysis of Algorithms
No ratings yet
Chapter 3: Recursion, Recurrence Relations, and Analysis of Algorithms
23 pages
OUTCOME 1 AND 2 Data Structure and Algorithm Notes
No ratings yet
OUTCOME 1 AND 2 Data Structure and Algorithm Notes
100 pages
Cs 6402 Design and Analysis of Algorithms
No ratings yet
Cs 6402 Design and Analysis of Algorithms
112 pages
Chapter Three: Data Encoding, Data Transmission and Multiplexing
No ratings yet
Chapter Three: Data Encoding, Data Transmission and Multiplexing
27 pages
Aditya Engineering College (A) : Python Data Structures
No ratings yet
Aditya Engineering College (A) : Python Data Structures
7 pages
Chapter Three
No ratings yet
Chapter Three
108 pages
ADBMS
100% (1)
ADBMS
41 pages
Mobile Computing EEM 825/ PEE411 Credits:4: Syllabus
100% (1)
Mobile Computing EEM 825/ PEE411 Credits:4: Syllabus
6 pages
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
Java Reflection Complete Self-Assessment Guide
From Everand
Java Reflection Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
String Matching Kmprabin Karp and Naive
No ratings yet
String Matching Kmprabin Karp and Naive
41 pages
UNIT-V String Matching
No ratings yet
UNIT-V String Matching
24 pages
Lecture#8 - String Matching Algorithm
No ratings yet
Lecture#8 - String Matching Algorithm
38 pages
Classifying Twitter Topic-Networks Using Social Network Analysis
No ratings yet
Classifying Twitter Topic-Networks Using Social Network Analysis
13 pages
Sykes Specsheet - HH220-432-SR - Au-Nz-Afr
No ratings yet
Sykes Specsheet - HH220-432-SR - Au-Nz-Afr
3 pages
Checklist - SRS Review.1.0
No ratings yet
Checklist - SRS Review.1.0
5 pages
P901BK TDS 2024
No ratings yet
P901BK TDS 2024
2 pages
Business Proposal PowerPoint Template
No ratings yet
Business Proposal PowerPoint Template
22 pages
Internet of Things
No ratings yet
Internet of Things
50 pages
323-1851-102.7 (6500 R12.6 Data Layer2 CPS) Issue1
No ratings yet
323-1851-102.7 (6500 R12.6 Data Layer2 CPS) Issue1
468 pages
User Manual - CiTO Power POLO Series (B1K-2K)
No ratings yet
User Manual - CiTO Power POLO Series (B1K-2K)
12 pages
eXact2-SpecSheet EN
No ratings yet
eXact2-SpecSheet EN
5 pages
8085 Instruction Set
No ratings yet
8085 Instruction Set
17 pages
SPiiPlus PCI Series Hardware Guide (V5-20)
No ratings yet
SPiiPlus PCI Series Hardware Guide (V5-20)
102 pages
Sdo New Format
No ratings yet
Sdo New Format
5 pages
Arul Murugan
No ratings yet
Arul Murugan
2 pages
Manual de Utilizare PT DELCOS Pro PDF
100% (2)
Manual de Utilizare PT DELCOS Pro PDF
28 pages
Main B Assessment
No ratings yet
Main B Assessment
10 pages
Aplikasi Yang Perlu Disiapkan: From Speedywiki
No ratings yet
Aplikasi Yang Perlu Disiapkan: From Speedywiki
22 pages
HSG Hướng 2023
No ratings yet
HSG Hướng 2023
9 pages
Pending Points of Eng
No ratings yet
Pending Points of Eng
7 pages
Ramjas Botany List
No ratings yet
Ramjas Botany List
58 pages
Solved The Pricing Model For Itunes Has Been To Price Songs
No ratings yet
Solved The Pricing Model For Itunes Has Been To Price Songs
1 page
Attacking Modern Environments With MSSQL Server SPs
No ratings yet
Attacking Modern Environments With MSSQL Server SPs
67 pages
Dapp Log
No ratings yet
Dapp Log
29 pages
Final Project Management
No ratings yet
Final Project Management
13 pages
Satish Yerramsetti
No ratings yet
Satish Yerramsetti
4 pages
Literature Review On The Role of Local Government in National Development
100% (1)
Literature Review On The Role of Local Government in National Development
7 pages
Booking Confirmation On IRCTC, Train: 12216, 25-Oct-2024, 3A, BDTS - JP
No ratings yet
Booking Confirmation On IRCTC, Train: 12216, 25-Oct-2024, 3A, BDTS - JP
1 page
Denso Spark Plugs Specification Sheet
No ratings yet
Denso Spark Plugs Specification Sheet
13 pages
Art Rocket - Drapery and Folds
No ratings yet
Art Rocket - Drapery and Folds
17 pages
Research Approach Methodology
No ratings yet
Research Approach Methodology
11 pages

String Matching Algorithms

Uploaded by

String Matching Algorithms

Uploaded by

STRING MATCHING

Aditya Pratap Singh

● String matching algorithms are an important class of string

● Why do we need string matching?

● To find all occurrences of a pattern in a given text

*text is the string that we are searching

1. Naive Algorithm - The naive approach is accomplished by

2. Rabin-Karp Algorithm - It compares the string’s hash values,

3. Knuth-Morris-Pratt Algorithm - It is improved on brute-force

One of the most obvious approach towards the string matching

If the first element of ‘p’ matches the first element of ‘s’ ,

This approach is easy to understand and implement but it can be

function naive(text[], pattern[]){

Step 1: Compare P[1] with S[1]

Step 2: Compare P[2] with S[2]

Step 3: Compare P[3] with S[3]

Drawbacks : If ‘m’ is the length of pattern P and ‘n’ is the length

This is actually the naive approach augmented with a powerful

This algorithm was authored by Michael Rabin and Richard Karp

Problem - Given a string S of length n = |S| . Calculate the hash

where p and m are suitably chosen prime numbers.

‘p’ should be taken roughly equal to the number of characters in

‘m’ should be a large prime. A popular choice is m = 10^9+7

So by knowing the hash value of each prefix of string S, we can

vector<long long> p_pow(max(S, T));

vector<long long> h(T + 1, 0);

A matching time of O(n) is achieved by avoiding comparisons with

KMP makes use of ‘prefix function’

The prefix function of a string is defined as an array Ⲡ of length n,

● We compute the prefix values Ⲡ[i] in a loop iterating from i=1

● To calculate the current value Ⲡ[i] we set the variable j

● Test if the suffix of length ‘j+1’ is also a prefix by comparing s[j]

vector<int> prefix_function(string s){

● This is a classical application of prefix function, which we just

Ⲡ[i] = n(=3) at positions i = 7 and 9 of G , which means at indices

vector<int> kmp(string pattern,string text){

Algorithm Time Complexity Key Ideas Approach

Brute Force (Naive) O(m*n) Searching with all Linear Searching

Rabin-Karp Θ(m+n) Compare the text Hashing Based

Knuth-Morris-Pratt O(m+n) Constructs an Heuristic Based

n = |pattern| , length of pattern

You might also like