0% found this document useful (0 votes)
29 views17 pages

SOU Lecture Handout ADA Unit-8

The document discusses string matching algorithms. It describes the naive string matching algorithm and its time complexity of O(nm). It then covers the Rabin-Karp algorithm which uses hashing to reduce the running time to O(n+m). Finally, it introduces using finite automata for string matching.

Uploaded by

prince2412001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views17 pages

SOU Lecture Handout ADA Unit-8

The document discusses string matching algorithms. It describes the naive string matching algorithm and its time complexity of O(nm). It then covers the Rabin-Karp algorithm which uses hashing to reduce the running time to O(n+m). Finally, it introduces using finite automata for string matching.

Uploaded by

prince2412001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

1010043316

(ANALYSIS & DESIGN OF


ALGORTIHM)
LECTURE COMPANION SEMESTER: 5 PREPARED BY: MONALI SUTHAR

CHAPTERS 8 – String Matching

1. Introduction

String Matching Algorithm is also called "String Searching Algorithm." This is a vital class
of string algorithm is declared as "this is the method to find a place where one is several
strings are found within the larger string."

Given a text array, T [1.....n], of n character and a pattern array, P [1......m], of m characters.
The problems are to find an integer s, called valid shift where 0 ≤ s < n-m and T
[s+1......s+m] = P [1......m]. In other words, to find even if P in T, i.e., where P is a sub-string
of T. The item of P and T are character drawn from some finite alphabet such as {0, 1} or {A,
B .....Z, a, b..... z}.

Given a string T [1......n], the sub-strings are represented as T [i......j] for some 0≤i ≤ j≤n-1,
the string formed by the characters in T from index i to index j, inclusive. This process that a
string is a sub-string of itself (take i = 0 and j =m).

The proper sub-string of string T [1......n] is T [1......j] for some 0<i ≤ j≤n-1. That is, we must
have either i>0 or j < m-1.

Using these descriptions, we can say given any string T [1......n], the sub-strings are

T [i.....j] = T [i] T [i +1] T [i+2]......T [j] for some 0≤i ≤ j≤n-1.

And proper sub-strings are

T [i.....j] = T [i] T [i +1] T [i+2]......T [j] for some 0≤i ≤ j≤n-1.

2. The naive string matching algorithm

 The string-matching problem is defined as follows.

 We assume that the text is an array T [1…n] of length n and that the pattern is an array P
[1…m] of length m ≤ n.

 We further assume that the elements of P and T are characters drawn from a finite
alphabet Σ. For example, we may have Σ = {0, 1} or Σ = {a, b ..., z}.
DEPARTMENT OF COMPUTER ENGINEERING Page | 1
*Proprietary material of SILVER OAK UNIVERSITY
1010043316
(ANALYSIS & DESIGN OF
ALGORTIHM)
LECTURE COMPANION SEMESTER: 5 PREPARED BY: MONALI SUTHAR

 The character arrays P and T are often called strings of characters.

 We say that pattern P occurs with shift s in text T (or, equivalently, that pattern P occurs
beginning at position s + 1 in text T) if 0 ≤ s ≤ n - m and T [s + 1…s + m] = P [1…m]
(that is, if T [s + j] = P[j], for 1 ≤ j ≤ m).

 If P occurs with shift s in T, then we call s a valid shift; otherwise, we call s an invalid
shift. The string-matching problem is the problem of finding all valid shifts with which a
given pattern P occurs in a given text T.

 The naive algorithm finds all valid shifts using a loop that checks the condition P [1…m]
= T [s + 1…s + m] for each of the n - m + 1 possible values of s.

NAIVE-STRING-MATCHER(T, P)

1 n ← length[T]

2 m ← length[P]

3 for s ← 0 to n – m do

4 if P[1,…, m] == T[s + 1,…, s + m]

5 then print "Pattern occurs with shift" s

 The naive string-matching procedure can be interpreted graphically as sliding a


"template" containing the pattern over the text, noting for which shifts all of the
characters on the template equal the corresponding characters in the text, as illustrated in
Figure.

 The for loop beginning on line 3 considers each possible shift explicitly.

 The test on line 4 determines whether the current shift is valid or not; this test involves an
implicit loop to check corresponding character positions until all positions match
successfully or a mismatch is found.

 Line 5 prints out each valid shift s.

DEPARTMENT OF COMPUTER ENGINEERING Page | 2


*Proprietary material of SILVER OAK UNIVERSITY
1010043316
(ANALYSIS & DESIGN OF
ALGORTIHM)
LECTURE COMPANION SEMESTER: 5 PREPARED BY: MONALI SUTHAR

2.1. Example:

 In the above example, valid shift is s = 3 for which we found the occurrence of pattern P
in text T.

 Procedure NAIVE-STRING-MATCHER takes time O ((n - m + 1) m), and this bound is


tight in the worst case. The running time of NAIVE-STRING- MATCHER is equal to its
matching time, since there is no preprocessing.

3. The Rabin-Karp algorithm

 This algorithm makes use of elementary number-theoretic notions such as the


equivalence of two numbers modulo a third number.

 Let us assume that Σ = {0, 1, 2… 9}, so that each character is a decimal digit. (In the
general case, we can assume that each character is a digit in radix-d notation, where d =
|Σ|).

 We can then view a string of k consecutive characters as representing a length-k decimal


number. The character string 31415 thus corresponds to the decimal number 31,415.

 Given a pattern P [1…m], let p denotes its corresponding decimal value.

 In a similar manner, given a text T [1…n], let ts denote the decimal value of the length-m
substring T[s + 1…s + m], for s = 0, 1. . . n - m.

 Certainly, ts = p if and only if T [s + 1…s + m] = P [1…m]; thus, s is a valid shift if and


only if ts = p.

 We can compute p in time θ(m) using Horner's rule:

P = P[m] + 10 (P[m - 1] + 10(P[m - 2] + · · · + 10(P[2] + 10P[1])…))

 The value t0 can be similarly computed from T [1…m] in time θ(m).


DEPARTMENT OF COMPUTER ENGINEERING Page | 3
*Proprietary material of SILVER OAK UNIVERSITY
1010043316
(ANALYSIS & DESIGN OF
ALGORTIHM)
LECTURE COMPANION SEMESTER: 5 PREPARED BY: MONALI SUTHAR

 To compute the remaining values t1, t2, . . . , tn-m in time θ(n - m), it suffices to observe
that ts+1 can be computed from ts in constant time, since

ts+1 = 10(ts – 10m-1T[s + 1]) + T[s + m + 1]

 Subtracting 10m-1 T[s + 1] removes the high-order digit from ts, multiplying the result by
10 shifts the number left one position, and adding T [s + m + 1] brings in the appropriate
lower order digit.

 For example, if m = 5 and ts = 31415 then we wish to remove the high order digit T [s +
1] = 3 and bring in the new lower order digit (suppose it is T [s + 5 + 1] = 2) to obtain ts+1
= 10 (31415 – 10000 *3) + 2 = 14152

 The only difficulty with this procedure is that p and ts may be too large to work with
conveniently.

 There is a simple cure for this problem, compute p and the t s's modulo a suitable
modulus q.

ts+1 = (d(ts – T[s + 1]h) + T[s + m + 1]) mod q

 Where h = dm-1 (mod q) is the value of the digit "1" in the high-order position of an m-
digit text window.

 The solution of working modulo q is not perfect, however ts = p (mod q) does not imply
that ts = p but if ts ≠ p (mod q) definitely implies ts ≠ p, so that shift s is invalid.

 Any shift s for which ts = p (mod q) must be tested further to see whether s is really valid
or it is just a spurious hit.

 This additional test explicitly checks the condition T [s + 1…s + m] = P [1…m].

Algorithm RABIN-KARP-MATCHER(T, P, d, q)

n ← length[T];

m ← length[P];

h ← dm-1 mod q;
DEPARTMENT OF COMPUTER ENGINEERING Page | 4
*Proprietary material of SILVER OAK UNIVERSITY
1010043316
(ANALYSIS & DESIGN OF
ALGORTIHM)
LECTURE COMPANION SEMESTER: 5 PREPARED BY: MONALI SUTHAR

p ← 0;

t0 ← 0;

for i ← 1 to m do

p ← (dp + P[i]) mod q;

t0 ← (dt0 + P[i]) mod q

for s ← 0 to n – m do

if p == ts then

if P[1..m] == T[s+1..s+m] then

print “pattern occurs with shift s”

if s < n-m then

ts+1 ← (d(ts – T[s+1]h) + T[s+m+1]) mod q

3.1. Analysis

 RABIN-KARP-MATCHER takes Θ(m) preprocessing time and it matching time is


Θ(m(n – m + 1)) in the worst case.

3.2. Example

Given T = 31415926535 and P = 26

We choose q = 11

P mod q = 26 mod 11 = 4

DEPARTMENT OF COMPUTER ENGINEERING Page | 5


*Proprietary material of SILVER OAK UNIVERSITY
1010043316
(ANALYSIS & DESIGN OF
ALGORTIHM)
LECTURE COMPANION SEMESTER: 5 PREPARED BY: MONALI SUTHAR

4. String Matching with finite automata

 Many string-matching algorithms build a finite automaton that scans the text string T for
all occurrences of the pattern P.

 We begin with the definition of a finite automaton. We then examine a special string-
matching automaton and show how it can be used to find occurrences of a pattern in a
text.

 Finally, we shall show how to construct the string-matching automaton for a given input
pattern.

4.1. Finite automata

A finite automaton M is a 5-tuple (Q, q0, A, Σ, δ), where

 Q is a finite set of states,

 q0 Є Q is the start state,

 A C Q is a distinguished set of accepting states,

DEPARTMENT OF COMPUTER ENGINEERING Page | 6


*Proprietary material of SILVER OAK UNIVERSITY
1010043316
(ANALYSIS & DESIGN OF
ALGORTIHM)
LECTURE COMPANION SEMESTER: 5 PREPARED BY: MONALI SUTHAR

 Σ is a finite input alphabet,

 δ is a function from Q × Σ into Q, called the transition function of M.

 The finite automaton begins in state q0 and reads the characters of its input string one at
a time.

 If the automaton is in state q and reads input character a, it moves ("makes a transition")
from state q to state δ(q, a).

 Whenever its current state q is a member of A, the machine M is said to have accepted
the string read so far. An input that is not accepted is said to be rejected.

 Following Figure illustrates these definitions with a simple two-state automaton.

4.2. String-matching automata

 There is a string-matching automaton for every pattern P; this automaton must be


constructed from the pattern in a pre-processing step before it can be used to search the
text string.

 In our example pattern P = ababaca.

 In order to properly search for the string, the program must define a suffix function (σ)
which checks to see how much of what it is reading matches the search string at any
given moment.

σ(x) = max {k : Pk ⊐ x}

P= ababa

P1=a
DEPARTMENT OF COMPUTER ENGINEERING Page | 7
*Proprietary material of SILVER OAK UNIVERSITY
1010043316
(ANALYSIS & DESIGN OF
ALGORTIHM)
LECTURE COMPANION SEMESTER: 5 PREPARED BY: MONALI SUTHAR

P2=ab

P3=aba

P4=abab

σ (ababaca)=aba {here aba is suffix of pattern P}

 We define the string-matching automaton that corresponds to a given pattern P[1,…,m]


as follows.

 The state set Q is {0, 1 . . . m}. The start state q0 is state 0, and state m is the only
accepting state.

 The transition function δ is defined by the following equation, for any state q and
character a:

δ (q, α) = σ(Pq α)

ALGORITHM FINITE-AUTOMATON-MATCHER(T, δ, m)

n ← length[T]

q←0

for i ← 1 to n do

q ← δ(q, T[i])

if q == m

then print "Pattern occurs with shift" i – m

ALGORITHM COMPUTE-TRANSITION-FUNCTION(P, Σ)

m ← length[P]

for q ← 0 to m do

for each character α Є Σ do

k ← min(m + 1, q + 2)

DEPARTMENT OF COMPUTER ENGINEERING Page | 8


*Proprietary material of SILVER OAK UNIVERSITY
1010043316
(ANALYSIS & DESIGN OF
ALGORTIHM)
LECTURE COMPANION SEMESTER: 5 PREPARED BY: MONALI SUTHAR

repeat k ← k - 1

until Pk ⊐ Pq α

δ(q, α) ← k

return δ

 This procedure computes δ(q, α) in a straightforward manner according to its definition.

 The nested loops beginning on lines 2 and 3 consider all states q and characters α and
lines 4-7 set δ(q, a) to be the largest k such that Pk ⊐ Pq α. The code starts with the
largest conceivable value of k, which is min (m, q + 1), and decreases k until Pk ⊐ Pq α.

 Time complexity for string matching algorithm

DEPARTMENT OF COMPUTER ENGINEERING Page | 9


*Proprietary material of SILVER OAK UNIVERSITY
1010043316
(ANALYSIS & DESIGN OF
ALGORTIHM)
LECTURE COMPANION SEMESTER: 5 PREPARED BY: MONALI SUTHAR

4.3. Example

 Repeat this procedure for q=0 to 7 we can get transition function table then pattern can
be matched using this table.

DEPARTMENT OF COMPUTER ENGINEERING Page | 10


*Proprietary material of SILVER OAK UNIVERSITY
1010043316
(ANALYSIS & DESIGN OF
ALGORTIHM)
LECTURE COMPANION SEMESTER: 5 PREPARED BY: MONALI SUTHAR

5. The Knuth-Morris-Pratt algorithm

Given a text txt[0 . . . N-1] and a pattern pat[0 . . . M-1], write a function search(char pat[],
char txt[]) that prints all occurrences of pat[] in txt[]. You may assume that N > M.

Examples:

Input: txt[] = “THIS IS A TEST TEXT”, pat[] = “TEST”

Output: Pattern found at index 10

Input: txt[] = “AABAACAADAABAABA”

pat[] = “AABA”

Output: Pattern found at index 0, Pattern found at index 9, Pattern found at index 12

Pattern searching is an important problem in computer science. When we do search for a


string in a notepad/word file or browser or database, pattern-searching algorithms are used to
show the search results.

DEPARTMENT OF COMPUTER ENGINEERING Page | 11


*Proprietary material of SILVER OAK UNIVERSITY
1010043316
(ANALYSIS & DESIGN OF
ALGORTIHM)
LECTURE COMPANION SEMESTER: 5 PREPARED BY: MONALI SUTHAR

The worst case complexity of the Naive algorithm is O(m(n-m+1)). The time complexity of
the KMP algorithm is O(n+m) in the worst case.

Knuth Morris Pratt (KMP) is an algorithm, which checks the characters from left to right.
When a pattern has a sub-pattern appears more than one in the sub-pattern, it uses that
property to improve the time complexity, also for in the worst case.

The time complexity of KMP is O(n).

5.1. Input and Output

Input:

Main String: “AAAABAAAAABBBAAAAB”, The pattern “AAAB”

Output:

Pattern found at location: 1

Pattern found at location: 7

Pattern found at location: 14

5.2. Algorithm

findPrefix(pattern, m, prefArray)

Input − The pattern, the length of pattern and an array to store prefix location

Output − The array to store where prefixes are located

Begin

length := 0

prefArray[0] := 0

for all character index ‘i’ of pattern, do

if pattern[i] = pattern[length], then

DEPARTMENT OF COMPUTER ENGINEERING Page | 12


*Proprietary material of SILVER OAK UNIVERSITY
1010043316
(ANALYSIS & DESIGN OF
ALGORTIHM)
LECTURE COMPANION SEMESTER: 5 PREPARED BY: MONALI SUTHAR

increase length by 1

prefArray[i] := length

else

if length ≠ 0 then

length := prefArray[length - 1]

decrease i by 1

else

prefArray[i] := 0

done

End

kmpAlgorithm(text, pattern)

Input: The main text, and the pattern, which will be searched

Output − The location where patterns are found

Begin

n := size of text

m := size of pattern

call findPrefix(pattern, m, prefArray)

while i < n, do

if text[i] = pattern[j], then

increase i and j by 1

if j = m, then

DEPARTMENT OF COMPUTER ENGINEERING Page | 13


*Proprietary material of SILVER OAK UNIVERSITY
1010043316
(ANALYSIS & DESIGN OF
ALGORTIHM)
LECTURE COMPANION SEMESTER: 5 PREPARED BY: MONALI SUTHAR

print the location (i-j) as there is the pattern

j := prefArray[j-1]

else if i < n AND pattern[j] ≠ text[i] then

if j ≠ 0 then

j := prefArray[j - 1]

else

increase i by 1

done

End

Example

#include<iostream>

using namespace std;

void findPrefix(string pattern, int m, int prefArray[]) {

int length = 0;

prefArray[0] = 0; //first place is always 0 as no prefix

for(int i = 1; i<m; i++) {

if(pattern[i] == pattern[length]) {

length++;

prefArray[i] = length;

}else {

if(length != 0) {

DEPARTMENT OF COMPUTER ENGINEERING Page | 14


*Proprietary material of SILVER OAK UNIVERSITY
1010043316
(ANALYSIS & DESIGN OF
ALGORTIHM)
LECTURE COMPANION SEMESTER: 5 PREPARED BY: MONALI SUTHAR

length = prefArray[length - 1];

i--; //decrease i to avoid effect of increasing after iteration

}else

prefArray[i] = 0;

void kmpPattSearch(string mainString, string pattern, int *locArray, int &loc) {

int n, m, i = 0, j = 0;

n = mainString.size();

m = pattern.size();

int prefixArray[m]; //prefix array as same size of pattern

findPrefix(pattern, m, prefixArray);

loc = 0;

while(i < n) {

if(mainString[i] == pattern[j]) {

i++; j++;

if(j == m) {

locArray[loc] = i-j; //item found at i-j position.

loc++;

DEPARTMENT OF COMPUTER ENGINEERING Page | 15


*Proprietary material of SILVER OAK UNIVERSITY
1010043316
(ANALYSIS & DESIGN OF
ALGORTIHM)
LECTURE COMPANION SEMESTER: 5 PREPARED BY: MONALI SUTHAR

j = prefixArray[j-1]; //get the prefix length from array

}else if(i < n && pattern[j] != mainString[i]) {

if(j != 0)

j = prefixArray[j-1];

else

i++;

int main() {

string str = "AAAABAAAAABBBAAAAB";

string patt = "AAAB";

int locationArray[str.size()];

int index;

kmpPattSearch(str, patt, locationArray, index);

for(int i = 0; i<index; i++) {

cout << "Pattern found at location: " <<locationArray[i] << endl;

Output

Pattern found at location: 1

DEPARTMENT OF COMPUTER ENGINEERING Page | 16


*Proprietary material of SILVER OAK UNIVERSITY
1010043316
(ANALYSIS & DESIGN OF
ALGORTIHM)
LECTURE COMPANION SEMESTER: 5 PREPARED BY: MONALI SUTHAR

Pattern found at location: 7

Pattern found at location: 14

DEPARTMENT OF COMPUTER ENGINEERING Page | 17


*Proprietary material of SILVER OAK UNIVERSITY

You might also like