0% found this document useful (0 votes)
18 views

Lecture#8 - String Matching Algorithm

Uploaded by

Pritom Das
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Lecture#8 - String Matching Algorithm

Uploaded by

Pritom Das
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

STRING MATCHING

CSE-237 : ALGORITHM DESIGN AND


A N A LY S I S
BASIC

String Matching Algorithms


STRING MATCHING

• Problem : Find if a pattern P of length m occurs within text T of length n


• Given a pattern P[1..m] and a text T[1..n], find all occurrences of P in T.
• Both P and T belong to *.

• Solution : if P  T, P occurs with shift s (beginning at s+1):


P[1]=T[s+1], P[2]=T[s+2], …, P[m]=T[s+m].
• If so, call s is a valid shift, otherwise, an invalid shift.
P = ABABDE
• Note: one occurrence may begin within another one - ║║║║║║
T = ABABABDEAA,
• P=abab, T=abcabababbc, P occurs at s=3 and s=5
s=3

String Matching Algorithms ‹#›


APPLICATIONS OF STRING MATCHING

• Remember :
• *Text is the string that we are searching to
• *Pattern is the string that we are searching for
• *Shift is an offset into a string

• Why do we need string matching?


• String matching is used in various applications like spell checkers, keyword
matching, spam filters, search engines, plagiarism detectors,
bioinformatics & DNA sequencing, database searching etc.

String Matching Algorithms ‹#›


NOTATION AND TERMINOLOGY

• Prefix : w is a prefix of x, if x = wy for some y  *. Denoted as w  x.


• Suffix : w is a suffix of x, if x = yw for some y  *. Denoted as w  x.
• For example, we have ab  abcca and cca  abcca

• Overlapping Shift Lemma :


Suppose x, y, z and x  z and y  z, then
a) if |x| < |y|, then x  y;
b) if |x| > |y|, then y  x;
c) if |x| = |y|, then x = y.

String Matching Algorithms ‹#›


BASIC CLASSIFICATION

• Naive Algorithm : performing a brute-force comparison of each character in the


pattern at each possible placement of the pattern in the string.
• It is O(mn) in the worst case scenario.

• Rabin-Karp Algorithm : compares the string’s hash values, rather than string
themselves.
• Performs well in practice and generalized to other algorithm for problems like 2D matching.

• Knuth-Morris-Pratt Algorithm : modification of on brute-force algorithm and is


capable of running O(m+n) in the worst case.
• It improves the running time by taking advantage of prefix function

String Matching Algorithms ‹#›


NAÏVE ALGORITHM

String Matching Algorithms


NAÏVE STRING MATCHING

• Naïve (Brute Force) approach : The naive algorithm finds all valid
shifts using a loop that checks the condition P[1..m] = T[s+1, s+2,
s+m] for each of the n - m + 1 possible values of s.

String Matching Algorithms ‹#›


NAÏVE STRING MATCHING ...

• Problem with naïve algorithm : Suppose P = ABABC, T = CABABABCD

• Whenever a character mismatch occurs after matching of several characters, the comparison
begins by going back in T from the character which follows the last beginning character.
• Worst-case complexity : O(m(n-m+1)); Average case performance is surprisingly good
provided stings are neither long nor have lots of repeated letters.

• Can we do better: not go back in T?

String Matching Algorithms ‹#›


RABIN-KARP ALGORITHM

String Matching Algorithms


RABIN-KARP ALGORITHM

• Rabin-Karp string matching algorithm is actually the naive approach


augmented with a powerful programming technique - hash function
• Algorithm
• Calculate the hash for the pattern P.
• Calculate the hash values for all the prefixes of the text T.
• If the hash values are equal, we can compare (in constant time) the pattern
with M-character sequence using Brute Force approach.
• If the hash values for a particular subsequence are unequal, the algorithm
will calculate the hash value for next M-character sequence.

String Matching Algorithms ‹#›


PSEUDOCODE

String Matching Algorithms ‹#›


EXAMPLE

String Matching Algorithms ‹#›


CALCULATING HASH VALUE

• Let’s associate one integer with every letter of the alphabet.


 Hence we can say ‘A’ corresponds to 1, ‘B’ corresponds to 2 , ‘C’
corresponds to 3……
 Similarly all other letters are corresponding to their index values.

• The Hash Value of the String “CAT” will be -

String Matching Algorithms ‹#›


WHAT IF TWO VALUES COLLIDE

• If the hash value matches for two strings then it is called a ‘hit’.
• It may be possible that two or more different strings produce the same
hash value.
String 1: “CBJ” hash code=3*100 + 2*10 + 10 = 330
String 2: “CAT” hash code=3*100 + 1*10 + 20 = 330
• Hence it is necessary to check whether it is a hit or not?
• Any hit will have to be tested to verify that it is not spurious and that
p[1..m] = T[s+1..s+m]

String Matching Algorithms ‹#›


MATHEMATICAL RESOLUTION

• Let’s take an m-character sequence as an m-digit number in base b.


Then the text subsequence t[ i .. i + m-1] is mapped to the number as
follows :

• If m is very large then the hash value will be very large in size, so
we can hash the value by taking mod a prime number, say q.

String Matching Algorithms ‹#›


REHASHING AND COMPLEXITY

• Rehashing : Hash at next shift must be efficiently computable (O(1))


from the current hash value and next character in text. [ To do
rehashing, we need to take off the most significant digit and add the
new least significant digit for in hash value. ]
hash(t[s+1 .. s+m]) = (d(hash(t[s, s+m-1]) - t[s]*h) +
t[s+m]) mod q
• where
• hash( t[s, s+m-1]) : Hash value at shift s.
• hash( t[s,+1 s+m]) : Hash value at next shift ( or shift s+1)
• d : Number of characters in the alphabet, q : A prime number and h : d^(m-
1)
String Matching Algorithms ‹#›
• Time complexity :: Best-case = O(N+M), Worst-case = O(NM)
KNUTH-MORRIS-PRATT ALGORITHM

String Matching Algorithms


MOTIVATION OF KMP ALGORITHM

• Knuth-Morris-Pratt’s algorithm compares the pattern to the text in


left-to-right, but shifts the pattern more intelligently than the brute-
force algorithm.

• Idea : after some character (such as q) matches of P with T and then a


mismatch, the matched q characters allows us to determine
immediately that certain shifts are invalid. So directly go to the shift
which is potentially valid.
• The matched characters in T are in fact a prefix of P, so just from P,
it is OK to determine whether a shift is invalid or not.

String Matching Algorithms ‹#›


MOTIVATION OF KMP ALGORITHM ...

• Question : When a mismatch occurs, what is the most we can shift


the pattern so as to avoid redundant comparisons?

• Answer : the largest prefix of


P[1..j-1] that is a also suffix of
P[1..j-1]

String Matching Algorithms ‹#›


COMPONENTS OF KMP ALGORITHM

• The prefix function, Π :


The prefix function,Π for a pattern encapsulates knowledge about how the
pattern matches against shifts of itself. This information can be used to avoid
useless shifts of the pattern ‘p’. In other words, this enables avoiding
backtracking on the text ‘T’.

• The KMP Matcher :


With text ‘T’, pattern ‘p’ and prefix function ‘Π’ as inputs, finds the occurrence of
‘p’ in ‘T’ and returns the number of shifts of ‘p’ after which occurrence is found.

String Matching Algorithms


FUNCTIONALITY OF Π

• Naive : • Smarter technique :


Step#1: • We can slide the pattern ahead so
p: pappar that the longest PREFIX of p that we
have already processed matches the
t: pappappapparrassanuaragh
longest SUFFIX of t that we have
already matched.
Step#2
p: pappar
• p: pappar
t: pappappapparrassanuaragh
• t: pappappapparrassanuaragh

String Matching Algorithms


LPS (LONGEST PREFIX SUFFIX) CALCULATION

• Initialization

• Step#1

• Step#2

String Matching Algorithms


LPS CALCULATION ...

• Step#3

• Step#4

String Matching Algorithms


LPS CALCULATION ...

• Step#5

• Step#6

String Matching Algorithms


LPS CALCULATION ...

• After iterating 6 times, the prefix function computation is complete:

• The running time of the prefix function is O(m).

String Matching Algorithms


PSEUDOCODE FOR THE PREFIX FUNCTION, Π

String Matching Algorithms


PSEUDOCODE FOR KMP MATCHER

String Matching Algorithms


KMP MATCHER

• Let us execute the KMP algorithm to find whether ‘p’ occurs in ‘T’.

• For ‘p’ the prefix function, Π will be :

String Matching Algorithms


KMP MATCHER ...

• Initialization : n = size of T = 15; m = size of p = 7


• Step#1 : i = 1, q = 0;
comparing p[1] with T[1]

• Step#2 : i = 2, q = 0;
comparing p[1] with T[2]

String Matching Algorithms


KMP MATCHER ...

• Step#3 : i = 3, q = 1;
comparing p[2] with T[3]

• Step#4 : i = 4, q = 0;
comparing p[1] with T[4]

String Matching Algorithms


KMP MATCHER ...

• Step#5 : i = 5, q = 0;
comparing p[1] with T[5]

• Step#6 : i = 6, q = 1;
comparing p[2] with T[6]

String Matching Algorithms


KMP MATCHER ...

• Step#7 : i = 7, q = 2;
comparing p[3] with T[7]

• Step#8 : i = 8, q = 3;
comparing p[4] with T[8]

String Matching Algorithms


KMP MATCHER ...

• Step#9 : i = 9 q = 4;
comparing p[5] with T[9]

• Step#10 : i = 10, q = 5;
comparing p[6] with T[10]

String Matching Algorithms


KMP MATCHER ...

• Step#11 : i = 11 q = 4;
comparing p[5] with T[11]

• Step#12 : i = 12, q = 5;
comparing p[6] with T[12]

String Matching Algorithms


KMP MATCHER ...

• Step#13 : i = 13 q = 6;
comparing p[7] with T[13]

• Pattern ‘p’ has been found to completely occur in text ‘T’. The total
number of shifts that took place for the match to be found are : i – m =
13 – 7 = 6 shifts.
• The running time of the KMP-Matcher function is O(n).

String Matching Algorithms


PERFORMANCE

• Advantages :
• The running time of the KMP algorithmis optimal (O(m + n)), which is very
fast.
• The algorithm never needs to move backwards in the input text T. It makes
the algorithm good for processing very large files.

• Disadvantages :
• Doesn’t work so well as the size of the alphabets increases. By which more
chances of mismatch occurs.

String Matching Algorithms


COMPUTATIONAL GEOMETRY
Explore it on NEXT DAY

You might also like