0% found this document useful (0 votes)
7 views16 pages

String Matching

The document discusses string matching algorithms, focusing on the Knuth-Morris-Pratt (KMP) algorithm and Rabin-Karp method. KMP uses a pre-computed Partial Match Table to efficiently search for patterns within a string, while Rabin-Karp employs hashing for quick filtering of non-matches. Both algorithms have distinct complexities and applications, including plagiarism detection.

Uploaded by

the.nexus.9870
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views16 pages

String Matching

The document discusses string matching algorithms, focusing on the Knuth-Morris-Pratt (KMP) algorithm and Rabin-Karp method. KMP uses a pre-computed Partial Match Table to efficiently search for patterns within a string, while Rabin-Karp employs hashing for quick filtering of non-matches. Both algorithms have distinct complexities and applications, including plagiarism detection.

Uploaded by

the.nexus.9870
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

String Matching Algorithms

Purpose of String Matching


• The most basic case of string searching
involves one (often very long) string,
sometimes called the , and one
(often very short) string, sometimes called
the . The goal is to find one or more
occurrences of the needle within the haystack

• The C library function does this


– char * s t r st r ( c ons t char * hay s t ac k, cons t c har
* needl e) ;
Worst-case Complexity:
KNUTH-MORRIS-PRATT ALGORITHM
Overview of KMP
• Searches a pattern of length in a string
of length

• Worst-case Complexity:

• Needs a pre-computed Partial Match Table


What is done
• At any given time, the algorithm is in a state
determined by two integers:
– : position within , for next prospective match
– : index of current character in
• Each step of algorithm compares
– with
– increments if equal
– Uses to re-evaluate (and sometimes )
Pseudocode
al gor i t hm kmp_ sear ch:
i nput :
an ar r ay of char act e r s, S ( t he t ex t t o be sear ched, hay st ack)
an ar r ay of char act e r s, W ( t he wor d sought , i . e. needl e)
out put :
an ar r ay of i nt eger s , P ( posi t i ons i n S at whi ch W i s f ound)

def i ne var i abl es:


an i nt eger , i ← 0 ( t he posi t i on of t he cur r ent char act er i n S wher e W i s al i gned)
an i nt eger , j ← 0 ( t he posi t i on of t he cur r ent char act er i n W)
an ar r ay of i nt eger s , T ( t he t abl e, comput ed el sewher e)

l et nP ← 0

whi l e i + j < l engt h( S) do


i f W[ j ] == S[ i + j ] t hen
l et j ←j + 1
if j == l engt h( W) t he n ( occ ur r ence f ound)
el se
l et i = i + j – T[ j ]
l et j ← T[ j ]
if j < 0 t hen
j ++
A detour before analysis
Assume that there are two persons and such that either they are "at the
same position" or is at most positions behind . Initially, both and are
at position = . Processing ends if either falls behind by OR has
reached position = .

In one stage, one of the two things happen:

goes one step forward


goes k ( < < ) steps forward. However, if that way catches up
with , both move one step forward

Question: Total MAXIMUM how many stages are required?


Answer: *
Analysis
Based on value of = Fact (to be proved
later)
• []<
• [ ] ==

B does NOT change. But j is decremented. Hence, this part


(B) cannot be executed more number of times than is
incremented in part A before increments again in part A

Hence, together the two parts executed at most 2 * length(S) times


The Failure Table T
• The idea is to find at T[k], the (last index of)
longest prefix of W that is a suffix of W[0..k]
– T[k] = -1 if no such prefix found
– Usually, T has length 1 more than length(W)

If S[.. + 18] is not W[18] ‘A’, backtrack to


match S[.. + 18] with W[3]
Failure Table Building Pseudocode
al gor i t hm kmp_ t abl e:
i nput :
an ar r ay of char act e r s, W ( t he wor d t o be anal y zed)
out put :
an ar r ay of i nt eger s , T ( t he t abl e t o be f i l l ed)

def i ne var i abl es:


an i nt eger , pos ← 1 ( t he cur r e nt posi t i on we ar e c omput i ng i n T)
an i nt eger , cnd ← 0 ( t he zer o- based i nde x i n W of t he next c har ac t er
of t he c ur r ent candi dat e subst r i ng)

l et T[ 0] ← - 1

whi l e pos < l e ngt h( W) do


i f W[ pos] = W[ c nd] t hen
l et T[ pos] ← T[ c nd]
el se
l et T[ pos] ← cnd
whi l e cnd ≥ 0 and W[ pos ] ≠ W[ cnd] do
l e t cnd ← T[ cnd]
l et pos ← pos + 1, c nd ← cnd + 1

l et T[ pos ] ← c nd ( onl y needed when al l wor d occur r ence s ar e sear ched)


Rabin-Karp’s Method
• Uses hashing to find an exact match
• Rolling hash for quickly filtering non-matches
• Then checks for full match
• Expected complexity
• However, worst-case complexity is
• A practical application is detecting plagiarism
Rabin-Karp Pseudocode
f unct i on Rabi nKar p(
st r i ng s [ 1. . n] , st r i ng pat t er n[ 1. . m] )

hpat t er n : = hash( pat t er n[ 1. . m] ) ;


f or i f r om 1 t o n- m+1
hs : = hash ( s[ i . . i +m- 1] )
i f hs = hpat t er n
i f s[ i . . i +m- 1] = pat t er n[ 1. . m]
r et ur n i
r et ur n not f ound
Some Issues
• Key to performance is efficient hash value
computation
– of the successive substrings of the text
• Rabin fingerprint is popular and effective rolling hash
function
– treats every substring as a number in some base
– the base being usually the size of the character set
– Somewhat like number system
• Example: for substring "hi" (base = ) prime
modulus = 101, hash value is:

• [(104 × 256 ) % 101 + 105] % 101 =


– (ASCII of 'h' is 104 and of 'i' is 105)
Rolling Hash in Action
• Example: Text = "abracadabra", Pattern-length =
– hash of first substring, "abr" (Base: , Prime-modulus: )
– hash("abr") =
– // ASCII a = 97, b = 98, r = 114

• Compute the hash of next substring "bra" from the hash of "abr"
1. subtract number added for the first 'a' of "abr"
2. multiply by base
3. add the last ‘a’ of "bra", i.e. 97 × 256

• hash("bra") =

You might also like