0% found this document useful (0 votes)
32 views43 pages

Unit 3-Pattern Matching

Uploaded by

emmadisetty.cs22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views43 pages

Unit 3-Pattern Matching

Uploaded by

emmadisetty.cs22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

String Matching Algorithms

Unit 3
String Matching Problem
Motivations: text-editing, pattern matching in DNA sequences

32.1

Text: array T[1...n] Pattern: array P[1...m]


Array Element: Character from finite alphabet Σ
Pattern P occurs with shift s in T if P[1...m] = T[s+1...s+m]
String Matching Algorithms
• Divide running time into preprocessing and matching time.
• Preprocessing: Setup some data structure based on pattern P.
• Matching: Perform actual matching by comparing characters from T
with P and precomputed data structure.

• Naive Algorithm
– Worst-case running time in O((n-m+1) m)
• Rabin-Karp
– Worst-case running time in O((n-m+1) m)
– Better than this on average and in practice
• Finite Automaton-Based
– Worst-case running time in O(n + m|Σ|)
• Knuth-Morris-Pratt
– Worst-case running time in O(n + m)
Notation & Terminology
• Σ* = set of all finite-length strings formed using
characters from alphabet Σ
• Empty string: ε
• |x| = length of string x
• w is a prefix of x: w x ab abcca

• w is a suffix of x: w x cca abcca

• prefix, suffix are transitive


Overlapping Suffix Lemma
32.1

32.3 32.1
String Matching Algorithms

Naive Algorithm
Naive String Matching

worst-case running time is ?

32.4
Naive String Matching

worst-case running time is in Θ((n-m+1)m)

32.4
String Matching Algorithms

Rabin-Karp
Rabin-Karp Algorithm
• Rabin-Karp string searching algorithm calculates a numerical (hash) value for the
pattern p, and for each m-character substring of text t.
• Then it compares the numerical values instead of comparing the actual
symbols.
• The algorithm slides the pattern, one by one, and matches the hash value of the
substring of the text.
• If any match is found, it compares the pattern with the substring by naive
approach.
• Otherwise it shifts to next substring of t to compare with p.
• The use of hashing converts the string to a numeric value which speeds up the
process of matching.
• The algorithm exploits the fact that if two strings are equal then their hash values
are also equal.
• Thus, the string matching is reduced to computing the hash value of the search
pattern and then looking for substring with that hash value.
Rabin-Karp (1987)
• Consider (sub)strings as numbers. Characters in a string correspond to digits in a
number written in radix-d notation (where d = |Σ|).
Rabin-Karp (1987)
Compute remaining ti‘s in O(n-m) time
t = d(t - d m-1T[s+1]) + T[s+m+1]
s+1 s

Check out: “fedc"


Rabin-Karp
• Assume each character is digit in radix-d notation (e.g. d=10)
• p = decimal value of pattern
• ts = decimal value of substring T[s+1..s+m] for s =
0,1...,n-m

Compute remaining ti‘s in O(n-m) time


t s+1 = sd(t - d m-1T[s+1]) +
T[s+m+1]
We can signify the position of each char by multiplying by some constant raised to
the power that corresponds (eg.10 ^n-1) to its position.
Now, H(1234)!=H(4321) or any other permutations
Rabin-Karp

If pattern was 1000 chars then we need to multiply by 10^9 which would be a
huge number (integer overflow).
Therefore, divide with a prime number (eg 113 – now hash value will always be
under/less than 113)
Rabin-Karp
Example

• Example (1):
• Input: T = gtgatcagatcact, P = tca
• Output: ? shift=?

• Example (2):
• Input: T = 189342670893, P = 1673
• Output: ? shift=?
Example

• Example (1):
• Input: T = gtgatcagatcact, P = tca
• Output: Yes. gtgatcagatcact, shift=4, 9

• Example (2):
• Input: T = 189342670893, P = 1673
• Output: No.
Rabin-Karp Algorithm
• Consider (sub)strings as numbers. Characters in a string
correspond to digits in a number written in radix-d notation (where d = |Σ|).
• Assume each character is digit in radix-d notation (e.g. d=10)
• p = decimal value of pattern
• ts = decimal value of substring T[s+1..s+m] for s = 0,1...,n-m
• Strategy:

– compute p in O(m) time (which is in O(n))


– compute all ti values in total of O(n) time
– find all valid shifts s in O(n) time by comparing p with each ts
Rabin-Karp scheme
• Consider a character as a number in a radix
system, e.g., English alphabet as in radix-26.
• Pick up each m-length "number" starting from
shift=0 through (n-m).
• So, T = gtgatcagatcact P=tga
in radix-4 (a/0, t/1, g/2, c/3) becomes
□ gtg = '212' in base-4 =2*4^2+1*4^1+2*4^0= 32+4+2
□ tga = '120' in base-4 = 1*4^2+2*4^1+0*4^0= 16+8+0
• Then do the comparison with P - number-wise.
• (1) preprocess for (n-m) numbers on T and 1 for P,
• (2) compare the number for P with those computed on T.
Rabin-Karp scheme
• Problem: in case each number (p and ts) is too large for comparison
• Solution: Hash, use modular arithmetic, with respect to a prime q.

• 31415%13 = 7
• New recurrence formula:
• ts+1 = (d (ts - h T[s+1]) + T[s+m+1]) mod q,
• where h = dm-1 mod q.
• q is a prime number so that we do not get a 0 in the mod operation.
• The comparison is not perfect and may have spurious hit (see next slide).
• So, we need a naïve string matching when the comparison succeeds in
modulo math.
Rabin-Karp Algorithm (continued)
m-1
ts+1 = d(ts - d T[s+1]) + T[s+m+1]

The comparison is not perfect and may have spurious hit (see example below).
So, we need a naïve string matching when the comparison succeeds in modulo math.

p = 31415

spurious
hit
Rabin-Karp Algorithm (continued)

source: 91.503 textbook Cormen et al.


Rabin-Karp Algorithm
• Compute p in O(m) time using Horner’s rule:
– p = P[m] + d(P[m-1] + d(P[m-2] + ... + d(P[2] + dP[1])))
• Compute t0 similarly from T[1..m] in O(m) time
• Compute remaining ti‘s in O(n-m) time
– t = d(t- d m-1T[s+1]) + T[s+m+1]
s+1 s

• Advantage: Calculating strings can reuse old results.


• Consider decimals: 43592.. and 43592..
• 3592 = (4359 - 4*1000)*10 + 2
= (359)*10+2= 3590+2
=3592
• General formula: t s+1 =s d (t - dm-1 T[s+1])+ T[s+m+1], in
radix-d, where ts is the corresponding number for the
substring T[s..(s+m)]. Note, m is the size of P.
Rabin-Karp Algorithm (continued)
d is radix q is modulus

high-order digit position for m-digit window

Preprocessing

Matching loop invariant: when line 10 executed


ts=T[s+1..s+m] mod q
rule out spurious hit

worst-case running time is in Θ((n-m+1)m) average-case running time is in Ο(n+m)


Find the number of Spurious hits happened
during the following pattern matching process
using in Rabin Karp string matching approach
considering modulus as 11.
TEXT:31415926535
PATTERN:26
String Matching Algorithms

Finite Automata
Finite Automata

32.6

Strategy: Build automaton for pattern, then examine each text character once.

worst-case running time is in Θ(n) + automaton creation time


Finite Automata
String-Matching Automaton
Pattern = P = ababaca

Automaton accepts
strings ending in P

32.7

source: 91.503 textbook Cormen et al.


String-Matching Automaton
Suffix Function for P:
σ (x) = length of longest prefix of P that is a suffix of x

32.3

32.4

at each step: keeps track of longest pattern prefix that is a suffix of what has been read so far
source: 91.503 textbook Cormen et al.
String-Matching Automaton
Simulate behavior of string-matching automaton that finds
occurrences of pattern P of length m in T[1..n]

assuming automaton has already been created...

worst-case running time of matching is in Θ(n)

source: 91.503 textbook Cormen et al.


String-Matching Automaton (continued)

source: 91.503 textbook Cormen et al.

worst-case running time of entire string-matching strategy


is in Ο(m |Σ|) + Ο(n)

automaton creation time pattern matching time


String-Matching Automaton
Suffix Function for P:
σ (x) = length of longest prefix of P that is a suffix of x

32.3

Automaton’s operational invariant 32.4

at each step: keeps track of longest pattern prefix that is a suffix of what has been read so far
source: 91.503 textbook Cormen et al.
String-Matching Automaton (continued)
Correctness of matching procedure...

32.2

32.8

32.8 32.2
source: 91.503 textbook Cormen et al.
String-Matching Automaton (continued)
Correctness of matching procedure...

32.3

32.9
32.2
32.1

32.9 32.3
source: 91.503 textbook Cormen et al.
String-Matching Automaton (continued)
Correctness of matching procedure...
32.4

32.3
32.3

source: 91.503 textbook Cormen et al.


String-Matching Automaton (continued)

source: 91.503 textbook Cormen et al.

worst-case running time of automaton creation is in Ο(m2 |Σ|)

can be improved to: Ο(m |Σ|)


worst-case running time of entire string-matching strategy
is in Ο(m |Σ|) + Ο(n)

automaton creation time pattern matching time


The Knuth-Morris-Pratt algorithm
Time complexity : m + n
m : time taken to construct the pi table
n : size of the pattern

You might also like