0% found this document useful (0 votes)
83 views3 pages

Pattern Matching

The document discusses different algorithms for pattern matching in strings: 1. The brute-force algorithm compares the pattern to every possible substring of the text and has a running time of O(nm) where n and m are the sizes of the text and pattern. 2. The Boyer-Moore algorithm improves on this by using heuristics like skipping ahead in the text based on character matches. It precomputes a last occurrence function to determine how far to shift the pattern. 3. The algorithm first builds a last occurrence function mapping each character to its last index in the pattern. It then searches for matches, shifting the pattern based on character matches or mismatches.

Uploaded by

Savi Ojha
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views3 pages

Pattern Matching

The document discusses different algorithms for pattern matching in strings: 1. The brute-force algorithm compares the pattern to every possible substring of the text and has a running time of O(nm) where n and m are the sizes of the text and pattern. 2. The Boyer-Moore algorithm improves on this by using heuristics like skipping ahead in the text based on character matches. It precomputes a last occurrence function to determine how far to shift the pattern. 3. The algorithm first builds a last occurrence function mapping each character to its last index in the pattern. It then searches for matches, shifting the pattern based on character matches or mismatches.

Uploaded by

Savi Ojha
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Pattern Matching 5/29/2002 11:27 AM

Outline and Reading


Strings (§9.1.1)
Pattern Matching Pattern matching algorithms
„ Brute-force algorithm (§9.1.2)
a b a c a a b
„ Boyer-Moore algorithm (§9.1.3)
1
a b a c a b „ Knuth-Morris-Pratt algorithm (§9.1.4)
4 3 2
a b a c a b

Pattern Matching 1 Pattern Matching 2

Strings Brute-Force Algorithm


Algorithm BruteForceMatch(T, P)
A string is a sequence of Let P be a string of size m The brute-force pattern
characters matching algorithm compares Input text T of size n and pattern
„ A substring P[i .. j] of P is the
subsequence of P consisting of the pattern P with the text T P of size m
Examples of strings:
the characters with ranks for each possible shift of P Output starting index of a
„ Java program substring of T equal to P or −1
between i and j relative to T, until either
„ HTML document if no such substring exists
„ A prefix of P is a substring of „ a match is found, or
„ DNA sequence the type P[0 .. i] for i ← 0 to n − m
„ all placements of the pattern
„ Digitized image A suffix of P is a substring of { test shift i of the pattern }
„ have been tried
An alphabet Σ is the set of the type P[i ..m − 1]
Brute-force pattern matching j←0
possible characters for a Given strings T (text) and P runs in time O(nm) while j < m ∧ T[i + j] = P[j]
family of strings (pattern), the pattern matching j←j+1
problem consists of finding a Example of worst case:
Example of alphabets: „ T = aaa … ah if j = m
ASCII substring of T equal to P
„
„ P = aaah return i {match at i}
Unicode Applications:
„ may occur in images and
„
Text editors
else
„ {0, 1} „
DNA sequences
Search engines break while loop {mismatch}
„ {A, C, G, T} „
„ unlikely in English text
„ Biological research return -1 {no match anywhere}
Pattern Matching 3 Pattern Matching 4

Boyer-Moore Heuristics Last-Occurrence Function


The Boyer-Moore’s pattern matching algorithm is based on two
heuristics Boyer-Moore’s algorithm preprocesses the pattern P and the
alphabet Σ to build the last-occurrence function L mapping Σ to
Looking-glass heuristic: Compare P with a subsequence of T integers, where L(c) is defined as
moving backwards „ the largest index i such that P[i] = c or
Character-jump heuristic: When a mismatch occurs at T[i] = c „ −1 if no such index exists
If P contains c, shift P to align the last occurrence of c in P with T[i]
„
Example:
„ Else, shift P to align P[0] with T[i + 1] c a b c d
„ Σ = {a, b, c, d}
Example „ P = abacab L(c) 4 5 3 −1

a p a t t e r n m a t c h i n g a l g o r i t h m
The last-occurrence function can be represented by an array
1 3 5 11 10 9 8 7 indexed by the numeric codes of the characters
r i t h m r i t h m r i t h m r i t h m The last-occurrence function can be computed in time O(m + s),
where m is the size of P and s is the size of Σ
2 4 6
r i t h m r i t h m r i t h m

Pattern Matching 5 Pattern Matching 6

1
Pattern Matching 5/29/2002 11:27 AM

The Boyer-Moore Algorithm Example


Algorithm BoyerMooreMatch(T, P, Σ) Case 1: j ≤ 1 + l
. . . . . . a . . . . . .
L ← lastOccurenceFunction(P, Σ )
i
i←m−1 a b a c a a b a d c a b a c a b a a b b
j←m−1 . . . . b a
repeat j l 1
if T[i] = P[j] m−j
a b a c a b
if j = 0 . . . . b a
return i { match at i } 4 3 2 13 12 11 10 9 8
else j a b a c a b a b a c a b
i←i−1
j←j−1 Case 2: 1 + l ≤ j
5 7
else . . . . . . a . . . . . .
i
a b a c a b a b a c a b
{ character-jump }
l ← L[T[i]] . a . . b . 6
i ← i + m – min(j, 1 + l) l j a b a c a b
j←m−1 m − (1 + l)
until i > n − 1
return −1 { no match } . a . . b .

1+l
Pattern Matching 7 Pattern Matching 8

Analysis The KMP Algorithm - Motivation


Boyer-Moore’s algorithm Knuth-Morris-Pratt’s algorithm
runs in time O(nm + s) a a a a a a a a a compares the pattern to the
Example of worst case: 6 5 4 3 2 1 text in left-to-right, but shifts . . a b a a b x . . . . .
„ T = aaa … a b a a a a a the pattern more intelligently
„ P = baaa than the brute-force algorithm.
12 11 10 9 8 7
The worst case may occur in When a mismatch occurs, a b a a b a
b a a a a a what is the most we can shift
images and DNA sequences j
but is unlikely in English text 18 17 16 15 14 13 the pattern so as to avoid
b a a a a a redundant comparisons?
Boyer-Moore’s algorithm is a b a a b a
significantly faster than the Answer: the largest prefix of
24 23 22 21 20 19 P[0..j] that is a suffix of P[1..j]
brute-force algorithm on b a a a a a
English text No need to Resume
repeat these comparing
comparisons here
Pattern Matching 9 Pattern Matching 10

KMP Failure Function The KMP Algorithm


Knuth-Morris-Pratt’s The failure function can be Algorithm KMPMatch(T, P)
j 0 1 2 3 4 5
algorithm preprocesses the represented by an array and F ← failureFunction(P)
pattern to find matches of P[j] a b a a b a
can be computed in O(m) time i←0
F(j) 0 0 1 1 2 3 j←0
prefixes of the pattern with At each iteration of the while- while i < n
the pattern itself loop, either if T[i] = P[j]
The failure function F(j) is . . a b a a b x . . . . . if j = m − 1
„ i increases by one, or
return i − j { match }
defined as the size of the „ the shift amount i − j else
largest prefix of P[0..j] that is increases by at least one i←i+1
also a suffix of P[1..j] a b a a b a (observe that F(j − 1) < j) j←j+1
else
Knuth-Morris-Pratt’s Hence, there are no more
j if j > 0
algorithm modifies the brute- than 2n iterations of the j ← F[j − 1]
force algorithm so that if a while-loop else
a b a a b a i←i+1
mismatch occurs at P[j] ≠ T[i] Thus, KMP’s algorithm runs in
we set j ← F(j − 1) return −1 { no match }
F(j − 1) optimal time O(m + n)

Pattern Matching 11 Pattern Matching 12

2
Pattern Matching 5/29/2002 11:27 AM

Computing the Failure


Function Example
The failure function can be
a b a c a a b a c c a b a c a b a a b b
represented by an array and Algorithm failureFunction(P)
can be computed in O(m) time F[0] ← 0 1 2 3 4 5 6
The construction is similar to i←1 a b a c a b
the KMP algorithm itself j←0
while i < m 7
At each iteration of the while- if P[i] = P[j] a b a c a b
loop, either {we have matched j + 1 chars}
8 9 10 11 12
i increases by one, or F[i] ← j + 1
„
i←i+1 a b a c a b
„ the shift amount i − j j←j+1
increases by at least one else if j > 0 then 13
(observe that F(j − 1) < j) {use failure function to shift P} a b a c a b
j ← F[j − 1] j 0 1 2 3 4 5
Hence, there are no more 14 15 16 17 18 19
else P[j] a b a c a b
than 2m iterations of the F[i] ← 0 { no match } a b a c a b
while-loop F(j) 0 0 1 0 1 2
i←i+1

Pattern Matching 13 Pattern Matching 14

You might also like