0% found this document useful (0 votes)
320 views

Pattern Matching 2

This document discusses different algorithms for pattern matching in strings, including the brute force algorithm, Knuth-Morris-Pratt (KMP) algorithm, and Boyer-Moore algorithm. It provides details on how the brute force algorithm works by checking each position in the text for a match. It then describes the KMP algorithm, which is more efficient by intelligently shifting the pattern on a mismatch instead of rechecking from the beginning. Key aspects of KMP include precomputing a border function to determine the optimal shift amount. Pseudocode is provided for KMP pattern matching and computing the border function.

Uploaded by

adi s. nugroho
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
320 views

Pattern Matching 2

This document discusses different algorithms for pattern matching in strings, including the brute force algorithm, Knuth-Morris-Pratt (KMP) algorithm, and Boyer-Moore algorithm. It provides details on how the brute force algorithm works by checking each position in the text for a match. It then describes the KMP algorithm, which is more efficient by intelligently shifting the pattern on a mismatch instead of rechecking from the beginning. Key aspects of KMP include precomputing a border function to determine the optimal shift amount. Pseudocode is provided for KMP pattern matching and computing the border function.

Uploaded by

adi s. nugroho
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 46

240-301, Computer Engineering Lab III (Software)

Semester 1, 2006-2007

Pattern Matching
Dr. Andrew Davison
WiG Lab (teachers room), CoE
[email protected]

T: a b a c a a b
1
P: a b a c a b
4 3 2
a b a c a b

240-301 Comp. Eng. Lab III (Software), Pattern Matching 1


Overview

1. What is Pattern Matching?


2. The Brute Force Algorithm
3. The Knuth-Morris-Pratt Algorithm
4. The Boyer-Moore Algorithm
5. More Information

240-301 Comp. Eng. Lab III (Software), Pattern Matching 2


1. What is Pattern Matching?
 Definition:
– given a text string T and a pattern string P, find the
pattern inside the text
 T: “the rain in spain stays mainly on the plain”
 P: “n th”

 Applications:
– text editors, Web search engines (e.g. Google), ima
ge analysis

240-301 Comp. Eng. Lab III (Software), Pattern Matching 3


String Concepts
 Assume S is a string of size m.
S = x 1 x 2 … xm

 A prefix of S is a substring S[1 .. k-1]


 A suffix of S is a substring S[k-1 .. m]
– k is any index between 1 and m
– S[0] is null character

240-301 Comp. Eng. Lab III (Software), Pattern Matching 4


Examples S
a n d r e w
0 5
 All possible prefixes of S:
– “”, “a", "an", "and", "andr”, "andre“,

 All possible suffixes of S:


– “”, “w", “ew", “rew", “drew", “ndrew”

240-301 Comp. Eng. Lab III (Software), Pattern Matching 5


2. The Brute Force Algorithm
 Check each position in the text T to see if th
e pattern P starts in that position

T: a n d r e w T: a n d r e w

P: r e w P: r e w

P moves 1 char at a time through T


....
240-301 Comp. Eng. Lab III (Software), Pattern Matching 6
Return index where
Brute Force in Java pattern starts, or -1

public static int brute(String text,String pattern)


{ int n = text.length(); // n is length of text
int m = pattern.length(); // m is length of pattern
int j;
for(int i=0; i <= (n-m); i++) {
j = 0;
while ((j < m) &&
(text.charAt(i+j) == pattern.charAt(j)) )
j++;
if (j == m)
return i; // match at i
}
return -1; // no match
} // end of brute()

240-301 Comp. Eng. Lab III (Software), Pattern Matching 7


Usage
public static void main(String args[])
{ if (args.length != 2) {
System.out.println("Usage: java BruteSearch
<text> <pattern>");
System.exit(0);
}
System.out.println("Text: " + args[0]);
System.out.println("Pattern: " + args[1]);

int posn = brute(args[0], args[1]);


if (posn == -1)
System.out.println("Pattern not found");
else
System.out.println("Pattern starts at posn "
+ posn);
}

240-301 Comp. Eng. Lab III (Software), Pattern Matching 8


Analysis

 Brute force pattern matching runs in time


O(mn) in the worst case.

 But most searches of ordinary text take


O(m+n), which is very quick.

240-301 Comp. Eng. Lab III (Software), Pattern Matching continued 9


 The brute force algorithm is fast when the al
phabet of the text is large
– e.g. A..Z, a..z, 1..9, etc.

 It is slower when the alphabet is small


– e.g. 0, 1 (as in binary files, image files, etc.)

240-301 Comp. Eng. Lab III (Software), Pattern Matching continued 10


 Example of a worst case:
– T: "aaaaaaaaaaaaaaaaaaaaaaaaaah"
– P: "aaah"

 Example of a more average case:


– T: "a string searching example is standard"
– P: "store"

240-301 Comp. Eng. Lab III (Software), Pattern Matching 11


3. The KMP Algorithm

 The Knuth-Morris-Pratt (KMP) algorithm l


ooks for the pattern in the text in a left-to-ri
ght order (like the brute force algorithm).

 But it shifts the pattern more intelligently th


an the brute force algorithm.

240-301 Comp. Eng. Lab III (Software), Pattern Matching continued 12


 If a mismatch occurs between the text and p
attern P at P[j], what is the most we can shif
t the pattern to avoid wasteful comparisons?

 Answer: the largest prefix of P[1 .. j-1] that i


s a suffix of P[1 .. j-1]

240-301 Comp. Eng. Lab III (Software), Pattern Matching 13


Example i

T:

P: j=6

jnew = 3

240-301 Comp. Eng. Lab III (Software), Pattern Matching 14


Why
j == 5

 Find largest prefix (start) of:


"a b a a b" ( P[1..j-1] )

which is suffix (end) of:


“a b a a b" ( p[1 .. j-1] )

 Answer: "a b"


 Set j = 3 // the new j value
240-301 Comp. Eng. Lab III (Software), Pattern Matching 15
KMP Border Function
 KMP preprocesses the pattern to find matches
of prefixes of the pattern with the pattern
itself.
 j = mismatch position in P[]
 k = position before the mismatch (k = j-1).
 The border function b(k) is defined as the size
of the largest prefix of P[1..k] that is also a
suffix of P[1..k].

240-301 Comp. Eng. Lab III (Software), Pattern Matching 16


Border Function Example
(k == j-1)
 P: "abaaba" kj 1 2 3 4 5
j: 123456 F(j)
b(k) 0 0 1 1 2

b(k) is the size of


the largest border.

 In code, b() is represented by an array, like


the table.

240-301 Comp. Eng. Lab III (Software), Pattern Matching 17


Why is b(5) == 2? P: "abaaba"

 b(5) means
– find the size of the largest prefix of P[1..5] that
is also a suffix of P[1..5]
= find the size largest prefix of "abaab" that
is also a suffix of "baab"
= find the size of "ab"
=2

240-301 Comp. Eng. Lab III (Software), Pattern Matching 18


Using the Failure Function

 Knuth-Morris-Pratt’s algorithm modifies the brute-force


algorithm.
– if a mismatch occurs at P[j]
(i.e. P[j] != T[i]), then
k = j-1;
j = b(k) + 1; // obtain the new j

240-301 Comp. Eng. Lab III (Software), Pattern Matching 19


Return index where
KMP in Java pattern starts, or -1

public static int kmpMatch(String text,


String pattern)
{
int n = text.length();
int m = pattern.length();

int fail[] = computeFail(pattern);

int i=0;
int j=0;
:

240-301 Comp. Eng. Lab III (Software), Pattern Matching 20


while (i < n) {
if (pattern.charAt(j) == text.charAt(i)) {
if (j == m - 1)
return i - m + 1; // match
i++;
j++;
}
else if (j > 0)
j = fail[j-1];
else
i++;
}
return -1; // no match
} // end of kmpMatch()

240-301 Comp. Eng. Lab III (Software), Pattern Matching 21


public static int[] computeFail(
String pattern)
{
int fail[] = new int[pattern.length()];
fail[0] = 0;

int m = pattern.length();
int j = 0;
int i = 1;
:

240-301 Comp. Eng. Lab III (Software), Pattern Matching 22


while (i < m) {
if (pattern.charAt(j) ==
pattern.charAt(i)) { //j+1 chars match
fail[i] = j + 1;
i++;
j++;
}
else if (j > 0) // j follows matching prefix
j = fail[j-1];
else { // no match
fail[i] = 0;
i++;
}
}
return fail;
} // end of computeFail()
Similar code
to kmpMatch()

240-301 Comp. Eng. Lab III (Software), Pattern Matching 23


Usage
public static void main(String args[])
{ if (args.length != 2) {
System.out.println("Usage: java KmpSearch
<text> <pattern>");
System.exit(0);
}
System.out.println("Text: " + args[0]);
System.out.println("Pattern: " + args[1]);

int posn = kmpMatch(args[0], args[1]);


if (posn == -1)
System.out.println("Pattern not found");
else
System.out.println("Pattern starts at posn "
+ posn);
}

240-301 Comp. Eng. Lab III (Software), Pattern Matching 24


Example
T: a b a c a a b a c c a b a c a b a a b b
1 2 3 4 5 6
P: a b a c a b
7
a b a c a b
8 9 10 11 12
a b a c a b
13
a b a c a b
k 1 2 3 4 5 14 15 16 17 18 19
b(k) 0 0 1 0 1 a b a c a b

240-301 Comp. Eng. Lab III (Software), Pattern Matching 25


Why is b(4) == 1? P: "abacab"

 b(4) means
– find the size of the largest prefix of P[1..5] that
is also a suffix of P[1..5]
= find the size largest prefix of "abaca" that
is also a suffix of "baca"
= find the size of "a"
=1

240-301 Comp. Eng. Lab III (Software), Pattern Matching 26


KMP Advantages

 KMP runs in optimal time: O(m+n)


– very fast

 The algorithm never needs to move backwa


rds in the input text, T
– this makes the algorithm good for processing ve
ry large files that are read in from external devi
ces or through a network stream

240-301 Comp. Eng. Lab III (Software), Pattern Matching 27


KMP Disadvantages

 KMP doesn’t work so well as the size of the


alphabet increases
– more chance of a mismatch (more possible mis
matches)
– mismatches tend to occur early in the pattern, b
ut KMP is faster when the mismatches occur lat
er

240-301 Comp. Eng. Lab III (Software), Pattern Matching 28


KMP Extensions

 The basic algorithm doesn't take into accou


nt the letter in the text that caused the mism
atch.

T: a b a a b x
Basic KMP
P: a b a a b a does not do this.
a b a a b a

240-301 Comp. Eng. Lab III (Software), Pattern Matching 29


3. The Boyer-Moore Algorithm
 The Boyer-Moore pattern matching
algorithm is based on two techniques.

 1. The looking-glass technique


– find P in T by moving backwards through P,
starting at its end

240-301 Comp. Eng. Lab III (Software), Pattern Matching 30


 2. The character-jump technique
– when a mismatch occurs at T[i] == x
– the character in pattern P[j] is not the
same as T[i]

T x a
 There are 3 possible
i
cases, tried in order.

P ba
j
240-301 Comp. Eng. Lab III (Software), Pattern Matching 31
Case 1
 If P contains x somewhere, then try to
shift P right to align the last occurrence
of x in P with T[i].

T x a T x a ? ?
i inew
and
move i and
j right, so
P x c ba j at end P x c ba
j jnew
240-301 Comp. Eng. Lab III (Software), Pattern Matching 32
Case 2
 If P contains x somewhere, but a shift right
to the last occurrence is not possible, then
shift P right by 1 character to T[i+1].

T x a x T xa x ?
i inew
and
move i and
j right, so
P cw ax j at end P cwax
j x is after jnew
240-301 Comp. Eng. Lab jIII position
(Software), Pattern Matching 33
Case 3
 If cases 1 and 2 do not apply, then shift P to
align P[1] with T[i+1].

T x a T x a ? ? ?
i inew
and
move i and
j right, so
P d c ba j at end P d c ba
j 1 jnew
No x in P
240-301 Comp. Eng. Lab III (Software), Pattern Matching 34
Boyer-Moore Example (1)

T:
a p a t t e r n m a t c h i n g a l g o r i t h m

1 3 5 11 10 9 8 7
r i t h m r i t h m r i t h m r i t h m

P: 2 4 6
r i t h m r i t h m r i t h m

240-301 Comp. Eng. Lab III (Software), Pattern Matching 35


Last Occurrence Function
 Boyer-Moore’s algorithm preprocesses the
pattern P and the alphabet A to build a last
occurrence function L()
– L() maps all the letters in A to integers

 L(x) is defined as: // x is a letter in A


– the largest index i such that P[i] == x, or
– -1 if no such index exists

240-301 Comp. Eng. Lab III (Software), Pattern Matching 36


L() Example
P a b a c a b
 A = {a, b, c, d} 1 2 3 4 5 6
 P: "abacab"

x a b c d
L(x) 5 6 4 -1

L() stores indexes into P[]

240-301 Comp. Eng. Lab III (Software), Pattern Matching 37


Note

 In Boyer-Moore code, L() is calculated whe


n the pattern P is read in.

 Usually L() is stored as an array


– something like the table in the previous slide

240-301 Comp. Eng. Lab III (Software), Pattern Matching 38


Boyer-Moore Example (2)
T: a b a c a a b a d c a b a c a b a a b b
1
P: a b a c a b
4 3 2 13 12 11 10 9 8
a b a c a b a b a c a b
5 7
a b a c a b a b a c a b
6
a b a c a b

x a b c d
L(x) 5 6 4 1
240-301 Comp. Eng. Lab III (Software), Pattern Matching 39
Return index where
Boyer-Moore in Java pattern starts, or -1

public static int bmMatch(String text,


String pattern)
{
int last[] = buildLast(pattern);
int n = text.length();
int m = pattern.length();
int i = m-1;

if (i > n-1)
return -1; // no match if pattern is
// longer than text
:

240-301 Comp. Eng. Lab III (Software), Pattern Matching 40


int j = m-1;
do {
if (pattern.charAt(j) == text.charAt(i))
if (j == 0)
return i; // match
else { // looking-glass technique
i--;
j--;
}
else { // character jump technique
int lo = last[text.charAt(i)]; //last occ
i = i + m - Math.min(j, 1+lo);
j = m - 1;
}
} while (i <= n-1);

return -1; // no match


} // end of bmMatch()

240-301 Comp. Eng. Lab III (Software), Pattern Matching 41


public static int[] buildLast(String pattern)
/* Return array storing index of last
occurrence of each ASCII char in pattern. */
{
int last[] = new int[128]; // ASCII char set

for(int i=0; i < 128; i++)


last[i] = -1; // initialize array

for (int i = 0; i < pattern.length(); i++)


last[pattern.charAt(i)] = i;

return last;
} // end of buildLast()

240-301 Comp. Eng. Lab III (Software), Pattern Matching 42


Usage
public static void main(String args[])
{ if (args.length != 2) {
System.out.println("Usage: java BmSearch
<text> <pattern>");
System.exit(0);
}
System.out.println("Text: " + args[0]);
System.out.println("Pattern: " + args[1]);

int posn = bmMatch(args[0], args[1]);


if (posn == -1)
System.out.println("Pattern not found");
else
System.out.println("Pattern starts at posn "
+ posn);
}

240-301 Comp. Eng. Lab III (Software), Pattern Matching 43


Analysis
 Boyer-Moore worst case running time is
O(nm + A)

 But, Boyer-Moore is fast when the alphabet (A) is


large, slow when the alphabet is small.
– e.g. good for English text, poor for binary

 Boyer-Moore is significantly faster than brute force


for searching English text.

240-301 Comp. Eng. Lab III (Software), Pattern Matching 44


Worst Case Example
T: a a a a a a a a a
 T: "aaaaa…a"
6 5 4 3 2 1
 P: "baaaaa" P: b a a a a a
12 11 10 9 8 7
b a a a a a
18 17 16 15 14 13
b a a a a a
24 23 22 21 20 19
b a a a a a

240-301 Comp. Eng. Lab III (Software), Pattern Matching 45


5. More Information
 Algorithms in C++ This book is
Robert Sedgewick in the CoE library.
Addison-Wesley, 1992
– chapter 19, String Searching

 Online Animated Algorithms:


– http://www.ics.uci.edu/~goodrich/dsa/
11strings/demos/pattern/
– http://www-sr.informatik.uni-tuebingen.de/
~buehler/BM/BM1.html
– http://www-igm.univ-mlv.fr/~lecroq/string/

240-301 Comp. Eng. Lab III (Software), Pattern Matching 46

You might also like