0% found this document useful (0 votes)
484 views

Pattern Matching

The document provides information about pattern matching algorithms. It describes the brute force algorithm, which checks for a pattern match at each position in a text in O(mn) time. It then covers the Boyer-Moore and Knuth-Morris-Pratt (KMP) algorithms, which are more efficient by skipping over text when a mismatch is found. Boyer-Moore uses a last occurrence table and character jumping, while KMP uses a failure function to determine the maximum shift of the pattern. Pseudocode and examples are given for each algorithm.

Uploaded by

Adeel Ahmad
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
484 views

Pattern Matching

The document provides information about pattern matching algorithms. It describes the brute force algorithm, which checks for a pattern match at each position in a text in O(mn) time. It then covers the Boyer-Moore and Knuth-Morris-Pratt (KMP) algorithms, which are more efficient by skipping over text when a mismatch is found. Boyer-Moore uses a last occurrence table and character jumping, while KMP uses a failure function to determine the maximum shift of the pattern. Pseudocode and examples are given for each algorithm.

Uploaded by

Adeel Ahmad
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 46

240-301, Computer Engineering Lab III (Software) Semester 1, 2006-2007

Pattern Matching
Dr. Andrew Davison WiG Lab (teachers room), CoE
[email protected]

T:
P:

a a

b b a

a a b

c c a

a a
4

a
1

b
3 2

b
1

240-301 Comp. Eng. Lab III (Software), Pattern Matching

Overview
1. 2. 3. 4. 5. What is Pattern Matching? The Brute Force Algorithm The Boyer-Moore Algorithm The Knuth-Morris-Pratt Algorithm More Information

240-301 Comp. Eng. Lab III (Software), Pattern Matching

1. What is Pattern Matching?


Definition:

given a text string T and a pattern string P, find the pattern inside the text
T:

the rain in spain stays mainly on the plain P: n th Applications:

text editors, Web search engines (e.g. Google), image analysis


240-301 Comp. Eng. Lab III (Software), Pattern Matching 3

String Concepts
Assume A

S is a string of size m.

substring S[i .. j] of S is the string fragment between indexes i and j.

prefix of S is a substring S[0 .. i] A suffix of S is a substring S[i .. m-1]


i is any index between 0 and m-1
240-301 Comp. Eng. Lab III (Software), Pattern Matching 4

Examples
Substring

a n d r e w
0 5

S[1..3] == "ndr"

All

possible prefixes of S:

"andrew", "andre", "andr", "and", "an, "a"


All

possible suffixes of S:

"andrew", "ndrew", "drew", "rew", "ew", "w"


240-301 Comp. Eng. Lab III (Software), Pattern Matching 5

2. The Brute Force Algorithm


Check

each position in the text T to see if the pattern P starts in that position

T: a n d r e w P: r e w

T: a n d r e w P: r e w

P moves 1 char at a time through T

....
240-301 Comp. Eng. Lab III (Software), Pattern Matching 6

Brute Force in Java

Return index where pattern starts, or -1

public static int brute(String text,String pattern) { int n = text.length(); // n is length of text int m = pattern.length(); // m is length of pattern int j; for(int i=0; i <= (n-m); i++) { j = 0; while ((j < m) && (text.charAt(i+j) == pattern.charAt(j)) ) j++; if (j == m) return i; // match at i } return -1; // no match } // end of brute()
240-301 Comp. Eng. Lab III (Software), Pattern Matching 7

Usage
public static void main(String args[]) { if (args.length != 2) { System.out.println("Usage: java BruteSearch <text> <pattern>"); System.exit(0); } System.out.println("Text: " + args[0]); System.out.println("Pattern: " + args[1]); int posn = brute(args[0], args[1]); if (posn == -1) System.out.println("Pattern not found"); else System.out.println("Pattern starts at posn " + posn); }
240-301 Comp. Eng. Lab III (Software), Pattern Matching 8

Analysis
Brute

force pattern matching runs in time O(mn) in the worst case. most searches of ordinary text take O(m+n), which is very quick.

But

240-301 Comp. Eng. Lab III (Software), Pattern Matching

continued

The

brute force algorithm is fast when the alphabet of the text is large
e.g. A..Z, a..z, 1..9, etc.

It

is slower when the alphabet is small

e.g. 0, 1 (as in binary files, image files, etc.)

240-301 Comp. Eng. Lab III (Software), Pattern Matching

continued

10

Example

of a worst case:

T: "aaaaaaaaaaaaaaaaaaaaaaaaaah" P: "aaah"
Example

of a more average case:

T: "a string searching example is standard" P: "store"

240-301 Comp. Eng. Lab III (Software), Pattern Matching

11

3. The Boyer-Moore Algorithm


The

Boyer-Moore pattern matching algorithm is based on two techniques.


The looking-glass technique

1.

find P in T by moving backwards through P, starting at its end

240-301 Comp. Eng. Lab III (Software), Pattern Matching

12

2.

The character-jump technique

when a mismatch occurs at T[i] == x the character in pattern P[j] is not the same as T[i]

There are 3 possible cases, tried in order.

x a i

ba j

240-301 Comp. Eng. Lab III (Software), Pattern Matching

13

Case 1
If

P contains x somewhere, then try to shift P right to align the last occurrence of x in P with T[i].
x a i

T
and move i and j right, so j at end

x a ? ? inew

x c ba j

x c ba jnew
14

240-301 Comp. Eng. Lab III (Software), Pattern Matching

Case 2
If

P contains x somewhere, but a shift right to the last occurrence is not possible, then shift P right by 1 character to T[i+1].
x a x i

T
and move i and j right, so j at end

xa x ? inew

cw ax j

x is after 240-301 Comp. Eng. Labj position Matching III (Software), Pattern

cw ax jnew
15

Case 3
If

cases 1 and 2 do not apply, then shift P to align P[0] with T[i+1]. T
and move i and j right, so j at end

x a i

x a ? ?? inew

P d c ba
No x in P j

P d c ba
0 jnew
16

240-301 Comp. Eng. Lab III (Software), Pattern Matching

Boyer-Moore Example (1)


T:
a p a t 1 t h m 2 t h m t e r n m a t c h i n g 3 t h m 4 t h m a l g o r i 5 t h m t h m r i r i r i 11 10 9 8 7 r i t h m 6 t h m

P:
r i

r i

r i

240-301 Comp. Eng. Lab III (Software), Pattern Matching

17

Last Occurrence Function


Boyer-Moores

algorithm preprocesses the pattern P and the alphabet A to build a last occurrence function L()
L() maps all the letters in A to integers

L(x)

is defined as:

// x is a letter in A

the largest index i such that P[i] == x, or -1 if no such index exists


240-301 Comp. Eng. Lab III (Software), Pattern Matching 18

L() Example
P a b a c a b
A

= {a, b, c, d} P: "abacab"
x L(x) a 4 b 5

0 1 2 3 4 5

c 3

d -1

L() stores indexes into P[]


240-301 Comp. Eng. Lab III (Software), Pattern Matching 19

Note
In

Boyer-Moore code, L() is calculated when the pattern P is read in. L() is stored as an array

Usually

something like the table in the previous slide

240-301 Comp. Eng. Lab III (Software), Pattern Matching

20

Boyer-Moore Example (2)


T: P:
a a b b a a a b a c c a b a a a
4

a
1

b
3 2 13 12 11 10 9 8

c a b

a c a

b
5

a b
6

b a

a c

c a

a
7

a c

a b

x L(x)
240-301 Comp. Eng. Lab III (Software), Pattern Matching

a 4

b 5

c 3

d -1
21

Boyer-Moore in Java

Return index where pattern starts, or -1

public static int bmMatch(String text, String pattern) { int last[] = buildLast(pattern); int n = text.length(); int m = pattern.length(); int i = m-1; if (i > n-1) return -1; // no match if pattern is // longer than text :

240-301 Comp. Eng. Lab III (Software), Pattern Matching

22

int j = m-1; do { if (pattern.charAt(j) == text.charAt(i)) if (j == 0) return i; // match else { // looking-glass technique i--; j--; } else { // character jump technique int lo = last[text.charAt(i)]; //last occ i = i + m - Math.min(j, 1+lo); j = m - 1; } } while (i <= n-1);
return -1; // no match } // end of bmMatch()
240-301 Comp. Eng. Lab III (Software), Pattern Matching 23

public static int[] buildLast(String pattern) /* Return array storing index of last occurrence of each ASCII char in pattern. */ { int last[] = new int[128]; // ASCII char set for(int i=0; i < 128; i++) last[i] = -1; // initialize array for (int i = 0; i < pattern.length(); i++) last[pattern.charAt(i)] = i; return last; } // end of buildLast()

240-301 Comp. Eng. Lab III (Software), Pattern Matching

24

Usage
public static void main(String args[]) { if (args.length != 2) { System.out.println("Usage: java BmSearch <text> <pattern>"); System.exit(0); } System.out.println("Text: " + args[0]); System.out.println("Pattern: " + args[1]); int posn = bmMatch(args[0], args[1]); if (posn == -1) System.out.println("Pattern not found"); else System.out.println("Pattern starts at posn " + posn); }
240-301 Comp. Eng. Lab III (Software), Pattern Matching 25

Analysis
Boyer-Moore

worst case running time is

O(nm + A)
But,

Boyer-Moore is fast when the alphabet (A) is large, slow when the alphabet is small.
e.g. good for English text, poor for binary

Boyer-Moore

is significantly faster than brute force for searching English text.


26

240-301 Comp. Eng. Lab III (Software), Pattern Matching

Worst Case Example


T:

"aaaaaa" P: "baaaaa"

T: a a a a a a a a a
6 5 4 3 2 9 1 8 7

P: b a a a a a
12 11 10

a b

a a b

a a a

a a a

a a a a a a
27

18 17 16 15 14 13 24 23 22 21 20 19

240-301 Comp. Eng. Lab III (Software), Pattern Matching

4. The KMP Algorithm


The

Knuth-Morris-Pratt (KMP) algorithm looks for the pattern in the text in a left-toright order (like the brute force algorithm).
it shifts the pattern more intelligently than the brute force algorithm.

But

240-301 Comp. Eng. Lab III (Software), Pattern Matching

continued

28

If

a mismatch occurs between the text and pattern P at P[j], what is the most we can shift the pattern to avoid wasteful comparisons?
the largest prefix of P[0 .. j-1] that is a suffix of P[1 .. j-1]
29

Answer:

240-301 Comp. Eng. Lab III (Software), Pattern Matching

Example
T: P:

j=5
jnew = 2

240-301 Comp. Eng. Lab III (Software), Pattern Matching

30

Why
j == 5
Find

largest prefix (start) of: "a b a a b" ( P[0..j-1] )

which is suffix (end) of: "b a a b" ( p[1 .. j-1] )


Answer:

"a b" Set j = 2 // the new j value


240-301 Comp. Eng. Lab III (Software), Pattern Matching 31

KMP Failure Function


KMP

preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself. j = mismatch position in P[] k = position before the mismatch (k = j-1). The failure function F(k) is defined as the size of the largest prefix of P[0..k] that is also a suffix of P[1..k].
240-301 Comp. Eng. Lab III (Software), Pattern Matching 32

Failure Function Example


(k == j-1)
P:

"abaaba"

j k
F(k) F(j)

0 0

1 0

2 1

3 1

4 2

j: 012345

F(k) is the size of the largest prefix.


In

code, F() is represented by an array, like the table.


33

240-301 Comp. Eng. Lab III (Software), Pattern Matching

Why is F(4) == 2?
F(4)

P: "abaaba"

means

find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaab" that is also a suffix of "baab" = find the size of "ab" =2

240-301 Comp. Eng. Lab III (Software), Pattern Matching

34

Using the Failure Function


Knuth-Morris-Pratts

algorithm modifies the brute-force algorithm.


if a mismatch occurs at P[j] (i.e. P[j] != T[i]), then k = j-1; j = F(k); // obtain the new j

240-301 Comp. Eng. Lab III (Software), Pattern Matching

35

KMP in Java

Return index where pattern starts, or -1

public static int kmpMatch(String text, String pattern) { int n = text.length(); int m = pattern.length(); int fail[] = computeFail(pattern); int i=0; int j=0;

240-301 Comp. Eng. Lab III (Software), Pattern Matching

36

while (i < n) { if (pattern.charAt(j) == text.charAt(i)) { if (j == m - 1) return i - m + 1; // match i++; j++; } else if (j > 0) j = fail[j-1]; else i++; } return -1; // no match } // end of kmpMatch()

240-301 Comp. Eng. Lab III (Software), Pattern Matching

37

public static int[] computeFail( String pattern) { int fail[] = new int[pattern.length()]; fail[0] = 0; int m = pattern.length(); int j = 0; int i = 1; :

240-301 Comp. Eng. Lab III (Software), Pattern Matching

38

while (i < m) { if (pattern.charAt(j) == pattern.charAt(i)) { //j+1 chars match fail[i] = j + 1; i++; j++; } else if (j > 0) // j follows matching prefix j = fail[j-1]; else { // no match fail[i] = 0; i++; } Similar code } to kmpMatch() return fail; } // end of computeFail()
240-301 Comp. Eng. Lab III (Software), Pattern Matching 39

Usage
public static void main(String args[]) { if (args.length != 2) { System.out.println("Usage: java KmpSearch <text> <pattern>"); System.exit(0); } System.out.println("Text: " + args[0]); System.out.println("Pattern: " + args[1]); int posn = kmpMatch(args[0], args[1]); if (posn == -1) System.out.println("Pattern not found"); else System.out.println("Pattern starts at posn " + posn); }
240-301 Comp. Eng. Lab III (Software), Pattern Matching 40

Example
T: P:
a b a c a a b a c c a b a c a b a a b b
1 2 3 4 5 6

a b a c a b
7

a b a c a b
8 9 10 11 12

a b a c a b
13

a b a c a b
k F(k) 0 0 1 0 2 1 3 0 4 1

14 15 16 17 18 19

a b a c a b

240-301 Comp. Eng. Lab III (Software), Pattern Matching

41

Why is F(4) == 1?
F(4)

P: "abacab"

means

find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaca" that is also a suffix of "baca" = find the size of "a" =1

240-301 Comp. Eng. Lab III (Software), Pattern Matching

42

KMP Advantages
KMP

runs in optimal time: O(m+n)

very fast
The

algorithm never needs to move backwards in the input text, T


this makes the algorithm good for processing very large files that are read in from external devices or through a network stream

240-301 Comp. Eng. Lab III (Software), Pattern Matching

43

KMP Disadvantages
KMP

doesnt work so well as the size of the alphabet increases


more chance of a mismatch (more possible mismatches) mismatches tend to occur early in the pattern, but KMP is faster when the mismatches occur later

240-301 Comp. Eng. Lab III (Software), Pattern Matching

44

KMP Extensions
The

basic algorithm doesn't take into account the letter in the text that caused the mismatch.
a b a a b x P: a b a a b a Basic KMP does not do this. a b a a b a

T:

240-301 Comp. Eng. Lab III (Software), Pattern Matching

45

5. More Information

Algorithms in C++ Robert Sedgewick Addison-Wesley, 1992


chapter 19, String Searching

This book is in the CoE library.

Online

Animated Algorithms:

http://www.ics.uci.edu/~goodrich/dsa/ 11strings/demos/pattern/ http://www-sr.informatik.uni-tuebingen.de/ ~buehler/BM/BM1.html http://www-igm.univ-mlv.fr/~lecroq/string/


240-301 Comp. Eng. Lab III (Software), Pattern Matching 46

You might also like