Pattern Matching
Pattern Matching
Pattern Matching
Dr. Andrew Davison WiG Lab (teachers room), CoE
[email protected]
T:
P:
a a
b b a
a a b
c c a
a a
4
a
1
b
3 2
b
1
Overview
1. 2. 3. 4. 5. What is Pattern Matching? The Brute Force Algorithm The Boyer-Moore Algorithm The Knuth-Morris-Pratt Algorithm More Information
given a text string T and a pattern string P, find the pattern inside the text
T:
String Concepts
Assume A
S is a string of size m.
Examples
Substring
a n d r e w
0 5
S[1..3] == "ndr"
All
possible prefixes of S:
possible suffixes of S:
each position in the text T to see if the pattern P starts in that position
T: a n d r e w P: r e w
T: a n d r e w P: r e w
....
240-301 Comp. Eng. Lab III (Software), Pattern Matching 6
public static int brute(String text,String pattern) { int n = text.length(); // n is length of text int m = pattern.length(); // m is length of pattern int j; for(int i=0; i <= (n-m); i++) { j = 0; while ((j < m) && (text.charAt(i+j) == pattern.charAt(j)) ) j++; if (j == m) return i; // match at i } return -1; // no match } // end of brute()
240-301 Comp. Eng. Lab III (Software), Pattern Matching 7
Usage
public static void main(String args[]) { if (args.length != 2) { System.out.println("Usage: java BruteSearch <text> <pattern>"); System.exit(0); } System.out.println("Text: " + args[0]); System.out.println("Pattern: " + args[1]); int posn = brute(args[0], args[1]); if (posn == -1) System.out.println("Pattern not found"); else System.out.println("Pattern starts at posn " + posn); }
240-301 Comp. Eng. Lab III (Software), Pattern Matching 8
Analysis
Brute
force pattern matching runs in time O(mn) in the worst case. most searches of ordinary text take O(m+n), which is very quick.
But
continued
The
brute force algorithm is fast when the alphabet of the text is large
e.g. A..Z, a..z, 1..9, etc.
It
continued
10
Example
of a worst case:
T: "aaaaaaaaaaaaaaaaaaaaaaaaaah" P: "aaah"
Example
11
1.
12
2.
when a mismatch occurs at T[i] == x the character in pattern P[j] is not the same as T[i]
x a i
ba j
13
Case 1
If
P contains x somewhere, then try to shift P right to align the last occurrence of x in P with T[i].
x a i
T
and move i and j right, so j at end
x a ? ? inew
x c ba j
x c ba jnew
14
Case 2
If
P contains x somewhere, but a shift right to the last occurrence is not possible, then shift P right by 1 character to T[i+1].
x a x i
T
and move i and j right, so j at end
xa x ? inew
cw ax j
x is after 240-301 Comp. Eng. Labj position Matching III (Software), Pattern
cw ax jnew
15
Case 3
If
cases 1 and 2 do not apply, then shift P to align P[0] with T[i+1]. T
and move i and j right, so j at end
x a i
x a ? ?? inew
P d c ba
No x in P j
P d c ba
0 jnew
16
P:
r i
r i
r i
17
algorithm preprocesses the pattern P and the alphabet A to build a last occurrence function L()
L() maps all the letters in A to integers
L(x)
is defined as:
// x is a letter in A
L() Example
P a b a c a b
A
= {a, b, c, d} P: "abacab"
x L(x) a 4 b 5
0 1 2 3 4 5
c 3
d -1
Note
In
Boyer-Moore code, L() is calculated when the pattern P is read in. L() is stored as an array
Usually
20
a
1
b
3 2 13 12 11 10 9 8
c a b
a c a
b
5
a b
6
b a
a c
c a
a
7
a c
a b
x L(x)
240-301 Comp. Eng. Lab III (Software), Pattern Matching
a 4
b 5
c 3
d -1
21
Boyer-Moore in Java
public static int bmMatch(String text, String pattern) { int last[] = buildLast(pattern); int n = text.length(); int m = pattern.length(); int i = m-1; if (i > n-1) return -1; // no match if pattern is // longer than text :
22
int j = m-1; do { if (pattern.charAt(j) == text.charAt(i)) if (j == 0) return i; // match else { // looking-glass technique i--; j--; } else { // character jump technique int lo = last[text.charAt(i)]; //last occ i = i + m - Math.min(j, 1+lo); j = m - 1; } } while (i <= n-1);
return -1; // no match } // end of bmMatch()
240-301 Comp. Eng. Lab III (Software), Pattern Matching 23
public static int[] buildLast(String pattern) /* Return array storing index of last occurrence of each ASCII char in pattern. */ { int last[] = new int[128]; // ASCII char set for(int i=0; i < 128; i++) last[i] = -1; // initialize array for (int i = 0; i < pattern.length(); i++) last[pattern.charAt(i)] = i; return last; } // end of buildLast()
24
Usage
public static void main(String args[]) { if (args.length != 2) { System.out.println("Usage: java BmSearch <text> <pattern>"); System.exit(0); } System.out.println("Text: " + args[0]); System.out.println("Pattern: " + args[1]); int posn = bmMatch(args[0], args[1]); if (posn == -1) System.out.println("Pattern not found"); else System.out.println("Pattern starts at posn " + posn); }
240-301 Comp. Eng. Lab III (Software), Pattern Matching 25
Analysis
Boyer-Moore
O(nm + A)
But,
Boyer-Moore is fast when the alphabet (A) is large, slow when the alphabet is small.
e.g. good for English text, poor for binary
Boyer-Moore
"aaaaaa" P: "baaaaa"
T: a a a a a a a a a
6 5 4 3 2 9 1 8 7
P: b a a a a a
12 11 10
a b
a a b
a a a
a a a
a a a a a a
27
18 17 16 15 14 13 24 23 22 21 20 19
Knuth-Morris-Pratt (KMP) algorithm looks for the pattern in the text in a left-toright order (like the brute force algorithm).
it shifts the pattern more intelligently than the brute force algorithm.
But
continued
28
If
a mismatch occurs between the text and pattern P at P[j], what is the most we can shift the pattern to avoid wasteful comparisons?
the largest prefix of P[0 .. j-1] that is a suffix of P[1 .. j-1]
29
Answer:
Example
T: P:
j=5
jnew = 2
30
Why
j == 5
Find
preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself. j = mismatch position in P[] k = position before the mismatch (k = j-1). The failure function F(k) is defined as the size of the largest prefix of P[0..k] that is also a suffix of P[1..k].
240-301 Comp. Eng. Lab III (Software), Pattern Matching 32
"abaaba"
j k
F(k) F(j)
0 0
1 0
2 1
3 1
4 2
j: 012345
Why is F(4) == 2?
F(4)
P: "abaaba"
means
find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaab" that is also a suffix of "baab" = find the size of "ab" =2
34
35
KMP in Java
public static int kmpMatch(String text, String pattern) { int n = text.length(); int m = pattern.length(); int fail[] = computeFail(pattern); int i=0; int j=0;
36
while (i < n) { if (pattern.charAt(j) == text.charAt(i)) { if (j == m - 1) return i - m + 1; // match i++; j++; } else if (j > 0) j = fail[j-1]; else i++; } return -1; // no match } // end of kmpMatch()
37
public static int[] computeFail( String pattern) { int fail[] = new int[pattern.length()]; fail[0] = 0; int m = pattern.length(); int j = 0; int i = 1; :
38
while (i < m) { if (pattern.charAt(j) == pattern.charAt(i)) { //j+1 chars match fail[i] = j + 1; i++; j++; } else if (j > 0) // j follows matching prefix j = fail[j-1]; else { // no match fail[i] = 0; i++; } Similar code } to kmpMatch() return fail; } // end of computeFail()
240-301 Comp. Eng. Lab III (Software), Pattern Matching 39
Usage
public static void main(String args[]) { if (args.length != 2) { System.out.println("Usage: java KmpSearch <text> <pattern>"); System.exit(0); } System.out.println("Text: " + args[0]); System.out.println("Pattern: " + args[1]); int posn = kmpMatch(args[0], args[1]); if (posn == -1) System.out.println("Pattern not found"); else System.out.println("Pattern starts at posn " + posn); }
240-301 Comp. Eng. Lab III (Software), Pattern Matching 40
Example
T: P:
a b a c a a b a c c a b a c a b a a b b
1 2 3 4 5 6
a b a c a b
7
a b a c a b
8 9 10 11 12
a b a c a b
13
a b a c a b
k F(k) 0 0 1 0 2 1 3 0 4 1
14 15 16 17 18 19
a b a c a b
41
Why is F(4) == 1?
F(4)
P: "abacab"
means
find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaca" that is also a suffix of "baca" = find the size of "a" =1
42
KMP Advantages
KMP
very fast
The
43
KMP Disadvantages
KMP
44
KMP Extensions
The
basic algorithm doesn't take into account the letter in the text that caused the mismatch.
a b a a b x P: a b a a b a Basic KMP does not do this. a b a a b a
T:
45
5. More Information
Online
Animated Algorithms: