Pattern Matching
Pattern Matching
References:
Algorithms in C (2nd edition), Chapter 19
http://www.cs.princeton.edu/introalgsds/63long
http://www.cs.princeton.edu/introalgsds/72regular
1
exact pattern matching
Knuth-Morris-Pratt
RE pattern matching
grep
2
Exact pattern matching
Problem:
Find first match of a pattern of length M in a text stream of length N.
text
i n a h a y s t a c k a n e e d l e i n a N = 21
Applications.
• parsers.
• spam filters.
• digital libraries.
• screen scrapers.
• word processors.
• web search engines.
• natural language processing.
• computational molecular biology.
• feature detection in digitized images.
... 3
Brute-force exact pattern match
a a a a a a a a a a a a a a a a b text length N
a a a a a b
a a a a a b
a a a a a b
a a a a a b
a a a a a b MN char compares
a a a a a b
a a a a a b
a a a a a b
a a a a a b
a a a a a b
a a a a a b
a a a a a b pattern length M
6
Algorithmic challenges in pattern matching
Practical challenge: Avoid backup in text stream. often no room or time to save text
Now is the time for all people to come to the aid of their party. Now is the time for all
good people to come to the aid of their party. Now is the time for many good people to
come to the aid of their party. Now is the time for all good people to come to the aid of
their party. Now is the time for a lot of good people to come to the aid of their party.
Now is the time for all of the good people to come to the aid of their party. Now is the
time for all good people to come to the aid of their party. Now is the time for each good
person to come to the aid of their party. Now is the time for all good people to come to
the aid of their party. Now is the time for all good Republicans to come to the aid of
their party. Now is the time for all good people to come to the aid of their party. Now
is the time for many or all good people to come to the aid of their party. Now is the
time for all good people to come to the aid of their party. Now is the time for all good
Democrats to come to the aid of their party. Now is the time for all people to come to
the aid of their party. Now is the time for all good people to come to the aid of their
party. Now is the time for many good people to come to the aid of their party. Now is the
time for all good people to come to the aid of their party. Now is the time for a lot of
good people to come to the aid of their party. Now is the time for all of the good people
to come to the aid of their party. Now is the time for all good people to come to the aid
of their attack at dawn party. Now is the time for each person to come to the aid of
their party. Now is the time for all good people to come to the aid of their party. Now
is the time for all good Republicans to come to the aid of their party. Now is the time
for all good people to come to the aid of their party. Now is the time for many or all
good people to come to the aid of their party. Now is the time for all good people to
come to the aid of their party. Now is the time for all good Democrats to come to the aid
of their party.
7
exact pattern matching
Knuth-Morris-Pratt
RE pattern matching
grep
8
Knuth-Morris-Pratt (KMP) exact pattern-matching algorithm
t pattern in text
ep
text DFA acc
a a a b a a b a a a b for
pattern reje
ct
pattern NOT
a a b a a a in text
No backup in a DFA
Linear-time because each step is just a state change
9
Knuth-Morris-Pratt DFA example
DFA
for
pattern
a a b a a a
b b b
a a b a a a
0 1 2 3 4 5
b a
b
b b b
a a b a a a
0 1 2 3 4 5
b a b
0 a a a b a a b a a a b
b b b
a a b a a a
0 1 2 3 4 5
b a b
1 a a a b a a b a a a b
b b b
a a b a a a
0 1 2 3 4 5
b a b
2 a a a b a a b a a a b
b b b
a a b a a a
0 1 2 3 4 5
b a b
2 a a a b a a b a a a b
b b b
a a b a a a
0 1 2 3 4 5
b a b
3 a a a b a a b a a a b
b b b
a a b a a a
0 1 2 3 4 5
b a b
11
Knuth-Morris-Pratt DFA simulation
b b b
a a b a a a
0 1 2 3 4 5
b a b
4 a a a b a a b a a a b
b b b
a a b a a a
0 1 2 3 4 5
b a b
5 a a a b a a b a a a b
b b b
a a b a a a
0 1 2 3 4 5
b a b
3 a a a b a a b a a a b
b b b
a a b a a a
0 1 2 3 4 5
b a b
4 a a a b a a b a a a b
b b b
a a b a a a
0 1 2 3 4 5
b a b
5 a a a b a a b a a a b
b b b
a a b a a a
0 1 2 3 4 5
accept! b a b
12
Knuth-Morris-Pratt DFA simulation
When in state i:
•have found match in i previous input chars
•that is the longest such match
0 a a a b a a b a a a b
1 a a a b a a b a a a b
2 a a a b a a b a a a b
2 a a a b a a b a a a b
3 a a a b a a b a a a b
4 a a a b a a b a a a b
5 a a a b a a b a a a b
3 a a a b a a b a a a b
4 a a a b a a b a a a b
5 a a a b a a b a a a b
a a a b a a b a a a b
b b b
a a b a a a
0 1 2 3 4 5
b a
b
13
KMP implementation
0 1 2 3 4 5
a 1 2 2 4 5 6
DFA b 0 0 3 0 0 3
for
only need to
pattern next 0 0 2 0 0 3
store mismatches
a a b a a a
b b b
a a b a a a
0 1 2 3 4 5
b a
b
14
KMP implementation
int j = 0;
for (int i = 0; i < N; i++)
{
if (t.charAt(i) == p.charAt(j)) j++; // match
else j = next[j]; // mismatch
if (j == M) return i - M + 1; // found
}
return -1; // not found
15
Knuth-Morris-Pratt: Iterative DFA construction
DFA for first i states contains the information needed to build state i+1
Key idea
• on mismatch at 7th char, need to simulate 6-char backup
• previous 6 chars are known (abaaaa in example)
• 6-state DFA (known) determines next state!
0 a b a a a a
Keep track of DFA state for start at 2nd char of pattern 1 a b a a a a
•
0 a b a a a a
compare char at that position with next pattern char 1 a b a a a a
• match/mismatch provides all needed info 2
2
a
a
b
b
a
a
a
a
a
a
a
a
2 a b a a a a
b b b
a a b a a a
0 1 2 3 4 5
b a
b
16
KMP iterative DFA construction: two cases
Let X be the next state in the simulation and j the next state to build.
X j
state for a b a a a b
b
DFA for b b
b
a a b a a a 6
a a b a a a b 0 1 2 3 4 5
b a
b a
b b b
DFA for a a b a a a a
a a b a a a a 0 1 2 3 4 5 6
b a
b b
17
Knuth-Morris-Pratt DFA construction
0
DFA X: current state in simulation
compare p[j] with p[X]
0 a
a 0 1
0 b match: copy and increment
next[j] = next[X];
X = X + 1;
0 1
p[1..j-1] X b mismatch: do the opposite
a a match a a next[j] = X + 1;
0 0 0 1 2
0 b X = next[X];
X j
0 1 2
a a b mismatch b
a a b
0 0 2 a 1 0 1 2 3
b a
0 1 2 3
a a b a b b
match a a b a
0 0 2 0 a b 0 0 1 2 3 4
b a
0 1 2 3 4
a a b a a match b b b
a a b a a
0 0 2 0 0 a b a 1 0 1 2 3 4 5
b a
0 1 2 3 4 5
a a b a a a mismatch b b b
0 0 2 0 0 3 a b a a 2 a a b a a a
0 1 2 3 4 5
b a b
18
Knuth-Morris-Pratt DFA construction examples
ex: a a b a a a b ex: a b b a b b b
0 0 X: current state in simulation
a a compare p[j] with p[X]
0 0
0 1 2 3 0 1 2 3
a a b a match a b b a match
0 0 2 0 0 1 1 0
0 1 2 3 4 0 1 2 3 4
a a b a a match a b b a b match
0 0 2 0 0 0 1 1 0 1
0 1 2 3 4 5 0 1 2 3 4 5
a a b a a a mismatch a b b a b b match
0 0 2 0 0 3 0 1 1 0 1 1
0 1 2 3 4 5 6 0 1 2 3 4 5 6
a a b a a a b match a b b a b b b mismatch
0 0 2 0 0 3 2 0 1 1 0 1 1 4
19
DFA construction for KMP: Java implementation
int X = 0;
int[] next = new int[M];
for (int j = 1; j < M; j++)
{
if (p.charAt(X) == p.charAt(j))
{ // match
next[j] = next[X];
X = X + 1;
}
else
{ // mismatch
next[j] = X + 1;
X = next[X];
}
}
20
Optimized KMP implementation
General alphabet
• more difficult
• easy with next[][] indexed by mismatch position, character
• KMP paper has ingenious solution that is not difficult to implement
[ build NFA, then prove that it finishes in 2N steps ]
Short history:
• inspired by esoteric theorem of Cook
[ linear time 2-way pushdown automata simulation is possible ]
• discovered in 1976 independently by two theoreticians and a hacker
Knuth: discovered linear time algorithm
Pratt: made running time independent of alphabet
Morris: trying to build a text editor.
• theory meets practice 22
Exact pattern matching: other approaches
pattern s y z y g y
text a a a b b a a b a b a a a b b a a a b a a
s y z y g y
s y z y g y
s y z y g y
23
Exact pattern match cost summary
24
exact pattern matching
Knuth-Morris-Pratt
RE pattern matching
grep
25
Regular-expression pattern matching
Ex. (genomics)
• Fragile X syndrome is a common cause of mental retardation.
• human genome contains triplet repeats of cgg or agg
bracketed by gcg at the beginning and ctg at the end
• number of repeats is variable, and correlated with syndrome
• use regular expression to specify pattern: gcg(cgg|agg)*ctg
• do RE pattern match on person’s genome to detect Fragile X
text gcggcgtgtgtgcgagagagtgggtttaaagctggcgcggaggcggctggcgcggaggctg
26
RE pattern matching: applications
27
Regular expression examples
cumulus succubus
wildcard .u.u.u.
jugulum tumultuous
aa
union aa | baab
baab every other string
aa ab
closure ab*a
abbba ababa
aaaab
a(a|b)aab
abaab every other string
parentheses
a aa
(ab)*a
ababababa abbba
28
Regular expression examples (continued)
a* | (a*ba*ba*ba*)* bbb b
aaa bb
number of b’s is a multiple of 3 bbbaababbaa baabbbaa
abcde ade
one or more a(bc)+de
abcbcde bcde
word camelCase
character classes [A-Za-z][a-z]*
Capitalized 4illegal
08540-1321 111111111
exactly k [0-9]{5}-[0-9]{4}
19072-5541 166-54-111
30
Regular expressions in Java
32
Regular expression caveat
33
Can the average web surfer learn to use REs?
34
Can the average TV viewer learn to use REs?
35
Can the average programmer learn to use REs?
37
GREP implementation: basic plan
t pattern in text
DFA acc
ep
text
for
pattern
actgtgcaggaggcggcgcggcggaggaggctggcga
reje
ct
pattern NOT
gcg(cgg|agg)*ctg in text
No backup in a DFA
Linear-time because each step is just a state change
38
Deterministic finite-state automata
DFA review.
int pc = 0;
while (!tape.isEmpty())
{
boolean bit = tape.read();
if (pc == 0) { if (!bit) pc = 0; else pc = 1; }
else if (pc == 1) { if (!bit) pc = 1; else pc = 2; }
else if (pc == 2) { if (!bit) pc = 2; else pc = 0; }
}
if (pc == 0) System.out.println("accepted");
else System.out.println("rejected");
39
Duality
Kleene's theorem.
• for any DFA, there exists a RE that describes the same set of strings
• for any RE, there exists a DFA that recognizes the same set of strings
DFA
RE
0* | (0*10*10*10*)*
NFA.
• may have 0, 1, or more transitions for each input symbol
• may have -transitions (move to another state without reading input)
• accept if any sequence of transitions leads to accept state
convention:
unlabelled arrows in set: 111, 00011, 101001011
areε- transitions
not in set: 110, 00011011, 00110
How to simulate an NFA? Maintain set of all possible states that NFA
could be in after reading in the first i symbols.
42
NFA Simulation
43
NFA Representation
0-graph 1-graph
-graph
44
NFA: Java Implementation
45
NFA Simulation
46
NFA Simulation: Java Implementation
while (!tape.isEmpty())
{ // Simulate NFA taking input from tape.
char c = tape.read();
int i = ALPHABET.indexOf(c); all possible states after
SET<Integer> next = G[i].neighbors(pc); reading character c from tape
47
Converting from an RE to an NFA: basic transformations
start
concatenation closure
from R to
from from
union
from from
c
c
from from
c R c* S
R S
R | S R S to to
to
to
to to
48
Converting from an RE to an NFA example: ab* | ab*
0 0
0
a
ab* | a*b ab* a*b 2 a*b
b*
1 1
1
0 0 0
a a a
a*
a
2 2 2 5
a*b 4
3 3 3 4
b b b
b
1 1 1
49
NFA Construction: Java Implementation
else if (re.length() == 1)
{ single char from from
char c = re.charAt(0);
for (int i = 0; i < EPSILON; i++) R | S R S
if (c == ALPHABET.charAt(i) || c == '.')
G[i].addEdge(from, to);
to to
}
else if (or != -1)
{ union
build(from, to, re.substring(0, or)); R | S
build(from, to, re.substring(or + 1));
} re.charAt(or)
52
Regular expressions in Java (revisited)
55
Not-so-regular expressions
Back-references.
• \1 notation matches sub-expression that was matched earlier.
• Supported by typical RE implementations.
56
Context
57
Summary of pattern-matching algorithms
Programmer:
• Implement exact pattern matching by DFA simulation (KMP).
• REs are a powerful pattern matching tool.
• Implement RE pattern matching by NFA simulation (grep).
Theoretician:
• RE is a compact description of a set of strings.
• NFA is an abstract machine equivalent in power to RE.
• DFAs and REs have limitations.
58