12 - Strings Matching
12 - Strings Matching
algorithms
David Kauchak
cs161
Summer 2009
Administrative
⚫ Check your scores on coursework
⚫ SCPD Final exam: e-mail me with proctor
information
⚫ Office hours next week?
⚫ Reminder: HW6 due Wed. 8/12 before class
and no late homework
Where did “dynamic programming” come from?
⚫ Running time
⚫ O(n) where n is length of shortest string
String operations
⚫ Concatenate (append): create string s1s2
‘this is a’ . ‘ string’ → ‘this is a string’
⚫ Running time
⚫ Θ(n+m)
String operations
⚫ Substitute: Exchange all occurrences of a
particular character with another character
Substitute(‘this is a string’, ‘i’, ‘x’) →
‘thxs xs a strxng’
Substitute(‘banana’, ‘a’, ‘o’) → ‘bonono’
⚫ Running time
⚫ Θ(n)
String operations
⚫ Length: return the number of
characters/symbols in the string
Length(‘this is a string’) → 16
Length(‘this is another string’) → 24
⚫ Running time
⚫ O(1) or Θ(n) depending on implementation
String operations
⚫ Prefix: Get the first j characters in the string
⚫ Running time
⚫ Θ(j)
⚫ Suffix: Get the last j characters in the string
⚫ Running time
⚫ Θ(j)
String operations
⚫ Substring – Get the characters between i and
j inclusive
Substring(‘this is a string’, 4, 8) → ‘s is ’
⚫ Running time
⚫ Θ(j - i)
⚫ Prefix?
⚫ Prefix(S, i) = Substring(S, 1, i)
⚫ Suffix?
⚫ Suffix(S, i) = Substring(S, i+1, length(n))
Edit distance
(aka Levenshtein distance)
⚫ Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Insertion:
ABACED
Edit distance
(aka Levenshtein distance)
⚫ Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Deletion:
ABACED BACED
Delete ‘A’
Edit distance
(aka Levenshtein distance)
⚫ Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Deletion:
Edit(Kitten, Mitten) = 1
Operations:
Edit(Happy, Hilly) = 3
Operations:
Edit(Banana, Car) = 5
Operations:
Edit(Simple, Apple) = 3
Operations:
⚫ Why?
⚫ sub ‘i’ for ‘j’ → sub ‘j’ for ‘i’
⚫ delete ‘i’ → insert ‘i’
⚫ insert ‘i’ → delete ‘i’
Calculating edit distance
X=ABCBDAB
Y=BDCABA
Ideas?
Calculating edit distance
X=ABCBDA?
Y=BDCAB?
X=ABCBDA?
Y=BDCAB?
Operations: Insert
Delete
Substitute
Insert
X=ABCBDA?
Y=BDCAB?
Insert
X=ABCBDA?
Edit
Y=BDCAB?
X=ABCBDA?
Y=BDCAB?
Delete
X=ABCBDA?
Edit
Y=BDCAB?
X=ABCBDA?
Y=BDCAB?
Substition
X=ABCBDA?
Edit
Y=BDCAB?
X=ABCBDA?
Y=BDCAB?
Equal
X=ABCBDA?
Y=BDCAB?
Equal
X=ABCBDA?
Edit
Y=BDCAB?
Θ(nm)
Variants
⚫ Only include insertions and deletions
⚫ What does this do to substitutions?
⚫ Include swaps, i.e. swapping two adjacent
characters counts as one edit
⚫ Weight insertion, deletion and substitution
differently
⚫ Weight specific character insertion, deletion
and substitutions differently
⚫ Length normalize the edit distance
String matching
⚫ Given a pattern string P of length m and a
string S of length n, find all locations where P
occurs in S
P = ABA
S = DCABABBABABA
String matching
⚫ Given a pattern string P of length m and a
string S of length n, find all locations where P
occurs in S
P = ABA
S = DCABABBABABA
Uses
⚫ grep/egrep
⚫ search
⚫ find
⚫ java.lang.String.contains()
Naive implementation
Is it correct?
Running time?
⚫ Best case
⚫ Θ(n) – when the first character of the pattern does
not occur in the string
⚫ Worst case
⚫ O((n-m+1)m)
Worst case
P = AAAA
S = AAAAAAAAAAAAA
Worst case
P = AAAA
S = AAAAAAAAAAAAA
Worst case
P = AAAA
S = AAAAAAAAAAAAA
Worst case
P = AAAA
S = AAAAAAAAAAAAA
repeated work!
Worst case
P = AAAA
S = AAAAAAAAAAAAA
P = ABAB
P = ABDC
P = BAA
P = ABBCDDCAABB
Patterns
⚫ Which of these patterns will have that
problem?
P = BAA
P = ABBCDDCAABB
Finite State Automata (FSA)
⚫ An FSA is defined by 5 components
⚫ Q is the set of states
q0 q1 q2 … qn
Finite State Automata (FSA)
⚫ An FSA is defined by 5 components
⚫ Q is the set of states
q0 q1 q2 … qn
q0 A q1
q0 B q2 q0 q1
A
q2 …
A
q1 A q1
…
FSA operation
B A A
q0 q1 q1 q1
A B A
B
B
q0 q1 q1 q1
A B A
B
B
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
Suffix function
⚫ The suffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y
σ(abcdab, ababcd) = ?
Suffix function
⚫ The suffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y
σ(abcdab, ababcd) = 2
Suffix function
⚫ The suffix function σ(x,y) is the index of the
longest suffix of x that is a prefix of y
σ(daabac, abacac) = ?
Suffix function
⚫ The suffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y
σ(daabac, abacac) = 4
Suffix function
⚫ The suffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y
σ(dabb, abacd) = ?
Suffix function
⚫ The suffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y
σ(dabb, abacd) = 0
Building a string matching
automata
⚫ Given a pattern P = p1, p2, …, pm, we’d like to build
an FSA that recognizes P in strings
P = ababaca
Ideas?
Building a string matching automata
P = ababaca
⚫ Q = q1, q2, …, qm corresponding to each
symbol, plus a q0 starting state
⚫ the set of accepting states, A = {qm}
⚫ vocab Σ all symbols in P, plus one more
representing all symbols not in P
⚫ The transition function for q Q and a Σ is
defined as:
⚫ (q, a) = σ(p1…qa, P)
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 ? a σ(a, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 ? a σ(b, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 ? a σ(b, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a σ(b, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
B,C
state a b c P
q0 1 0 0 a q0 q1
A
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘aba’ so far
q1 1 2 0 b
q2 3 0 0 a σ(abaa, ababaca)
q3 ? b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘aba’ so far
q1 1 2 0 b
q2 3 0 0 a σ(abaa, ababaca)
q3 1 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘ababa’ so far
q1 1 2 0 b
q2 3 0 0 a
q3 1 4 0 b
q4 5 0 0 a
q5 1 ? c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘ababa’ so far
q1 1 2 0 b
q2 3 0 0 a σ(ababab, ababaca)
q3 1 4 0 b
q4 5 0 0 a
q5 1 ? c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘ababa’ so far
q1 1 2 0 b
q2 3 0 0 a σ(ababab, ababaca)
q3 1 4 0 b
q4 5 0 0 a
q5 1 4 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
q1 1 2 0 b
q2 3 0 0 a
q3 1 4 0 b
q4 5 0 0 a
q5 1 4 6 c
q6 7 0 0 a
q7 1 2 0
Matching runtime
⚫ Once we’ve built the FSA, what is the
runtime?
⚫ Θ(n) - Each symbol causes a state transition and
we only visit each character once
⚫ What is the cost to build the FSA?
⚫ How many entries in the table?
⚫ m|Σ| - Best case: Ω(m|Σ|)
⚫ How long does it take to calculate the suffix
function at each entry?
⚫ Naïve: O(m2)
⚫ Overall naïve: O(m3|Σ|)
⚫ Overall fast implementation O(m|Σ|)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare
P = ABA
S = BABABBABABA
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare
P = ABA Hash P
T(P)
S = BABABBABABA
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare
P = ABA
Hash m symbol
S = BABABBABABA sequences and compare
T(BAB)
=
T(P)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare
P = ABA
match
Hash m symbol
S = BABABBABABA sequences and compare
T(ABA)
=
T(P)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare
P = ABA
Hash m symbol
S = BABABBABABA sequences and compare
T(BAB)
=
T(P)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare
P = ABA
Hash m symbol
S = BABABBABABA sequences and compare
…
T(BAB)
=
T(P)
Rabin-Karp algorithm
For this to be
useful/efficient, what
P = ABA needs to be true
about T?
S = BABABBABABA
…
T(BAB)
=
T(P)
Rabin-Karp algorithm
For this to be
useful/efficient, what
P = ABA needs to be true
about T?
T(‘9847261’) = ?
Horner’s rule
T ( p1...m ) = pm + 10 ( pm −1 + 10 ( pm − 2 + ... + 10 ( p2 + 10 p1 )))
9847261
9 * 10 = 90
… = 9847621
Horner’s rule
T ( p1...m ) = pm + 10 ( pm −1 + 10 ( pm − 2 + ... + 10 ( p2 + 10 p1 )))
9847261
Running time?
9 * 10 = 90
Θ(m)
(90 + 8)*10 = 980
… = 9847621
Calculating the hash on the
string
⚫ Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?
m=4
963801572348267
T(si…i+m-1)
T ( si +1...i + m ) = 10 (T ( si...i + m −1 ) − 10 m −1 si ) + si + m
Calculating the hash on the
string
⚫ Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?
m=4 801
963801572348267
T(si…i+m-1) subtract highest order digit
T ( si +1...i + m ) = 10 (T ( si...i + m −1 ) − 10 m −1 si ) + si + m
Calculating the hash on the
string
⚫ Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?
m=4 8010
963801572348267
T(si…i+m-1)
shift digits up
T ( si +1...i + m ) = 10 (T ( si...i + m −1 ) − 10 m −1 si ) + si + m
Calculating the hash on the
string
⚫ Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?
m=4 8015
963801572348267
T(si…i+m-1)
add in the lowest digit
T ( si +1...i + m ) = 10 (T ( si...i + m −1 ) − 10 m −1 si ) + si + m
Calculating the hash on the
string
⚫ Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?
T(si…i+m-1)
T ( si +1...i + m ) = 10 (T ( si...i + m −1 ) − 10 m −1 si ) + si + m
Algorithm so far…
⚫ Is it correct?
⚫ Each string has a unique numerical value and we
compare that with each value in the string
⚫ Running time
⚫ Preprocessing:
⚫ Θ(m)
⚫ Matching
⚫ Θ(n-m+1)
Σ* 1…q
Σ* 1…q