0% found this document useful (0 votes)
31 views

12 - Strings Matching

Uploaded by

22070073
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

12 - Strings Matching

Uploaded by

22070073
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 111

String processing

algorithms
David Kauchak
cs161
Summer 2009
Administrative
⚫ Check your scores on coursework
⚫ SCPD Final exam: e-mail me with proctor
information
⚫ Office hours next week?
⚫ Reminder: HW6 due Wed. 8/12 before class
and no late homework
Where did “dynamic programming” come from?

Richard Bellman On the Birth of


Dynamic Programming
Stuart Dreyfus
http://www.eng.tau.ac.il/~ami/cd/o
r50/1526-5463-2002-50-01-
0048.pdf
Strings
⚫ Let Σ be an alphabet, e.g. Σ = ( , a, b, c, …,
z)
⚫ A string is any member of Σ*, i.e. any
sequence of 0 or more members of Σ
⚫ ‘this is a string’  Σ*
⚫ ‘this is also a string’  Σ*
⚫ ‘1234’  Σ*
String operations
⚫ Given strings s1 of length n and s2 of length m
⚫ Equality: is s1 = s2? (case sensitive or
insensitive)
‘this is a string’ = ‘this is a string’
‘this is a string’ ≠ ‘this is another string’
‘this is a string’ =? ‘THIS IS A STRING’

⚫ Running time
⚫ O(n) where n is length of shortest string
String operations
⚫ Concatenate (append): create string s1s2
‘this is a’ . ‘ string’ → ‘this is a string’

⚫ Running time
⚫ Θ(n+m)
String operations
⚫ Substitute: Exchange all occurrences of a
particular character with another character
Substitute(‘this is a string’, ‘i’, ‘x’) →
‘thxs xs a strxng’
Substitute(‘banana’, ‘a’, ‘o’) → ‘bonono’

⚫ Running time
⚫ Θ(n)
String operations
⚫ Length: return the number of
characters/symbols in the string
Length(‘this is a string’) → 16
Length(‘this is another string’) → 24

⚫ Running time
⚫ O(1) or Θ(n) depending on implementation
String operations
⚫ Prefix: Get the first j characters in the string

Prefix(‘this is a string’, 4) → ‘this’

⚫ Running time
⚫ Θ(j)
⚫ Suffix: Get the last j characters in the string

Suffix(‘this is a string’, 6) → ‘string’

⚫ Running time
⚫ Θ(j)
String operations
⚫ Substring – Get the characters between i and
j inclusive

Substring(‘this is a string’, 4, 8) → ‘s is ’

⚫ Running time
⚫ Θ(j - i)
⚫ Prefix?
⚫ Prefix(S, i) = Substring(S, 1, i)
⚫ Suffix?
⚫ Suffix(S, i) = Substring(S, i+1, length(n))
Edit distance
(aka Levenshtein distance)
⚫ Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Insertion:

ABACED ABACCED DABACCED

Insert ‘C’ Insert ‘D’


Edit distance
(aka Levenshtein distance)
⚫ Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Deletion:

ABACED
Edit distance
(aka Levenshtein distance)
⚫ Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Deletion:

ABACED BACED

Delete ‘A’
Edit distance
(aka Levenshtein distance)
⚫ Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Deletion:

ABACED BACED BACE

Delete ‘A’ Delete ‘D’


Edit distance
(aka Levenshtein distance)
⚫ Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Substitution:

ABACED ABADED ABADES

Sub ‘D’ for ‘C’ Sub ‘S’ for ‘D’


Edit distance examples

Edit(Kitten, Mitten) = 1

Operations:

Sub ‘M’ for ‘K’ Mitten


Edit distance examples

Edit(Happy, Hilly) = 3

Operations:

Sub ‘a’ for ‘i’ Hippy


Sub ‘l’ for ‘p’ Hilpy
Sub ‘l’ for ‘p’ Hilly
Edit distance examples

Edit(Banana, Car) = 5

Operations:

Delete ‘B’ anana


Delete ‘a’ nana
Delete ‘n’ naa
Sub ‘C’ for ‘n’ Caa
Sub ‘a’ for ‘r’ Car
Edit distance examples

Edit(Simple, Apple) = 3

Operations:

Delete ‘S’ imple


Sub ‘A’ for ‘i’ Ample
Sub ‘m’ for ‘p’ Apple
Is edit distance symmetric?
⚫ that is, is Edit(s1, s2) = Edit(s2, s1)?

Edit(Simple, Apple) =? Edit(Apple, Simple)

⚫ Why?
⚫ sub ‘i’ for ‘j’ → sub ‘j’ for ‘i’
⚫ delete ‘i’ → insert ‘i’
⚫ insert ‘i’ → delete ‘i’
Calculating edit distance

X=ABCBDAB

Y=BDCABA

Ideas?
Calculating edit distance

X=ABCBDA?

Y=BDCAB?

After all of the operations, X needs


to equal Y
Calculating edit distance

X=ABCBDA?

Y=BDCAB?

Operations: Insert
Delete
Substitute
Insert

X=ABCBDA?

Y=BDCAB?
Insert

X=ABCBDA?
Edit

Y=BDCAB?

Edit ( X , Y ) = 1 + Edit ( X 1...n , Y1...m −1 )


Delete

X=ABCBDA?

Y=BDCAB?
Delete

X=ABCBDA?
Edit

Y=BDCAB?

Edit ( X , Y ) = 1 + Edit ( X 1...n −1 , Y1...m )


Substition

X=ABCBDA?

Y=BDCAB?
Substition

X=ABCBDA?
Edit

Y=BDCAB?

Edit ( X , Y ) = 1 + Edit ( X 1...n −1 , Y1...m −1 )


Anything else?

X=ABCBDA?

Y=BDCAB?
Equal

X=ABCBDA?

Y=BDCAB?
Equal

X=ABCBDA?
Edit

Y=BDCAB?

Edit ( X , Y ) = Edit ( X 1...n −1 , Y1...m −1 )


Combining results
Insert: Edit ( X , Y ) = 1 + Edit ( X 1...n , Y1...m −1 )

Delete: Edit ( X , Y ) = 1 + Edit ( X 1...n −1 , Y1...m )

Substitute: Edit ( X , Y ) = 1 + Edit ( X 1...n −1 , Y1...m −1 )

Equal: Edit ( X , Y ) = Edit ( X 1...n −1 , Y1...m −1 )


Combining results
 1 + Edit(X1...n,Y1...m −1 ) insertion

Edit( X , Y ) = min  1 + Edit( X 1...n −1 , Y1...m ) deletion
 Diff ( x , y ) + Edit( X
 n m 1...n −1 , Y1...m −1 ) equal/substitution
Running time

Θ(nm)
Variants
⚫ Only include insertions and deletions
⚫ What does this do to substitutions?
⚫ Include swaps, i.e. swapping two adjacent
characters counts as one edit
⚫ Weight insertion, deletion and substitution
differently
⚫ Weight specific character insertion, deletion
and substitutions differently
⚫ Length normalize the edit distance
String matching
⚫ Given a pattern string P of length m and a
string S of length n, find all locations where P
occurs in S
P = ABA

S = DCABABBABABA
String matching
⚫ Given a pattern string P of length m and a
string S of length n, find all locations where P
occurs in S
P = ABA

S = DCABABBABABA
Uses
⚫ grep/egrep
⚫ search
⚫ find
⚫ java.lang.String.contains()
Naive implementation
Is it correct?
Running time?

⚫ What is the cost of the equality check?


⚫ Best case: O(1)
⚫ Worst case: O(m)
Running time?

⚫ Best case
⚫ Θ(n) – when the first character of the pattern does
not occur in the string
⚫ Worst case
⚫ O((n-m+1)m)
Worst case
P = AAAA

S = AAAAAAAAAAAAA
Worst case
P = AAAA

S = AAAAAAAAAAAAA
Worst case
P = AAAA

S = AAAAAAAAAAAAA
Worst case
P = AAAA

S = AAAAAAAAAAAAA

repeated work!
Worst case
P = AAAA

S = AAAAAAAAAAAAA

Ideally, after the first match, we’d


know to just check the next
character to see if it is an ‘A’
Patterns
⚫ Which of these patterns will have that
problem?

P = ABAB

P = ABDC

P = BAA

P = ABBCDDCAABB
Patterns
⚫ Which of these patterns will have that
problem?

P = ABAB If the pattern has a


suffix that is also a
P = ABDC prefix then we will
have this problem

P = BAA

P = ABBCDDCAABB
Finite State Automata (FSA)
⚫ An FSA is defined by 5 components
⚫ Q is the set of states
q0 q1 q2 … qn
Finite State Automata (FSA)
⚫ An FSA is defined by 5 components
⚫ Q is the set of states
q0 q1 q2 … qn

⚫ q0 is the start state q7

⚫ A  Q, is the set of accepting states where |A| > 0


⚫ Σ is the alphabet (e.g. {A, B}
⚫  is the transition function from Q x Σ to Q
QΣ Q B

q0 A q1
q0 B q2 q0 q1
A
q2 …
A
q1 A q1

FSA operation

B A A

q0 q1 q1 q1
A B A

B
B

An FSA starts at state q0 and reads the characters of the input


string one at a time.
If the automaton is in state q and reads character a, then it
transitions to state (q,a).
If the FSA reaches an accepting state (q  A), then the FSA has
found a match.
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

What pattern does this represent?


FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
Suffix function
⚫ The suffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y

 ( x, y ) = max ( xm−i +1...m = y1...i )


i

σ(abcdab, ababcd) = ?
Suffix function
⚫ The suffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y

 ( x, y ) = max ( xm−i +1...m = y1...i )


i

σ(abcdab, ababcd) = 2
Suffix function
⚫ The suffix function σ(x,y) is the index of the
longest suffix of x that is a prefix of y

 ( x, y ) = max ( xm−i +1...m = y1...i )


i

σ(daabac, abacac) = ?
Suffix function
⚫ The suffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y

 ( x, y ) = max ( xm−i +1...m = y1...i )


i

σ(daabac, abacac) = 4
Suffix function
⚫ The suffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y

 ( x, y ) = max ( xm−i +1...m = y1...i )


i

σ(dabb, abacd) = ?
Suffix function
⚫ The suffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y

 ( x, y ) = max ( xm−i +1...m = y1...i )


i

σ(dabb, abacd) = 0
Building a string matching
automata
⚫ Given a pattern P = p1, p2, …, pm, we’d like to build
an FSA that recognizes P in strings

P = ababaca

Ideas?
Building a string matching automata
P = ababaca
⚫ Q = q1, q2, …, qm corresponding to each
symbol, plus a q0 starting state
⚫ the set of accepting states, A = {qm}
⚫ vocab Σ all symbols in P, plus one more
representing all symbols not in P
⚫ The transition function for q  Q and a  Σ is
defined as:
⚫ (q, a) = σ(p1…qa, P)
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 ? a σ(a, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 ? a σ(b, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 ? a σ(b, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a σ(b, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
B,C
state a b c P
q0 1 0 0 a q0 q1
A
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘aba’ so far
q1 1 2 0 b
q2 3 0 0 a σ(abaa, ababaca)
q3 ? b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘aba’ so far
q1 1 2 0 b
q2 3 0 0 a σ(abaa, ababaca)
q3 1 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘ababa’ so far
q1 1 2 0 b
q2 3 0 0 a
q3 1 4 0 b
q4 5 0 0 a
q5 1 ? c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘ababa’ so far
q1 1 2 0 b
q2 3 0 0 a σ(ababab, ababaca)
q3 1 4 0 b
q4 5 0 0 a
q5 1 ? c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘ababa’ so far
q1 1 2 0 b
q2 3 0 0 a σ(ababab, ababaca)
q3 1 4 0 b
q4 5 0 0 a
q5 1 4 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
q1 1 2 0 b
q2 3 0 0 a
q3 1 4 0 b
q4 5 0 0 a
q5 1 4 6 c
q6 7 0 0 a
q7 1 2 0
Matching runtime
⚫ Once we’ve built the FSA, what is the
runtime?
⚫ Θ(n) - Each symbol causes a state transition and
we only visit each character once
⚫ What is the cost to build the FSA?
⚫ How many entries in the table?
⚫ m|Σ| - Best case: Ω(m|Σ|)
⚫ How long does it take to calculate the suffix
function at each entry?
⚫ Naïve: O(m2)
⚫ Overall naïve: O(m3|Σ|)
⚫ Overall fast implementation O(m|Σ|)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare

P = ABA

S = BABABBABABA
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare

P = ABA Hash P

T(P)

S = BABABBABABA
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare

P = ABA

Hash m symbol
S = BABABBABABA sequences and compare

T(BAB)
=
T(P)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare

P = ABA

match
Hash m symbol
S = BABABBABABA sequences and compare

T(ABA)
=
T(P)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare

P = ABA

Hash m symbol
S = BABABBABABA sequences and compare

T(BAB)
=
T(P)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare

P = ABA

Hash m symbol
S = BABABBABABA sequences and compare

T(BAB)
=
T(P)
Rabin-Karp algorithm

For this to be
useful/efficient, what
P = ABA needs to be true
about T?

S = BABABBABABA

T(BAB)
=
T(P)
Rabin-Karp algorithm

For this to be
useful/efficient, what
P = ABA needs to be true
about T?

⚫ Given T(si…i+m-1) we must be


able to efficiently calculate
S = BABABBABABA T(si+1…i+m)

T(BAB)
=
T(P)
Calculating the hash function
⚫ For simplicity, assume Σ = (0, 1, 2, …, 9). (in
general we can use a base larger than 10).
⚫ A string can then be viewed as a decimal
number
⚫ How do we efficiently calculate the numerical
representation of a string?

T(‘9847261’) = ?
Horner’s rule
T ( p1...m ) = pm + 10 ( pm −1 + 10 ( pm − 2 + ... + 10 ( p2 + 10 p1 )))

9847261
9 * 10 = 90

(90 + 8)*10 = 980

(980 + 4)*10 = 9840

(9840 + 7)*10 = 98470

… = 9847621
Horner’s rule
T ( p1...m ) = pm + 10 ( pm −1 + 10 ( pm − 2 + ... + 10 ( p2 + 10 p1 )))

9847261
Running time?
9 * 10 = 90
Θ(m)
(90 + 8)*10 = 980

(980 + 4)*10 = 9840

(9840 + 7)*10 = 98470

… = 9847621
Calculating the hash on the
string
⚫ Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?

m=4
963801572348267
T(si…i+m-1)

T ( si +1...i + m ) = 10 (T ( si...i + m −1 ) − 10 m −1 si ) + si + m
Calculating the hash on the
string
⚫ Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?

m=4 801
963801572348267
T(si…i+m-1) subtract highest order digit

T ( si +1...i + m ) = 10 (T ( si...i + m −1 ) − 10 m −1 si ) + si + m
Calculating the hash on the
string
⚫ Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?

m=4 8010
963801572348267
T(si…i+m-1)
shift digits up

T ( si +1...i + m ) = 10 (T ( si...i + m −1 ) − 10 m −1 si ) + si + m
Calculating the hash on the
string
⚫ Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?

m=4 8015
963801572348267
T(si…i+m-1)
add in the lowest digit

T ( si +1...i + m ) = 10 (T ( si...i + m −1 ) − 10 m −1 si ) + si + m
Calculating the hash on the
string
⚫ Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?

m=4 Running time?


- Θ(m) for s1…m
963801572348267
- O(1) for the rest

T(si…i+m-1)

T ( si +1...i + m ) = 10 (T ( si...i + m −1 ) − 10 m −1 si ) + si + m
Algorithm so far…
⚫ Is it correct?
⚫ Each string has a unique numerical value and we
compare that with each value in the string
⚫ Running time
⚫ Preprocessing:
⚫ Θ(m)
⚫ Matching
⚫ Θ(n-m+1)

Is there any problem with this analysis?


Algorithm so far…
⚫ Is it correct?
⚫ Each string has a unique numerical value and we
compare that with each value in the string
⚫ Running time
⚫ Preprocessing:
⚫ Θ(m)
⚫ Matching
⚫ Θ(n-m+1)

How long does the check T(P) = T(si…i+m-1) take?


Modular arithmetics
⚫ The run time assumptions we made were
assuming arithmetic operations were
constant time, which is not true for large
numbers
⚫ To keep the numbers small, we’ll use modular
arithmetics, i.e. all operations are performed
mod q
⚫ a+b = (a+b) mod q
⚫ a*b = (a*b) mod q
⚫ …
Modular arithmetics
⚫ If T(A) = T(B), then T(A) mod q = T(B) mod q
⚫ In general, we can apply mods as many times as
we want and we will not effect the result
⚫ What is the downside to this modular
approach?
⚫ Spurious hits: if T(A) mod q = T(B) mod q that
does not necessarily mean that T(A) = T(B)
⚫ If we find a hit, we must check that the actual
string matches the pattern
Runtime
⚫ Preprocessing
⚫ Θ(m)
⚫ Running time
⚫ Best case:
⚫ Θ(n-m+1) – No matches and no spurious hits
⚫ Worst case
⚫ Θ((n-m+1)m)
Average case running time
⚫ Assume v valid matches in the string
⚫ What is the probability of a spurious hit?
⚫ As with hashing, assume a uniform mapping onto
values of q:

Σ* 1…q

⚫ What is the probability under this assumption?


Average case running time
⚫ Assume v valid matches in the string
⚫ What is the probability of a spurious hit?
⚫ As with hashing, assume a uniform mapping onto
values of q:

Σ* 1…q

⚫ What is the probability under this assumption? 1/q


Average case running time
⚫ How many spurious hits?
⚫ n/q
⚫ Average case running time:
O(n-m+1) + O(m(v+n/q)

iterate over the checking matches


positions and spurious hits
Matching running times

Algorithm Preprocessing time Matching time


Naïve 0 O((n-m+1)m)
FSA Θ(m|Σ|) Θ(n)
Rabin-Karp Θ(m) O(n-m+1)m)
Knuth-Morris-Pratt Θ(m) Θ(n)

You might also like