0% found this document useful (0 votes)
23 views7 pages

28 - Text Processing

Text processing is becoming a primary function of computers as more web applications are deployed. This involves editing, searching, transporting, and displaying documents which often involves string operations like pattern matching and substring testing. The document then describes several classic algorithms for pattern matching on strings including brute force, Boyer-Moore, and Knuth-Morris-Pratt which aim to improve on brute force by reusing previous comparison information.

Uploaded by

Meena Vinoth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views7 pages

28 - Text Processing

Text processing is becoming a primary function of computers as more web applications are deployed. This involves editing, searching, transporting, and displaying documents which often involves string operations like pattern matching and substring testing. The document then describes several classic algorithms for pattern matching on strings including brute force, Boyer-Moore, and Knuth-Morris-Pratt which aim to improve on brute force by reusing previous comparison information.

Uploaded by

Meena Vinoth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Text Processing

Document Processing is rapidly becoming one of the primary functions of computers. As more
web-enabled applications are being deployed every day, the editing, searching, transporting, and
display of documents is increasing. Many of the computations involve character strings (text),
string pattern matching, and string similarity testing.
Character Strings
Typical string processing operations involve breaking longer strings into shorter strings.
A substring of an m-character string P is a string of the form P[i]P[i+1]P[i+2]…P[j],
for 0 ≤ i ≤ m-1 (i.e. the characters in P from index i to j ‒ P[i…j]).
A proper substring is a substring with either i > 0 or j < m-1.
A prefix of P is a substring of the form P[0…i], for 0 ≤ i ≤ m-1.
A suffix of P is a substring of the form P[j…m-1], for 0 ≤ j ≤ m-1.
The null string is a string of length zero (and is both a prefix and suffix of any string).
Example
P = “CGTAAACTG”
“CGTAA” is a prefix
“CTG” is a suffix
“CGTAAASCTG” is a substring but not a proper substring
“AAA” is a proper substring
Pattern Matching Algorithms
The classic pattern matching problem on strings is to determine whether a pattern string P of
length m is a substring of a text string T.
A match is a substring of T, starting at some index i, that matches P character by character (i.e.
T[i]=P[0], T[i+1]=P[1], … T[i+m-1]=P[m-1] or P=T[i…i+m-1]).
Output from a pattern matching algorithm is either some indication that P was not found or an
integer representing the starting index in T of the substring P.
Brute Force Algorithm
Brute-force pattern matching enumerates all possible placements of the substring P in
relation to the text T.
BruteForce(t, p)
m = Length(p)
n = Length(t)
for i = 0 to n - m
j = 0
while j < m and t[i + j] == p[j]
j++
if j == m
return i
return SUBSTRING_NOT_FOUND
Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

t a b a c a a b a d c a b a c a b a a b b

1 2 3 4 5 6
red numbers represent the
p a b a c a b i=0 number of comparisons

a b a c a b i=1

8 9

a b a c a b i=2

i = 3 to 9
12 comparisons

22 23 24 25 26 27

a b a c a b i = 10

In the worst case, P is not found or is found in the last m characters of T, so the outer for loop
is executed n-m+1 times, and the inner loop is executed m times.
( )
O((n − m + 1)m ) = O nm − m 2 ≈ O(nm )
(Because n is typically much greater than m)
Boyer-Moore Pattern Matching
Boyer-Moore pattern matching reduces the running time of the brute-force algorithm by
utilizing two heuristics:
Looking-Glass Heuristic: When testing a possible placement of P in T, begin the
comparisons from the end of P and move backward to the front of P.
Character-Jump Heuristic: When testing the possible placement of P in T, if a mismatch
of character T[i] == c occurs with character P[j], determine whether c is an
element of P. If not, shift P completely past T[i]. Otherwise, shift P until an occurrence
of c in P is aligned with T[i].
General Idea
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

t b c a b a d

0 1 2 3

Mismatch occurred on a in t and p b b a d


b in p. Find the last occurrence
Case 1
of a in p. If right of b, then shift 0 1 2 3
p to the right one unit.
p b b a d

0 1 2 3 4

Mismatch occurred on a in t p a d b c d
and c in p. Find the last
occurrence of a in p. If left of Case 2
0 1 2 3 4
c, then shift p to the right
index(c) - index(last(a)) units. p a d b c d

0 1 2 3 4 5

p c b d b c d
Mismatch occurred on a in t and
c in p. If no occurrence of a in p, Case 3
0 1 2 3 4 5
then shift p completely past c.
p c b d b c d

BoyerMoore (t, p)
m = Length(p)
n = Length(t)
i = m - 1
j = m - 1
do
if p[j] == t[i]
if j == 0
return i
else
i--
j--
else
i = i + m - Min(j, 1 + Last(t[i], p))
j = m - 1
while i <= n - 1
return SUBSTRING_NOT_FOUND

Last (c, p)
m = Length(p)
for i = m - 1 to 0
if c == p[i]
return i
return -1
Example

In the worst case (see diagram below), P is not found or is found in the last m characters of T,
so the outer for loop is executed n-m+1 times, and the inner loop is executed m times.
( )
O((n − m + 1)m ) = O nm − m 2 ≈ O(nm )
(Because n is typically much greater than m)

Although this is the same efficiency as the brute force method, in practice, the worst case is
highly unlikely to occur in English text.
Knuth-Morris-Pratt Pattern Matching
Knuth-Morris-Pratt pattern matching reduces the running time of the brute-force and Boyer-
Moore algorithms.
Using the brute-force and Boyer-Moore, if a pattern character does not match the text, all the
information gained by the sequence of comparisons is discarded and the algorithm starts over
at the next placement of the pattern.
The main idea behind this algorithm is that the pattern string P is preprocessed to compute a
failure function f that indicates the shift of P so that some previous comparisons can be
reused.
The failure function is defined as the longest prefix of P that is a suffix of P[1…j] (note
that didn’t say P[0…j]).
The failure function encodes any repeated substrings that occur inside the pattern.
Example (failure function)

General Idea
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

t a a b c a a b a c a a a b a c a a b

p a a b a c a a b

a a b a c a a b

a a b a c a a b

a a b a c a a b

a a b a c a a b
KnuthMorrisPratt (t, p)
m = Length(p)
n = Length(t)
Failure(f, p)
i = 0
j = 0
while i < n
if p[j] == t[i]
if j == m - 1
return i - m + 1
i++
j++
else if j > 0
j = f[j - 1]
else
i++
return SUBSTRING_NOT_FOUND

Failure (f, p)
m = Length(p)
i = 1
j = 0
f[0] = 0
while i < m
if p[j] == p[i]
f[i] = j + 1
i++
j++
else if j > 0
j = f[j - 1]
else
f[i] = 0
i++
Example

Efficiency
Characters that match are looked at only once
Characters that fail to match the first character are looked at only once.
When a match fails inside the string, the character that caused the failure will be checked
again.
Since the algorithm looks at each character at most twice, it is O(n ) .

You might also like