28 - Text Processing

Text processing is becoming a primary function of computers as more web applications are deployed. This involves editing, searching, transporting, and displaying documents which often involves string operations like pattern matching and substring testing. The document then describes several classic algorithms for pattern matching on strings including brute force, Boyer-Moore, and Knuth-Morris-Pratt which aim to improve on brute force by reusing previous comparison information.

Uploaded by

Meena Vinoth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views7 pages

28 - Text Processing

Uploaded by

Meena Vinoth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Text Processing

Document Processing is rapidly becoming one of the primary functions of computers. As more
web-enabled applications are being deployed every day, the editing, searching, transporting, and
display of documents is increasing. Many of the computations involve character strings (text),
string pattern matching, and string similarity testing.
Character Strings
Typical string processing operations involve breaking longer strings into shorter strings.
A substring of an m-character string P is a string of the form P[i]P[i+1]P[i+2]…P[j],
for 0 ≤ i ≤ m-1 (i.e. the characters in P from index i to j ‒ P[i…j]).
A proper substring is a substring with either i > 0 or j < m-1.
A prefix of P is a substring of the form P[0…i], for 0 ≤ i ≤ m-1.
A suffix of P is a substring of the form P[j…m-1], for 0 ≤ j ≤ m-1.
The null string is a string of length zero (and is both a prefix and suffix of any string).
Example
P = “CGTAAACTG”
“CGTAA” is a prefix
“CTG” is a suffix
“CGTAAASCTG” is a substring but not a proper substring
“AAA” is a proper substring
Pattern Matching Algorithms
The classic pattern matching problem on strings is to determine whether a pattern string P of
length m is a substring of a text string T.
A match is a substring of T, starting at some index i, that matches P character by character (i.e.
T[i]=P[0], T[i+1]=P[1], … T[i+m-1]=P[m-1] or P=T[i…i+m-1]).
Output from a pattern matching algorithm is either some indication that P was not found or an
integer representing the starting index in T of the substring P.
Brute Force Algorithm
Brute-force pattern matching enumerates all possible placements of the substring P in
relation to the text T.
BruteForce(t, p)
m = Length(p)
n = Length(t)
for i = 0 to n - m
j = 0
while j < m and t[i + j] == p[j]
j++
if j == m
return i
return SUBSTRING_NOT_FOUND
Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

t a b a c a a b a d c a b a c a b a a b b

1 2 3 4 5 6
red numbers represent the
p a b a c a b i=0 number of comparisons

a b a c a b i=1

8 9

a b a c a b i=2

i = 3 to 9
12 comparisons

22 23 24 25 26 27

a b a c a b i = 10

In the worst case, P is not found or is found in the last m characters of T, so the outer for loop
is executed n-m+1 times, and the inner loop is executed m times.
( )
O((n − m + 1)m ) = O nm − m 2 ≈ O(nm )
(Because n is typically much greater than m)
Boyer-Moore Pattern Matching
Boyer-Moore pattern matching reduces the running time of the brute-force algorithm by
utilizing two heuristics:
Looking-Glass Heuristic: When testing a possible placement of P in T, begin the
comparisons from the end of P and move backward to the front of P.
Character-Jump Heuristic: When testing the possible placement of P in T, if a mismatch
of character T[i] == c occurs with character P[j], determine whether c is an
element of P. If not, shift P completely past T[i]. Otherwise, shift P until an occurrence
of c in P is aligned with T[i].
General Idea
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

t b c a b a d

0 1 2 3

Mismatch occurred on a in t and p b b a d

b in p. Find the last occurrence
Case 1
of a in p. If right of b, then shift 0 1 2 3
p to the right one unit.
p b b a d

0 1 2 3 4

Mismatch occurred on a in t p a d b c d
and c in p. Find the last
occurrence of a in p. If left of Case 2
0 1 2 3 4
c, then shift p to the right
index(c) - index(last(a)) units. p a d b c d

0 1 2 3 4 5

p c b d b c d
Mismatch occurred on a in t and
c in p. If no occurrence of a in p, Case 3
0 1 2 3 4 5
then shift p completely past c.
p c b d b c d

BoyerMoore (t, p)
m = Length(p)
n = Length(t)
i = m - 1
j = m - 1
do
if p[j] == t[i]
if j == 0
return i
else
i--
j--
else
i = i + m - Min(j, 1 + Last(t[i], p))
j = m - 1
while i <= n - 1
return SUBSTRING_NOT_FOUND

Last (c, p)
m = Length(p)
for i = m - 1 to 0
if c == p[i]
return i
return -1
Example

In the worst case (see diagram below), P is not found or is found in the last m characters of T,
so the outer for loop is executed n-m+1 times, and the inner loop is executed m times.
( )
O((n − m + 1)m ) = O nm − m 2 ≈ O(nm )
(Because n is typically much greater than m)

Although this is the same efficiency as the brute force method, in practice, the worst case is
highly unlikely to occur in English text.
Knuth-Morris-Pratt Pattern Matching
Knuth-Morris-Pratt pattern matching reduces the running time of the brute-force and Boyer-
Moore algorithms.
Using the brute-force and Boyer-Moore, if a pattern character does not match the text, all the
information gained by the sequence of comparisons is discarded and the algorithm starts over
at the next placement of the pattern.
The main idea behind this algorithm is that the pattern string P is preprocessed to compute a
failure function f that indicates the shift of P so that some previous comparisons can be
reused.
The failure function is defined as the longest prefix of P that is a suffix of P[1…j] (note
that didn’t say P[0…j]).
The failure function encodes any repeated substrings that occur inside the pattern.
Example (failure function)

General Idea
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

t a a b c a a b a c a a a b a c a a b

p a a b a c a a b

a a b a c a a b

a a b a c a a b
KnuthMorrisPratt (t, p)
m = Length(p)
n = Length(t)
Failure(f, p)
i = 0
j = 0
while i < n
if p[j] == t[i]
if j == m - 1
return i - m + 1
i++
j++
else if j > 0
j = f[j - 1]
else
i++
return SUBSTRING_NOT_FOUND

Failure (f, p)
m = Length(p)
i = 1
j = 0
f[0] = 0
while i < m
if p[j] == p[i]
f[i] = j + 1
i++
j++
else if j > 0
j = f[j - 1]
else
f[i] = 0
i++
Example

Efficiency
Characters that match are looked at only once
Characters that fail to match the first character are looked at only once.
When a match fails inside the string, the character that caused the failure will be checked
again.
Since the algorithm looks at each character at most twice, it is O(n ) .

COS3711 2023 JanFeb Question Paper
100% (1)
COS3711 2023 JanFeb Question Paper
8 pages
NextGen Connect Programming Reference
100% (2)
NextGen Connect Programming Reference
35 pages
Unit-4 Ads
100% (1)
Unit-4 Ads
31 pages
Unit 5
No ratings yet
Unit 5
42 pages
Notes 5
No ratings yet
Notes 5
23 pages
DS V Unit Notes
No ratings yet
DS V Unit Notes
33 pages
Pattren Matching
No ratings yet
Pattren Matching
3 pages
Week 9 String Algorithms, Approximation
No ratings yet
Week 9 String Algorithms, Approximation
22 pages
UNIT-4 PPT New
No ratings yet
UNIT-4 PPT New
47 pages
Outline and Reading: Strings ( 9.1.1) Pattern Matching Algorithms
No ratings yet
Outline and Reading: Strings ( 9.1.1) Pattern Matching Algorithms
3 pages
Pattern Matching
No ratings yet
Pattern Matching
3 pages
String Matching Algorithm
100% (1)
String Matching Algorithm
14 pages
Data Structures Unit 5
No ratings yet
Data Structures Unit 5
20 pages
Unit 5 DS
No ratings yet
Unit 5 DS
53 pages
DS Unit-V
No ratings yet
DS Unit-V
35 pages
Abstract
No ratings yet
Abstract
12 pages
04 03-PatternMatchingAndTries
No ratings yet
04 03-PatternMatchingAndTries
28 pages
Ads Unit5
No ratings yet
Ads Unit5
26 pages
String Matching Algorithms: 1 Brute Force
No ratings yet
String Matching Algorithms: 1 Brute Force
5 pages
Unit-V DS Pattern Matching and Tries
No ratings yet
Unit-V DS Pattern Matching and Tries
26 pages
Knuth-Morris-Pratt Algorithm KENT
No ratings yet
Knuth-Morris-Pratt Algorithm KENT
4 pages
String Search Algorithm
No ratings yet
String Search Algorithm
6 pages
Unit 5
No ratings yet
Unit 5
14 pages
A Two Way Pattern Matching Algorithm Using Sliding Patterns
No ratings yet
A Two Way Pattern Matching Algorithm Using Sliding Patterns
5 pages
Co 4 (Lo 2)
No ratings yet
Co 4 (Lo 2)
12 pages
Sandeep Singh (Iii B.Tech I.T)
No ratings yet
Sandeep Singh (Iii B.Tech I.T)
179 pages
String Search: 1 2 I I+1 I+m-1 N
No ratings yet
String Search: 1 2 I I+1 I+m-1 N
8 pages
CHPT 9 Pattern Matching
No ratings yet
CHPT 9 Pattern Matching
14 pages
MADFL 2025 Expt8
No ratings yet
MADFL 2025 Expt8
8 pages
MADF Unit 4
No ratings yet
MADF Unit 4
144 pages
5 TH Long Ans
No ratings yet
5 TH Long Ans
31 pages
資料工程 Data Engineering: Pattern Matching 張賢宗
No ratings yet
資料工程 Data Engineering: Pattern Matching 張賢宗
38 pages
String Matching Chapter 12 Goodrich Nep
No ratings yet
String Matching Chapter 12 Goodrich Nep
43 pages
Text Processing (Complete)
No ratings yet
Text Processing (Complete)
100 pages
Pattern Matching 2
No ratings yet
Pattern Matching 2
46 pages
String Matching Algorithms: Antonio Carzaniga
No ratings yet
String Matching Algorithms: Antonio Carzaniga
11 pages
Pattern Matching
No ratings yet
Pattern Matching
46 pages
Ir Asnment
No ratings yet
Ir Asnment
6 pages
Lecture 18 - String Matching-KMP
No ratings yet
Lecture 18 - String Matching-KMP
40 pages
String Matching Algorithms
No ratings yet
String Matching Algorithms
25 pages
Ada Notes Unit 4
No ratings yet
Ada Notes Unit 4
28 pages
String Matching
100% (1)
String Matching
12 pages
Text Pattern Search Using Naïve Algorithm: Justine Estoesta, Patricia Mae Omana, Winci John Singh
No ratings yet
Text Pattern Search Using Naïve Algorithm: Justine Estoesta, Patricia Mae Omana, Winci John Singh
5 pages
Unit 3
No ratings yet
Unit 3
34 pages
DAA Unit 5
No ratings yet
DAA Unit 5
22 pages
String Matching
No ratings yet
String Matching
5 pages
4string Matching Kmprabin Karp and Naive
No ratings yet
4string Matching Kmprabin Karp and Naive
57 pages
String Matching
No ratings yet
String Matching
35 pages
Lec 3
No ratings yet
Lec 3
37 pages
Strings and Pattern Searching
100% (1)
Strings and Pattern Searching
80 pages
11 Data Structures and Algorithms - Narasimha Karumanchi
100% (1)
11 Data Structures and Algorithms - Narasimha Karumanchi
12 pages
A357460420 - 22393 - 2 - 2018 - String Matching
No ratings yet
A357460420 - 22393 - 2 - 2018 - String Matching
27 pages
UNIT 5.3 (String Mactching)
No ratings yet
UNIT 5.3 (String Mactching)
23 pages
Chapter 3 - String Processing
0% (1)
Chapter 3 - String Processing
28 pages
Data Structures Using C: Example 4.13
No ratings yet
Data Structures Using C: Example 4.13
5 pages
Unit8 ADA SPPDF 2022 11 11 17 17 37pdf 2023 12 06 16 57 08
No ratings yet
Unit8 ADA SPPDF 2022 11 11 17 17 37pdf 2023 12 06 16 57 08
18 pages
Algo Lecture 7
No ratings yet
Algo Lecture 7
52 pages
Lec 6-String Processing
100% (1)
Lec 6-String Processing
25 pages
ADA Lect10
No ratings yet
ADA Lect10
12 pages
ASSIGNMENT (Java)
No ratings yet
ASSIGNMENT (Java)
9 pages
Multiple Questions
No ratings yet
Multiple Questions
3 pages
C Tokens
No ratings yet
C Tokens
20 pages
Coding Made Simple 2016
100% (4)
Coding Made Simple 2016
148 pages
STG8000
No ratings yet
STG8000
73 pages
NJ Data Log Function Block
No ratings yet
NJ Data Log Function Block
6 pages
In Lab 04 Tasks
No ratings yet
In Lab 04 Tasks
3 pages
MASM Lab
No ratings yet
MASM Lab
7 pages
JSF 2.0 Portlet Using PrimeFaces PDF
No ratings yet
JSF 2.0 Portlet Using PrimeFaces PDF
58 pages
12th String Assignments Year Wise Q - 24682004
No ratings yet
12th String Assignments Year Wise Q - 24682004
21 pages
Cambridge O Level: Computer Science For Examination From 2023
No ratings yet
Cambridge O Level: Computer Science For Examination From 2023
16 pages
NCMSEA 18 Proceedings
No ratings yet
NCMSEA 18 Proceedings
375 pages
Ravi Python
No ratings yet
Ravi Python
3 pages
Java Math Method
No ratings yet
Java Math Method
14 pages
Cse - Ai
No ratings yet
Cse - Ai
46 pages
B2-Sep-18 Notes
No ratings yet
B2-Sep-18 Notes
161 pages
Python Assignment
100% (2)
Python Assignment
19 pages
Python
No ratings yet
Python
23 pages
Garmin IMG Format
No ratings yet
Garmin IMG Format
36 pages
104 - Programming and Problem Solving Through 'C Language
No ratings yet
104 - Programming and Problem Solving Through 'C Language
3 pages
AmiBroker Development Kit
No ratings yet
AmiBroker Development Kit
24 pages
Worksheet Topic: Data File Handling in Python CSV Files
No ratings yet
Worksheet Topic: Data File Handling in Python CSV Files
4 pages
TASM
No ratings yet
TASM
31 pages
Computer Practical 2-1
No ratings yet
Computer Practical 2-1
38 pages
Formalizing BPE Tokenization
No ratings yet
Formalizing BPE Tokenization
12 pages
Compiler Design Short Notes
No ratings yet
Compiler Design Short Notes
133 pages
Finite Automata Theory and Formal Languages: Lec01: Introduction
No ratings yet
Finite Automata Theory and Formal Languages: Lec01: Introduction
26 pages
String Handling MR Long Student Guide
No ratings yet
String Handling MR Long Student Guide
5 pages

28 - Text Processing

Uploaded by

28 - Text Processing

Uploaded by

Text Processing

Mismatch occurred on a in t and p b b a d

You might also like