Information Retrieval Systems U6
Information Retrieval Systems U6
in
UNIT-6
TEXT SEARCH ALGORITHM
Searching in Compressed Text (Overview)
• What is Text Compression
• Definition The Shannon Bound Huffman Codes The Kolmogorov Measure Searching in
Non-adaptive Codes KMP in Huffman Codes
• Searching in Adaptive Codes
• The Lempel-Ziv Codes Pattern Matching in Z-Compressed Files Adapting Compression for
Searching
Smartzworld.com jntuworldupdates.org
Smartworld.asia Specworld.in
• Boyer-Moore
• Shift-OR algorithm
• Rabin -karp
Brute Force :
This approach is the simplest string matching algorithm. The idea is to try and match the
search string against the input text. If as soon as a mis-match is detected in the comparison
process , shift the input text one position and start the comparison process over.
KNUTH-MORRIS PRATT
Even in the worst case it does not depend upon the length of the text pattern being searched
for. The basic concept behind the algorithm is that whenever a mismatch is detected , the
previous matched characters define the number of characters that can be skipped in the input
stream prior to process again . Starting the comparison
Position :1 2 3 4 5 6 7 8
Input stream a b d a d e f g
Search pattern a b d f
m , denoting the position within S where the prospective match for W begins,
i , denoting the index of the currently considered character in W .
In each step the algorithm compares S[m+i] with W[i] and advances i if they are equal.
This is depicted, at the start of the run, like
1 2
m: 01234567890123456789012
S: ABCABCDAB ABCDABCDABDE
W: ABCDABD
i: 0123456
Smartzworld.com jntuworldupdates.org
Smartworld.asia Specworld.in
is no chance of finding the beginning of a match. Therefore, the algorithm sets m = 3 and i
= 0.
1 2
m: 01234567890123456789012
S: ABCABCDAB ABCDABCDABDE
W: ABCDABD
i: 0123456
This match fails at the initial character, so the algorithm sets m = 4 and i = 0
1 2
m: 01234567890123456789012
S: ABC ABCDABABCDABCDABDE
W: ABCDABD
i: 0123456
1 2
m: 01234567890123456789012
S: ABC ABCDABABCDABCDABDE
W: ABCDABD
i: 0123456
This search fails immediately, however, as W does not contain another "A" , so as in the
first trial, the algorithm returns to the beginning of W and begins searching at the
mismatched character position of S : m = 10 , reset i = 0 .
1 2
m: 01234567890123456789012
S: ABC ABCDABABCDABCDABDE
W: ABCDABD
i: 0123456
The match at m=10 fails immediately, so the algorithm next tries m = 11 and i = 0 .
3
Smartzworld.com jntuworldupdates.org
Smartworld.asia Specworld.in
1 2
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W: ABCDABD
i: 0123456
Once again, the algorithm matches "ABCDAB" , but the next character, 'C' , does not match
the final character 'D' of the word W . Reasoning as before, the algorithm sets m = 15 , to
start at the two-character string "AB" leading up to the current position, set i = 2 , and
continue matching from the current position.
1 2
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W: ABCDABD
i: 0123456
This time the match is complete, and the first character of the match is S[15]
Description of pseudocode for the search algorithm[edit]
The above example contains all the elements of the algorithm. For the moment, we assume
the existence of a "partial match" table T , described below, which indicates where we need
to look for the start of a new match in the event that the current one ends in a mismatch. The
entries of T are constructed so that if we have a match starting at S[m] that fails when
comparing S[m + i] to W[i] , then the next possible match will start at index m + i -
T[i] in S (that is, T[i] is the amount of "backtracking" we need to do after a mismatch).
This has two implications: first, T[0] = -1 , which indicates that if W[0] is a mismatch, we
cannot backtrack and must simply check the next character; and second, although the next
possible match will begin at index m + i - T[i] , as in the example above, we need not
actually check any of the T[i] characters after that, so that we continue searching
from W[T[i]] . The following is a sample pseudocode implementation of the KMP search
algorithm.
algorithmkmp_search:
input:
an array of characters, S (the text to be searched)
an array of characters, W (the word sought)
output:
an integer (the zero-based position in S at which W is found)
define variables:
an integer, m ← 0 (the beginning of the current match in S)
an integer, i ← 0 (the position of the current character in W)
4
Smartzworld.com jntuworldupdates.org
Smartworld.asia Specworld.in
This fact implies that the loop can execute at most 2n times, since at each iteration it executes
one of the two branches in the loop. The first branch invariably increases i and does not
change m , so that the index m + i of the currently scrutinized character of S is increased.
The second branch adds i - T[i] to m , and as we have seen, this is always a positive number.
Thus the location m of the beginning of the current potential match is increased. At the same
time, the second branch leaves m + i unchanged, for m gets i - T[i] added to it, and
immediately after T[i] gets assigned as the new value of i , hence new_m + new_i = old_m
+ old_i - T[old_i] + T[old_i] = old_m + old_i . Now, the loop ends if m + i = n; therefore,
each branch of the loop can be reached at most n times, since they respectively increase
either m + i or m , and m ≤ m + i : if m = n, then certainly m + i ≥ n, so that since it
increases by unit increments at most, we must have had m + i = n at some point in the past,
and therefore either way we would be done.
Thus the loop executes at most 2n times, showing that the time complexity of the search
algorithm is O(n).
Here is another way to think about the runtime: Let us say we begin to match W and S at
position i and p . If W exists as a substring of S at p, then W[0..m] = S[p..p+m] . Upon
success, that is, the word and the text matched at the positions ( W[i] = S[p+i] ), we
5
Smartzworld.com jntuworldupdates.org
Smartworld.asia Specworld.in
increase i by 1. Upon failure, that is, the word and the text does not match at the positions
( W[i] ≠ S[p+i] ), the text pointer is kept still, while the word pointer is rolled back a certain
amount ( i = T[i] , where T is the jump table), and we attempt to
match W[T[i]] with S[p+i] . The maximum number of roll-back of i is bounded by i , that is
to say, for any failure, we can only roll back as much as we have progressed up to the failure.
Then it is clear the runtime is 2n.
Examples:
1) Input:
Output:
2) Input:
txt[] = "AABAACAADAABAAABAA"
6
Smartzworld.com jntuworldupdates.org
Smartworld.asia Specworld.in
pat[] = "AABA"
Output:
# include <limits.h>
Smartzworld.com jntuworldupdates.org
Smartworld.asia Specworld.in
# include <string.h>
# include <stdio.h>
intbadchar[NO_OF_CHARS];
Smartzworld.com jntuworldupdates.org
Smartworld.asia Specworld.in
else
/* Shift the pattern so that the bad character in text
aligns with the last occurrence of it in pattern. The
Smartzworld.com jntuworldupdates.org
Smartworld.asia Specworld.in
10
Smartzworld.com jntuworldupdates.org
Smartworld.asia Specworld.in
The Fast Data Finder (FDF) is the most recent specialized hardware text search unit still in
use in many organizations. It was developed to search text and has been used to search
English and foreign languages. The early Fast Data Finders consisted of an array of
programmable text processing cells connected in series forming a pipeline hardware search
processor (Mettler-93). The cells are implemented using a VSLI chip. In the TREC tests each
chip contained 24
processor cells with a typical system containing 3600 cells (the FDF-3 has a rack mount
configuration with 10,800 cells). Each cell is a comparator for a single character, limiting the
total number of characters in a query to the number of cells. The cells are interconnected with
an 8-bit data path and approximately 20-bit control path. The text to be searched passes
through each cell in a pipeline fashion until the complete database has been searched. As data
are analyzed at each cell, the 20 control lines states are modified depending upon their
current state and the results from the comparator. An example of a Fast Data Finder system is
shown in Figure 9.11.
A cell is composed of both a register cell (Rs) and a comparator (Cs). The input from the
Document database is controlled and buffered by the microprocessor/memory and feed
11
Smartzworld.com jntuworldupdates.org
Smartworld.asia Specworld.in
through the comparators. The search characters are stored in the registers. The connection
between the registers reflects the control lines that are also passing state information.
Smartzworld.com jntuworldupdates.org
Smartworld.asia Specworld.in
13
Smartzworld.com jntuworldupdates.org