0% found this document useful (0 votes)
9 views9 pages

Unit 5 Irs PDF

The document discusses text search algorithms and multimedia information, focusing on text streaming architecture, finite state automata, and software text search algorithms. It highlights the advantages and disadvantages of streaming systems versus indexed systems, as well as various algorithms like Brute Force, Knuth-Morris-Pratt, Boyer-Moore, and Karp-Rabin. Additionally, it covers hardware text search systems, specifically the Fast Data Finder, which utilizes specialized hardware for efficient text searching.

Uploaded by

boycoder310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views9 pages

Unit 5 Irs PDF

The document discusses text search algorithms and multimedia information, focusing on text streaming architecture, finite state automata, and software text search algorithms. It highlights the advantages and disadvantages of streaming systems versus indexed systems, as well as various algorithms like Brute Force, Knuth-Morris-Pratt, Boyer-Moore, and Karp-Rabin. Additionally, it covers hardware text search systems, specifically the Fast Data Finder, which utilizes specialized hardware for efficient text searching.

Uploaded by

boycoder310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

UNIT 5

Text Search Algorithms and


Multimedia Information
Text Streaming Architecture:
The basic concept of a text scanning system is the ability for one or more users to enter
queries, and the text to be searched is accessed and compared to the query terms. When all of
the text has been accessed, the query is complete.
One advantage of this type architecture is that as soon as an item is identified as
satisfying a query, the results can be presented to the user for retrieval. Figure provides a
diagram of a text streaming search system. The database contains the full text of the items.
The term detector is the special hardware/software that contains all of the terms being
searched for and in some systems the logic between the items. It will input the text and detect
the existence of the search terms. It will output to the query resolver the detected terms to
allow for final logical processing of a query against an item. The query resolver performs two
functions. It will accept search statements from the users, extract the logic and search terms
and pass the search terms to the detector. It also accepts results from the detector and
determines which queries are satisfied by the item and possibily the weight associated with
hit. The Query Resolver will pass information to the user interface that will be continually
updating search status to the user and on request retrieve any items that satisfy the user search
statement.

The process is focused on finding at least one or all occurrences of a pattern of text
(query term) in a text stream. It is assumed that the same alphabet is used in both situations
.The worst case search for a pattern of m characters in a string of n characters is at least n - m
+ 1 or a magnitude of O(n) (Rivest-77). Some of the original brute force methods could
require O(n*m) symbol comparisons. More recent improvements have reduced the time to
O(n + m).
In the case of hardware search machines, multiple parallel search machines (term detectors)
may work against the same data stream allowing for more queries or against different data
streams reducing the time to access the complete database. In software systems, multiple
detectors may execute at the same time.
The major disadvantage of basing the search on streaming the text is the dependency of the
search on the slowest module in the computer (the I/O module). Inversions/indexes gain their
speed by minimizing the amount of data to be retrieved and provide the best ratio between the
total number of items delivered to the user versus the total number of items retrieved in
response to a query.
There is also the advantage where hits may be returned to the user as soon as found.
Typically, in an index system, the complete query must be processed before any hits are
determined or available. Streaming systems also provide a very accurate estimate of current
search status and time to complete the query. Inversions/indexes also encounter problems in
fuzzy searches (m of n characters).

Finite State Automata:


A finite state automata is a logical machine that is composed of five elements:
I: a set of input symbols from the aphabet supported by the automata
S: a set of possible states
P: a set of productions that define the next state based upon the current state and
SO: a special state called the initial state
SF: a set of one or more final states from the set S

A finite state automata is represented by a directed graph consisting of a series of nodes


(states) and edges between nodes represented as transitions defined by the set of productions.
The symbol(s) associated with each edge defines the inputs that allow a transition from one
node SI to another node SJ. Figure shows a finite state automata that will identify the
character string CPU in any input stream. The automata is defined by the the automata
definition.
The automata remains in the initial state until it has an input symbol of “C” which moves it to
state S1.It will remain in that state as long as it receives “C”s as input. If it receives a “P” it
will move to S2.If it receives anything else it falls back to the initial state. Once in state S2 it
will either go to the final state if “U” is the next symbol, go to S1 if a “C” is received or go
back to the initial state S0 if anything else is received. It is possible to represent the
productions by a table with the states as the rows and the input symbols that cause state
transitions as each column. The states are representing the current state and the values in the
table are the next state given the particular input symbol.

Software Text Search Algorithms:


In software streaming techniques, the item to be searched is read into memory and then the
algorithm is applied. Although nothing in the architecture described above prohibits software
streaming from being applied to many simulataneous searches against the same item, it is
more frequently used to resolve a particular search against a particular item.
There are four major algorithms associated with software text search:
1. The Brute force approach,
2. Knuth-Morris-Pratt algorithm
3. Boyer- Moore algorithm
4. Karp-Rabin algorithm
5. Shift-OR (or) Shift-AND algorithm

1. The Brute-Force algorithm:

The Brute force approach is the simplest string matching algorithm. The idea is to
try and match the search string against the input text. If as soon as a mismatch is
detected in the cmparison process, shift the nput text one position and start the
comparison process over. The expected number of comparisons when searching an
input text string of n characters for a pattern of m characters is

Example:

Position :1 2 3 4 5 6
Input Stream : p q p p p q
Search pattern : p p q
^
If mismatch occurs in second position of search pattern, one position is shifted to
right of search pattern.

Position :1 2 3 4 5 6
Input Stream : p q p p p q
Search pattern : p p q
^
If mismatch again occurs in position 2, one position of the search pattern patern is
shifted until the two patterns match.

Position :1 2 3 4 5 6
Input Stream : p q p p p q
Search pattern : p p q

2. The Knuth-Morris- Pratt algorithm:

3. Bayer-Moore Algorithm:
Boyer-Moore recognized that the string algorithm could be significantly enhanced if the
comparison process started at the end of the search pattern processing right to left versus the
start of the search pattern. The advantage is that large jumps are possible when the
mismatched character in the input stream does not exist in the search pattern which occurs
frequently. This leads to two possible sources of determining how many input characters to
be jumped.
Algorithm 1:
on a mismatch, the character in the input stream is compared to the search pattern to
determine the shifting of the search pattern (number of characters in input stream to be
skipped) to align the input character to a character in the search pattern. If the character does
not exist in the search pattern then it is possible to shift the length of the search pattern
matched to that position.
Algorithm 2:
on a mismatch occurs with previous matching on a substring in the input text, the matching
process can jump to the repeating ocurrence in the pattern of the initially matched subpattern
thus aligning that portion of the search pattern that is in the input text.

4. Karp-Rabin algorithm:
An approach which has similar functionality as that of n-grams and signature files is used to
partition the input text string into n- characters and then calculates a hash function i.e; the
signature value for each of the individual strings. The calculated hash value for the search
pattern is compared to the input text. Karp and Rabin discovered a significantly enhanced
signature function so that the hash values can be calculated.
h(l) = l mod p
p refers to the large prime number.

GOTO Function (state transition):


The GOTO function is a directed graph where the letter(s) on the connecting line between
states (circles) specify the transition for that input given the current state.
The GOTO function, is applied for the following set of words of words,tool,tin,test and stone.
if the current state is 1 and a E or I are received, then the machine will go to steates 2 and 6
respectively. The absence of an arrow or current input character that is not on a line leading
from the current nore represents a failure condition. When a failure occurs, the failure
function maps a state into another state (it could be to itself) to continue the search process.
Certain states are defined as output states. Whenever they are reached it means one or more
query terms have been matched.

Failure function:
It is used to determine that there exists no directed line and an input character associated on
that line.Whenever a failure occurs at a particular state,then its function maps a state into the
another state or that state itself to continue the search process.
i 1 2 3 4 5 6 7 8 9
f(i) 0 0 0 1 2 0 3 0 3
Output function:
It is used to determine that the query terms have been matched.
State 2 5 7 9
Output HE HE, SHE HIS HERS
Thus if an H has been received and the system is in state 1. If the next input symbol is an E
the system moves to state 2, if an I is received then it moves to state 6, if any other letter is
received, it will be an error and Failure Function (the third column in 9.6(b)) specifies the
system should move to state 0 and the same input character is applied to this state.

Hardware Text Search Systems:


Software text search is applicable to many circumstances but has encountered restrictions on
the ability to handle many search terms simultaneously against the same text and limits due to
I/O speeds. One approach that off loaded the resource intensive searching from the main
processors was to have a specialized hardware machine to perform the searches and pass the
results to the main computer which supported the user interface and retrieval of hits. Another
major advantage of using a hardware text search unit is in the elimination of the index that
represents the document database.
Other advantages are that new items can be searched as soon as received by the system rather
than waiting for the index to be created and the search speed is deterministic.
Following figure represents hardware as well as software text search solutions. The
arithmetic part of the system is focused on the term detector. There has been three approaches
to implementing term detectors: parallel comparators or associative memory, a cellular
structure, and a universal finite state automata.
When the term comparator is implemented with parallel comparators, each term in the query
is assigned to an individual comparison element and input data are serially streamed into the
detector. When a match occurs, the term comparator informs the external query resolver
(usually in the main computer) by setting status flags. In some systems, some of the Boolean
logic between terms is resolved in the term detector hardware (e.g., in the GESCAN
machine).

The Fast Data Finder (FDF)


The Fast Data Finder (FDF) is the most recent specialized hardware text search unit still in
use in many organizations. It was developed to search text and has been used to search
English and foreign languages. The early Fast Data Finders consisted of an array of
programmable text processing cells connected in series forming a pipeline hardware search
processor (Mettler-93). The cells are implemented using a VSLI chip. In the TREC tests each
chip contained 24 processor cells with a typical system containing 3600 cells (the FDF-3 has
a rack mount configuration with 10,800 cells). Each cell will be a comparator for a single
character limiting the total number of characters in a query to the number of cells. The cells
are interconnected with an 8-bit data path and approximately 20-bit control path. The text to
be searched passes through each cell in a pipeline fashion until the complete database has
been searched. As data is analysed at each cell, the 20 control lines states are modified
depending upon their current state and the results from the comparator.
Architecture of FDF System:
A cell is composed of both a register cell (Rs) and a comparator (Cs). The input from the
Document database is controlled and buffered by the microprocess/memory and feed through
the comparators. The search characters are stored in the registers. The connection between
the registers reflect the control lines that are also passing state information.
Groups of cells are used to detect query terms, along with logic between the
terms, by appropriate programming of the control lines. When a pattern match is detected, a
hit is passed to the internal microprocessor that passes it back to the host processor, allowing
immediate access by the user to the Hit item. The functions supported by the Fast data Finder
are:
(i) Boolean Logic including negation
(ii) Proximity on an arbitrary pattern
(iii) Variable length “don’t cares”
(iv) Term counting and thresholds
(v) fuzzy matching
(vi) term weights
(vii) numeric ranges

You might also like