0% found this document useful (0 votes)
25 views62 pages

IRS Unit-5

This document introduces text search algorithms and systems, detailing both software and hardware approaches. It discusses various algorithms such as brute force, Knuth-Morris-Pratt, Boyer-Moore, and Shift-OR, highlighting their efficiencies and methodologies. Additionally, it contrasts software text search limitations with the advantages of hardware systems, including speed and immediate result availability.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views62 pages

IRS Unit-5

This document introduces text search algorithms and systems, detailing both software and hardware approaches. It discusses various algorithms such as brute force, Knuth-Morris-Pratt, Boyer-Moore, and Shift-OR, highlighting their efficiencies and methodologies. Additionally, it contrasts software text search limitations with the advantages of hardware systems, including speed and immediate result availability.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 62

Unit-5

INTRODUCTION TO
INFORMATION RETRIEVAL
SYSTEMS
5.1 Text Search Algorithms

• 5.1.1 Introduction to Text Search Techniques


• 5.1.2 Software Text Search Algorithms
• 5.1.3 Hardware Text Search Systems
5.1 Introduction to Text Search Techniques
• The basic concept of a text scanning system is
the ability for one or more users to enter
queries with the text of the items to be
searched sequentially accessed and compared
to the query terms.
• When all of the text has been accessed, the
query is complete One advantage of this type
architecture is that as soon as an item is
identified as satisfying a query, the results can
be presented to the user for retrieval.
5.1 Introduction to Text Search Techniques
5.1 Introduction to Text Search Techniques
• The term detector is the special hardware/software that
contains all of the search terms and in some systems the
logic between the terms.
• The query resolver performs two major functions:
accepting search statements from the users and
extracting the logic and search terms to pass to the
detector.
• It also accepts results from the detector and determines
which queries are satisfied by the item and possibly the
relevance weight associated with hit.
• In foreign language streamers, different encodings may
have to be available for items from the same language
5.1 Introduction to Text Search Techniques
• The worst case search for a pattern of m characters in a
string of n characters is at least
n - m + 1 or a magnitude of O(n)
• Some of the original brute force methods could require
O(n*m) symbol comparisons
• More recent improvements have reduced the time to
O(n + m).
• In the case of hardware text search machines, multiple
parallel search machines (term detectors) may work
against the same data stream.
• This permits more queries or the same queries against
different data streams thereby reducing the time to
access the complete database.
5.1 Introduction to Text Search Techniques

• There are two approaches to the data stream.


• In the first approach the complete database is
being sent to the detector(s) which function as
a search of the database.
• In the second approach, random retrieved
items are being passed to the detectors.
• In this second case, an index search is
performed that constrains the items from the
database requiring additional processing
5.1 Introduction to Text Search Techniques
• Examples where index searches may not be able to
satisfy the complete search statement are:
1. search for stop words
2. search for exact matches when stemming is
performed
3. search for terms that contain both leading and trailing
"don't cares"
4. search for symbols that are on the inter word symbol
list (e.g., ," ;)
5. search for "fuzzy" search terms (m of n characters)
5.1 Introduction to Text Search Techniques

• Typically in an index system, the complete query


must be processed before any hits are determined
or available.
• Streaming systems also provide a very accurate
estimate of current search status and time to
complete the query.
• Most streaming algorithms locate imbedded query
terms, and some algorithms and hardware search
units will also perform fuzzy searches.
5.1 Introduction to Text Search Techniques

• Many of the hardware and software text searchers use finite


state automata as a basis for their search algorithms.
• A finite state automata is a logical machine that is composed
of five elements
1. I - a set of input symbols from the alphabet supported by the
automata
2. S - a set of possible states
3. P - a set of productions that define the next state based upon
the current state and input symbol
4. So - a special state called the initial state
5. SF - a set of one or more final states from the set S
5.1 Introduction to Text Search Techniques

• A finite state automata can be represented by a


directed graph consisting of a series of nodes
(states) and edges between nodes representing
transitions defined by the set of productions.
• Direction is indicated on the edges from the old
state to the new state.
• The symbol(s) associated with each edge defines
the inputs that allow a transition from one node
Si to another node Sj
5.1 Introduction to Text Search Techniques
5.1 Introduction to Text Search Techniques

• The automata remains in the initial state until it has an


input symbol of "C" which moves it to state S1.
• It will remain in that state as long as it receives "C"s as
input.
• If it receives a "P" it will move to S2.
• If it receives anything else it falls back to the initial
state.
• Once in state $2 it will either go to the final state if "U"
is the next symbol, go to S1 if a "C" is received or go
back to the initial state So if anything else is received.
5.1 Introduction to Text Search Techniques
5.2 Software Text Search Algorithms
• There are four major algorithms associated with software
text search:
• The brute force approach
• Knuth-Morris-Pratt
• Boyer-Moore
• Shift-OR algorithm and Rabin-Karp.
• Of all of the algorithms, Boyer-Moore has been the fastest,
requiring at most O(n + m) comparisons where n is the
number of characters being searched and m is the size of
the search string.
• Knuth- Pratt-Morris and Boyer-Moore both require O(n)
preprocessing of search strings in addition to the search
comparisons
5.2 Software Text Search Algorithms
• The brute force approach is the simplest string matching
algorithm.
• The idea is to match the search string against the input text.
• Whenever a mis-match is detected in the comparison process,
the input text is shifted one position, and the
• comparison process is initialized and restarted.
• The expected number of comparisons when searching an input
text string of n characters for a pattern of m characters is

• Nc is the expected number of comparisons and c is the size of


the alphabet for the text.
Knuth-Morris-Pratt Algorithm
o Let Search stream be “S”, Input Stream be “i”
o Position of Input stream be “P” and Position of Search stream be
“j.”
• Initially i=0, j=0
• then i++ and j++
• i=1,j=1
• if match i++ and j++
• if does not match i++ and j--

P 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
i B A B C B A B C A B C A C A B A
S A B C A B C A C A B
j 1 2 3 4 5 6 7 8 9 10
• i=1, j=1 there is a mismatch , increment of i++
• i=2 ,j=1 there is a match , do the increment for both i
and j , i++ and j++.
• i=3, j=2 there is a match , do the increment for both i
and j , i++ and j++.
• i=4, j=3 there is a match , do the increment for both i
and j , i++ and j++.
• i=5, j=4 there is a mismatch , j- -
• i=5,j=1
P 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
i B A B C B A B C A B C A C A B A
S A B C A B C A C A B
j 1 2 3 4 5 6 7 8 9 10
• i=5,j=1 there is a mismatch , increment of i++
P 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

i B A B C B A B C A B C A C A B A

S A B C A B C A C A B

j 1 2 3 4 5 6 7 8 9 10

• i=6,j=1 there is a match , do the increment for both i and j , i+


+ and j++.
P 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
i B A B C B A B C A B C A C A B A
S A B C A B C A C A B
j 1 2 3 4 5 6 7 8 9 10

Now continue the same by incrementing I and j till you get a


mismatch
If u get repeat previous step. If no mismatch and you get the search
stream in input stream then you found relevent data
H E R S

I S

S H E
Shift OR algorithm
• The Shift OR algorithm uses bitwise techniques to check whether
the given pattern is present in the string or not. It is efficient if the
pattern length is lesser than the memory-word size of the
machine (In this article at OPENGENUS, we consider the memory-
word size to be 64bits). We are given string, string length, and the
pattern. Our job is to return the starting index of the pattern if the
pattern exists in the string and -1 if it does not exist.
• Example:
Input:
Text: Opengenus
Pattern: genus
Output: Pattern found at index: 4
Step 1: Take string and pattern as input.
Step 2: Create an array called pattern_mask of size 256
(total no of ASCII characters) and initialize it to ~0.
Step 3: Now traverse the pattern and initialize the ith bit
of pattern_mask[pattern[i]] from right to 0.
Step 4: Now initialize a variable called R which contains
~1.
Step 5: Traverse the string from left to right.
Step 6: R is equal to R shift pattern_mask[test[i]].
Step 7: Shift R towards left by 1.
Step 8: If the mth(length of pattern) index in R from right
is equal to 0 then the string is found at index i - m + 1.
Step 9: If no such m exists then return -1.
Example
• The given string is opengenus.
The given pattern is genus.
The length of the pattern is 5(genus).
suneg
• pattern_mask[g] = 1 1 1 1 0
• pattern_mask[e] = 1 1 1 0 1
• pattern_mask[n] = 1 1 0 1 1
• pattern_mask[u] = 1 0 1 1 1
• pattern_mask[s] = 0 1 1 1 1
• pattern_mask[p] = 1 1 1 1 1
• pattern_mask[O] = 1 1 1 1 1
• R is equal to 1 1 1 1 0
• traverse opengenus from left to right.
R is equal to R | pattern_mask[o]
11110|11111
11111
R << 1 is equal to
11110
i is equal to 0.
• R is equal to R | pattern_mask[p]
11110|11111
11111
R << 1 is equal to
11110
i is equal to 1 ( 0 + 1)
• R is equal to R | pattern_mask[e]
11110|11101
11111
R << 1 is equal to
11110
i is equal to 2 (1 + 1)
• R is equal to R | pattern_mask[n]
11110|11011
11111
R << 1 is equal to
11110
i is equal to 3 (2 + 1)
• R is equal to R | pattern_mask[g]
11110|11110
11110
R << 1 is equal to
11100
i is equal to 4 (3 + 1)
• R is equal to R | pattern_mask[e]
11100|11101
11101
R << 1 is equal to
11010
i is equal to 5 (4 + 1)
• R is equal to R | pattern_mask[n]
11010|11011
11011
R << 1 is equal to
10110
i is equal to 6 (5 + 1)
• R is equal to R | pattern_mask[u]
10110|10111
10111
R << 1 is equal to
01110
i is equal to 7 (6 +1)
• R is equal to R | pattern_mask[s]
01110|01111
01111
R << 1 is equal to
011110
i is equal to 8 (7 + 1)
• R & 1 << 5 is equal to 0.
Therefore the pattern has been found.
return i - m + 1.
8 - 5 + 1.
return 4.
• The pattern is found at index 4.
• The Knuth-Pratt-Morris algorithm made a major
improvement in previous
• algorithms in that even in the worst case it does not
depend upon the length of the
• search term and does not require comparisons for every
character in the input
• The basic concept behind the algorithm is that whenever
a mismatch is detected, the previous matched
characters define the number of characters that can be
skipped in the input stream prior to starting the
comparison process again
5.3 Hardware Text Search Systems
• Software text search is applicable to many circumstances but has
encountered restrictions on the ability to handle many search
terms simultaneously against the same text and limits due to I/O
speeds.
• The only limit on speed is the time it takes to flow the text off
secondary storage (i.e., disk drives) to the searchers.
• Another major advantage of using a hardware text search unit is
in the elimination of the index that represents the document
database.
• Typically the indexes are 70 per cent the size of the actual items.
• Other advantages are that new items can be searched as soon as
received by the system rather than waiting for the index to be
created and the search speed is deterministic.
5.3 Hardware Text Search Systems
• Even though it may be slower than using an
index, the predictability of how long it will
take to stream the data provides the user with
an exact search time.
• As hits as discovered they can immediately be
made available to the user versus waiting for
the total search to complete as in index
searches.
5.3 Hardware Text Search Systems
5.3 Hardware Text Search Systems
• The algorithmic part of the system is focused on the term detector.
• There have been three approaches to implementing term detectors:
1. Parallel comparators or associative memory
2. A cellular structure
3. A universal finite state automata
• When the term comparator is implemented with parallel
comparators, each term in the query is assigned to an individual
comparison element and input data are serially streamed into the
detector.
• When a match occurs, the term comparator informs the external
query resolver by setting status flags.
5.3 Hardware Text Search Systems
5.3 Hardware Text Search Systems
• Specialized hardware that interfaces with computers and is used to
search secondary storage devices was developed from the early
1970s the need for this hardware was driven by the limits in
computer resources.
• The speed of search is then based on the speed of the I/O.
• One of the earliest hardware text string search units was the Rapid
Search Machine developed by General Electric.
• The machine consisted of a special purpose search unit in which a
single query was passed against a magnetic tape containing the
documents.
• A more sophisticated search unit was developed by Operating
Systems Inc. called the Associative File Processor.
• It is capable of searching against multiple queries at the same time.
5.3 Hardware Text Search Systems
5.3 Hardware Text Search Systems
• The GESCAN system uses a text array processor (TAP) that
simultaneously matches many terms and conditions against a given
text stream
• The TAP receives the query information from the user's computer and
directly accesses the textual data from secondary storage.
• The TAP consists of a large cache memory and an array of tour to 128
query processors.
• The text is loaded into the cache and searched by the query
processors.
• Each query processor is independent and can be loaded at any time
• A complete query is handled by each query processor.
• Each row of the matrix is a query processor in which the first chip
performs the query resolution while the remaining chips match query
terms
5.3 Hardware Text Search Systems
• A query processor works two operations in parallel:
• Matching query terms to input text and Boolean logic
resolution.
• Term matching is performed by a series of character cells, each
containing one character of the query
• A string of character cells is implemented on the same LSI chip
and the chips can be connected in series for longer strings.
• When a word or phrase of the query is matched, a signal is sent
to the resolution sub-process on the LSI chip.
• The resolution chip is responsible for resolving the Boolean
logic between terms and proximity requirements.
• If the item satisfies the query, the information is transmitted to
5.3 Hardware Text Search Systems
• Another approach for hardware searchers is to augment disc storage.
• The augmentation is a generalized associative search element placed
between the read and write heads on the disk.
• The content addressable segment sequential memory (CASSM)
system uses these search elements in parallel to obtain structured
data from a database.
• The CASSM system was developed at the University of Florida as a
general purpose search device .
• It can be used to perform string searching across the database.
• Another special search machine is the relational associative
processor (RAP) developed at the University of Toronto
• Like CASSM performs search across a secondary Storage device using
a series of cells comparing data in parallel
5.3 Hardware Text Search Systems
• The Fast Data Finder (FDF) is the most recent specialized
hardware text search unit still in use in many organizations
• It was developed to search text and has been used to
search English and foreign languages.
• The early Fast Data Finders consisted of an array of
programmable text processing cells connected in series
forming a pipeline hardware search processor
• The cells are interconnected with an 8-bit data path and
approximately 20-bit control path.
• The text to be searched passes through each cell in a
pipeline fashion until the complete database has been
searched
5.3 Hardware Text Search Systems
• As data are analyzed at each cell, the 20 control lines
states are modified depending upon their current state
and the results from the comparator
• A cell is composed of both a register cell (Rs) and a
comparator (Cs).
• The input from the Document database is controlled
and buffered by the microprocessor/memory and feed
through the comparators.
• The search characters are stored in the registers.
• The connection between the registers reflects the
control lines that are also passing state information
5.3 Hardware Text Search Systems
5.3 Hardware Text Search Systems
• When a pattern match is detected, a hit is passed to the internal
microprocessor that passes it back to the host processor,
allowing immediate access by the user to the Hit item.
• The functions supported by the Fast Data Finder are
1. Boolean Logic including negation
2. Proximity on an arbitrary pattern
3. Variable length "don't cares"
4. Term counting and thresholds
5. fuzzy matching
6. Term weights
7. numeric ranges.
5.2 Multimedia Information retrieval
• 5.2.1 Spoken Language Audio Retrieval
• 5.2.2 Non-Speech Audio Retrieval
• 5.2.3 Graphical Retrieval
• 5.2.4 Imagery Retrieval
• 5.2.5 Video Retrieval
5.2.1 Spoken Language Audio Retrieval
• A user may wish to search the archives of a large text collection, the
ability to search the content of audio sources such as speeches,
radio broadcasts, and conversations would be valuable for a range of
applications.
• An assortment of techniques have been developed to support the
automated recognition of speech (Waibel and Lee 1990).These have
applicability for a range of application areas such as speaker
verification, transcription and command and control.
• For example, Jones et al (1997) report a comparative evaluation of
speech and text retrieval in the context of the Video Mail
Retrieval(VMR) project. While speech transcription word error rates
may be high (as much as 50% or more depending upon the source,
speaker, dictation vs. conversation, environmental factors and so
on)redunndancy in the souce material help offset the error rates
and still support effective retrieval.
• In Jones et al’s speech/text comparative experiments, using standard
information retrieval evaluation techniques, speaker-dependent technique
retain approximately 95% of the performance of retrieval of text
transcripts, speaker independent technique about 75%.However system
significant may remain a challenge.
• Some recent efforts have focused on the automated transcription of
broadcast news.
• For example Figure illustrates BNN’s Rough’n’Ready
prototype that aims to provide information access to spoken
language from audio and video sources (Kubala et al 2000).
Rough’n’Ready “creates a Rough summization of speech
that is ready for browsing.”
• This figure illustrates a January 31, 1998 sample from ABC’s
Word News Tonight in which the left hand column indicates
the speaker, the center column shows the translation with
highlighted named entities (i.e people, organization, location)
and the right most column list the topic of discussion.
TM

• Rough’n’Ready’s transcription is created by the BYBL OS large


vocabulary recognition system
5.2.2 Non-Speech Audio Retrieval
5.2.3 Graphical Retrieval
5.2.4 Imagery Retrieval
5.2.5 Video Retrieval

You might also like