IRS Unit-5
IRS Unit-5
INTRODUCTION TO
INFORMATION RETRIEVAL
SYSTEMS
5.1 Text Search Algorithms
P 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
i B A B C B A B C A B C A C A B A
S A B C A B C A C A B
j 1 2 3 4 5 6 7 8 9 10
• i=1, j=1 there is a mismatch , increment of i++
• i=2 ,j=1 there is a match , do the increment for both i
and j , i++ and j++.
• i=3, j=2 there is a match , do the increment for both i
and j , i++ and j++.
• i=4, j=3 there is a match , do the increment for both i
and j , i++ and j++.
• i=5, j=4 there is a mismatch , j- -
• i=5,j=1
P 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
i B A B C B A B C A B C A C A B A
S A B C A B C A C A B
j 1 2 3 4 5 6 7 8 9 10
• i=5,j=1 there is a mismatch , increment of i++
P 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
i B A B C B A B C A B C A C A B A
S A B C A B C A C A B
j 1 2 3 4 5 6 7 8 9 10
I S
S H E
Shift OR algorithm
• The Shift OR algorithm uses bitwise techniques to check whether
the given pattern is present in the string or not. It is efficient if the
pattern length is lesser than the memory-word size of the
machine (In this article at OPENGENUS, we consider the memory-
word size to be 64bits). We are given string, string length, and the
pattern. Our job is to return the starting index of the pattern if the
pattern exists in the string and -1 if it does not exist.
• Example:
Input:
Text: Opengenus
Pattern: genus
Output: Pattern found at index: 4
Step 1: Take string and pattern as input.
Step 2: Create an array called pattern_mask of size 256
(total no of ASCII characters) and initialize it to ~0.
Step 3: Now traverse the pattern and initialize the ith bit
of pattern_mask[pattern[i]] from right to 0.
Step 4: Now initialize a variable called R which contains
~1.
Step 5: Traverse the string from left to right.
Step 6: R is equal to R shift pattern_mask[test[i]].
Step 7: Shift R towards left by 1.
Step 8: If the mth(length of pattern) index in R from right
is equal to 0 then the string is found at index i - m + 1.
Step 9: If no such m exists then return -1.
Example
• The given string is opengenus.
The given pattern is genus.
The length of the pattern is 5(genus).
suneg
• pattern_mask[g] = 1 1 1 1 0
• pattern_mask[e] = 1 1 1 0 1
• pattern_mask[n] = 1 1 0 1 1
• pattern_mask[u] = 1 0 1 1 1
• pattern_mask[s] = 0 1 1 1 1
• pattern_mask[p] = 1 1 1 1 1
• pattern_mask[O] = 1 1 1 1 1
• R is equal to 1 1 1 1 0
• traverse opengenus from left to right.
R is equal to R | pattern_mask[o]
11110|11111
11111
R << 1 is equal to
11110
i is equal to 0.
• R is equal to R | pattern_mask[p]
11110|11111
11111
R << 1 is equal to
11110
i is equal to 1 ( 0 + 1)
• R is equal to R | pattern_mask[e]
11110|11101
11111
R << 1 is equal to
11110
i is equal to 2 (1 + 1)
• R is equal to R | pattern_mask[n]
11110|11011
11111
R << 1 is equal to
11110
i is equal to 3 (2 + 1)
• R is equal to R | pattern_mask[g]
11110|11110
11110
R << 1 is equal to
11100
i is equal to 4 (3 + 1)
• R is equal to R | pattern_mask[e]
11100|11101
11101
R << 1 is equal to
11010
i is equal to 5 (4 + 1)
• R is equal to R | pattern_mask[n]
11010|11011
11011
R << 1 is equal to
10110
i is equal to 6 (5 + 1)
• R is equal to R | pattern_mask[u]
10110|10111
10111
R << 1 is equal to
01110
i is equal to 7 (6 +1)
• R is equal to R | pattern_mask[s]
01110|01111
01111
R << 1 is equal to
011110
i is equal to 8 (7 + 1)
• R & 1 << 5 is equal to 0.
Therefore the pattern has been found.
return i - m + 1.
8 - 5 + 1.
return 4.
• The pattern is found at index 4.
• The Knuth-Pratt-Morris algorithm made a major
improvement in previous
• algorithms in that even in the worst case it does not
depend upon the length of the
• search term and does not require comparisons for every
character in the input
• The basic concept behind the algorithm is that whenever
a mismatch is detected, the previous matched
characters define the number of characters that can be
skipped in the input stream prior to starting the
comparison process again
5.3 Hardware Text Search Systems
• Software text search is applicable to many circumstances but has
encountered restrictions on the ability to handle many search
terms simultaneously against the same text and limits due to I/O
speeds.
• The only limit on speed is the time it takes to flow the text off
secondary storage (i.e., disk drives) to the searchers.
• Another major advantage of using a hardware text search unit is
in the elimination of the index that represents the document
database.
• Typically the indexes are 70 per cent the size of the actual items.
• Other advantages are that new items can be searched as soon as
received by the system rather than waiting for the index to be
created and the search speed is deterministic.
5.3 Hardware Text Search Systems
• Even though it may be slower than using an
index, the predictability of how long it will
take to stream the data provides the user with
an exact search time.
• As hits as discovered they can immediately be
made available to the user versus waiting for
the total search to complete as in index
searches.
5.3 Hardware Text Search Systems
5.3 Hardware Text Search Systems
• The algorithmic part of the system is focused on the term detector.
• There have been three approaches to implementing term detectors:
1. Parallel comparators or associative memory
2. A cellular structure
3. A universal finite state automata
• When the term comparator is implemented with parallel
comparators, each term in the query is assigned to an individual
comparison element and input data are serially streamed into the
detector.
• When a match occurs, the term comparator informs the external
query resolver by setting status flags.
5.3 Hardware Text Search Systems
5.3 Hardware Text Search Systems
• Specialized hardware that interfaces with computers and is used to
search secondary storage devices was developed from the early
1970s the need for this hardware was driven by the limits in
computer resources.
• The speed of search is then based on the speed of the I/O.
• One of the earliest hardware text string search units was the Rapid
Search Machine developed by General Electric.
• The machine consisted of a special purpose search unit in which a
single query was passed against a magnetic tape containing the
documents.
• A more sophisticated search unit was developed by Operating
Systems Inc. called the Associative File Processor.
• It is capable of searching against multiple queries at the same time.
5.3 Hardware Text Search Systems
5.3 Hardware Text Search Systems
• The GESCAN system uses a text array processor (TAP) that
simultaneously matches many terms and conditions against a given
text stream
• The TAP receives the query information from the user's computer and
directly accesses the textual data from secondary storage.
• The TAP consists of a large cache memory and an array of tour to 128
query processors.
• The text is loaded into the cache and searched by the query
processors.
• Each query processor is independent and can be loaded at any time
• A complete query is handled by each query processor.
• Each row of the matrix is a query processor in which the first chip
performs the query resolution while the remaining chips match query
terms
5.3 Hardware Text Search Systems
• A query processor works two operations in parallel:
• Matching query terms to input text and Boolean logic
resolution.
• Term matching is performed by a series of character cells, each
containing one character of the query
• A string of character cells is implemented on the same LSI chip
and the chips can be connected in series for longer strings.
• When a word or phrase of the query is matched, a signal is sent
to the resolution sub-process on the LSI chip.
• The resolution chip is responsible for resolving the Boolean
logic between terms and proximity requirements.
• If the item satisfies the query, the information is transmitted to
5.3 Hardware Text Search Systems
• Another approach for hardware searchers is to augment disc storage.
• The augmentation is a generalized associative search element placed
between the read and write heads on the disk.
• The content addressable segment sequential memory (CASSM)
system uses these search elements in parallel to obtain structured
data from a database.
• The CASSM system was developed at the University of Florida as a
general purpose search device .
• It can be used to perform string searching across the database.
• Another special search machine is the relational associative
processor (RAP) developed at the University of Toronto
• Like CASSM performs search across a secondary Storage device using
a series of cells comparing data in parallel
5.3 Hardware Text Search Systems
• The Fast Data Finder (FDF) is the most recent specialized
hardware text search unit still in use in many organizations
• It was developed to search text and has been used to
search English and foreign languages.
• The early Fast Data Finders consisted of an array of
programmable text processing cells connected in series
forming a pipeline hardware search processor
• The cells are interconnected with an 8-bit data path and
approximately 20-bit control path.
• The text to be searched passes through each cell in a
pipeline fashion until the complete database has been
searched
5.3 Hardware Text Search Systems
• As data are analyzed at each cell, the 20 control lines
states are modified depending upon their current state
and the results from the comparator
• A cell is composed of both a register cell (Rs) and a
comparator (Cs).
• The input from the Document database is controlled
and buffered by the microprocessor/memory and feed
through the comparators.
• The search characters are stored in the registers.
• The connection between the registers reflects the
control lines that are also passing state information
5.3 Hardware Text Search Systems
5.3 Hardware Text Search Systems
• When a pattern match is detected, a hit is passed to the internal
microprocessor that passes it back to the host processor,
allowing immediate access by the user to the Hit item.
• The functions supported by the Fast Data Finder are
1. Boolean Logic including negation
2. Proximity on an arbitrary pattern
3. Variable length "don't cares"
4. Term counting and thresholds
5. fuzzy matching
6. Term weights
7. numeric ranges.
5.2 Multimedia Information retrieval
• 5.2.1 Spoken Language Audio Retrieval
• 5.2.2 Non-Speech Audio Retrieval
• 5.2.3 Graphical Retrieval
• 5.2.4 Imagery Retrieval
• 5.2.5 Video Retrieval
5.2.1 Spoken Language Audio Retrieval
• A user may wish to search the archives of a large text collection, the
ability to search the content of audio sources such as speeches,
radio broadcasts, and conversations would be valuable for a range of
applications.
• An assortment of techniques have been developed to support the
automated recognition of speech (Waibel and Lee 1990).These have
applicability for a range of application areas such as speaker
verification, transcription and command and control.
• For example, Jones et al (1997) report a comparative evaluation of
speech and text retrieval in the context of the Video Mail
Retrieval(VMR) project. While speech transcription word error rates
may be high (as much as 50% or more depending upon the source,
speaker, dictation vs. conversation, environmental factors and so
on)redunndancy in the souce material help offset the error rates
and still support effective retrieval.
• In Jones et al’s speech/text comparative experiments, using standard
information retrieval evaluation techniques, speaker-dependent technique
retain approximately 95% of the performance of retrieval of text
transcripts, speaker independent technique about 75%.However system
significant may remain a challenge.
• Some recent efforts have focused on the automated transcription of
broadcast news.
• For example Figure illustrates BNN’s Rough’n’Ready
prototype that aims to provide information access to spoken
language from audio and video sources (Kubala et al 2000).
Rough’n’Ready “creates a Rough summization of speech
that is ready for browsing.”
• This figure illustrates a January 31, 1998 sample from ABC’s
Word News Tonight in which the left hand column indicates
the speaker, the center column shows the translation with
highlighted named entities (i.e people, organization, location)
and the right most column list the topic of discussion.
TM