Unit 5 IRS
Unit 5 IRS
There are three classical text retrieval techniques that are defined for organizing items
in a textual database, for rapidly identifying the relevant items and for eliminating
items that do not satisfy the search.
They are
o Full text scanning (streaming)
o Word inversion
o Multi attribute retrieval
In addition to indexes, streaming of text was used for searching text in information
systems.
database
o Contains the full text of the items.
term detector
o Special hardware/software that contains all of the terms being searched for.
o It will input the text and detect the existence of the search terms.
o It will output to the query resolver the detected terms to allow for final logical
processing of a query against an item.
o In Hardware search machines
Multiple parallel search machines (term detectors) may work against
the same data stream allowing for more queries or against different
data streams reducing the time to access the complete database.
o In software systems
Multiple detectors may execute at the same time.
o Two approaches to the data stream.
In the first approach, the complete database is being sent to the
detector(s) functioning as a search of the database.
In the second approach, random retrieved items are being passed to the
detectors.
query resolver
o It performs two functions.
o It will accept search statements from the users, extract the logic and search
terms and pass the search terms to the detector.
o It also accepts results from the detector and determines which queries are
satisfied by the item and possibly the weight associated with hit.
user interface
o The Query Resolver will pass information to the user interface that will be
continually updating search status to the user.
o On request, retrieve any items that satisfy the user search statement.
Inversions/indexes
o gain their speed by minimizing the amount of data to be retrieved and provide
the best ratio between the total number of items delivered to the user versus
the total number of items retrieved in response to a query.
o require storage overheads of 50% to 300%.
o hits may be returned to the user as soon as found.
o complete query must be processed before any hits are determined or available.
o encounter problems in fuzzy searches and imbedded string query terms.
o difficult to locate all the possible index values short of searching the complete
dictionary of possible terms.
It is possible to represent the productions by a table with the states as the rows and the
input symbols that cause state transitions as each column.
The states are representing the current state and the values in the table are the next
state given the particular input symbol.
5.3 Hardware Text Search Systems
If the item satisfies the query, the information is transmitted to the user’s
computer.
The text array processor uses these chips in a matrix arrangement as shown in
Figure9.10.
Each row of the matrix is a query processor in which the first chip performs
the query resolution while the remaining chips match query terms.
The maximum number of characters in a query is restricted by the length of a
row while the number of rows limits the number of simultaneous queries that
can be processed.
Another approach for hardware searchers is to augment disc storage.
The augmentation is a generalized associative search element placed between
the read and write heads on the disk.
Examples
Content Addressable Segment Sequential Memory(CASSM) system
developed at the University of Florida
uses search elements in parallel to obtain structured data from a database.
perform string searching across the database.
Groups of cells are used to detect query terms, along with logic between the terms,
by appropriate programming of the control lines.
When a pattern match is detected, a hit is passed to the internal microprocessor
that passes it back to the host processor, allowing immediate access by the user to
the Hit item.
SoundFisher
o It’s a user-extensible sound classification and retrieval system
o Illustrates from several disciplines, including signal processing,
psychoacoustics, speech recognition, computer music, and multimedia
databases.
o As image indexing algorithms use visual feature vectors to index and match
images, a vector of directly measurable acoustic features such as duration,
loudness, pitch, rightness are used to index sounds.
o This enables users to search for sounds within specified feature ranges.
o Content-based retrieval application enables a user to browse and/or query a
sound database by acoustic (e.g., pitch, duration) and/or perceptual properties
(e.g., “scratchy”) and/or query by example.
o For example, SoundFisher supports complex content queries such as “Find all
AIFF encoded files with animal or human vocal sounds that are similar to
barking sounds without regard to duration or amplitude.”
o The user can also perform a weighted query-by-value
o For example, foreground and transition with >.8 metallic and >.7 plucked
aural properties and 2000 hz < average pitch < 300 hz and duration.
o The system can also be trained by example, so that perceptual properties (e.g.,
“scratchiness” or “buzziness”) that are more indirectly related to acoustic
features can be specified and retrieved.
Another important media class is graphics, to include tables and charts (e.g., column,
bar, line, pie, scatter).
Graphs are constructed from more primitive data elements such as points, lines, and
labels.
Sagebook
An example of a graph retrieval system created at Carnegie Mellon University.
Enables both search and customization of stored data graphics.
Supports data graphic query, representation (content description), indexing, search,
and adaptation capabilities.
o Queries are formulated via a graphical direct-manipulation interface by
selecting and arranging spaces (e.g., charts, tables),
objects contained within those spaces (e.g., marks, bars)
object properties (e.g., color, size, shape, position).
o Relevant graphics stored in a library retrieved by matching the content and/or
properties of the graphical query.
o Both exact matching and similarity based matching is performed.
Maintains an internal representation of the syntax and semantics of data-graphics
(spatial relationships between objects, relationships between data domains, and the
various graphic and data attributes)
Search is performed both on graphical and data properties to enable varying degrees
of match relaxation.
Provides automatic adaptation techniques that can modify the retrieved graphic that
do not match the specified query.
The ability to enable new capabilities for retrieve graphics by content.
5.7 Imagery Retrieval
Increasing volumes of images have raised the need for more effective and efficient
imagery access.
There are needs for indexing and search of not only the metadata (e.g., captions,
annotations) associated with the imagery but also retrieval directly on the content of
the imagery.
The automatic indexing of visual features of imagery(e.g., color, texture, shape) used
for retrieving similar images without the burden of manual indexing.
However, the ultimate objective is semantic based access to imagery.