0% found this document useful (0 votes)
34 views16 pages

Unit 5 IRS

Irs unit 5.cse3 rd year 1 sem

Uploaded by

anjuanjani769
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views16 pages

Unit 5 IRS

Irs unit 5.cse3 rd year 1 sem

Uploaded by

anjuanjani769
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

TEXT SEARCH ALGORITHMS

 There are three classical text retrieval techniques that are defined for organizing items
in a textual database, for rapidly identifying the relevant items and for eliminating
items that do not satisfy the search.
 They are
o Full text scanning (streaming)
o Word inversion
o Multi attribute retrieval
 In addition to indexes, streaming of text was used for searching text in information
systems.

5.1 Introduction to Text Search Techniques

Text scanning system


 Basic concept of a text scanning system
o The ability for one or more users to enter queries
o text to be searched is accessed and compared to the query terms.
o When all of the text has been accessed, the query is complete.
 Advantage of text scanning system
o As soon as an item is identified as satisfying a query, the results can be
presented to the user for retrieval.
 Architecture

 database
o Contains the full text of the items.
 term detector
o Special hardware/software that contains all of the terms being searched for.
o It will input the text and detect the existence of the search terms.
o It will output to the query resolver the detected terms to allow for final logical
processing of a query against an item.
o In Hardware search machines
 Multiple parallel search machines (term detectors) may work against
the same data stream allowing for more queries or against different
data streams reducing the time to access the complete database.
o In software systems
 Multiple detectors may execute at the same time.
o Two approaches to the data stream.
 In the first approach, the complete database is being sent to the
detector(s) functioning as a search of the database.
 In the second approach, random retrieved items are being passed to the
detectors.
 query resolver
o It performs two functions.
o It will accept search statements from the users, extract the logic and search
terms and pass the search terms to the detector.
o It also accepts results from the detector and determines which queries are
satisfied by the item and possibly the weight associated with hit.
 user interface
o The Query Resolver will pass information to the user interface that will be
continually updating search status to the user.
o On request, retrieve any items that satisfy the user search statement.

 Inversions/indexes
o gain their speed by minimizing the amount of data to be retrieved and provide
the best ratio between the total number of items delivered to the user versus
the total number of items retrieved in response to a query.
o require storage overheads of 50% to 300%.
o hits may be returned to the user as soon as found.
o complete query must be processed before any hits are determined or available.
o encounter problems in fuzzy searches and imbedded string query terms.
o difficult to locate all the possible index values short of searching the complete
dictionary of possible terms.

 Finite state automata


 Many of the hardware and software text searchers use finite state automata.
 A finite state automata is a logical machine that is composed of
o I - a set of input symbols from the alphabet supported by the automata
o S - a set of possible states
o P - a set of productions that define the next state based upon the current state
and input symbol
o a special state called the initial state
o a set of one or more final states from the set S

 It is possible to represent the productions by a table with the states as the rows and the
input symbols that cause state transitions as each column.
 The states are representing the current state and the values in the table are the next
state given the particular input symbol.
5.3 Hardware Text Search Systems

 Issues in Software text search systems


 Restrictions to handle many search terms simultaneously against the same text
and limits due to I/O speeds.
 Hardware Text Search Systems
 Specialized hardware machine to perform the searches and pass the results to
the main computer supports the user interface and retrieval of hits.
 Advantages of hardware text search systems
 Scalability by increasing the number of hardware search devices.
 Elimination of the index that represents the document database.
 New items can be searched as soon as received by the system rather than
waiting for the index to be created
 Search speed is deterministic.
 It is slower than using an index, but provides the user with an exact search
time.
 Architecture
 Figure represents hardware text search solutions.
 The algorithmic part of the system is focused on the term detector.
 There are three approaches to implement term detectors:
 parallel comparators or associative memory
 cellular structure
 universal finite state automata
 When the term comparator is implemented with parallel comparators, each term in the
query is assigned to an individual comparison element and input data are serially
streamed into the detector.
 When a match occurs, the term comparator informs the external query resolver
(usually in the main computer) by setting status flags.

 Example for hardware text string search units


 Rapid Search Machine by General Electric.
 In this, a single query was passed against a magnetic tape containing the
documents.
 Associative File Processor (AFP) by Operating Systems Inc.
 It is capable of searching against multiple queries at the same time.
 High Speed Text Search (HSTS) machine by Operating Systems Inc.
 It uses a finite state machine algorithm that runs three parallel state machines.
 One state machine is dedicated to contiguous word phrases
 another for imbedded term match
 final for exact word match

 The GESCAN system


 Uses a text array processor (TAP) that simultaneously matches many terms
 Conditions against a given text stream, the TAP receives the query
information from the user’s computer and directly access the textual data from
secondary storage.
 TAP consists of a large cache memory and an array of four to 128 query
processors.
 Text is loaded into the cache and searched by the query processors.
 Each query processor is independent and can be loaded at any time.
 A complete query is handled by each query processor.
 Queries support exact term matches, fixed length don’t cares, variable length
don’t cares, terms may be restricted to specified zones, Boolean logic, and
proximity.
 A query processor works two operations in parallel; matching query terms to
input text and boolean logic resolution.
 Term matching is performed by a series of character cells each containing one
character of the query.
 A string of character cells is implemented on the same LSI chip and the chips
can be connected in series for longer strings.
 When a word or phrase of the query is matched, a signal is sent to the
resolution sub-process on the LSI chip.
 The resolution chip is responsible for resolving the Boolean logic between
terms and proximity requirements.

 If the item satisfies the query, the information is transmitted to the user’s
computer.
 The text array processor uses these chips in a matrix arrangement as shown in
Figure9.10.
 Each row of the matrix is a query processor in which the first chip performs
the query resolution while the remaining chips match query terms.
 The maximum number of characters in a query is restricted by the length of a
row while the number of rows limits the number of simultaneous queries that
can be processed.
 Another approach for hardware searchers is to augment disc storage.
 The augmentation is a generalized associative search element placed between
the read and write heads on the disk.
 Examples
 Content Addressable Segment Sequential Memory(CASSM) system
 developed at the University of Florida
 uses search elements in parallel to obtain structured data from a database.
 perform string searching across the database.

 Relational Associative Processor (RAP)


 Another special search machine developed at the University of Toronto.
 Performs search across a secondary storage device using a series of cells
comparing data in parallel.

 Fast Data Finder (FDF)


 Most recent specialized hardware text search unit.
 It was developed to search text and has been used to search English and
foreign languages.
 The early Fast Data Finders consisted of an array of programmable text
processing cells connected in series forming a pipeline hardware search
processor.
 The cells are implemented using a VSLI chip.
 Each chip contained 24processor cells with a typical system containing 3600
cells
 Each cell will be a comparator for a single character limiting the total number
of characters in a query to the number of cells.
 The cells are interconnected with an 8-bit data path and approximately 20-bit
control path.
 The text to be searched passes through each cell in a pipeline fashion until the
complete database has been searched.
 As data is analyzed at each cell, the 20 control lines states are modified
depending upon their current state and the results from the comparator.
 A cell is composed of both a register cell (Rs) and a comparator (Cs).
 The input from the Document database is controlled and buffered by the micro
process/memory and feed through the comparators.
 The search characters are stored in the registers.
 The connection between the registers reflects the control lines that are also passing
state information.

 Groups of cells are used to detect query terms, along with logic between the terms,
by appropriate programming of the control lines.
 When a pattern match is detected, a hit is passed to the internal microprocessor
that passes it back to the host processor, allowing immediate access by the user to
the Hit item.

 The functions supported by the Fast data Finder are:


 Boolean Logic including negation
 Proximity on an arbitrary pattern
 Variable length “don’t cares “
 Term counting and thresholds
 fuzzy matching
 term weights
 numeric ranges
Multimedia Information Retrieval

 Text elements that are used for indexing are


o Characters
o word stems
o words
o Phrases.
 Imagery, audio, and video elements are
o In audio: Phonemes (or basic units of sound) and their properties (e.g.,
loudness, pitch),
o In imagery: color, shape, texture, and location
o In video: imagery and audio elements, camera position and movement.
 The users demanding content-based access to materials are increasing and
approximately 10 million sites are on the World Wide Web.

5.4 Spoken Language Audio Retrieval


 The ability to search the content of audio sources such as speeches, radio broadcasts,
and conversations would be valuable for a range of applications.
 Techniques developed are
o automated recognition of speech
 application areas are
 speaker verification
 transcription
 command and control
Evaluation, Issues, and findings
 Speech and text retrieval in the context of the Video Mail Retrieval (VMR) project by
Jones et al.
o speech transcription word error rates may be high
o Redundancy in the source material helps offset these error rates and still
support effective retrieval.
o speaker-dependent techniques retain approximately 95% of the performance of
retrieval of text transcripts
o Speaker independent techniques about 75%.
o System scalability remains a significant challenge.

 BBN’s Rough ’n’ Ready prototype


o Provides information access to spoken language from audio and video sources.
o It creates a Rough summarization of speech that is ready for browsing.
o Its transcription is created by the BYBLOS™ large vocabulary speech
recognition system
o A continuous-density Hidden Markov Model (HMM) system tested in annual
formal evaluations for the past 12 years.
o BYBLOS runs at 3 times real-time, uses a 60,000 word dictionary, and
reported word error rates of 18.8% for the broadcast news transcription task.
o Addressing multilingual information access.

 Tokyo Institute of Technology and NHK broadcasting


o It addresses transcription and topic extraction from Japanese broadcast news.
o Improvise processing by modeling filled pauses, performing on-line
incremental speaker adaptation and by using a context dependent language
model
o The language model includes Chinese characters and two kinds of Japanese
characters.
5.5 Non-Speech Audio Retrieval

 In addition to content-based access to speech audio, noise/sound retrieval is also


important in such fields as music and movie/video production.

 SoundFisher
o It’s a user-extensible sound classification and retrieval system
o Illustrates from several disciplines, including signal processing,
psychoacoustics, speech recognition, computer music, and multimedia
databases.
o As image indexing algorithms use visual feature vectors to index and match
images, a vector of directly measurable acoustic features such as duration,
loudness, pitch, rightness are used to index sounds.
o This enables users to search for sounds within specified feature ranges.
o Content-based retrieval application enables a user to browse and/or query a
sound database by acoustic (e.g., pitch, duration) and/or perceptual properties
(e.g., “scratchy”) and/or query by example.
o For example, SoundFisher supports complex content queries such as “Find all
AIFF encoded files with animal or human vocal sounds that are similar to
barking sounds without regard to duration or amplitude.”
o The user can also perform a weighted query-by-value
o For example, foreground and transition with >.8 metallic and >.7 plucked
aural properties and 2000 hz < average pitch < 300 hz and duration.
o The system can also be trained by example, so that perceptual properties (e.g.,
“scratchiness” or “buzziness”) that are more indirectly related to acoustic
features can be specified and retrieved.

o Additional requirements identified by research are


 need for sound displays
 sound synthesis (a kind of query formulation/refinement tool)
 sound separation
 matching of trajectories of features over time
5.6 Graph Retrieval

 Another important media class is graphics, to include tables and charts (e.g., column,
bar, line, pie, scatter).
 Graphs are constructed from more primitive data elements such as points, lines, and
labels.

Sagebook
 An example of a graph retrieval system created at Carnegie Mellon University.
 Enables both search and customization of stored data graphics.
 Supports data graphic query, representation (content description), indexing, search,
and adaptation capabilities.
o Queries are formulated via a graphical direct-manipulation interface by
 selecting and arranging spaces (e.g., charts, tables),
 objects contained within those spaces (e.g., marks, bars)
 object properties (e.g., color, size, shape, position).
o Relevant graphics stored in a library retrieved by matching the content and/or
properties of the graphical query.
o Both exact matching and similarity based matching is performed.
 Maintains an internal representation of the syntax and semantics of data-graphics
(spatial relationships between objects, relationships between data domains, and the
various graphic and data attributes)
 Search is performed both on graphical and data properties to enable varying degrees
of match relaxation.
 Provides automatic adaptation techniques that can modify the retrieved graphic that
do not match the specified query.
 The ability to enable new capabilities for retrieve graphics by content.
5.7 Imagery Retrieval

 Increasing volumes of images have raised the need for more effective and efficient
imagery access.
 There are needs for indexing and search of not only the metadata (e.g., captions,
annotations) associated with the imagery but also retrieval directly on the content of
the imagery.
 The automatic indexing of visual features of imagery(e.g., color, texture, shape) used
for retrieving similar images without the burden of manual indexing.
 However, the ultimate objective is semantic based access to imagery.

Query By Image Content (QBIC) system


 UltimediaManager, commercial version of QBIC, represents imagery attribute
indexing approach.
 Access to imagery collections on the basis of visual properties such as color, shape,
texture, and sketches.
 Query facilities for specifying color parameters, drawing desired shapes, or selecting
textures replace the traditional keyword query found in text retrieval.
 As robust, domain independent object identification remains difficult and manual
image annotation is tedious, automated and semi automated object outlining tools
(e.g., foreground/background models to extract objects) were developed to facilitate
database population.

Content based imagery access to video retrieval.


 More recently researchers have investigated the application of content based imagery
access to video retrieval.
 For example, shot detection and extraction of a representative frame (r-frame or
keyframe) is performed for each shot, and derived a layered representation of moving
objects.
 This enables queries such as “find me all shots panning left to right” which yield a list
of relevancy ranked r-frames (which acts as a thumbnail), selection of which retrieves
the associated video shot.
 Additional research in image processing
o addressed specific kinds of content-based retrieval problems.
o face processing, where
 distinguish face detection (identifying a face or faces in a scene),
 face recognition (authenticating that a given face is of a particular
person),
 face retrieval (find the closest matching face in a repository).

 Also developed systems


o to track human movement (e.g., heads, hands, feet)
o to differentiate human expressions such as a smile, surprise, anger, or disgust.
o to research in emotion recognition in the context of human computer
interaction.

 Informedia Digital Video Library system


o Face recognition is also important in video retrieval.
o This system extracts information from audio and video and supports full
content search over digitized video sources.
o It provides a facility called named face which automatically associates a name
with a face and enables the user to search for a face given a name and vice
versa.
5.8 Video Retrieval

 The ability to support content based access to video are


o access to video mail
o videotaped meetings
o surveillance video
o broadcast television
 Broadcast News Navigator (BNN) system
o It is a web-based tool that automatically captures, annotates, segments,
summarizes and visualizes stories from broadcast news video.
o Integrates text, speech, and image processing technologies to perform
multistream analysis of video to support content-based search and retrieval.
o Addresses the problem of time-consuming, manual video acquisition /
annotation techniques that frequently result in inconsistent, error-full or
incomplete video catalogues.
o From BNN’s video query page, the user can
 search among thirty national or local news sources,
 specify an absolute or relative date range,
 search closed captions or speech transcriptions,
 run a pre-specified profile,
 search on text keywords or concepts that express topics or named
entities such as people, organizations, and locations.
o BNN automatically generates a custom query web page which includes menus
of people and location names from content extracted over the relevant time
period.
o It incorporates the Alembic natural language information extraction system.
o Supports simple browsing of stories during particular time intervals or from
particular sources.
o Ability to display a graph of named entity frequency over time.
o User can automatically data mine the named entities in the analyzed stories
using the “search for correlations” link shown on the left panel.
o Users are able to find video content about six times as fast.
o Automated segmentation of news programs into individual stories using cross
media cues such as visual changes, speaker changes, and topic changes
enhanced the performance.

 Topic detection and tracking (TDT)


o Topic detection and tracking initiative for broadcast news and newswire
sources aims to investigate algorithms that perform
 story segmentation (detection of story boundaries)
 topic tracking (detection of stories that discuss a topic, for each given
target topic)
 topic detection (detection of stories that discuss an arbitrary topic, for
all topics).

 Geospatial News on Demand Environment (GeoNODE)


o Whereas BNN focuses on story segmentation, GeoNODE addresses topic
detection and tracking.
o It presents news in a geospatial and temporal context.
o An analyst can navigate the information space through indexed access into
multiple types of information sources (from broadcast video, on-line
newspapers to specialist archives)
o It incorporates information extraction, data mining/correlation and
visualization components.
o ability of GeoNODE to automatically nominate and animate topics from
sources thereby directing analysis to the relevant documents having the right
topic, the right time, and the right place.

You might also like