Irs Unit 5 PDF
Irs Unit 5 PDF
This system allows for real-time search and retrieval of text items without needing extra storage, but may
face performance challenges with I/O speed and certain types of queries.
● Both the Aho-Corasick and Knuth-Morris-Pratt (KMP) algorithms compare the same number of
characters.
● The new algorithm improves upon these by making state transitions independent of the number
of search terms, and the search operation is linear with respect to the number of characters in
the input stream.
● The comparison count is proportional to T (the length of the text) multiplied by a constant w > 1,
providing a significant improvement over KMP (which depends on the query size) and
Boyer-Moore (which handles only one search term).
Baeza-Yates and Gonnet's Extension:
● This algorithm can handle "don’t care" symbols, complement symbols, and up to k mismatches.
● It uses a vector of m states, where m is the length of the search pattern, with each state
corresponding to a specific portion of the pattern matched to the input text.
● The algorithm provides a fuzzy search by determining the number of mismatches between the
search pattern and the text. If mismatches occur, the vector helps track the positions where
matches may happen, supporting fuzzy searches.
Shift-Add Algorithm:
● The Shift-Add algorithm utilizes this vector representation and performs a comparison by
updating the vector as it moves through the text.
● For each character in the pattern, a table T(x) is maintained that stores the status (match or
mismatch). When the vector value is zero, a perfect match is found.
● Don’t care symbols and complementary symbols can be included, making the algorithm flexible
for varied search types.
Extensions by Wu and Manber:
● The Shift-Add algorithm was extended by Wu and Manber to handle insertions, deletions, and
positional mismatches as well.
Hardware Implementation:
● One of the advantages of the Shift-Add algorithm is its ease of implementation in hardware,
making it efficient for real-time applications.
● Limitations: Software text search faces restrictions such as the ability to handle multiple search
terms simultaneously and issues with I/O speeds.
● Hardware Solution: To offload the resource-intensive searching, specialized hardware search
machines were developed. These machines perform searches independently of the main
processor and send the results to the main computer.
1. Scalability: Speed improves with the addition of more hardware devices (one per disk), and the
only limiting factor is the speed of data transfer from secondary storage (disks).
2. No Need for Indexing: Unlike traditional systems that need large indexes (often 70% the size of
the documents), hardware search machines do not require an index, allowing immediate
searches as new items arrive.
3. Deterministic Speed: While hardware searches may be slower than indexed searches, they
provide predictable search times, and results are available immediately as hits are found.
1. Parasel Searcher (formerly Fast Data Finder): A specialized hardware text search system with an
array of programmable processing cells. Each cell compares a single character in the query,
making it scalable based on the number of cells.
○ Fast Data Finder (FDF): The system uses cells interconnected in series, each handling a
single character comparison, and is used for complex searches, including Boolean logic,
proximity, and fuzzy matching.
2. GESCAN: Uses a Text Array Processor (TAP) that matches multiple terms and conditions in
parallel. It supports various search features like exact term matches, don’t-care symbols, Boolean
logic, and proximity.
3. Associative File Processor (AFP): An early hardware search unit capable of handling multiple
queries simultaneously.
4. Content Addressable Segment Sequential Memory (CASSM): Developed as a general-purpose
search device, this system can be used for string searching across a database.
● The Fast Data Finder (FDF) has been adapted for genetic analysis, including DNA and protein
sequence matching. It is used in Smith-Waterman (S-W) dynamic programming for sequence
similarity searches and for identifying conserved regions in biological sequences.
● The Biology Tool Kit (BTK) integrates with the FDF to perform fuzzy matching for biological
sequence data.
Limitations:
● Expense: The cost and the need to stream entire databases for search have limited the
widespread adoption of hardware search systems.
● Database Size: The entire database must be streamed for a search, which can be
resource-intensive.
Multimedia Information Retrieval:
Spoken Language Audio Retrieval:
● Value of Speech Search: Just like text search, the ability to search audio sources such as
speeches, radio broadcasts, and conversations would be valuable for various applications,
including speaker verification, transcription, and command and control.
● Challenges in Speech Recognition:
○ Word Error Rates: Transcription of speech can be challenging due to high word error
rates, which can be as high as 50%, depending on factors like the speaker, the type of
speech (dictation vs. conversation), and environmental conditions. However, redundancy
in the source material can help offset these errors, still allowing effective retrieval.
○ Lexicon Size: While speech recognition systems often have lexicons of around 100,000
words, text systems typically contain much larger lexicons, sometimes over 500,000
words, which adds complexity to speech recognition.
○ Development Effort: A significant challenge is the effort needed to develop an
annotated corpus (e.g., a video mail corpus) to train and evaluate speech recognition
systems.
● Performance: Research by Jones et al. (1997) compared speech retrieval to text retrieval. The
results showed:
○ Speaker-dependent systems: Retain around 95% of the performance of text-based
retrieval.
○ Speaker-independent systems: Achieve about 75% of the performance of text-based
retrieval.
● System Scalability: Scalability remains a challenge in speech retrieval, partly due to the size of
the lexicon and the complexity of developing annotated corpora for training.
● Rough’n’Ready Prototype:
○ Purpose: Developed by BBN, Rough’n’Ready provides information access to spoken
language from audio and video sources, especially broadcast news. It generates a
summarization of speech for easy browsing.
○ Technology: The transcription is created by the BYBLOS™ system, a large vocabulary
speech recognition system that uses a continuous-density Hidden Markov Model
(HMM).
○ Performance: BYBLOS runs at three times real-time speed, with a 60,000-word
dictionary. The system has reported a word error rate (WER) of 18.8% for broadcast
news transcription.
● Multilingual Efforts:
○ LIMSI North American Broadcast News System: Reported a 13.6% word error rate and
focused on multilingual information access.
○ Tokyo Institute of Technology & NHK: Joint research focused on transcription and topic
extraction from Japanese broadcast news. This project aims to improve accuracy by
modeling filled pauses, performing online incremental speaker adaptation, and using a
context-dependent language model.
● Filled Pauses: One area of focus in improving speech recognition systems is the handling of filled
pauses (e.g., "um," "uh") in natural speech.
● Speaker Adaptation: Improving speaker adaptation is crucial, as different speakers have different
speaking styles and accents. On-line incremental adaptation is a key strategy to improve
recognition over time.
● Language Models: Using context-dependent language models can significantly improve the
performance of speech recognition systems by considering the sequence and context of words,
including special characters such as Kanji (Chinese characters used in Japanese), Hira-gana, and
Kata-kana (Japanese syllabary).
Challenges in Multilingual Speech Processing:
● Multilingual Transcription: Efforts are being made to extend broadcast news transcription
systems to support multiple languages, including German, French, and Japanese.
● Contextual Models for Multilingual Support: Developing models that handle multiple languages
with different writing systems (e.g., Chinese characters, Japanese characters) poses an additional
challenge in improving the accuracy and performance of speech recognition systems across
languages.
● Content-Based Search: Users can search for sounds by their acoustic properties (e.g., pitch,
loudness, duration) or perceptual properties (e.g., “scratchy”).
○ Example: A user might search for "all AIFF encoded files with animal or human vocal
sounds that resemble barking" without specifying the exact duration or amplitude.
○ Query by Example: Users can also train the system by example, where the system learns
to associate perceptual properties (like "scratchiness" or "buzziness") with the sound
features.
○ Weighted Queries: The system supports complex weighted queries based on different
sound characteristics. For example, a query might specify a foreground sound with
metallic and plucked properties, and a specific pitch range.
● Training by Example: Users can train the system to recognize and retrieve sounds with indirectly
related perceptual properties, allowing for more flexible and nuanced searches.
● Performance Evaluation: The system was tested using a database of 400 sound files, including
sounds from nature, animals, instruments, and speech.
● Additional System Requirements:
○ Sound Displays: Visual representations of sound data may be necessary for better
understanding and refining searches.
○ Sound Synthesis: This refers to a query formulation or refinement tool that helps users
create or refine sound queries.
○ Sound Separation: This involves separating overlapping sound features or elements
within a given sound.
○ Matching Feature Trajectories Over Time: The system also supports tracking how
features (e.g., pitch, loudness) evolve over time in a sound, allowing for more dynamic
and sophisticated search queries.
GraphRetrieval:
● Purpose: SageBook provides a comprehensive system for querying, indexing, and retrieving data
graphics, which includes charts, tables, and other types of visual representations of data. It
allows users to search based on both the graphical elements (e.g., bars, lines) and the underlying
data represented by the graphic.
● Graphical Querying:
○ Users can formulate queries through a graphical direct-manipulation interface called
SageBrush. This allows them to select and arrange various components of a graphic,
such as:
■ Spaces (e.g., charts, tables),
■ Objects (e.g., marks, bars),
■ Object properties (e.g., color, size, shape, position).
○ The left side of Figure 10.3 (not shown here) illustrates how the user can design a query
by manipulating these elements.
● Matching Criteria:
○ The system performs both exact and similarity-based matching of the graphics. For a
successful match, the graphemes (graphical elements such as bars or lines) must not
only belong to the same class (e.g., bars, lines) but also match specific properties (e.g.,
color, shape, size).
○ The retrieved data-graphics are ranked based on their degree of similarity to the query.
For example, in a “close graphics matching strategy”, SageBook will prioritize results that
closely resemble the structure and properties of the query.
● Graphic Adaptation:
○ The system also allows users to manually adapt the retrieved graphics. For instance,
users can modify or eliminate certain elements that do not match the specifications of
the query.
● Grouping and Clustering:
○ To facilitate browsing large collections, SageBook includes data-graphic grouping
techniques based on both graphical and data properties, enabling users to efficiently
browse large collections of graphics.
Internal Representation:
Search Strategies:
● SageBook offers multiple search strategies with varying levels of match relaxation:
○ For graphical properties, users can perform searches with different strategies (e.g., exact
match, partial match).
○ For data properties, there are also alternative search strategies to accommodate
different matching requirements.
Applications:
● Versatility: SageBook’s capability to search and retrieve graphics by content is valuable across
various domains, including:
○ Business Graphics: for financial charts, reports, and presentations.
○ Cartography: for terrain, elevation, and feature maps.
○ Architecture: for blueprints and designs.
○ Communications and Networking: for routers, links, and network diagrams.
○ Systems Engineering: for component and connection diagrams.
○ Military Campaign Planning: for strategic maps and force deployment visualizations.
● Real-World Relevance: The system’s ability to handle complex graphical elements, relationships,
and data attributes makes it applicable in a broad range of fields where visual representations of
data are crucial for analysis, planning, and decision-making.
Imagery retrieval:
● Problem: Traditional image retrieval systems rely heavily on metadata, such as captions or tags,
but these often do not fully capture the visual content of images. As a result, there has been
significant research into content-based retrieval, where images are indexed and searched based
on their visual features.
● Early Approaches:
○ Initial efforts focused on indexing visual features such as color, texture, and shape to
allow for retrieving similar images without needing manual indexing. Notable works
include Niblack and Jain’s algorithm development for automatic indexing of visual
features.
○ QBIC System (Query By Image Content):
■ QBIC, developed by Flicker et al. (1997), is an example of a content-based image
retrieval system that supports queries based on visual properties such as color,
shape, texture, and even sketches.
■ For instance, users could query a collection of US stamps by selecting the color
red or searching for stamps associated with the keyword "president". QBIC
would retrieve images that match these criteria, allowing for more intuitive and
visual-based searching.
○ Refining Queries: Users can refine their search by adding additional constraints. For
example, a query might be refined to include images that contain a red round object
with coarse texture and a green square.
○ Automated and Semi-Automated Object Identification: Since manual annotation of
images is cumbersome, automated tools (e.g., foreground/background models) were
developed to help identify objects in images, facilitating the indexing process.
● Researchers extended the concepts from image retrieval to video retrieval. Flicker et al. (1997)
explored shot detection and keyframe extraction, allowing for queries such as “find me all shots
panning left to right” based on the content of video shots. The system retrieves a list of
keyframes (representative frames) that can then be used to retrieve the associated video shot.
● Informedia Digital Video Library: Developed by Wactlar et al. (2000), this system supports
content-based video retrieval by extracting information from both audio and video. It includes a
feature called named face that automatically associates a name with a face, enabling users to
search for faces by name or vice versa.
Video retrieval:
● Personalcasts and Video Mail: The growing availability of video content (e.g., video mail, taped
meetings, surveillance video, broadcast television) has created a demand for more efficient
access methods. Content-based systems allow users to search video based on its content rather
than relying on manually added tags or metadata.
● BNN System: This system helps create personalcasts from broadcast news, enabling users to
search for and retrieve specific news stories from a large repository of video data.
○ BNN performs automated processing of broadcast news, including capture, annotation,
segmentation, summarization, and visualization of stories.
○ It integrates text, speech, and image processing technologies to allow users to search
video content based on:
■ Keywords
■ Named entities (people, locations, organizations)
■ Time intervals (e.g., specific news broadcast dates)
○ This approach significantly reduces the need for manual video annotation, which can
often be inconsistent or error-prone.
● GeoNODE System: This system focuses on topic detection and tracking for broadcast news and
newswire sources. It allows users to analyze geospatial and temporal contexts for news stories.
○ GeoNODE provides visual analytics by displaying data on a time line of stories related to
specific topics, as well as cartographic visualizations that highlight news mentions of
specific locations (e.g., countries or cities).
○ For example, in the GeoNODE map, the saturation of color indicates the frequency of
news mentions in different regions (e.g., darker colors indicate more mentions).
● Geospatial Search and Data Mining:
○ Users can search for documents that mention specific locations or geospatial trends.
○ The system also supports data mining for discovering correlations among named
entities across multiple news sources.
● GeoNODE Performance: In preliminary evaluations, GeoNODE identified over 80% of
human-defined topics and detected 83% of stories within those topics with a very low
misclassification error (0.2%).
● Integration of Multiple Data Sources: The future of systems like GeoNODE will rely on the ability
to extract and analyze information from a variety of multimedia sources, including text, audio,
and video.
● Machine Learning and Evaluation: As these systems evolve, they will increasingly depend on
machine learning techniques, multimedia corpora, and common evaluation tasks to improve
their performance and capabilities.