0% found this document useful (0 votes)
20 views24 pages

Irs Unit 5 PDF

Uploaded by

nshreya09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views24 pages

Irs Unit 5 PDF

Uploaded by

nshreya09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Unit- V

Text search Algorithms:


Introduction to text search techniques:
1. Concept:
○ A text streaming search system allows users to enter queries, which are then compared
to a database of text. As soon as an item matching a query is found, results can be
presented to the user.
2. System Components:
○ Term Detector: Identifies query terms in the text.
○ Query Resolver: Processes search statements, passes terms to the detector, and
determines if an item satisfies the query. It then communicates results to the user
interface.
○ User Interface: Updates the user with search status and retrieves matching items.
3. Search Process:
○ The system searches for occurrences of query terms in a text stream. It involves
detecting patterns (query terms) in the text.
○ Worst-case time complexity: O(n) where n is the length of the text, and O(n*m) for brute
force methods.
○ Hardware & Software: Multiple detectors may run in parallel for efficiency, either
searching the entire database or retrieving random items.
4. Index vs. Streaming Search:
○ Streaming Search: More efficient in terms of speed as results are shown immediately
when found and doesn't require extra storage.
○ Index Search: Requires the whole query to be processed before results appear and has
storage overhead but can be more efficient in some cases like fuzzy searches.
○ Disadvantages of Streaming: Dependent on the I/O speed, may not handle all search
types as efficiently as indexing systems.
5. Finite State Automata:
○ Many text searchers use finite state automata (FSA), which are logical machines used to
recognize patterns in input strings. The FSA consists of:
■ I: Set of input symbols.
■ S: Set of states.
■ P: Productions, defining state transitions.
■ Initial state: The starting point.
■ Final state(s): Where the machine ends when a pattern is found.
○ Example: An FSA can be used to detect the string "CPU" in a text stream by transitioning
through states based on the input symbols received.
6. Transition Representation:
○ The states and transitions in an FSA can be represented by a table, where rows represent
current states, and columns represent the input symbols that trigger state transitions.

This system allows for real-time search and retrieval of text items without needing extra storage, but may
face performance challenges with I/O speed and certain types of queries.

Software text search algorithms:


The four major algorithms associated with software text search are:

1. Brute Force Approach:


○ This is the simplest algorithm. It attempts to match the search string against the text,
shifting the text one position after each mismatch and starting the comparison over.
○ The expected number of comparisons when searching an input text string of length n for
a pattern of length m is O(n * m).
2. Knuth-Morris-Pratt (KMP):
○ Requires O(n) time for searching after preprocessing the search string in O(m) time.
3. Boyer-Moore:
○ The fastest among the algorithms, requiring O(n + m) comparisons.
○ Both KMP and Boyer-Moore require O(n) preprocessing of the search string.
4. Shift-OR Algorithm:
○ A more efficient algorithm for string searching, but specifics about its time complexity
were not mentioned.
5. Rabin-Karp Algorithm:
○ Like Boyer-Moore and KMP, it also requires O(n + m) comparisons in the best case.

Of these, Boyer-Moore is considered the fastest.

1. Knuth-Morris-Pratt (KMP) Algorithm:


○ KMP improves on brute force by avoiding unnecessary comparisons. When a mismatch
occurs, the previously matched characters tell how far to skip in the input stream before
restarting the comparison.
○ A Shift Table is used to store how many positions to skip after a mismatch.
○ Example: Given an input stream and a pattern, the algorithm can skip positions based on
the already matched part of the pattern, making it more efficient.
2. Boyer-Moore Algorithm:
○ Boyer-Moore enhances string search efficiency by comparing from the end of the
pattern to the start.
○ Shift Rules:
■ When a mismatch happens, the character in the input stream is aligned with its
next occurrence in the pattern.
■ If the character doesn't exist in the pattern, the search pattern shifts by its full
length.
■ If there's a mismatch after some matched characters, the pattern shifts based on
the previously matched substring in the pattern.
○ This allows for larger skips compared to KMP and other algorithms, leading to faster
searching, especially when mismatches occur frequently.
3. Hashing and Rabin-Karp Algorithm:
○ This algorithm calculates a hash value for substrings of the text and compares these with
the hash value of the search pattern.
○ The hash function used is h(k)=kmod qh(k) = k \mod qh(k)=kmodq, where qqq is a large
prime number.
○ Although hashing doesn't guarantee uniqueness (collisions), it reduces the number of
comparisons by quickly finding potential matches, which are then validated by directly
comparing the text and pattern.
4. Finite State Machine (FSM) Approach:
○ This approach uses a finite state machine to process multiple query terms. Each state
transition occurs based on the current input symbol.
○ The FSM consists of:
■ GOTO function: Defines state transitions based on input symbols.
■ Failure function: Maps a state to another in case of a failure.
■ Output function: Indicates when a query term has been matched.
○ The FSM efficiently processes multiple queries at once, making it suitable for handling
complex pattern matching tasks.

5. Boyer-Moore and Preprocessing:


○ Boyer-Moore is fast in practice, but it requires significant preprocessing time to set up
tables (for shifts, etc.). Despite this, it outperforms other algorithms in many cases.

Aho-Corasick vs. KMP Algorithms:

● Both the Aho-Corasick and Knuth-Morris-Pratt (KMP) algorithms compare the same number of
characters.
● The new algorithm improves upon these by making state transitions independent of the number
of search terms, and the search operation is linear with respect to the number of characters in
the input stream.
● The comparison count is proportional to T (the length of the text) multiplied by a constant w > 1,
providing a significant improvement over KMP (which depends on the query size) and
Boyer-Moore (which handles only one search term).
Baeza-Yates and Gonnet's Extension:

● This algorithm can handle "don’t care" symbols, complement symbols, and up to k mismatches.
● It uses a vector of m states, where m is the length of the search pattern, with each state
corresponding to a specific portion of the pattern matched to the input text.
● The algorithm provides a fuzzy search by determining the number of mismatches between the
search pattern and the text. If mismatches occur, the vector helps track the positions where
matches may happen, supporting fuzzy searches.

Shift-Add Algorithm:

● The Shift-Add algorithm utilizes this vector representation and performs a comparison by
updating the vector as it moves through the text.
● For each character in the pattern, a table T(x) is maintained that stores the status (match or
mismatch). When the vector value is zero, a perfect match is found.
● Don’t care symbols and complementary symbols can be included, making the algorithm flexible
for varied search types.
Extensions by Wu and Manber:

● The Shift-Add algorithm was extended by Wu and Manber to handle insertions, deletions, and
positional mismatches as well.

Hardware Implementation:

● One of the advantages of the Shift-Add algorithm is its ease of implementation in hardware,
making it efficient for real-time applications.

Hardware text search systems:

Challenges with Software Text Search:

● Limitations: Software text search faces restrictions such as the ability to handle multiple search
terms simultaneously and issues with I/O speeds.
● Hardware Solution: To offload the resource-intensive searching, specialized hardware search
machines were developed. These machines perform searches independently of the main
processor and send the results to the main computer.

Advantages of Hardware-Based Text Search:

1. Scalability: Speed improves with the addition of more hardware devices (one per disk), and the
only limiting factor is the speed of data transfer from secondary storage (disks).
2. No Need for Indexing: Unlike traditional systems that need large indexes (often 70% the size of
the documents), hardware search machines do not require an index, allowing immediate
searches as new items arrive.
3. Deterministic Speed: While hardware searches may be slower than indexed searches, they
provide predictable search times, and results are available immediately as hits are found.

Types of Term Detectors in Hardware:

● Parallel Comparators: Each term is assigned to a dedicated comparison element. Text is


streamed serially into the detector, and matches are flagged for further processing.
● Cellular Structure: This approach involves multiple processing elements working together to
detect terms.
● Finite State Automata: State machines that manage term detection across multiple queries
simultaneously.

Examples of Hardware-Based Search Machines:

1. Parasel Searcher (formerly Fast Data Finder): A specialized hardware text search system with an
array of programmable processing cells. Each cell compares a single character in the query,
making it scalable based on the number of cells.
○ Fast Data Finder (FDF): The system uses cells interconnected in series, each handling a
single character comparison, and is used for complex searches, including Boolean logic,
proximity, and fuzzy matching.
2. GESCAN: Uses a Text Array Processor (TAP) that matches multiple terms and conditions in
parallel. It supports various search features like exact term matches, don’t-care symbols, Boolean
logic, and proximity.
3. Associative File Processor (AFP): An early hardware search unit capable of handling multiple
queries simultaneously.
4. Content Addressable Segment Sequential Memory (CASSM): Developed as a general-purpose
search device, this system can be used for string searching across a database.

Notable Features of Hardware Text Search Units:

● Speed: Dependent on I/O speeds, with no need for indexing.


● Parallelism: Multiple queries can be processed simultaneously by specialized search processors
(e.g., in the GESCAN and FDF systems).
● Boolean and Proximity Logic: Capable of complex search features like term counting, fuzzy
matching, and the use of variable-length don’t-cares.
Applications in Biological Research:

● The Fast Data Finder (FDF) has been adapted for genetic analysis, including DNA and protein
sequence matching. It is used in Smith-Waterman (S-W) dynamic programming for sequence
similarity searches and for identifying conserved regions in biological sequences.
● The Biology Tool Kit (BTK) integrates with the FDF to perform fuzzy matching for biological
sequence data.

Limitations:

● Expense: The cost and the need to stream entire databases for search have limited the
widespread adoption of hardware search systems.
● Database Size: The entire database must be streamed for a search, which can be
resource-intensive.
Multimedia Information Retrieval:
Spoken Language Audio Retrieval:
● Value of Speech Search: Just like text search, the ability to search audio sources such as
speeches, radio broadcasts, and conversations would be valuable for various applications,
including speaker verification, transcription, and command and control.
● Challenges in Speech Recognition:
○ Word Error Rates: Transcription of speech can be challenging due to high word error
rates, which can be as high as 50%, depending on factors like the speaker, the type of
speech (dictation vs. conversation), and environmental conditions. However, redundancy
in the source material can help offset these errors, still allowing effective retrieval.
○ Lexicon Size: While speech recognition systems often have lexicons of around 100,000
words, text systems typically contain much larger lexicons, sometimes over 500,000
words, which adds complexity to speech recognition.
○ Development Effort: A significant challenge is the effort needed to develop an
annotated corpus (e.g., a video mail corpus) to train and evaluate speech recognition
systems.

Comparative Evaluation of Speech vs. Text Retrieval:

● Performance: Research by Jones et al. (1997) compared speech retrieval to text retrieval. The
results showed:
○ Speaker-dependent systems: Retain around 95% of the performance of text-based
retrieval.
○ Speaker-independent systems: Achieve about 75% of the performance of text-based
retrieval.
● System Scalability: Scalability remains a challenge in speech retrieval, partly due to the size of
the lexicon and the complexity of developing annotated corpora for training.

Recent Efforts in Broadcast News Transcription:

● Rough’n’Ready Prototype:
○ Purpose: Developed by BBN, Rough’n’Ready provides information access to spoken
language from audio and video sources, especially broadcast news. It generates a
summarization of speech for easy browsing.
○ Technology: The transcription is created by the BYBLOS™ system, a large vocabulary
speech recognition system that uses a continuous-density Hidden Markov Model
(HMM).
○ Performance: BYBLOS runs at three times real-time speed, with a 60,000-word
dictionary. The system has reported a word error rate (WER) of 18.8% for broadcast
news transcription.
● Multilingual Efforts:
○ LIMSI North American Broadcast News System: Reported a 13.6% word error rate and
focused on multilingual information access.
○ Tokyo Institute of Technology & NHK: Joint research focused on transcription and topic
extraction from Japanese broadcast news. This project aims to improve accuracy by
modeling filled pauses, performing online incremental speaker adaptation, and using a
context-dependent language model.

Technological Approaches and Improvements:

● Filled Pauses: One area of focus in improving speech recognition systems is the handling of filled
pauses (e.g., "um," "uh") in natural speech.
● Speaker Adaptation: Improving speaker adaptation is crucial, as different speakers have different
speaking styles and accents. On-line incremental adaptation is a key strategy to improve
recognition over time.
● Language Models: Using context-dependent language models can significantly improve the
performance of speech recognition systems by considering the sequence and context of words,
including special characters such as Kanji (Chinese characters used in Japanese), Hira-gana, and
Kata-kana (Japanese syllabary).
Challenges in Multilingual Speech Processing:

● Multilingual Transcription: Efforts are being made to extend broadcast news transcription
systems to support multiple languages, including German, French, and Japanese.
● Contextual Models for Multilingual Support: Developing models that handle multiple languages
with different writing systems (e.g., Chinese characters, Japanese characters) poses an additional
challenge in improving the accuracy and performance of speech recognition systems across
languages.

Non-Speech Audio Retrieval:


● Purpose: SoundFisher is designed for content-based sound retrieval and classification, especially
useful in domains such as music production, movie/video production, and sound design.
● Sound Indexing:
○ The system uses directly measurable acoustic features (e.g., duration, loudness, pitch,
brightness) to index sounds. This is similar to how image indexing works using visual
feature vectors, enabling the user to search for sounds based on these features.
○ The features can be used to search within specified ranges, making the system flexible
and powerful for retrieving sounds of interest.

User Queries and Retrieval:

● Content-Based Search: Users can search for sounds by their acoustic properties (e.g., pitch,
loudness, duration) or perceptual properties (e.g., “scratchy”).
○ Example: A user might search for "all AIFF encoded files with animal or human vocal
sounds that resemble barking" without specifying the exact duration or amplitude.
○ Query by Example: Users can also train the system by example, where the system learns
to associate perceptual properties (like "scratchiness" or "buzziness") with the sound
features.
○ Weighted Queries: The system supports complex weighted queries based on different
sound characteristics. For example, a query might specify a foreground sound with
metallic and plucked properties, and a specific pitch range.

Features and Applications:

● Training by Example: Users can train the system to recognize and retrieve sounds with indirectly
related perceptual properties, allowing for more flexible and nuanced searches.
● Performance Evaluation: The system was tested using a database of 400 sound files, including
sounds from nature, animals, instruments, and speech.
● Additional System Requirements:
○ Sound Displays: Visual representations of sound data may be necessary for better
understanding and refining searches.
○ Sound Synthesis: This refers to a query formulation or refinement tool that helps users
create or refine sound queries.
○ Sound Separation: This involves separating overlapping sound features or elements
within a given sound.
○ Matching Feature Trajectories Over Time: The system also supports tracking how
features (e.g., pitch, loudness) evolve over time in a sound, allowing for more dynamic
and sophisticated search queries.
GraphRetrieval:
● Purpose: SageBook provides a comprehensive system for querying, indexing, and retrieving data
graphics, which includes charts, tables, and other types of visual representations of data. It
allows users to search based on both the graphical elements (e.g., bars, lines) and the underlying
data represented by the graphic.
● Graphical Querying:
○ Users can formulate queries through a graphical direct-manipulation interface called
SageBrush. This allows them to select and arrange various components of a graphic,
such as:
■ Spaces (e.g., charts, tables),
■ Objects (e.g., marks, bars),
■ Object properties (e.g., color, size, shape, position).
○ The left side of Figure 10.3 (not shown here) illustrates how the user can design a query
by manipulating these elements.

Search and Retrieval Process:

● Matching Criteria:
○ The system performs both exact and similarity-based matching of the graphics. For a
successful match, the graphemes (graphical elements such as bars or lines) must not
only belong to the same class (e.g., bars, lines) but also match specific properties (e.g.,
color, shape, size).
○ The retrieved data-graphics are ranked based on their degree of similarity to the query.
For example, in a “close graphics matching strategy”, SageBook will prioritize results that
closely resemble the structure and properties of the query.
● Graphic Adaptation:
○ The system also allows users to manually adapt the retrieved graphics. For instance,
users can modify or eliminate certain elements that do not match the specifications of
the query.
● Grouping and Clustering:
○ To facilitate browsing large collections, SageBook includes data-graphic grouping
techniques based on both graphical and data properties, enabling users to efficiently
browse large collections of graphics.

Internal Representation:

● SageBook maintains an internal representation of the syntax and semantics of data-graphics,


which includes:
○ Spatial relationships between objects,
○ Relationships between data domains (e.g., interval, 2D coordinates),
○ Various graphic and data attributes.
● This internal representation aids in performing accurate and effective searches.

Search Strategies:

● SageBook offers multiple search strategies with varying levels of match relaxation:
○ For graphical properties, users can perform searches with different strategies (e.g., exact
match, partial match).
○ For data properties, there are also alternative search strategies to accommodate
different matching requirements.

Applications:

● Versatility: SageBook’s capability to search and retrieve graphics by content is valuable across
various domains, including:
○ Business Graphics: for financial charts, reports, and presentations.
○ Cartography: for terrain, elevation, and feature maps.
○ Architecture: for blueprints and designs.
○ Communications and Networking: for routers, links, and network diagrams.
○ Systems Engineering: for component and connection diagrams.
○ Military Campaign Planning: for strategic maps and force deployment visualizations.
● Real-World Relevance: The system’s ability to handle complex graphical elements, relationships,
and data attributes makes it applicable in a broad range of fields where visual representations of
data are crucial for analysis, planning, and decision-making.
Imagery retrieval:
● Problem: Traditional image retrieval systems rely heavily on metadata, such as captions or tags,
but these often do not fully capture the visual content of images. As a result, there has been
significant research into content-based retrieval, where images are indexed and searched based
on their visual features.
● Early Approaches:
○ Initial efforts focused on indexing visual features such as color, texture, and shape to
allow for retrieving similar images without needing manual indexing. Notable works
include Niblack and Jain’s algorithm development for automatic indexing of visual
features.
○ QBIC System (Query By Image Content):
■ QBIC, developed by Flicker et al. (1997), is an example of a content-based image
retrieval system that supports queries based on visual properties such as color,
shape, texture, and even sketches.
■ For instance, users could query a collection of US stamps by selecting the color
red or searching for stamps associated with the keyword "president". QBIC
would retrieve images that match these criteria, allowing for more intuitive and
visual-based searching.
○ Refining Queries: Users can refine their search by adding additional constraints. For
example, a query might be refined to include images that contain a red round object
with coarse texture and a green square.
○ Automated and Semi-Automated Object Identification: Since manual annotation of
images is cumbersome, automated tools (e.g., foreground/background models) were
developed to help identify objects in images, facilitating the indexing process.

Content-Based Video Retrieval:

● Researchers extended the concepts from image retrieval to video retrieval. Flicker et al. (1997)
explored shot detection and keyframe extraction, allowing for queries such as “find me all shots
panning left to right” based on the content of video shots. The system retrieves a list of
keyframes (representative frames) that can then be used to retrieve the associated video shot.

Face Recognition and Video Retrieval:

● Face Detection and Recognition:


○ Face detection refers to identifying faces in a scene.
○ Face recognition involves confirming that a specific face corresponds to a given
individual.
○ Face retrieval refers to finding the closest matching face from a database, based on a
given query.
● Example – US Immigration and Naturalization Service:
○ FaceIt®: This face recognition system is used by the US Immigration and Naturalization
Service to authenticate "fast lane" drivers at the US/Mexico border. The system retrieves
a driver’s registered image and compares it to a real-time image captured when the
driver passes through the checkpoint. Successful verification allows the vehicle to
proceed without delay, while failed matches prompt the car to be routed to an
inspection station.
○ Performance Measurement: Systems like FaceIt® can be evaluated using precision and
recall—terms commonly used in information retrieval to assess the system’s ability to
correctly identify and retrieve relevant results.
● Human Movement Tracking and Expression Recognition:
○ There is ongoing research into tracking human movements (such as heads, hands, and
feet) and recognizing facial expressions (e.g., smile, anger, surprise, disgust). This
research is linked to emotion recognition in human-computer interaction, which can
enhance the accuracy of content-based retrieval systems that deal with video or
audio-visual data.

Face Recognition in Video Retrieval:

● Informedia Digital Video Library: Developed by Wactlar et al. (2000), this system supports
content-based video retrieval by extracting information from both audio and video. It includes a
feature called named face that automatically associates a name with a face, enabling users to
search for faces by name or vice versa.
Video retrieval:

Content-Based Video Access:

● Personalcasts and Video Mail: The growing availability of video content (e.g., video mail, taped
meetings, surveillance video, broadcast television) has created a demand for more efficient
access methods. Content-based systems allow users to search video based on its content rather
than relying on manually added tags or metadata.

Broadcast News Navigator (BNN):

● BNN System: This system helps create personalcasts from broadcast news, enabling users to
search for and retrieve specific news stories from a large repository of video data.
○ BNN performs automated processing of broadcast news, including capture, annotation,
segmentation, summarization, and visualization of stories.
○ It integrates text, speech, and image processing technologies to allow users to search
video content based on:
■ Keywords
■ Named entities (people, locations, organizations)
■ Time intervals (e.g., specific news broadcast dates)
○ This approach significantly reduces the need for manual video annotation, which can
often be inconsistent or error-prone.

BNN Features and Results:

● User Query Page: Users can query video by:


○ Date range (e.g., a two-week period)
○ People and location tags (e.g., "George Bush" or "New York")
○ Keywords and concepts (e.g., "presidential primary")
● Named Entity Extraction: BNN incorporates the Alembic natural language information
extraction system to ensure that results are relevant and accurate. For example, a search for
"Bush" will only return stories related to George Bush and not other meanings of "Bush" (e.g.,
shrub or brush).
● Skimming and Story Analysis:
○ The system generates a “story skim”, which shows a keyframe and the most frequently
occurring named entities for each story. This helps users quickly understand the context
and relevance of each story.
○ Users can select a story to view more detailed content, such as closed captions or
transcribed text for a deeper understanding of the video content.
● User Performance Evaluation: Research by Merlino and Maybury (1999) showed that using BNN
helped users retrieve video content six times faster than using traditional keyword searches.
The automated segmentation of news broadcasts (e.g., by visual changes, speaker changes, or
topic shifts) contributes to this speed, allowing users to quickly find the stories they’re interested
in.

Geospatial News on Demand Environment (GeoNODE):

● GeoNODE System: This system focuses on topic detection and tracking for broadcast news and
newswire sources. It allows users to analyze geospatial and temporal contexts for news stories.
○ GeoNODE provides visual analytics by displaying data on a time line of stories related to
specific topics, as well as cartographic visualizations that highlight news mentions of
specific locations (e.g., countries or cities).
○ For example, in the GeoNODE map, the saturation of color indicates the frequency of
news mentions in different regions (e.g., darker colors indicate more mentions).
● Geospatial Search and Data Mining:
○ Users can search for documents that mention specific locations or geospatial trends.
○ The system also supports data mining for discovering correlations among named
entities across multiple news sources.
● GeoNODE Performance: In preliminary evaluations, GeoNODE identified over 80% of
human-defined topics and detected 83% of stories within those topics with a very low
misclassification error (0.2%).

Future of Multimedia Analysis:

● Integration of Multiple Data Sources: The future of systems like GeoNODE will rely on the ability
to extract and analyze information from a variety of multimedia sources, including text, audio,
and video.
● Machine Learning and Evaluation: As these systems evolve, they will increasingly depend on
machine learning techniques, multimedia corpora, and common evaluation tasks to improve
their performance and capabilities.

You might also like