0% found this document useful (0 votes)
58 views38 pages

Module 7 - Multimedia Information Retrieval

Uploaded by

Aathmika Vijay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views38 pages

Module 7 - Multimedia Information Retrieval

Uploaded by

Aathmika Vijay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Module 7

MULTIMEDIA INFORMATION
RETRIEVAL

VIT - Dr.D.SARASWATHI 1
• Overview of Spoken Language Audio Retrieval
• Non-Speech Audio Retrieval
• Graph Retrieval
• Imagery Retrieval
• Video Retrieval

VIT - Dr.D.SARASWATHI 2
1.Overview of Spoken Language Audio
Retrieval
• Combines speech recognition and text retrieval to enable
users to access digitized audio-visual content that contains
spoken language components.

VIT - Dr.D.SARASWATHI 3
Basic Goal: Spoken Term Detection
• The fundamental objective is to identify the time spans in an
audio database where a specific query occurs. This process is
known as Spoken Term Detection.
• For instance, if a user searches for “US President,” the system
should locate relevant segments, including utterances
related to “Obama.”

VIT - Dr.D.SARASWATHI 4
Speech Recognition Models
• The first step involves transcribing spoken content into text
using speech recognition techniques.
• Various models, such as RNN/LSTM and DNN, are employed
to achieve accurate transcription.

VIT - Dr.D.SARASWATHI 5
Text Retrieval
• Once transcribed, the spoken content becomes searchable.
• Text retrieval approaches are then used to search over the
transcriptions.
• The goal is to retrieve relevant content based on user
queries.

VIT - Dr.D.SARASWATHI 6
Lattices and Beyond
• To handle the inherent errors in speech recognition, we work
with lattices—graphical representations of multiple
recognition hypotheses.
• Lattices allow us to explore different recognition paths and
improve retrieval accuracy.

VIT - Dr.D.SARASWATHI 7
Semantic Retrieval
• Going beyond basic term detection, we aim for semantic
retrieval of spoken content.
• This involves understanding context, speaker intent, and
deeper meaning within the audio.

VIT - Dr.D.SARASWATHI 8
Spoken
Document
Retrieval

VIT - Dr.D.SARASWATHI 9
VIT - Dr.D.SARASWATHI 10
Challenges and Directions
• Despite advances, speech recognition always produces
errors.
• Researchers are exploring new directions to enhance
retrieval performance and address challenges.

VIT - Dr.D.SARASWATHI 11
2. Non-Speech Audio Retrieval
• While speech audio retrieval focuses on transcribing spoken
language, non-speech audio retrieval deals with identifying
and retrieving other types of audio content.

VIT - Dr.D.SARASWATHI 12
Importance of Non-Speech Audio Retrieval
• Beyond speech, audio databases contain various non-speech
sounds, including music, environmental noise, and sound
effects.
• Retrieving relevant non-speech audio is crucial in fields
like music, movie/video production, and multimedia
content.

VIT - Dr.D.SARASWATHI 13
SoundFisher: A User-Extensible System
• Thorn Blum et al. (1997) introduced SoundFisher, a user-
extensible sound classification and retrieval system.
• SoundFisher draws from multiple disciplines to classify and
retrieve non-speech audio.
• It allows users to define custom sound classes and extend
the system’s capabilities

VIT - Dr.D.SARASWATHI 14
Acoustic Features for Indexing
• Similar to image indexing, where visual feature vectors are
used, non-speech audio retrieval employs acoustic features.
• These features capture properties such as duration,
loudness, pitch, and spectral characteristics.
• Acoustic feature vectors enable efficient indexing and
matching of non-speech audio segments.

VIT - Dr.D.SARASWATHI 15
Applications

• Music Retrieval: Identifying music tracks, genres, and


artists from audio databases.
• Environmental Sound Retrieval: Locating specific sounds
(e.g., birdsong, traffic, waves) in large audio collections.
• Sound Effects Retrieval: Finding relevant sound effects for
multimedia production.

VIT - Dr.D.SARASWATHI 16
Challenges
• Non-speech audio can be highly diverse, making
classification and retrieval complex.
• Handling variations in recording quality, background noise,
and context is essential.

VIT - Dr.D.SARASWATHI 17
• Non-speech audio retrieval enriches our
understanding of audio content beyond spoken
language

VIT - Dr.D.SARASWATHI 18
3. Graph Retrieval
• Graph-based models play a crucial role in information
retrieval, especially when dealing with complex and
interconnected data.

VIT - Dr.D.SARASWATHI 19
VIT - Dr.D.SARASWATHI 20
4. Imagery Retrieval
• There are many real-world applications in which CBIR plays
an important role. Some examples are medicine, forensics,
security, and remote sensing.

VIT - Dr.D.SARASWATHI 21
VIT - Dr.D.SARASWATHI 22
VIT - Dr.D.SARASWATHI 23
• Content-Based Image Retrieval (CBIR) is a way of retrieving images
from a database. In CBIR, a user specifies a query image and gets the
images in the database similar to the query image. To find the most
similar images, CBIR compares the content of the input image to the
database images.
• More specifically, CBIR compares visual features such as shapes,
colors, texture and spatial information and measures the similarity
between the query image with the images in the database with
respect to those features:

VIT - Dr.D.SARASWATHI 24
Text-Based Image Retrieval

VIT - Dr.D.SARASWATHI 25
Feature Extraction – Global

VIT - Dr.D.SARASWATHI 26
Feature Extraction – Local
• Local features describe visual patterns or structures identifiable in
small groups of pixels. For example, edges, points, and various image
patches.
• The descriptors used to extract local features consider the regions
centered around the detected visual structures. Those descriptors
transform a local pixel neighborhood into a vector presentation.
• One of the most used local descriptors is SIFT which stands for Scale-
Invariant Feature Transform. It consists of a descriptor and a detector
for key points. It doesn’t change when we rotate the image we’re
working on. However, it has some drawbacks, such as needing a fixed
vector for encoding and a huge amount of memory.
VIT - Dr.D.SARASWATHI 27
Deep Neural Networks

VIT - Dr.D.SARASWATHI 28
Similarity Measures
• Similarity measures quantify how similar a database image is
to our query image. The selection of the right similarity
measure has always been a challenging task.
• The structure of feature vectors drives the choice of the
similarity measure. There are two types of similarity
measures: distance measures and similarity metrics.

VIT - Dr.D.SARASWATHI 29
Distance
• A distance measure typically quantifies the dissimilarity of
two feature vectors. We calculate it as the distance between
two vectors in some metric space.
• Manhattan distance, Mahalanobis distance, and Histogram
Intersection Distance (HID) are some examples of distance
measure functions.

VIT - Dr.D.SARASWATHI 30
VIT - Dr.D.SARASWATHI 31
Similarity Metrics
• A similarity metric quantifies the similarity between two feature
vectors.

VIT - Dr.D.SARASWATHI 32
5. Video Retrieval

VIT - Dr.D.SARASWATHI 33
•Video retrieval is a fascinating field that
involves searching for relevant videos
based on user queries

VIT - Dr.D.SARASWATHI 34
Objective of Video Retrieval
• The primary goal of video retrieval is to select a video
that corresponds to a given text query. Typically,
videos are returned as a ranked list of candidates,
scored using document retrieval metrics.
• Given a text query and a pool of candidate videos, the
task is to identify the most relevant video content.

VIT - Dr.D.SARASWATHI 35
Methods and Techniques
• Video retrieval can be classified into two main categories:
• Text-Based Video Retrieval: In this approach, users input
representative keywords or a single image (or a group of
images) to search for desired videos.
• Content-Based Video Retrieval: Here, the query is based on
the actual content of the videos, such as visual features,
audio, and other modalities

VIT - Dr.D.SARASWATHI 36
• Deep learning techniques play a crucial role in video-text retrieval.
Some notable methods include:
• ECO (Efficient Convolutional Network for Online Video
Understanding): A network architecture that considers long-term
content while enabling fast per-video processing
• Mixture of Embedding Experts: Learning a text-video embedding
from incomplete and heterogeneous data
• Frozen in Time: A joint video and image encoder for end-to-end
retrieval
• CLIP4Clip: Transferring knowledge from the CLIP model to video-
language retrieval in an end-to-end manner
• CoCa (Contrastive Captioners are Image-Text Foundation Models):
Applying contrastive loss between unimodal image and text
embeddings for video retrieval
VIT - Dr.D.SARASWATHI 37
Datasets and Benchmarks:
• Researchers evaluate video retrieval models on various
datasets. Some popular ones include:
• Kinetics, ActivityNet, MSR-VTT, MSVD, HowTo100M, Charades-
STA, and more
• Subtasks within video retrieval include:
• Video-Text Retrieval, Video Grounding, Video-Adverb Retrieval,
and Replay Grounding

VIT - Dr.D.SARASWATHI 38

You might also like