0% found this document useful (0 votes)
48 views12 pages

Introduction To Multimodal RAG

Multimodal Retrieval-Augmented Generation (RAG) integrates various data types such as text, images, and tables to enhance retrieval accuracy and provide context-aware responses, particularly useful in enterprise environments. The document discusses the challenges of managing unstructured data across modalities, including data alignment and computational latency, and outlines approaches like unified vector spaces and separate storage for different modalities. It also highlights the importance of selecting appropriate vision-language models based on specific task requirements and provides resources for evaluating these models.

Uploaded by

aritra0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views12 pages

Introduction To Multimodal RAG

Multimodal Retrieval-Augmented Generation (RAG) integrates various data types such as text, images, and tables to enhance retrieval accuracy and provide context-aware responses, particularly useful in enterprise environments. The document discusses the challenges of managing unstructured data across modalities, including data alignment and computational latency, and outlines approaches like unified vector spaces and separate storage for different modalities. It also highlights the importance of selecting appropriate vision-language models based on specific task requirements and provides resources for evaluating these models.

Uploaded by

aritra0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Introduction to Multimodal

Retrieval-Augmented Generation
(RAG)
Mehdi Allahyari

TwoSetAI https://www.youtube.com/@TwoSetAI
What is Multimodal RAG?

Definition: RAG with multiple data types like images, text, tables
Why it’s important? Enterprise (unstructured) data is often spread
across multiple modalities, e.g. images or PDFs containing a mix of text
tables, charts, diagrams.
● Goal: Improve retrieval accuracy and provide richer, context-aware
responses.
● Applications: Virtual assistants, recommendation systems, content
generation.
Multimodal Capabilities in RAG

Multimodal: Involves more than one type of data (e.g., text,


images, audio).
Enhanced Context: Combining modalities provides a fuller
understanding, helping models answer more complex queries.
Example Use Case: Searching images and text databases to
answer a visual question.
Why Multimodal RAG is Challenging?

● Data Spread: Unstructured data across modalities (e.g.,


images, PDFs).
● Unique Modality Challenges: Each data type has specific
retrieval requirements.
● Data Alignment: Combining text and image data meaningfully.
● Latency: Increased computational requirements.
● Scaling: Managing large, multimodal datasets.
Approaches to Multimodal RAG

Unified Vector Space: Encode text/images together using models


like CLIP.
Primary Modality Grounding: Convert all modalities to a single,
main modality.
Separate Storage: Distinct stores with multimodal re-ranking.
Embed all modalities into the same vector space
● Encode both text and images in the same vector
space using e.g. CLIP
● Largely use the same text-only RAG infrastructure
● Swap the embedding model to accommodate
another modality.
● Replace the LLM with a multimodal LLM (MLLM)
for all question and answering

Images: https://weaviate.io/blog/multimodal-rag#step1-create-a-multimodal-collection
Ground all modalities into one primary modality
Select a primary data type that aligns with the application’s main purpose, then convert other data
types to fit within this chosen primary format.

Example: text-based Q&A over PDFs:

Text Processing: Handle text as usual.

Image Processing: Create text descriptions and

metadata for images in preprocessing.

Storage: Save original images for later reference.

Inference: Retrieve data based on text descriptions and metadata, using LLMs and MLLMs as needed.

Advantages: Metadata aids in answering factual questions and avoids complex re-ranking.

Trade-offs: Higher preprocessing costs; may lose some image nuances.


Store different modalities separately
- Have separate stores for different modalities
- Query them all to retrieve top-N chunks, then have a dedicated multimodal
re-ranker provide the most relevant chunks.
- Simplifies the modeling process. Don’t have to align one model to work with
multiple modalities. However, adds complexity in the form of a re-ranker to
arrange the top-M*N chunks (N each from M modalities).

Image: https://newsletter.theaiedge.io/p/how-to-build-a-multimodal-rag-pipeline-8d6
How to find the right Vision Language Models?
Vision-language model (VLM): is an AI model designed to process and interpret
both visual (image, video) and textual data, creating a unified understanding of
both modalities.
To find the right vision-language model:
1. Task Requirements: Define your specific tasks—such as image-captioning,
visual question-answering, or image-text retrieval—as different models excel
in different tasks.
2. Model Capabilities: Models like CLIP are great for retrieval, while BLIP and
Flamingo are better for generating descriptive captions or handling dialogue.
How to find the right Vision Language Models? (Cont’d)
Vision Arena: is a leaderboard solely based on
anonymous voting of model outputs and is updated
continuously.
Input: User enters an image and a prompt.
Anonymous Outputs: Outputs are generated from two
different models, shown anonymously.
Human Selection: Users pick their preferred output
without knowing the model.
Leaderboard Ranking: Constructed purely based on
human preferences across multiple samples.

https://huggingface.co/spaces/WildVision/vision-arena
How to find the right Vision Language Models? (Cont’d)
Open VLM Leaderboard: ranks
vision-language models based on various
performance metrics and overall average
scores. It allows filtering by model size,
licensing type (proprietary or open-source),
and scores on different metrics, making it
easier to compare models suited for specific
needs and preferences.

https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
How to find the right Vision Language Models? (Cont’d)
VLMEvalKit: is an open-source evaluation toolkit of large
vision-language models (LVLMs) to run various benchmarks.

https://github.com/open-compass/VLMEvalKit

You might also like