0% found this document useful (0 votes)

49 views12 pages

Introduction To Multimodal RAG

Multimodal Retrieval-Augmented Generation (RAG) integrates various data types such as text, images, and tables to enhance retrieval accuracy and provide context-aware responses, particularly useful in enterprise environments. The document discusses the challenges of managing unstructured data across modalities, including data alignment and computational latency, and outlines approaches like unified vector spaces and separate storage for different modalities. It also highlights the importance of selecting appropriate vision-language models based on specific task requirements and provides resources for evaluating these models.

Uploaded by

aritra0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views12 pages

Introduction To Multimodal RAG

Uploaded by

aritra0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Introduction to Multimodal

Retrieval-Augmented Generation
(RAG)
Mehdi Allahyari

TwoSetAI https://www.youtube.com/@TwoSetAI
What is Multimodal RAG?

Definition: RAG with multiple data types like images, text, tables
Why it’s important? Enterprise (unstructured) data is often spread
across multiple modalities, e.g. images or PDFs containing a mix of text
tables, charts, diagrams.
● Goal: Improve retrieval accuracy and provide richer, context-aware
responses.
● Applications: Virtual assistants, recommendation systems, content
generation.
Multimodal Capabilities in RAG

Multimodal: Involves more than one type of data (e.g., text,

images, audio).
Enhanced Context: Combining modalities provides a fuller
understanding, helping models answer more complex queries.
Example Use Case: Searching images and text databases to
answer a visual question.
Why Multimodal RAG is Challenging?

● Data Spread: Unstructured data across modalities (e.g.,

images, PDFs).
● Unique Modality Challenges: Each data type has specific
retrieval requirements.
● Data Alignment: Combining text and image data meaningfully.
● Latency: Increased computational requirements.
● Scaling: Managing large, multimodal datasets.
Approaches to Multimodal RAG

Unified Vector Space: Encode text/images together using models

like CLIP.
Primary Modality Grounding: Convert all modalities to a single,
main modality.
Separate Storage: Distinct stores with multimodal re-ranking.
Embed all modalities into the same vector space
● Encode both text and images in the same vector
space using e.g. CLIP
● Largely use the same text-only RAG infrastructure
● Swap the embedding model to accommodate
another modality.
● Replace the LLM with a multimodal LLM (MLLM)
for all question and answering

Images: https://weaviate.io/blog/multimodal-rag#step1-create-a-multimodal-collection
Ground all modalities into one primary modality
Select a primary data type that aligns with the application’s main purpose, then convert other data
types to fit within this chosen primary format.

Example: text-based Q&A over PDFs:

Text Processing: Handle text as usual.

Image Processing: Create text descriptions and

metadata for images in preprocessing.

Storage: Save original images for later reference.

Inference: Retrieve data based on text descriptions and metadata, using LLMs and MLLMs as needed.

Advantages: Metadata aids in answering factual questions and avoids complex re-ranking.

Trade-offs: Higher preprocessing costs; may lose some image nuances.

Store different modalities separately
- Have separate stores for different modalities
- Query them all to retrieve top-N chunks, then have a dedicated multimodal
re-ranker provide the most relevant chunks.
- Simplifies the modeling process. Don’t have to align one model to work with
multiple modalities. However, adds complexity in the form of a re-ranker to
arrange the top-M*N chunks (N each from M modalities).

Image: https://newsletter.theaiedge.io/p/how-to-build-a-multimodal-rag-pipeline-8d6
How to find the right Vision Language Models?
Vision-language model (VLM): is an AI model designed to process and interpret
both visual (image, video) and textual data, creating a unified understanding of
both modalities.
To find the right vision-language model:
1. Task Requirements: Define your specific tasks—such as image-captioning,
visual question-answering, or image-text retrieval—as different models excel
in different tasks.
2. Model Capabilities: Models like CLIP are great for retrieval, while BLIP and
Flamingo are better for generating descriptive captions or handling dialogue.
How to find the right Vision Language Models? (Cont’d)
Vision Arena: is a leaderboard solely based on
anonymous voting of model outputs and is updated
continuously.
Input: User enters an image and a prompt.
Anonymous Outputs: Outputs are generated from two
different models, shown anonymously.
Human Selection: Users pick their preferred output
without knowing the model.
Leaderboard Ranking: Constructed purely based on
human preferences across multiple samples.

https://huggingface.co/spaces/WildVision/vision-arena
How to find the right Vision Language Models? (Cont’d)
Open VLM Leaderboard: ranks
vision-language models based on various
performance metrics and overall average
scores. It allows filtering by model size,
licensing type (proprietary or open-source),
and scores on different metrics, making it
easier to compare models suited for specific
needs and preferences.

https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
How to find the right Vision Language Models? (Cont’d)
VLMEvalKit: is an open-source evaluation toolkit of large
vision-language models (LVLMs) to run various benchmarks.

https://github.com/open-compass/VLMEvalKit

Robust Multi Model RAG Pipeline For Documents Containing Text Table Amp Images
No ratings yet
Robust Multi Model RAG Pipeline For Documents Containing Text Table Amp Images
7 pages
Ask in Any Modality: A Comprehensive Survey On Multimodal Retrieval-Augmented Generation
No ratings yet
Ask in Any Modality: A Comprehensive Survey On Multimodal Retrieval-Augmented Generation
32 pages
Beyond Text: Optimizing RAG With Multimodal Inputs For Industrial Applications
No ratings yet
Beyond Text: Optimizing RAG With Multimodal Inputs For Industrial Applications
14 pages
Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts
No ratings yet
Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts
15 pages
Ask in Any Modality A Comprehensive Survey On Multimodal Retrieval - Augmented Generation
No ratings yet
Ask in Any Modality A Comprehensive Survey On Multimodal Retrieval - Augmented Generation
34 pages
2501.02189v3 - 2025
No ratings yet
2501.02189v3 - 2025
35 pages
Exploring
No ratings yet
Exploring
16 pages
Multi-Modal Generative AI Survey
No ratings yet
Multi-Modal Generative AI Survey
23 pages
M Rag Survey
No ratings yet
M Rag Survey
80 pages
LVLM Survey
No ratings yet
LVLM Survey
22 pages
2023.findings Emnlp.314v2
No ratings yet
2023.findings Emnlp.314v2
21 pages
Multimodal RAG Systems Hands-On Guide
No ratings yet
Multimodal RAG Systems Hands-On Guide
7 pages
Lecture1 2-MultimodalResearchTasks
No ratings yet
Lecture1 2-MultimodalResearchTasks
46 pages
TTDN 20242 Retrieval Vision Language Model
No ratings yet
TTDN 20242 Retrieval Vision Language Model
10 pages
Pixel To Phrases
No ratings yet
Pixel To Phrases
6 pages
07 - LLM Attention Models
No ratings yet
07 - LLM Attention Models
17 pages
Deploying A Multimodal RAG System Using VLLM and Milvus - by Zilliz - Nov, 2024 - Medium
No ratings yet
Deploying A Multimodal RAG System Using VLLM and Milvus - by Zilliz - Nov, 2024 - Medium
19 pages
Multimodal Foundation Models
No ratings yet
Multimodal Foundation Models
14 pages
677 A Survey On Bridging VLMs
No ratings yet
677 A Survey On Bridging VLMs
20 pages
Session 15-1 Multimodal
No ratings yet
Session 15-1 Multimodal
82 pages
2023 Multimodal Large Language Models - A Survey
No ratings yet
2023 Multimodal Large Language Models - A Survey
10 pages
Lijuan Slides Cvpr2024 Fundationmodels
No ratings yet
Lijuan Slides Cvpr2024 Fundationmodels
25 pages
LLMsVsDiffusionModels Report
No ratings yet
LLMsVsDiffusionModels Report
3 pages
Lecture-27-Introduction To VLM
No ratings yet
Lecture-27-Introduction To VLM
46 pages
Reducing Hallucinations of Medical Multimodal Large Language Models With Visual Retrieval-Augmented Generation
No ratings yet
Reducing Hallucinations of Medical Multimodal Large Language Models With Visual Retrieval-Augmented Generation
8 pages
Group 16 Synopsis
No ratings yet
Group 16 Synopsis
7 pages
Vision-Language Pre-Training
No ratings yet
Vision-Language Pre-Training
102 pages
Chapter 3 Methods
No ratings yet
Chapter 3 Methods
20 pages
Visually-Aligned Retrieval-Augmented Long Video Comprehension
No ratings yet
Visually-Aligned Retrieval-Augmented Long Video Comprehension
15 pages
Grounding Language Models To Images For Multimodal Inputs and Outputs
No ratings yet
Grounding Language Models To Images For Multimodal Inputs and Outputs
18 pages
MM-LLMs Recent Advances in MultiModal Large Language Models
No ratings yet
MM-LLMs Recent Advances in MultiModal Large Language Models
22 pages
Applsci 14 05068
No ratings yet
Applsci 14 05068
30 pages
5th and 6th Topic
No ratings yet
5th and 6th Topic
8 pages
The Evolution of 2024 Multimodal Model Architectures
No ratings yet
The Evolution of 2024 Multimodal Model Architectures
30 pages
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
No ratings yet
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
76 pages
Cao MAPLM A Real-World Large-Scale Vision-Language Benchmark For Map and Traffic CVPR 2024 Paper
No ratings yet
Cao MAPLM A Real-World Large-Scale Vision-Language Benchmark For Map and Traffic CVPR 2024 Paper
12 pages
1-2024-arxiv-MobileVLM V2：更快更强的视觉语言模型基线
No ratings yet
1-2024-arxiv-MobileVLM V2：更快更强的视觉语言模型基线
15 pages
Rag
No ratings yet
Rag
10 pages
Visual Large Language Models For Generalized and Specialized Applications
No ratings yet
Visual Large Language Models For Generalized and Specialized Applications
29 pages
Steps Involved in RAG
No ratings yet
Steps Involved in RAG
4 pages
LLM and RAG
No ratings yet
LLM and RAG
12 pages
Large Language Models
No ratings yet
Large Language Models
1 page
Visionllama
No ratings yet
Visionllama
17 pages
Lecture1.2 - Multimodal Research Tasks
No ratings yet
Lecture1.2 - Multimodal Research Tasks
154 pages
AI Concepts and Viva Prep Updated
No ratings yet
AI Concepts and Viva Prep Updated
16 pages
Li Et Al. - 2023 - Multimodal Foundation Models From Specialists To
No ratings yet
Li Et Al. - 2023 - Multimodal Foundation Models From Specialists To
119 pages
MML Language
No ratings yet
MML Language
11 pages
Untitled 2
No ratings yet
Untitled 2
40 pages
A Comprehensive Survey of Multimodal Large Languag
No ratings yet
A Comprehensive Survey of Multimodal Large Languag
53 pages
Mobile-Videogpt: Fast and Accurate Video Understanding Language Model
No ratings yet
Mobile-Videogpt: Fast and Accurate Video Understanding Language Model
13 pages
RAG For Vision - Building Multimodal Computer Vision Systems - by The Tenyks Blogger - Jul, 2024 - Medium
No ratings yet
RAG For Vision - Building Multimodal Computer Vision Systems - by The Tenyks Blogger - Jul, 2024 - Medium
18 pages
Lecture12 1MultimodalFusion
No ratings yet
Lecture12 1MultimodalFusion
66 pages
Internship Report Hamas Khan
No ratings yet
Internship Report Hamas Khan
24 pages
A Deep Dive Into Retrieval Augmented Generation: Team Members
No ratings yet
A Deep Dive Into Retrieval Augmented Generation: Team Members
14 pages
SVLM Survey For ACL 2025
No ratings yet
SVLM Survey For ACL 2025
20 pages
Lecture 20
No ratings yet
Lecture 20
12 pages
Deepseek-Vl: Towards Real-World Vision-Language Understanding
No ratings yet
Deepseek-Vl: Towards Real-World Vision-Language Understanding
33 pages
mPLUG-Owl2: Revolutionizing Multi-Modal Large Language Model With Modality Collaboration
No ratings yet
mPLUG-Owl2: Revolutionizing Multi-Modal Large Language Model With Modality Collaboration
18 pages
Retrieval Augmented Generation - A Simple Introduction
No ratings yet
Retrieval Augmented Generation - A Simple Introduction
82 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Hindu Law Notes and Study Material
No ratings yet
Hindu Law Notes and Study Material
17 pages
MIL 12 Text Information and Media
100% (1)
MIL 12 Text Information and Media
66 pages
The Cambridge Foucault Lexicon (Review)
No ratings yet
The Cambridge Foucault Lexicon (Review)
4 pages
Mary Jane Lesson Plan in English VI
100% (1)
Mary Jane Lesson Plan in English VI
27 pages
الخامس الابتدائي عام بنات - We can Mc Graw Hill الابتدائية منتظم
No ratings yet
الخامس الابتدائي عام بنات - We can Mc Graw Hill الابتدائية منتظم
26 pages
21-DAY PRAYER AND FASTING 2024 Evening Prayer WEEK 1 To 3 (HM)
No ratings yet
21-DAY PRAYER AND FASTING 2024 Evening Prayer WEEK 1 To 3 (HM)
11 pages
Linux VI and Vim Editor: Tutorial and Advanced Features
No ratings yet
Linux VI and Vim Editor: Tutorial and Advanced Features
17 pages
0547 - s03 - RP - 3 SPEAKING2
No ratings yet
0547 - s03 - RP - 3 SPEAKING2
18 pages
AIML&CS ITIOT, BCT R24 COURSE STRUTURE With Syllabus
No ratings yet
AIML&CS ITIOT, BCT R24 COURSE STRUTURE With Syllabus
10 pages
Grace Hopper The Queen of Code
No ratings yet
Grace Hopper The Queen of Code
9 pages
Coupling UW16.2 KL Ver 1.1
No ratings yet
Coupling UW16.2 KL Ver 1.1
4 pages
Unit 1
No ratings yet
Unit 1
26 pages
Grade 2 Cause Effect B
No ratings yet
Grade 2 Cause Effect B
3 pages
Guitar L1 Curriculum
No ratings yet
Guitar L1 Curriculum
65 pages
Precalc 12 Review Rational Functions 202324 EB
No ratings yet
Precalc 12 Review Rational Functions 202324 EB
14 pages
On Every LOCKET (TABEEZ) Product Individual Pages - Completed
No ratings yet
On Every LOCKET (TABEEZ) Product Individual Pages - Completed
3 pages
Acord Sb+predicat - Engleza - Exercitii
No ratings yet
Acord Sb+predicat - Engleza - Exercitii
5 pages
Li Lai-Resume
No ratings yet
Li Lai-Resume
2 pages
Polynomial Sample Problems
No ratings yet
Polynomial Sample Problems
3 pages
Unit 3.1
No ratings yet
Unit 3.1
88 pages
Mariana
No ratings yet
Mariana
11 pages
Java Programming 2 Syllabus
No ratings yet
Java Programming 2 Syllabus
3 pages
Ca3 Es-Cs-201 Cse 2nd Semester
No ratings yet
Ca3 Es-Cs-201 Cse 2nd Semester
1 page
Comparative and Superlative Adjective
No ratings yet
Comparative and Superlative Adjective
5 pages
Question - Quora
No ratings yet
Question - Quora
24 pages
Anglais G2 ISTA 2017 Etudiants
No ratings yet
Anglais G2 ISTA 2017 Etudiants
78 pages
Grade 5 DLL English 5 Q4 Week 1-D5
No ratings yet
Grade 5 DLL English 5 Q4 Week 1-D5
3 pages
Electric Machines and Power Electronics
100% (2)
Electric Machines and Power Electronics
58 pages
Bcse328l Cryptocurrency-Technologies TH 1.0 70 Bcse328l
No ratings yet
Bcse328l Cryptocurrency-Technologies TH 1.0 70 Bcse328l
2 pages
RFT PDF
No ratings yet
RFT PDF
4 pages

Introduction To Multimodal RAG

Uploaded by

Introduction To Multimodal RAG

Uploaded by

Introduction to Multimodal

Multimodal: Involves more than one type of data (e.g., text,

● Data Spread: Unstructured data across modalities (e.g.,

Unified Vector Space: Encode text/images together using models

Example: text-based Q&A over PDFs:

Text Processing: Handle text as usual.

Image Processing: Create text descriptions and

metadata for images in preprocessing.

Storage: Save original images for later reference.

Trade-offs: Higher preprocessing costs; may lose some image nuances.

You might also like