Introduction To Multimodal RAG
Introduction To Multimodal RAG
Retrieval-Augmented Generation
(RAG)
Mehdi Allahyari
TwoSetAI https://www.youtube.com/@TwoSetAI
What is Multimodal RAG?
Definition: RAG with multiple data types like images, text, tables
Why it’s important? Enterprise (unstructured) data is often spread
across multiple modalities, e.g. images or PDFs containing a mix of text
tables, charts, diagrams.
● Goal: Improve retrieval accuracy and provide richer, context-aware
responses.
● Applications: Virtual assistants, recommendation systems, content
generation.
Multimodal Capabilities in RAG
Images: https://weaviate.io/blog/multimodal-rag#step1-create-a-multimodal-collection
Ground all modalities into one primary modality
Select a primary data type that aligns with the application’s main purpose, then convert other data
types to fit within this chosen primary format.
Inference: Retrieve data based on text descriptions and metadata, using LLMs and MLLMs as needed.
Advantages: Metadata aids in answering factual questions and avoids complex re-ranking.
Image: https://newsletter.theaiedge.io/p/how-to-build-a-multimodal-rag-pipeline-8d6
How to find the right Vision Language Models?
Vision-language model (VLM): is an AI model designed to process and interpret
both visual (image, video) and textual data, creating a unified understanding of
both modalities.
To find the right vision-language model:
1. Task Requirements: Define your specific tasks—such as image-captioning,
visual question-answering, or image-text retrieval—as different models excel
in different tasks.
2. Model Capabilities: Models like CLIP are great for retrieval, while BLIP and
Flamingo are better for generating descriptive captions or handling dialogue.
How to find the right Vision Language Models? (Cont’d)
Vision Arena: is a leaderboard solely based on
anonymous voting of model outputs and is updated
continuously.
Input: User enters an image and a prompt.
Anonymous Outputs: Outputs are generated from two
different models, shown anonymously.
Human Selection: Users pick their preferred output
without knowing the model.
Leaderboard Ranking: Constructed purely based on
human preferences across multiple samples.
https://huggingface.co/spaces/WildVision/vision-arena
How to find the right Vision Language Models? (Cont’d)
Open VLM Leaderboard: ranks
vision-language models based on various
performance metrics and overall average
scores. It allows filtering by model size,
licensing type (proprietary or open-source),
and scores on different metrics, making it
easier to compare models suited for specific
needs and preferences.
https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
How to find the right Vision Language Models? (Cont’d)
VLMEvalKit: is an open-source evaluation toolkit of large
vision-language models (LVLMs) to run various benchmarks.
https://github.com/open-compass/VLMEvalKit