Exploring
Exploring
Akash Ghosh1∗, Arkadeep Acharya1 , Sriparna Saha1 , Vinija Jain2,3 , Aman Chadha2,3
1 Department of Computer Science and Engineering, IIT Patna, India
2 Stanford University, 3 Amazon AI
Abstract
The emergence of Large Language Models (LLMs) has profoundly altered the course of the AI revolution. Nevertheless, these LLMs
arXiv:2404.07214v2 [cs.CV] 12 Apr 2024
exhibit a notable limitation, as they are primarily adept at processing textual information. To address this constraint, researchers
have endeavored to integrate visual capabilities with LLMs, resulting in the emergence of Vision-Language Models (VLMs).These
sophisticated models play a crucial role in addressing complex tasks like generating captions for images and responding to visual
questions. In our comprehensive survey paper, we delve into the key advancements within the realm of VLMs. Our classification
organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal
inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs. This
classification is based on their respective capabilities and functionalities in processing and generating various modalities of data.We
meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its
strengths and limitations wherever possible, providing readers with a comprehensive understanding of its essential components. We
also analyzed the performance of VLMs in various benchmark datasets. By doing so, we aim to offer a nuanced understanding of the
diverse landscape of VLMs. Additionally, we underscore potential avenues for future research in this dynamic domain, anticipating
further breakthroughs and advancements.
Keywords: Visual Language Models, Multimodal Language Models, Large Language Models, Generative-AI, Benchmark Datasets
AlphaCLIP [76]
MetaCLIP [93]
Handles only Images
GLIP [40]
VideoClip [92]
Handles Videos
VideoMAE [80]
GPT-4V
LLaVa[50] LLaVa-1.5[53]
LLaVa-Plus[54] BakLLaVa[52]
Flamingo [3]
IDEFICS [39]
PaLI [16]
Qwen-VL [5]
Fuyu-8B [8]
SPHINX [47]
Mirasol3B [67]
CogVLM [84]
Ferret [103]
LaVIN [57]
PALM-E [25]
InstructBLIP [21]
BLIP [44] BLIP-2 [45]
Frozen [6]
CoVLM [42]
BEiT-3 [85]
Vision-Language Models
mPLUG-2 [91]
X2 -VLM [107]
Lyrics [55]
Prismer [51]
X-FM [109]
MM-REACT [98]
PICa [99]
Text Generation with
Multimodal Input §2.2 PNP-VQA [79]
Img2LLM [32]
SimVLM [87]
TinyGPT-V [106]
GPT4Tools [97]
mPLUG-Owl [101]
Ying-VLM [46]
LLaMA-VID [43]
Video-LLaMA [111]
ChatBridge [113]
Video-ChatGPT [59]
Handles Videos
VideoCoCa [96]
VALOR [10]
MACAW-LLM [58]
PandaGPT [73]
CoDi[77] CoDi-2[78]
Video-Poet [37]
Figure 1: Taxonomy of Visual Language Models highlighting the input and output formats that the models are capable of handling.
3
Component Trainable Frozen
LLM MiniGPT-V2 Fuyu [8],
Decoder [14] Qwen-VL [5]
Image Qwen-VL [5] MiniGPT-V2
Encoder [14]
Image-Text Fusion:
Some prominent examples are:
• Qformer: BLIP-2 [45]
• Perceiver Resampler: Flamingo [3]
• Full Connected Layers (MLP): Llava [52]
Figure 2: A high-level overview of the architecture of VLMs highlighting various design choices with corresponding examples.
ImageBind [28]: Imagebind learns a shared representation potential for improvements through dataset expansion and in-
space by aligning embeddings from various modalities with im- tegration of additional data streams. The paper acknowledges
age embeddings through diverse paired data sources. This facili- energy consumption concerns during pre-training but empha-
tates zero-shot recognition across modalities, leveraging exten- sizes VideoMAE’s practical value in scenarios with limited data
sive web-scale image-text data and large-scale vision-language availability.
models such as CLIP. With minimal training required for de-
ployment across different tasks and modalities, this approach 2.2. Text generation with Multimodal Input
proves highly adaptable. ImageBind utilizes a wealth of large-
scale image-text pairs and naturally paired self-supervised data GPT-4V [63]: GPT-4V marks a significant advancement, al-
spanning multiple modalities (audio, depth, thermal, IMU) to lowing users to instruct GPT-4 to analyze image inputs. OpenAI
achieve robust emergent zero-shot classification and retrieval ca- conducted extensive safety evaluations and preparations for GPT-
pabilities. It surpasses specialized models on audio benchmarks 4V, building on the safety work done for GPT-4. The training
and demonstrates versatility in compositional tasks. Additional process involved predicting the next word in a document, utiliz-
enhancements include the incorporation of richer alignment data ing a vast dataset of text and image data. GPT-4V inherits both
and the adaptation of embeddings tailored to specific tasks. text and vision capabilities, presenting novel features at their
intersection. The system card outlines OpenAI’s preparation,
VideoCLIP [92]: VideoCLIP’s objective is to pre-train a uni- early access period, safety evaluations, red team assessments,
fied model capable of comprehending both video and text in a and implemented mitigations before broad release.
zero-shot manner, without dependence on labels for downstream LLaVA [50]: LLaVA, an open-source multimodal framework
tasks. Its approach involves employing a contrastive learning designed to enhance LLMs for understanding both language
framework, integrating hard-retrieved negatives and overlapping and images. It utilizes language-only GPT-4 to generate data
positives during the pre-training phase for video-text compre- for instruction-following tasks in a multimodal context. LLaVA
hension. Noteworthy innovations include the incorporation of integrates a vision encoder from CLIP with the LLM, enabling
loosely temporally overlapping positive pairs and the utilization it to process visual information alongside language. The model
of a retrieval-based sampling technique for negative pairs. By undergoes pre-training on image-text pairs and fine-tuning for
leveraging contrastive loss and integrating overlapped video- end-to-end multimodal understanding, resulting in a versatile
text clips, VideoCLIP aims to enhance the association between multimodal chatbot. LLaVA demonstrates impressive multi-
different modalities.It is evaluated on various end tasks, show- modal chat abilities and achieves an 85.1% relative score com-
casing state-of-the-art performance on video language datasets pared to GPT-4 on a synthetic multimodal instruction-following
like Youcook2[117]. The approach demonstrates significant dataset. Upon fine-tuning on the Science QA dataset, LLaVA
advancements in zero-shot video-text understanding, outper- and GPT-4 together achieve a new state-of-the-art accuracy of
forming previous work and even supervised approaches in some 92.53%.
cases. Flamingo [3]: Flamingo introduces novel architectural fea-
VideoMAE [80]: VideoMAE is a self-supervised video pre- tures to seamlessly integrate vision-only and language-only mod-
training method challenging the need for large-scale datasets. els. By incorporating interleaved cross-attention layers with
It adapts the masked autoencoder framework with a unique frozen language-only self-attention layers, Flamingo excels in
video tube masking strategy, achieving data efficiency on small handling sequences of intermixed visual and textual data. It
datasets (3k-4k videos). It employs a Vision Transformer with adopts a Perceiver-based architecture to convert input sequence
joint space-time attention, demonstrating superior efficiency and data, like videos, into a fixed set of visual tokens. Leveraging
effectiveness compared to traditional methods. VideoMAE out- large-scale multimodal web corpora with interleaved text and
performs in downstream tasks like action detection and holds images, Flamingo demonstrates remarkable few-shot learning
4
Model Science-QA VizWiz Flickr30K POPE VQAv2 GQA LlavaBench Chart-QA MM-Vet ViSiTBench
Table 1: Table showing the comparative analysis of multiple Visual Language Models on 10 benchmark datasets: Science-QA [56], VizWiz [23], Flickr30K [1],
POPE [26], VQAv2 [31], GQA [35], LLaVaBench [52], Chart-QA [61], MM-Vet [104], ViSiTBench [9] . V : Vicuna, , img : only image, R/A : Random and Accuracy
have been reported, VQAS : VQA Score, EM : EM Accuracy, w : in-the-wild version, u : Ultra version.XB : The model is of X billion parameters.
capabilities across various benchmarks, surpassing models fine- strates the transfer of knowledge from visual-language domains
tuned on significantly more task-specific data. This showcases to embodied reasoning tasks, underscoring its adaptability and
its adaptability and efficiency in rapidly adapting to diverse im- scalability. PaLM-E faces limitations in relying on low-level
age and video understanding tasks with limited examples.The language-conditioned policies for robotic tasks, prompting the
OpenFlamingo project is an ongoing effort dedicated to craft- proposal of self-supervised entity-centric labeling to enhance
ing an open-source rendition of DeepMind’s Flamingo models. guidance in intricate tasks.
Across a spectrum of seven vision-language datasets, the perfor- BLIP [41]: BLIP stands out as an innovative Vision-Language
mance of OpenFlamingo models consistently ranges between Pre-training framework, overcoming challenges associated with
80% and 89% when compared to the original Flamingo models. noisy training data. Central to BLIP is its Multimodal Mixture
Med-Flamingo [62], a medical-focused multimodal few-shot of Encoder-Decoder (MED) architecture, which incorporates
learner based on OpenFlamingo-9B, achieves up to a 20% im- Image-Text Contrastive (ITC),Language Modeling (LM) and
provement in generative medical Visual Question Answering. It Image-Text Matching (ITM) objectives during pre-training. Cap-
pioneers human evaluation in this context, involving clinicians in tioning and Filtering (CapFilt) enhances data quality, improving
an interactive assessment, and enables applications like rationale downstream task performance.BLIP, implemented in PyTorch
generation. and pre-trained on a diverse 14 million image dataset, demon-
PALM-E [24]: PALM-E emerges as an innovative Embodied strated notable improvements in downstream tasks like image-
Multimodal Language Model, meticulously crafted to navigate text retrieval and captioning. Leveraging nucleus sampling and
real-world scenarios by fusing language comprehension with effective parameter sharing, BLIP outperformed existing models
continuous sensor inputs. This model is the result of a col- on standard datasets.
laborative endeavor between TU Berlin and Google Research, BLIP-2 [41]:BLIP-2 introduces an efficient strategy for
signifying a pivotal advancement in the realm of multimodal vision-language pre-training, utilizing frozen image encoders
AI. By integrating continuous sensor modalities from the real alongside large language models. The Querying Transformer
world into language models, this approach enables end-to-end within BLIP-2 achieves top-tier performance in vision-language
training of multimodal sentences using a pre-trained large lan- tasks while employing fewer parameters, effectively addressing
guage model. It effectively addresses various embodied tasks challenges in interoperability among different modality embed-
such as robotic manipulation planning, visual question answer- dings. A novel addition brought by BLIP-2 is the Querying
ing, and captioning. PaLM-E, the most extensive model boast- Transformer (Q-Former), which acts as a trainable link between
ing 562 billion parameters, exhibits cutting-edge performance static image encoders and a fixed LLM. Depicted in Figure[3],
across embodied reasoning tasks and visual-language domains the Q-Former undergoes a two-stage pre-training process. Ini-
like OK-VQA. Operating on multimodal sentences, it demon- tially, it focuses on learning representations that bridge vision
5
and language, enabling it to comprehend visual elements crucial boasts 80 billion parameters and is available on HuggingFace. It
to accompanying text. Later, the emphasis shifts to generative performs well in image-text benchmarks, such as visual question
learning, connecting the Q-Former’s output to the fixed LLM answering and image captioning, utilizing in-context few-shot
and refining its ability to generate visual representations inter- learning. IDEFICS has two versions – an 80 billion parameters
pretable by the LLM. model and a 9 billion parameters model.
InstructBLIP [22]: InstructBLIP employs instruction-aware PaLI [11]: PALI, or Pathways Language and Image model,
visual feature extraction, enhancing its ability to extract informa- from Google Research, leverages large pre-trained encoder-
tive features tailored to provided instructions. Achieving state- decoder language models and vision transformers for joint lan-
of-the-art zero-shot performance across 13 held-out datasets, In- guage and vision modeling. The model achieves state-of-the-art
structBLIP outperforms BLIP-2 and larger models like Flamingo. results in various vision and language tasks across 100+ lan-
The models also excel in downstream tasks, demonstrating a guages by utilizing a diverse multilingual dataset containing 10B
90.7% accuracy on ScienceQA IMG [56], and showcase qualita- images and texts. With a simple, modular, and scalable design,
tive advantages over concurrent multimodal models in diverse PaLI highlights the importance of joint scaling in vision and
capabilities such as knowledge-grounded image description, vi- language components for effective training and performance.
sual scene understanding, and multi-turn visual conversation. Frozen [82]: Frozen, an innovative multimodal learning ap-
KOSMOS-1 [34]: KOSMOS-1 is a VLM from Microsoft. proach crafted by DeepMind, combines vision encoders trained
Trained on a web-scale multimodal corpus, KOSMOS-1 excels on image captioning data with a frozen language model. This
in language understanding and generation, OCR-free NLP, and design empowers the model to swiftly adapt to new tasks in
various perception-language tasks, showcasing its capabilities a few-shot setup, showcasing its efficacy across a spectrum of
in image captioning and visual question answering. Using a challenges such as visual question answering across diverse
Transformer-based architecture, KOSMOS-1 aligns vision with benchmark datasets. This approach trains the vision encoder
large language models. Its training involves diverse multimodal by backpropagating the gradients thorough the frozen language
corpora, including The Pile and Common Crawl, with language- model’s self-attention layers.The system’s notable limitation
only instruction tuning. Additionally, the model excels in chain- lies in its suboptimal performance on tasks learned with few
of-thought prompting, generating rationales before addressing shots compared to state-of-the-art models using full training sets,
complex question-answering tasks. Overall, KOSMOS-1 repre- highlighting the potential for enhanced zero-shot and few-shot
sents a significant advancement in the field of Multimodal Large generalization through further improvements in accuracy and
Language Models, offering robust performance across a wide reduced seed requirements.
range of tasks. Qwen-VL [4] : Qwen-VL series, introduced as large vision-
KOSMOS-2 [66]: KOSMOS-2 again from Microsoft Re- language models, encompasses Qwen-VL and Qwen-VL-Chat,
search, advances traditional models by introducing capabilities demonstrating exceptional performance in tasks such as image
for perceiving object descriptions, such as bounding boxes, and captioning, question answering, visual localization, and versatile
grounding text in the visual world. Utilizing a unique represen- interactions. Qwen-VLs demonstrate outstanding performance
tation format for referring expressions, KOSMOS-2 links text across various vision-centric tasks, surpassing counterparts of
spans to spatial locations in images. Employing a sophisticated similar scale. Their exceptional accuracy extends beyond tra-
image processing approach, the model combines vision encod- ditional benchmarks like captioning and question-answering to
ings with location tokens to understand and relate specific image include recent dialogue benchmarks. Trained on multilingual
areas to textual descriptions. Constructed upon the foundational image-text data, with a significant portion in English and Chi-
KOSMOS-1 architecture, this causal language model, based on nese, Qwen-VLs naturally support multiple languages. They
Transformers, signifies a significant advancement towards Em- handle multiple images concurrently during training, enabling
bodiment AI. It marks a pivotal stride towards the integration of Qwen-Chat-VL to contextualize and analyze complex scenar-
language, visual perception, action, and world modeling, bring- ios. With higher-resolution inputs and fine-grained training
ing them closer together in convergence for advancing artificial data, Qwen-VLs excel in fine-grained visual understanding, out-
general intelligence. performing existing vision-language models in grounding, text
MultiInstruct [94]: MultiInstruct presents a benchmark comprehension, question answering, and dialogue tasks.
dataset for multimodal instruction tuning, featuring 62 tasks Fuyu-8B [8]: Fuyu-8B, a multi-modal text and image trans-
across 10 categories. Utilizing the OFA pre-trained multimodal former developed by Adept AI, offers a simplified yet powerful
language model, the study focuses on enhancing zero-shot per- solution tailored for digital agents. Its straightforward architec-
formance on diverse tasks through large-scale text-only instruc- ture and training process enhance comprehension, scalability,
tion datasets like Natural Instructions. Results show strong zero- and deployment, making it ideal for various applications. Specif-
shot performance and reduced model sensitivity to instruction ically designed for digital agents, Fuyu-8B seamlessly handles
variations. Comparative analysis of transfer learning strategies arbitrary image resolutions and excels in tasks like graph and
indicates improved robustness across multimodal tasks. Increas- diagram comprehension, UI-based queries, and rapid process-
ing task clusters during training enhances overall performance, ing of large images within 100 milliseconds. Despite its op-
supporting the effectiveness of MultiInstruct. timization for Adept’s use case, Fuyu-8B delivers impressive
IDEFICS [38] : IDEFICS, an open-access reproduction of the performance in standard image understanding benchmarks such
closed-source vision-language model Flamingo by DeepMind, as visual question-answering and natural image captioning. Ar-
6
chitecturally, Fuyu adopts a vanilla decoder-only transformer, lucinations, emphasizing the need for more high-quality image-
efficiently processing image patches by linear projection into the text-aligned data.
first layer. Its versatility in supporting diverse image resolutions LLaVA-Plus [54]: LLaVA-Plus, is a general-purpose mul-
is achieved by treating image tokens like text tokens, utilizing timodal assistant designed to enhance LMMs through visual
raster-scan order, and signaling line breaks for adaptability. instruction tuning. The model maintains a skill repository with
Sphinx [47]: SPHINX is a versatile VLM that integrates diverse vision and vision-language pre-trained models, activat-
model weights, tuning tasks, and visual embeddings to enhance ing relevant tools in response to user inputs for various tasks.
its capabilities. It unfreezes the large language model during Trained on multimodal instruction-following data, LLaVA-Plus
pre-training to strengthen vision-language alignment and effi- covers tool usage in visual understanding, generation, and ex-
ciently mixes weights from LLMs trained on real-world and syn- ternal knowledge retrieval, surpassing its predecessor, LLaVA,
thetic data for robust understanding. By incorporating diverse in both existing and new capabilities.The training approach in-
tasks like region-level understanding and human pose estimation, volves using GPT-4 to generate instruction data and integrating
SPHINX achieves mutual enhancement across different scenar- new tools through instruction tuning, enabling continuous en-
ios. It also extracts comprehensive visual embeddings from hancement. LLaVA-Plus demonstrates state-of-the-art perfor-
various sources, enriching language models with robust image mance on VisiT-Bench, a real-life multimodal task benchmark,
representations. SPHINX demonstrates superior multi-modal excelling in tool use compared to other tool-augmented LLMs.
understanding across applications and introduces an efficient BakLLaVA [9]: BakLLaVA represents a Visual Language
strategy for capturing fine-grained details in high-resolution Model (VLM) crafted through a collaborative endeavor involv-
images, excelling in visual parsing and reasoning tasks. ing LAION, Ontocord, and Skunkworks AI. It harnesses the
Mirasol [67]: Mirasol, from Google DeepMind and Google power of the Mistral 7B base, enhanced by the innovative LLaVA
Research, is a multimodal autoregressive model designed to 1.5 architecture. When paired with the llama.cpp framework,
handle both time-aligned modalities (audio, video) and non- BakLLaVA emerges as a swifter and more resource-efficient
aligned modality (text). The architecture involves segmenting alternative to GPT-4 with Vision capabilities.
long video-audio sequences into manageable chunks, passing LLaVa-1.5 [50]: It is a refined version of the LLaVA, focus-
them through respective encoders, and using a Combiner to fuse ing on visual instruction tuning to enhance multimodal mod-
video and audio features. Autoregressive training predicts se- els. The paper outlines modifications to LLaVA, such as using
quential features, with a separate Transformer block integrating CLIP-ViT-L-336px with an MLP projection and incorporating
textual prompts through cross-modal attention. This enables en- academic-task-oriented Visual Question Answering (VQA) data.
riched contextual understanding, showcasing a comprehensive Despite its advancements, limitations are acknowledged, such
approach to multimodal learning and generation. Pretrained on as prolonged training iterations due to the use of full image
12% of VTP, the model uniformly weighed losses in pretraining, patches and challenges in processing multiple images and cer-
with a tenfold emphasis on unaligned text loss during fine-tuning. tain domain-specific tasks.
Ablation studies underscore its ability to maintain content con- CogVLM [84]: CogVLM is an open-source vision-language
sistency and adapt to dynamic changes in video-audio sequence foundation model developed by researchers from Tsinghua Uni-
MiniGPT-4 [118]: MiniGPT-4 combines a frozen visual en- versity. Its architecture comprises a Vision Transformer (ViT)
coder (ViT + Q-Former from BLIP-2) with LLM using a single encoder (e.g., EVA2-CLIP-E) for image processing, with output
trainable projection layer. Pretrained on aligned image-text pairs mapped into the text feature space using an MLP adapter. The
and fine-tuned on detailed image descriptions, MiniGPT-4 ex- model includes a pre-trained GPT-style language model and a
hibits GPT-4-like capabilities without training vision or language visual expert module added to each layer, consisting of a QKV
modules separately. The finetuning process enhances language matrix and an MLP. CogVLM adopts a deep fusion approach, in-
outputs, demonstrating diverse skills like meme interpretation, tegrating visual and language features at multiple layers through
recipe generation, and poem composition. The model’s architec- the visual expert module, surpassing traditional shallow align-
ture involves a vision encoder, linear projection layer, and large ment methods. Alignment techniques involve pretraining on a
language model. vast dataset of 1.5 billion image-text pairs, employing image cap-
MiniGPT-v2 [14]: The model architecture of MiniGPT-v2 tioning loss and Referring Expression Comprehension (REC).
consists of a ViT visual backbone, a projection layer for dimen- Fine-tuning on various tasks, with a focus on free-form instruc-
sion matching , and a large language model like LLaMA-2 [81] tions, leads to the creation of a variant known as CogVLM-Chat.
for the final generation . The ViT backbone is frozen during FERRET [103]: FERRET is designed for spatial referring
training, and four adjacent visual output tokens are concatenated and grounding in images at different shapes and granularities.
and projected into LLaMA-2 space. Task-specific identifiers are Ferret’s distinct features include a Hybrid Region Representa-
incorporated during training using a three-stage strategy with tion, blending discrete coordinates and continuous visual fea-
weakly labeled image-text datasets and multi-modal instruc- tures for diverse region inputs. It uses a Spatial-Aware Vi-
tional datasets. The model demonstrates superior performance sual Sampler to handle various region shapes effectively and
in visual question-answering and visual grounding, outperform- is trained on the Ground-and-Refer Instruction-Tuning (GRIT)
ing other generalist models. The use of task identifier tokens dataset, which includes hierarchical spatial knowledge and hard
enhances efficiency in multi-task learning, contributing to its negative samples. The architecture includes an image encoder,
state-of-the-art performance. Challenges include occasional hal- a spatial-aware visual sampler, and a Language Model . Fer-
7
ret utilizes a pre-trained visual encoder (CLIP-ViT-L/14) and els, with acknowledgment of potential future improvements in
a Language Model’s tokenizer for image and text embeddings. compositionality.
Training occurs on the GRIT dataset for three epochs, with the Emu2: [74]: Emu2 stands as a generative multimodal model
model randomly choosing between center points or bounding boasting 37 billion parameters, showcasing exceptional con-
boxes to represent regions. In multimodal chatting tasks, Ferret textual learning across varied multimodal sequences. It sets
significantly enhances performance by integrating refer-and- unprecedented benchmarks in tasks requiring rapid comprehen-
ground capabilities. Notably, Ferret mitigates the issue of object sion with limited examples. Employing a unified autoregressive
hallucination, a common challenge in multimodal models. objective, Emu2 seamlessly integrates visual embeddings and
BARD[29]: BARD from Google utilizes a reinforcement textual tokens. Its architecture includes a visual encoder, multi-
learning framework to automate machine learning model design, modal modeling, and visual decoder, allowing coherent outputs
architecture search, and hyperparameter tuning, making it ac- across different modalities. Emu2 excels in vision-language
cessible to users without extensive AI expertise. The system is tasks, instruction tuning, and controllable visual generation,
positioned as a standalone experiment, with a focus on productiv- showcasing state-of-the-art performance in image question an-
ity, creativity, and curiosity enhancement. Users engage BARD swering, subject-driven generation, and zero-shot text-to-image
for tasks such as writing resumes, creating workout routines, generation. The paper acknowledges broader impact consider-
and planning itineraries. The model is pre-trained on diverse ations and limitations, emphasizing responsible deployment in
data sources, and responses are generated by considering con- light of challenges such as hallucination, biases, and question-
text, classified against safety parameters, and re-ranked based on answering capabilities.
quality. Human feedback and evaluation, including fine-tuning Video-LLaMA [111]: Video-LLaMA is tailored to compre-
and reinforcement learning on human feedback, are used to im- hend both the visual and auditory aspects of videos. By merg-
prove BARD. Limitations include potential inaccuracies, biases, ing pre-trained visual and audio encoders with static LLMs,
persona attribution, false positives/negatives, and vulnerability the model adeptly tackles the complexities of capturing tempo-
to adversarial prompting. Google is committed to addressing ral shifts in visual contexts while seamlessly integrating audio-
these limitations and improving BARD responsibly over time. visual cues. Utilizing a Video Q-former for temporal informa-
LLaMA-VID [43]: LLaMA-VID introduces a novel dual- tion and an Audio Q-former for audio encoding, the framework
token strategy, incorporating context and content tokens, to ef- aligns audio-visual data with textual information. Experimental
ficiently encode each video frame. This approach enables the results demonstrate Video-LaMA’s effectiveness in comprehend-
model to handle hour-long videos while mitigating computa- ing video content and generating meaningful responses in audio
tional complexity. LLaMA-VID employs a hybrid architecture, and video-grounded conversations. However, the paper acknowl-
incorporating pre-trained models like Vicuna for text processing edges limitations such as restricted perception capacities and
and a Vision Transformer for image embeddings in videos. The challenges with long videos. Despite these, Video-LLaMA rep-
Q-Former introduces the context-attention token (Et) by com- resents a notable advancement in audio-visual AI assistants,
puting attention between query-generated text embeddings (Q) with the authors providing open-sourced resources for further
and visual tokens (X). Et encapsulates relevant visual features. development.
A content token (Ev) is obtained through mean pooling on vi- Video-ChatGPT [59]: It is, a novel multimodal model en-
sual tokens. Both tokens are integrated into the V decoder for hancing video understanding by integrating a video-adapted
generating text responses. LLaMA-VID’s dual-token generation visual encoder with a Large Language Model. The architecture
strategy, comprising context and content tokens, ensures adapt- leverages the CLIP ViT-L/14 visual encoder for spatiotempo-
ability to various settings, optimizing efficiency for videos while ral video representations and the V-v1.1 language model for
preserving detail for single images.LLaMA-VID is a video and comprehensive understanding. Notably, a dataset of 100,000
image understanding model designed for efficiency, complet- video-instruction pairs is created to fine-tune the model, focus-
ing training in two days on 8xA100 GPUs. It uses EVA-G for ing on temporal relationships and contextual understanding. The
visual encoding and QFormer for text decoding. The training model exhibits competitive performance in correctness, detail
set includes image and video caption pairs, with evaluations on orientation, contextual and temporal understanding, and consis-
diverse benchmarks. LLaMA-VID excels in zero-shot video QA tency, surpassing contemporary models in zero-shot question-
benchmarks, achieving high accuracy with only two tokens per answering tasks. Qualitatively, Video-ChatGPT demonstrates
frame. proficiency in various video-based tasks but faces challenges in
CoVLM [42]: CoVLM introduces a novel approach to en- subtle temporal relationships and small visual details, indicating
hance large language models’ compositional reasoning by in- avenues for future improvement.
tegrating vision-language communicative decoding. Utilizing LAVIN [57]: LAVIN is utilizing Mixture-of-Modality Adap-
communication tokens, the model dynamically composes vi- tation (MMA) for cost-effective adaptation of LLMs to vision-
sual entities and relationships, improving language generation language tasks. LaVIN, with lightweight adapters, achieves
through iterative communication with vision encoders and detec- competitive performance and superior training efficiency in mul-
tion networks . CoVLM demonstrates strong performance across timodal tasks like science question answering and dialogue. Re-
a range of tasks including visual reasoning, reading comprehen- markably, LaVIN requires only 1.4 training hours and 3.8M
sion, and visual question answering. The model represents a trainable parameters. Experimental results on the ScienceQA
noteworthy advancement in integrating vision and language mod- dataset demonstrate efficiency with comparable performance
8
and reduced training time and storage costs. LAVIN represents includes advanced techniques for refining visual inputs to ex-
a breakthrough in cost-effective adaptation but has limitations, tract specific visual characteristics, alongside modules designed
including the potential for incorrect responses and challenges in for semantic segmentation, object detection, and image tagging.
identifying fine-grained details in images. Within the Querying Transformer, visual features seamlessly
BEiT-3 [85]: BEiT-3 represents a pioneering multimodal merge with language inputs, enhanced by boundary boxes and
foundation model, embodying substantial integration across lan- tags derived from the visual refiner.A distinctive aspect of Lyrics
guage, vision, and multimodal pretraining domains. Distin- is its two-stage training process, which aids in bridging the
guished by its remarkable proficiency in transfer learning across modality gap by aligning visual-language targets during pre-
both vision-centric and vision-language tasks, BEiT-3 under- training. To extract valuable features from tangible objects, it
scores the advancement of convergence through innovative en- employs a crucial technique called semantic-aware visual feature
hancements in backbone architecture, pretraining methodologies, extraction. The effectiveness of this approach is evidenced by its
and scalable model design. Leveraging Multiway Transformers, robust performance across various visual-language benchmark
this model is characterized by a modular architecture facilitat- tasks and datasets.
ing profound fusion capabilities and modality-specific encoding. X-FM [109]: XFM, a novel general foundation model
With a unified backbone, BEiT-3 executes cohesive masked ”lan- equipped with one language encoder, one vision encoder, and
guage” modeling across images, English text, and image-text one fusion encoder, featuring a unique training method. The
pairs, colloquially referred to as ”parallel sentences”. Empirical proposed method incorporates two innovative techniques: halt-
findings corroborate BEiT-3’s attainment of state-of-the-art per- ing gradients from vision-language training during language-
formance benchmarks across a spectrum of tasks encompassing encoder learning and leveraging vision-language training to
object detection, semantic segmentation, image classification, guide vision-encoder learning. Extensive experiments on bench-
visual reasoning, visual question answering, image captioning, mark datasets demonstrate that X-FM outperforms existing
and cross-modal retrieval. general foundation models and performs competitively with
mPLUG-2 [91]: mPLUG-2 leads the way by introducing or surpasses models tailored specifically for language, vision,
a multi-module composition network, diverging from the tra- or vision-language understanding. The paper acknowledges
ditional approach of sequence-to-sequence generation. This limitations, including substantial computational requirements,
innovative design fosters modality collaboration while effec- and aims to explore techniques for efficiency improvement and
tively addressing modality entanglement. The flexibility inher- reduced environmental impact. The authors highlight their com-
ent in mPLUG-2 allows for the selective use of diverse modules mitment to addressing efficiency challenges and reducing the
across text, image, and video modalities for various understand- carbon footprint in line with ”green” deep learning initiatives.
ing and generation tasks. Empirical evaluations demonstrate However, due to computational constraints, the study did not
mPLUG-2’s prowess, achieving state-of-the-art or competitive explore super-large models or pre-train large-sized models on
results across an extensive range of over 30 downstream tasks. extensive datasets, emphasizing scalability as an essential con-
From challenging multi-modal endeavors like image-text and sideration for foundation models.
video-text understanding to uni-modal tasks spanning text-only, VALOR [10]: VALOR is a unified vision-audio-language
image-only, and video-only domains, mPLUG-2 exhibits its ver- crossmodality pretraining model designed for trimodality un-
satility. A notable achievement of mPLUG-2 is its groundbreak- derstanding and generation. VALOR employs two pretraining
ing performance, achieving a top-1 accuracy of 48.0 and an 80.3 tasks, Multimodal Grouping Alignment and Multimodal Group-
CIDEr score in video QA and caption tasks. Impressively, these ing Captioning, showcasing good versatility and scalability. Two
results were obtained with a significantly smaller model size datasets, namely VALOR-1M and VALOR-32K, emerge as piv-
and dataset scale. Furthermore, mPLUG-2 exhibits robust zero- otal resources for the advancement of tri-modality pretraining
shot transferability in both vision-language and video-language research, aimed at benchmarking audiovisual-language retrieval
tasks, consolidating its forefront status in advancing multimodal and captioning. Upon completion of training on the VALOR-
pretraining methodologies. 1M dataset alongside other vision-language datasets, VALOR
X2 -VLM [107]: X2 -VLM is a versatile model with a flexible establishes novel performance benchmarks across diverse down-
modular architecture that integrates image-text and video-text stream tasks. These tasks encompass retrieval scenarios incorpo-
pre-training into a unified framework. It excels in image-text and rating vision, audio, and audiovisual inputs, as well as tasks such
video-text tasks across different scales, balancing performance as captioning and question answering. The documentation delin-
and model scale. X2 -VLM’s modular design enhances transfer- eates prospective avenues for future research, notably including
ability, allowing seamless use in various languages or domains. the expansion of the VALOR-1M dataset through unsupervised
By substituting the text encoder, it outperforms state-of-the- methodologies, alongside the integration of vision and audio
art multilingual multimodal pre-trained models, demonstrating generation modeling within the overarching VALOR framework.
superior performance without requiring specific multilingual pre- Prismer [51]: Prismer is a data- and parameter-efficient
training. This adaptability positions X2 -VLM as a promising vision-language model that utilizes a frozen ensemble of do-
model in the field of multimodal pre-training. main experts, minimizing the need for extensive training data.
Lyrics [55]: Lyrics introduces a novel approach to fine-tuning By inheriting weights from pre-trained domain experts across
instruction and multimodal pretraining through cross-modal col- various domains and keeping them frozen during training, Pris-
laboration, built upon the foundational concepts of BLIP-2.It mer efficiently adapts to different vision-language reasoning
9
tasks. Despite its small-scale language model foundation, Pris- for pre-trained language models (PLMs).
mer demonstrates competitive fine-tuned and few-shot learn- Img2LLM [32]: Img2LLM is designed for LLMs that fa-
ing performance, requiring significantly less training data than cilitates zero-shot VQA without necessitating end-to-end train-
current state-of-the-art models. However, it lacks the ability ing. The approach involves developing LLM-agnostic models
for zero-shot in-context generalization and shows limitations that articulate image content through exemplar question-answer
in adapting to new experts or partial expert ensembles during pairs, proving to be effective prompts for LLMs. Img2LLM
inference, leading to performance drops. The paper discusses boasts several advantages, achieving performance on par with or
these limitations, including the absence of few-shot in-context surpassing end-to-end trained methods, such as outperforming
prompting, challenges in adapting to new experts, and potential Flamingo by 5.6% on VQAv2 and exhibiting notable superi-
improvements in representing expert knowledge for enhanced ority on the challenging A-OKVQA dataset. Additionally, the
reasoning performance in future iterations. flexibility of Img2LLM allows seamless integration with vari-
MMReact [98]: MM-REACT introduces a novel textual ous LLMs for VQA tasks, eliminating the need for specialized,
prompt design enabling language models to process multimodal costly end-to-end fine-tuning. One caveat is the additional in-
information, including text descriptions, spatial coordinates, and ference overhead incurred during image caption and question-
file names for dense visual signals. The approach demonstrates answer pair generation, contributing to a 24.4% increase in
its effectiveness in zero-shot experiments, showcasing its poten- computational time. However, this overhead can be mitigated
tial for advanced visual understanding across various scenarios. by shortening prompts, trading a fraction of accuracy for speed,
However, the paper identifies limitations, such as challenges in while Img2LLM avoids the resource-intensive end-to-end multi-
systematically evaluating performance due to the absence of an- modal representation alignment seen in comparable models like
notated benchmarks for recognition capability in the wild. The Flamingo.
integrated vision experts may introduce errors, and the system’s SimVLM [87]: SimVLM is a streamlined pretraining frame-
success depends on the availability of necessary experts. Ad- work that embraces a minimalist approach. Unlike previous
ditionally, the number of experts is constrained by the context methods, SimVLM simplifies training complexities by leverag-
window of ChatGPT, and the conversion of visual signals to ing large-scale weak supervision, and undergoing end-to-end
text words may not be optimal for certain tasks. Manual prompt training with a singular prefix language modeling objective. Re-
engineering is required, and the authors suggest future research markably, without resorting to additional data or task-specific
to automate this process for increased system development ease. tailoring, the resultant model surpasses its predecessors like
PICa [36]: PICa is a method utilizing image captions to OSCAR, VILLA etc establishing new benchmarks in various
prompt GPT-3 for knowledge-based Visual Question Answering vision-language tasks. Additionally, SimVLM demonstrates
(VQA). Leveraging GPT-3’s knowledge retrieval and question- robust generalization and transfer capabilities, showcasing zero-
answering capabilities, the approach treats GPT-3 as an implicit shot behavior in tasks such as open-ended visual question an-
and unstructured knowledge base, converting images into cap- swering and cross-modality transfer.
tions or tags for GPT-3 understanding. By adapting GPT-3 VideoCOCA [95]: VideoCoCa is an adaptation of the Con-
for VQA through a few-shot learning approach with in-context trastive Captioners CoCa [95] model for video-text tasks. Uti-
examples, PICa achieves notable performance, surpassing the lizing CoCa’s generative and contrastive attentional pooling
supervised state of the art on the OK-VQA dataset with just 16 layers, VideoCoCa achieves state-of-the-art results in zero-shot
examples. The method is the first to use GPT-3 for multimodal video classification and text-to-video retrieval with minimal addi-
tasks. However, a limitation is noted, as the image is abstracted tional training. The model processes uniformly sampled frames
as text, and captions may provide only a partial description, through CoCa’s image encoder, creating a tensor representing
potentially missing crucial visual details necessary for detailed the entire video sequence. This tensor undergoes attention-
question answering, such as queries on specific visual attributes. pooling layers for both generative and contrastive modeling
PNP-VQA [79]: Plug-and-Play VQA (PNP-VQA) is a modu- tasks. VideoCoCa demonstrates proficiency in various video-
lar framework designed for zero-shot Visual Question Answer- based tasks, including video reasoning and action recognition,
ing (VQA). Unlike existing approaches that demand extensive but faces challenges in subtle temporal relationships. Various
adaptation of pre-trained language models (PLMs) for vision, adaptation strategies and lightweight finetuning approaches were
PNP-VQA eliminates the need for additional training of PLMs. explored, with the attentional pooler method proving most ef-
Instead, it employs natural language and network interpretation fective. The model was tested on multiple datasets, exhibiting
as an intermediate representation to connect pre-trained models. significant improvements over the CoCa baseline. VideoCoCa
The framework generates question-guided informative image consistently outperforms CoCa across various scales and tasks,
captions, utilizing them as context for PLMs during question showcasing its robust performance in video-text modeling.
answering.PNP-VQA outperforms end-to-end trained baseline TinyGPT-V [105]: TinyGPT-V addresses the challenges
models and establishes new benchmarks by achieving state-of- posed by closed-source and computationally demanding mul-
the-art results on zero-shot VQAv2 and GQA datasets. Despite timodal models like GPT-4V. A notable achievement of this
having 11 billion parameters, it surpasses the performance of an model is its impressive performance while utilizing minimal
80 billion-parameter model on VQAv2 and demonstrates a re- computational resources, requiring only 24GB for training and
markable 9.1% improvement over a comparable model on GQA. 8GB for inference. TinyGPT-V, after integrating Phi-2 and vision
This highlights its effectiveness across a range of parameter sizes modes from CLIP, demonstrates competitive performance across
10
various visual question answering and comprehension bench- sual and auditory information. By combining ImageBind’s multi-
mark datasets when compared to larger models like LLAVA.The modal encoders and Vicuna’s large language models, PandaGPT
model’s compact and efficient design, combining a small back- requires only aligned image-text pairs for training and exhibits
bone with large model capabilities, marks a significant step to- emergent cross-modal behaviors for various data modalities.
wards practical, high-performance multimodal language models The paper suggests improvements, including using additional
for diverse applications. alignment data, exploring fine-grained feature extraction, gener-
ChatBridge [113]: ChatBridge is a multimodal language ating richer multimedia content, creating new benchmarks, and
model aiming to create versatile AI models capable of under- addressing common language model deficiencies. Despite these
standing diverse real-world modalities. The model utilizes lan- considerations, PandaGPT represents a promising step toward
guage as a conduit, harnessing language-paired data to forge building Artificial General Intelligence for holistic perception
connections between diverse modalities. ChatBridge expands across diverse modalities.
upon the zero-shot capabilities of large language models through mPLUG-Owl [100]: mPLUG-Owl introduces a unique train-
a two-stage training procedure, aligning each modality with lan- ing approach that empowers LLMs with multi-modal capabilities
guage and refining with a fresh multimodal instruction dataset . by modularizing the learning process into three key components:
The model demonstrates strong results on zero-shot multimodal a foundational LLM, a visual knowledge module, and a visual
tasks, encompassing text, image, video, and audio. Neverthe- abstractor module. Through a two-stage training methodology,
less, there are constraints in effectively comprehending lengthy this paradigm effectively aligns image and text data, harnessing
videos and audio, indicating a requirement for a more refined the support of LLMs while preserving their generation capabili-
temporal modeling method. There is potential to expand the ties. Experimental results demonstrate mPLUG-Owl’s superior
framework by incorporating supplementary modalities such as performance in instruction and visual understanding, multi-turn
sketches and point clouds. While employing frozen modules conversation, and knowledge reasoning. The model exhibits
helps mitigate computational constraints, it may also result in unexpected abilities like multi-image correlation and multilin-
inadequate performance and introduce biases inherited from gual understanding but has limitations, including challenges
pre-trained models. in multi-image correlation, limited multilingual training, and
Macaw LLM [58]: Macaw-LLM represents a pioneering mixed performance in OCR of complex scenes. The model also
multi-modal large language model, seamlessly blending visual, shows potential in vision-only document comprehension, with
audio, and textual data. Its architecture includes a dedicated strengths in tasks like movie review writing and code generation
modality module for encoding multi-modal information, a cog- but limitations in other applications, indicating further explo-
nitive module leveraging pre-trained LLMs, and an alignment ration opportunities in document understanding and downstream
module harmonizing disparate representations. This alignment applications.
facilitates the integration of multi-modal features with textual Ying-VLM [46]: Ying-VLM is trained on M3 IT dataset .
information, streamlining adaptation processes. Additionally, a Models trained with M3 IT show success in following human
comprehensive multi-modal instruction dataset has been curated instructions, providing engaging responses, and achieving strong
to support multi-turn dialogue. However, the paper acknowl- performance on unseen videos and tasks in the Chinese language.
edges certain limitations, particularly regarding the accuracy of The analysis indicates that increasing task numbers improves
the evaluation in fully capturing the capabilities of Macaw-LLM. performance, and instruction diversity influences results.M3 It
The model is not optimized for multi-turn dialogues, and po- consists of 2.4 million instances, including meticulously crafted
tential issues like hallucination, toxicity, and fairness are not task instructions spanning across forty different task.
evaluated due to the unavailability of suitable evaluation suites. BLIVA [33]: BLIVA is a novel multimodal Language Learn-
GPT4Tools [97]: GPT4Tools, aiming to enable open-source ing Model designed to handle text-rich visual questions, integrat-
LLMs, such as LLaMA and OPT, to efficiently use multimodal ing query and patch embeddings. It outperforms existing VLMs
tools. It addresses challenges posed by proprietary LLMs like like GPT-4 and Flamingo, showing significant improvements in
ChatGPT and GPT-4, which often rely on inaccessible data OCR-VQA and Visual Spatial Reasoning benchmarks. BLIVA’s
and high computational costs. GPT4Tools creates instructional architecture includes a Q-Former for instruction-aware visual
datasets that support large open-source models such as LLAMA features and a fully connected projection layer for additional
in tackling visual challenges through LORA optimization. The visual information. It demonstrates an overall improvement in
approach significantly improves tool invocation accuracy and en- the multimodal LLM benchmark (MME) of 17.72% compared
ables zero-shot capacity for unseen tools. However, the explicit to InstructBLIP and performs well in real-world scenarios like
and fixed prompt approach reduces computational efficiency, processing YouTube thumbnail question-answer pairs.
prompting the exploration of implicit tool invocation methods. LLAVA-phi [120]: LLaVA-Phi is a highly efficient multi-
Despite limitations, GPT4Tools is considered a viable approach modal assistant powered by the compact language model, Phi-2.
for equipping language models with multimodal tools. This model demonstrates significant progress in compact multi-
PandaGPT [73]: PandaGPT is an approach enhancing large modal systems, showing that even smaller models with 2.7B
language models with visual and auditory instruction-following parameters can effectively engage in complex dialogues blending
capabilities. PandaGPT excels in tasks like image description, text and visuals, given proper training. LLaVA-Phi excels in var-
video-inspired story writing, and answering audio-related ques- ious benchmarks covering visual comprehension, reasoning, and
tions. It seamlessly handles multimodal inputs, connecting vi- knowledge-based perception, suggesting its suitability for real-
11
Perception Cognition
Models Overall
Exist. Count Pos. Color Poster Cele. Scene Land Art OCR Com. Cal. Trans. Code
Sphinx 1870.2 195 160 153.3 160 164.3 177.9 160 168.1 134 87.5 130 55 75 50
GPT-4V 1926.6 190 160 95 150 192.2 0 151 138.3 148 185 142.1 130 75 170
Gemini 1933.4 175 131.7 90 163.3 165 147.4 144.8 158.8 135.8 185 129.3 77.5 145 85
LLaVa - 50 50 50 55 50 48.82 50 50 49 50 57.14 50 57.5 50
MiniGPT-4 - 68.33 55 43.33 75 41.84 54.41 71.75 54 60.5 57.5 59.29 45 0 40
LaVIN - 185 88.33 63.33 75 79.79 47.35 136.75 93.5 87.25 107.5 87.14 65 47.5 50
InstructBLIP - 185 143.33 66.67 153.33 123.81 101.18 153 79.75 134.25 72.5 129.29 40 65 57.5
BLIP-2 - 160 135 73.33 148.33 141.84 105.59 145.25 138 136.5 110 110 40 65 75
mPLUG-OWL - 120 50 50 55 136.05 100.29 135.5 159.25 96.25 65 78.57 60 80 57.5
Qwen-VL-Chat 1487.5 - - - - - - - - - - - - - -
LLaVa-1.5V7B 1510.7 - - - - - - - - - - - - - -
LLaVa-1.5V13B 1531.3 - - - - - - - - - - - - - -
LLaMA-VIDV13B 1542.3 - - - - - - - - - - - - - -
LLaMA-VIDV7B 1521.4 - - - - - - - - - - - - - -
Table 2: Table showing a comparative analysis of various VLMs on the MME Benchmark [27].XB : The model is of X billion parameters.
time interaction scenarios like embodied agents.Significantly, model, meticulously crafted by Vikhyatk, emerges from the
it highlights how smaller language models can attain advanced fusion of SigLIP, Phi-1.5, and the expansive LLaVa training
levels of comprehension and engagement without sacrificing dataset. Representing a significant milestone in AI research,
resource efficiency. The training process involves two stages: this model is purposefully unveiled for scholarly exploration,
(1) feature alignment, where a pretrained vision encoder is con- underlining its exclusivity for non-commercial endeavors. This
nected to a language model using a subset of the LAION-CC- amalgamation of cutting-edge techniques and robust data sets
SBU dataset, and (2) visual instruction tuning, using a com- underscores a commitment to advancing the frontiers of artificial
bination of GPT-generated multimodal instruction-following intelligence, setting a new benchmark for computational prowess
data and VQA data to teach the model to follow multimodal and innovation in the field.
instructions. Shikra [15]: Shikra is a Multimodal Large Language Model
MoE-LLaVA [48]: MoE-LLaVA is a groundbreaking train- designed to bridge the gap in human-like referential abilities
ing strategy for Large Vision-Language Models. Known as within dialogue. Shikra boasts the capability to interpret spatial
MoE-tuning, this innovative approach efficiently manages per- coordinates through natural language, facilitated by its straight-
formance degradation in multi-modal learning and model spar- forward architecture consisting of a vision encoder, alignment
sity by activating only the top k experts during deployment via layer, and LLM. Unlike other models, Shikra doesn’t require
routers. Despite its architecture comprising 3 billion sparsely additional vocabularies or external plugins, allowing for effort-
activated parameters, MoE-LLaVA achieves comparable or su- less integration of referential dialogue tasks with diverse vision-
perior performance to state-of-the-art models while minimizing language tasks. Its performance is notably strong across various
hallucinations in model outputs. Its architecture includes a vi- tasks such as REC, PointQA, Image Captioning, and VQA, en-
sual encoder, visual projection layer in the form of MLP, word abling functionalities like offering object coordinates and com-
embedding layer, stacked LLM and MoE blocks. MoE-tuning paring user-pointed regions. However, it currently supports only
comprises three stages: MLP training, parameter training ex- English and lacks user-friendliness for non-English speakers.
cluding the Vision Encoder, and initializing experts in MoE Future work aims to make Shikra multilingual and explore im-
followed by training only MoE layers. Evaluation on various proved coordinate representations for dense object detection and
visual understanding datasets demonstrates MoE-LLaVA’s ef- segmentation tasks. Additionally, like most LLMs, Shikra may
ficiency and effectiveness, with extensive ablation studies and generate harmful or counterfactual responses.
visualizations illustrating its effectiveness and offering insights BuboGPT [114]: BuboGPT stands out as a Visual Language
for future research in multi-modal learning systems. Model (VLM) equipped with visual grounding capabilities,
Yi-VL [13]: Yi Vision Language (Yi-VL) is an open-source aimed at enhancing cross-modal interaction across vision, audio,
multimodal model based on the Yi Large Language Model se- and language domains. It offers a detailed comprehension of vi-
ries, excelling in content comprehension, recognition, and multi- sual elements and other modalities, enabling precise localization
round conversations about images. It leads in recent benchmarks, of objects in images during response generation. Employing
including English and Chinese. Key features include multi- an off-the-shelf visual grounding module based on SAM for
round text-image conversations, bilingual support, strong image entity extraction and mask correspondence in images, along-
comprehension, and fine-grained resolution at 448 × 448. Yi-VL side a two-stage training approach and an extensive instruction
utilizes the LLaVA architecture, comprising a vision transformer, dataset, BuboGPT strives for comprehensive understanding of
projection module, and large language model. However, it has text-image-audio interactions.Despite facing challenges such as
limitations such as supporting only visual question answering, language hallucination and limited capacities in Grounding QA,
accepting a single image input, and potential content generation BuboGPT exhibits remarkable proficiency in understanding mul-
issues and object identification inaccuracies in complex scenes. tiple modalities and visual grounding tasks, signaling promising
Additionally, it operates at a fixed resolution of 448x448, which advancements in the realm of multi-modal Language Models.
may result in information loss for low-resolution images and a ChatSpot: [112]: ChatSpot has been introduced as a uni-
lack of additional knowledge for higher resolutions. fied end-to-end multimodal large language model designed to
Moondream [83]: Moondream is a 1.6 billion parameter enhance human-AI interaction. It supports diverse interactive
12
forms such as mouse clicks, drag-and-drop, and drawing boxes, showcasing its superior performance without additional com-
providing users with a flexible and seamless interactive expe- plexities. Notably, VILA’s multi-modal pre-training unveils
rience. The model is built on precise referring instructions, compelling properties such as multi-image reasoning, enhanced
utilizing various reference representations like points and boxes in-context learning, and improved world knowledge, marking
to focus on specific regions of interest. Additionally, a multi- significant advancements in visual language modeling.
grained vision-language instruction-following dataset is created
for training ChatSpot. Experimental results demonstrate its 2.3. Multimodal Output with Multimodal Input
robustness in region referring, showing minimal instances of The Composable Diffusion (CoDi): CoDi [77] model adopts
region referring hallucination even in the presence of box noises. a multimodal approach using Latent Diffusion Models for text,
This highlights ChatSpot’s capability for precise region refer- image, video, and audio. Text processing involves a variational
encing and its potential to improve interactive accuracy and encoder (VAE) with BERT and GPT-2, image tasks use a latent
efficiency in multimodal large language models. diffusion model (LDM) with a VAE, and audio tasks utilize an
MiniGPT5 [115]: MiniGPT-5 introduces an innovative in- LDM with a VAE encoder-decoder for mel-spectrogram rep-
terleaved vision-and-language generation technique, utilizing resentation. CoDi creates a shared multimodal space through
”generative vokens” to harmonize image-text outputs. Its dis- cross-modal generation with Joint Multimodal Generation and
tinctive two-staged training strategy focuses on description-free cross-attention modules. Training involves individual diffusion
multimodal generation, eliminating the need for comprehensive models with aligned prompt encoders, and CoDi achieves any-
image descriptions. MiniGPT-5 enhances model integrity with to-any generation with a linear number of training objectives.
classifier-free guidance, resulting in substantial improvements CoDi-2 [78]: CoDi-2 employs a multimodal encoder, Image-
over baseline models like Divter on the MMDialog dataset. It Bind, with aligned encoders and a multilayer perceptron for
consistently produces superior multimodal results in human modalities projection. It integrates diffusion models (DMs) into
assessments conducted on the VIST dataset, showcasing its ef- the multimodal latent language model (MLLM) for detailed,
fectiveness across a range of benchmarks. modality-interleaved generation. The fusion strategy involves
DRESS [12]:DRESS, a sophisticated multimodal language projecting multimodal data into a feature sequence, processed by
model, leverages Natural Language Feedback (NLF) from Lan- the MLLM, and utilizing DMs for improved generation quality.
guage Models (LLMs) to enhance alignment through interac- The alignment method leverages projections from aligned multi-
tive engagements, effectively mitigating key limitations seen in modal encoders to enable the MLLM to understand modality-
current Virtual Language Models (VLMs). NLF is classified interleaved input sequences, facilitating in-context learning and
into two categories: critique and refinement, aimed at closely supporting multi-round interactive conversations.
aligning with human preferences and bolstering the model’s pro- Google Gemini [30]: Gemini models feature a transformative
ficiency in multi-turn conversations. Refinement NLF provides architecture with deep fusion capabilities, excelling in integrat-
constructive suggestions to enhance responses, while critique ing text, image, audio, and video modalities. They surpass
NLF aids in aligning VLM outputs with human preferences. GPT-4 in 30 out of 32 benchmarks and are trained on Google’s
Through Reinforcement Learning, DRESS is trained to handle TPU v4 and v5e accelerators for efficient scaling. The multi-
the non-differentiable nature of NLF. Empirical findings demon- modal and multilingual training dataset prioritizes quality and
strate that DRESS facilitates the generation of more beneficial safety, with models undergoing Reinforcement Learning from
and benign outputs and adeptly learns from feedback during Human Feedback (RLHF). While specific details remain undis-
multi-turn interactions, surpassing state-of-the-art VLMs. closed, safety evaluations for bias and toxicity are a central part
X-InstructBLIP [64]: X-InstructBLIP is a cross-modality of Gemini’s development, involving collaboration with external
framework built upon frozen large language models that inte- experts.
grates various modalities without extensive customization. High- NExT-GPT [89]: NExT-GPT features three stages: Mul-
quality instruction tuning data is collected automatically, en- timodal Encoding, LLM Understanding and Reasoning, and
abling fine-tuning for different modalities. The model performs Multimodal Generation. It uses models like ImageBind for
comparably to leading-edge counterparts without extensive pre- encoding and Transformer-based layers for generation. In infer-
training or customization. A novel evaluation task, Discrimina- ence, modal encoders transform inputs, LLM decides on content,
tive Cross-modal Reasoning (DisCRn), has been introduced to and diffusion decoders use signal tokens for synthesis. The
assess the model’s cross-modal abilities across disparate input system employs Multimodal Alignment Learning to align fea-
modalities. X-InstructBLIP demonstrates emergent cross-modal tures and Modality-switching Instruction Tuning (MosIT) for
reasoning despite separate optimization of each modality and improved LLM capabilities by aligning modal signal tokens with
outperforms strong captioning baselines across all examined gold captions. The diverse MosIT dataset enhances the model’s
modalities in DisCRn. However, complexities and unanswered ability to handle various user interactions effectively.
questions within each modality highlight challenges and oppor- VideoPoet [89]: VideoPoet is a language model designed for
tunities for future exploration both across and within modalities. high-quality video synthesis with matching audio. The model
VILA [49]: VILA, a Visual Language model family, emerges employs a decoder-only transformer architecture, processing
from an enhanced pre-training recipe that systematically aug- multimodal inputs like images, videos, text, and audio. Utilizing
ments LLMs towards VLMs. VILA consistently outperforms a two-stage training protocol, VideoPoet showcases state-of-
state-of-the-art models like LLaVA1.5 across main benchmarks, the-art capabilities in zero-shot video generation and excels in
13
MSVD-QA MSRTT-QA ActivityNet-QA
Models
Accu. Score Accu. Score Accu. Score
VideoLLaMaV7B 51.6 2.5 29.6 1.8 12.4 1.1
LLaMa-AdapterL7B 54.9 3.1 43.8 2.7 34.2 2.7
VideoChatV7B 56.3 2.8 45 2.5 26.5 2.2
Video-ChatGPTV7B 64.9 3.3 49.3 2.8 35.2 2.7
LLaMa-VIDV7B 69.7 3.7 57.7 3.2 47.4 3.3
LLaMa-VIDV13B 70 3.7 58.9 3.3 47.5 3.3
Table 3: Comparative analysis of leading models on 4 zero-shot video QA datasets from [43]. Results are reported with two tokens for each frame.
tasks such as text-to-video and video stylization. Notable fea- SkinGPT [116], have paved the way in their specialized fields.
tures include a Large Language Model backbone, custom spatial Further endeavors are in progress to craft VLMs specifically
super-resolution, and scalability with model size. Human eval- tailored for sectors like education and agriculture.
uations highlight VideoPoet’s superiority in text fidelity, video
quality, and motion interestingness. Responsible AI analysis un-
derscores considerations for fairness, emphasizing the model’s 4. Conclusion
capabilities in zero-shot editing, task chaining, and maintaining
This paper provides a comprehensive survey of the latest
quality across multiple stages in video generation.
developments in the space of VLMs. We categorize VLMs
according to their use cases and output-generation capabilities,
3. Future Directions offering concise insights into the architecture, strengths, and
Tradeoff between pre-training and modular structure: limitations of each model. Additionally, we highlight future
A Lot of research is going on to increase the understanding, directions in the field, informed by recent trends, to provide a
control, and faithfulness capacities of VLMs by introducing roadmap for further exploration in this domain. We trust that
modularity in place of black box pretraining. Incorporating this paper will serve as a valuable resource, offering guidance
other modalities :Works are going on for incorporating more to researchers in the realms of Computer Vision and Natural
finer modalities like gaze/gestures inspired by [17] which is very Language Processing who are actively involved in interesting
important for the educational sector. areas of multimodal learning.
Fine-grained Evaluation of VLMs: Works are going on for
more fine-grained evaluation of VLMs on parameters like bias,
fairness etc. DALL-Eval [18] and VP-Eval [19] are a few works 5. Acknowledgements
in this direction. Akash Ghosh and Sriparna Saha express their heartfelt grati-
Causality and Counterfactual Capabilities in VLMs: A tude to the SERB (Science and Engineering Research Board )
lot of work is done to understand the causal and counterfactual POWER scheme(SPG/2021/003801) of the Department of Sci-
capabilities of LLM, which inspired researchers to explore the ence and Engineering, Govt. of India, for providing the funding
same in the realm of VLMs. Cm3 [2] is one of the earliest works for carrying out this research
in this domain and there is a lot of buzz in this topic.
Continual Learning/Unlearning: A trend is there on effi-
References
ciently learning continuously without training it from scratch in
the VLM space. VQACL [108] and Decouple before Interact [1] Bryan A, Plummer, and et al. 2016. Flickr30k Entities: Collecting
[68] are some of the earliest works in this domain. Inspired by Region-to-Phrase Correspondences for Richer Image-to-Sentence Mod-
the concept of knowledge unlearning observed in LLMs [72], els. arXiv:1505.04870 [cs.CV]
[2] Aghajanyan, Armen, and et al. 2022. Cm3: A causal masked multimodal
researchers are also delving into similar approaches within the model of the internet. arXiv preprint arXiv:2201.07520 (2022).
realm of VLMs. [3] Alayrac, Jean-Baptiste, and et al. 2022. Flamingo: a visual language
Efficieny in training: Efforts are concentrated on developing model for few-shot learning. Neurips 35 (2022), 23716–23736.
efficient multimodal models, with notable advancements such as [4] Bai, Jinze, and et al. 2023. Qwen-vl: A frontier large vision-language
model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023).
BLIP-2 showing promise. It surpasses Flamingo-80B by 8.7% in [5] Jinze Bai, Shuai Bai, and et al. 2023. Qwen-VL: A Versatile Vision-
zero-shot VQA-v2, while utilizing significantly fewer trainable Language Model for Understanding, Localization, Text Reading, and
parameters (54 times fewer). Beyond. arXiv:2308.12966 [cs.CV]
Multilingual grounding of VLMs: Following the recent [6] Bain, Max, and et al. 2021. Frozen in time: A joint video and image
encoder for end-to-end retrieval. In ICCV. 1728–1738.
surge in multilingual LLMs such as OpenHathi [71] and Bharat- [7] Bao, Hangbo, and el al. 2022. VLMo: Unified Vision-Language Pre-
GPT [20], there is a growing momentum towards the develop- Training with Mixture-of-Modality-Experts. arXiv:2111.02358 [cs.CV]
ment of multilingual vision-language models (VLMs). Palo[60] [8] Bavishi, Rohan, and et al. 2023. Introducing our Multimodal Models.
https://www.adept.ai/blog/fuyu-8b
is the first notable work in this direction . [9] Bitton, Yonatan, and et al. 2023. Visit-bench: A benchmark for vision-
More Domain specific VLMs: Various domain-specific language instruction following inspired by real-world use. arXiv preprint
VLMs, exemplified by projects like MedFlamingo [62] and arXiv:2308.06595 (2023).
14
[10] Chen, Sihan, and et al. 2023. Valor: Vision-audio-language tered dataset of interleaved image-text documents. arXiv preprint
omni-perception pretraining model and dataset. arXiv preprint arXiv:2306.16527 (2023).
arXiv:2304.08345 (2023). [39] Hugo Laurençon, Lucile Saulnier, and et al. 2023. OBELICS: An
[11] Chen, Xi, and et al. 2022. Pali: A jointly-scaled multilingual language- Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents.
image model. arXiv preprint arXiv:2209.06794 (2022). arXiv:2306.16527 [cs.IR]
[12] Chen, Yangyi, and et al. 2023. Dress: Instructing large vision-language [40] Li, Liunian Harold, and et al. 2022. Grounded language-image pre-
models to align and interact with humans via natural language feedback. training. In CVPR. 10965–10975.
arXiv preprint arXiv:2311.10081 (2023). [41] Li, Junnan, and et al. 2023. Blip-2: Bootstrapping language-image pre-
[13] Bei Chen, y, and et al. 2023. Better Bilingual Multimodal Model. https: training with frozen image encoders and large language models. arXiv
//huggingface.co/01-ai/Yi-VL-34B preprint arXiv:2301.12597 (2023).
[14] Jun Chen, Deyao Zhu, and et al. 2023. MiniGPT-v2: large language [42] Li, Junyan, and et al. 2023. CoVLM: Composing Visual Entities and
model as a unified interface for vision-language multi-task learning. Relationships in Large Language Models Via Communicative Decoding.
arXiv:2310.09478 [cs.CV] arXiv preprint arXiv:2311.03354 (2023).
[15] Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and [43] Li, Yanwei, and et al. 2023. LLaMA-VID: An Image is Worth 2 Tokens
Rui Zhao. 2023. Shikra: Unleashing Multimodal LLM’s Referential in Large Language Models. arXiv:2311.17043 [cs.CV]
Dialogue Magic. arXiv preprint arXiv:2306.15195 (2023). [44] Junnan Li, Dongxu Li, and et al. 2022. BLIP: Bootstrapping Language-
[16] Xi Chen, Xiao Wang, and et al. 2023. PaLI: A Jointly-Scaled Multilingual Image Pre-training for Unified Vision-Language Understanding and Gen-
Language-Image Model. arXiv:2209.06794 [cs.CV] eration. arXiv:2201.12086 [cs.CV]
[17] Cheng, Yihua, and Feng Lu. 2022. Gaze estimation using transformer. In [45] Junnan Li, Dongxu Li, and et al. 2023. BLIP-2: Bootstrapping Language-
ICPR. IEEE, 3341–3347. Image Pre-training with Frozen Image Encoders and Large Language
[18] Cho, Jaemin, and et al. 2023. Dall-eval: Probing the reasoning skills and Models. arXiv:2301.12597 [cs.CV]
social biases of text-to-image generation models. In ICCV. 3043–3054. [46] Lei Li, Yuwei Yin, and et al. 2023. M3 IT: A Large-Scale Dataset towards
[19] Cho, Jaemin, and et al. 2023. Visual Programming for Text-to-Image Multi-Modal Multilingual Instruction Tuning. arXiv:2306.04387 [cs.CV]
Generation and Evaluation. arXiv preprint arXiv:2305.15328 (2023). [47] Lin, Ziyi, and et al. 2023. Sphinx: The joint mixing of weights, tasks,
[20] corovor.ai. 2023. BharatGPT. (2023). https://corover.ai/bharatgpt/. and visual embeddings for multi-modal large language models. arXiv
[21] Wenliang Dai and et al. 2023. InstructBLIP: Towards General- preprint arXiv:2311.07575 (2023).
purpose Vision-Language Models with Instruction Tuning. [48] Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin,
arXiv:2305.06500 [cs.CV] Junwu Zhang, Munan Ning, and Li Yuan. 2024. MoE-LLaVA: Mix-
[22] Wenliang Dai, Junnan Li, and et al. 2023. InstructBLIP: Towards General- ture of Experts for Large Vision-Language Models. arXiv preprint
purpose Vision-Language Models with Instruction Tuning. arXiv 2023. arXiv:2401.15947 (2024).
arXiv preprint arXiv:2305.06500 (2023). [49] Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew
[23] Danna, Gurari, and et al. 2018. VizWiz Grand Challenge: Answering Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. 2023.
Visual Questions from Blind People. arXiv:1802.08218 [cs.CV] Vila: On pre-training for visual language models. arXiv preprint
[24] Driess, Danny, and et al. 2023. Palm-e: An embodied multimodal arXiv:2312.07533 (2023).
language model. arXiv preprint arXiv:2303.03378 (2023). [50] Liu, Haotian, and et al. 2023. Visual instruction tuning. arXiv preprint
[25] Danny Driess and et al. 2023. PaLM-E: An Embodied Multimodal arXiv:2304.08485 (2023).
Language Model. arXiv:2303.03378 [cs.LG] [51] Liu, Shikun, and et al. 2023. Prismer: A vision-language model with an
[26] Fan, Zhiwen, and et al. 2023. POPE: 6-DoF Promptable Pose ensemble of experts. arXiv preprint arXiv:2303.02506 (2023).
Estimation of Any Object, in Any Scene, with One Reference. [52] Liu, Shilong, and et al. 2023. Llava-plus: Learning to use tools for
arXiv:2305.15727 [cs.CV] creating multimodal agents. arXiv preprint arXiv:2311.05437 (2023).
[27] Fu, Chaoyou, and et al. 2023. MME: A Comprehensive [53] Haotian Liu, Chunyuan Li, and et al. 2023. Improved Baselines with
Evaluation Benchmark for Multimodal Large Language Models. Visual Instruction Tuning. arXiv:2310.03744 [cs.CV]
arXiv:2306.13394 [cs.CV] [54] Shilong Liu, Hao Cheng, and et al. 2023. LLaVA-Plus: Learning to Use
[28] Girdhar, Rohit, and et al. 2023. Imagebind: One embedding space to Tools for Creating Multimodal Agents. arXiv:2311.05437 [cs.CV]
bind them all. In CVPR. 15180–15190. [55] Lu, Junyu, and et al. 2023. Lyrics: Boosting Fine-grained Language-
[29] Google. 2023. Google Bard. (2023). https://ai.google/static/documents/ Vision Alignment and Comprehension via Semantic-aware Visual Ob-
google-about-bard.pdf. jects. ArXiv abs/2312.05278 (2023). https://api.semanticscholar.org/
[30] Google, team, and et al. 2023. Gemini: a family of highly capable CorpusID:266162357
multimodal models. arXiv preprint arXiv:2312.11805 (2023). [56] Lu, Pan, and et al. 2022. Learn to explain: Multimodal reasoning via
[31] Yash Goyal, Tejas Khot, and et al. 2017. Making the V in VQA Matter: thought chains for science question answering. Neurips 35 (2022), 2507–
Elevating the Role of Image Understanding in Visual Question Answering. 2521.
arXiv:1612.00837 [cs.CV] [57] Luo, Gen, and et al. 2023. Cheap and quick: Efficient vision-
[32] Guo, Jiaxian, and et al. 2023. From Images to Textual Prompts: Zero- language instruction tuning for large language models. arXiv preprint
shot Visual Question Answering with Frozen Large Language Models. arXiv:2305.15023 (2023).
In CVPR. 10867–10877. [58] Lyu, Chenyang, and et al. 2023. Macaw-LLM: Multi-Modal Language
[33] Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and Zhuowen Tu. Modeling with Image, Audio, Video, and Text Integration. arXiv preprint
2023. Bliva: A simple multimodal llm for better handling of text-rich arXiv:2306.09093 (2023).
visual questions. arXiv preprint arXiv:2308.09936 (2023). [59] Muhammad Maaz, Hanoona Rasheed, and et al. 2023. Video-ChatGPT:
[34] Huang, Shaohan, and et al. 2023. Language is not all you need: Aligning Towards Detailed Video Understanding via Large Vision and Language
perception with language models. arXiv preprint arXiv:2302.14045 Models. arXiv:2306.05424 [cs.CV]
(2023). [60] Muhammad Maaz, Hanoona Rasheed, Abdelrahman Shaker, Salman
[35] Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Khan, Hisham Cholakal, Rao M Anwer, Tim Baldwin, Michael Felsberg,
Dataset for Real-World Visual Reasoning and Compositional Question and Fahad S Khan. 2024. PALO: A Polyglot Large Multimodal Model
Answering. arXiv:1902.09506 [cs.CL] for 5B People. arXiv preprint arXiv:2402.14818 (2024).
[36] Jin, Woojeong, and et al. 2021. A good prompt is worth millions of [61] Masry, Ahmed, and et al. 2022. ChartQA: A Benchmark for Ques-
parameters: Low-resource prompt-based learning for vision-language tion Answering about Charts with Visual and Logical Reasoning.
models. arXiv preprint arXiv:2110.08484 (2021). arXiv:2203.10244 [cs.CL]
[37] Kondratyuk, Dan, and et al. 2023. VideoPoet: A Large Language [62] Moor, Michael, and et al. 2023. Med-flamingo: a multimodal medical
Model for Zero-Shot Video Generation. arXiv preprint arXiv:2312.14125 few-shot learner. In ML4H. PMLR, 353–367.
(2023). [63] OpenAI. 2023. GPT4. (2023). https://openai.com/research/gpt-4.
[38] Laurençon, Hugo, and et al. 2023. Obelisc: An open web-scale fil- [64] Panagopoulou, Artemis, and et al. 2023. X-InstructBLIP: A Framework
15
for aligning X-Modal instruction-aware representations to LLMs and [94] Xu, Zhiyang, and et al. 2022. Multiinstruct: Improving multi-modal zero-
Emergent Cross-modal Reasoning. arXiv preprint arXiv:2311.18799 shot learning via instruction tuning. arXiv preprint arXiv:2212.10773
(2023). (2022).
[65] Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, [95] Yan, Shen, and et al. 2022. Video-text modeling with zero-shot transfer
Shuming Ma, and Furu Wei. 2023. Kosmos-2: Grounding Multimodal from contrastive captioners. arXiv preprint arXiv:2212.04979 (2022).
Large Language Models to the World. arXiv:2306.14824 [cs.CL] [96] Shen Yan, Tao Zhu, and et al. 2023. VideoCoCa: Video-
[66] Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Text Modeling with Zero-Shot Transfer from Contrastive Captioners.
Shuming Ma, and Furu Wei. 2023. Kosmos-2: Grounding Multimodal arXiv:2212.04979 [cs.CV]
Large Language Models to the World. arXiv preprint arXiv:2306.14824 [97] Yang, Rui, and et al. 2023. Gpt4tools: Teaching large language model to
(2023). use tools via self-instruction. arXiv preprint arXiv:2305.18752 (2023).
[67] Piergiovanni, AJ, and et al. 2023. Mirasol3B: A Multimodal Autoregres- [98] Yang, Zhengyuan, and et al. 2023. Mm-react: Prompting chatgpt for mul-
sive model for time-aligned and contextual modalities. arXiv preprint timodal reasoning and action. arXiv preprint arXiv:2303.11381 (2023).
arXiv:2311.05698 (2023). [99] Zhengyuan Yang, Zhe Gan, and et al. 2022. An Empirical Study of GPT-3
[68] Qian, Zi, and et al. 2023. Decouple before interact: Multi-modal prompt for Few-Shot Knowledge-Based VQA. arXiv:2109.05014 [cs.CV]
learning for continual visual question answering. In ICCV. 2953–2962. [100] Ye, Qinghao, and et al. 2023. mplug-owl: Modularization empowers large
[69] Radford, Alec, and et al. 2021. Learning transferable visual models from language models with multimodality. arXiv preprint arXiv:2304.14178
natural language supervision. In ICML. PMLR, 8748–8763. (2023).
[70] Ramesh, Aditya, and et al. 2021. Zero-shot text-to-image generation. In [101] Qinghao Ye, Haiyang Xu, and et al. 2023. mPLUG-Owl: Mod-
ICML. PMLR, 8821–8831. ularization Empowers Large Language Models with Multimodality.
[71] sarvam.ai. 2023. openhathi. (2023). https://www.sarvam.ai/blog/ arXiv:2304.14178 [cs.CL]
announcing-openhathi-series. [102] Yin, Shukang, and et al. 2023. A Survey on Multimodal Large Language
[72] Nianwen Si, Hao Zhang, Heyu Chang, Wenlin Zhang, Dan Qu, and Models. arXiv preprint arXiv:2306.13549 (2023).
Weiqiang Zhang. 2023. Knowledge unlearning for llms: Tasks, methods, [103] You, Haoxuan, and et al. 2023. Ferret: Refer and ground anything
and challenges. arXiv preprint arXiv:2311.15766 (2023). anywhere at any granularity. arXiv preprint arXiv:2310.07704 (2023).
[73] Su, Yixuan, and et al. 2023. Pandagpt: One model to instruction-follow [104] Yu, Weihao, and et al. 2023. MM-Vet: Evaluating Large Multimodal
them all. arXiv preprint arXiv:2305.16355 (2023). Models for Integrated Capabilities. arXiv:2308.02490 [cs.AI]
[74] Sun, Quan, and et al. 2023. Generative Multimodal Models are In-Context [105] Yuan, Zhengqing, and et al. 2023. TinyGPT-V: Efficient Multi-
Learners. arXiv preprint arXiv:2312.13286 (2023). modal Large Language Model via Small Backbones. arXiv preprint
[75] Sun, Zeyi, and et al. 2023. Alpha-CLIP: A CLIP Model Focusing on arXiv:2312.16862 (2023).
Wherever You Want. arXiv e-prints (2023), arXiv–2312. [106] Zhengqing Yuan, Zhaoxu Li, and Lichao Sun. 2023. TinyGPT-V:
[76] Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Efficient Multimodal Large Language Model via Small Backbones.
Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. 2023. Alpha-CLIP: A CLIP arXiv:2312.16862 [cs.CV]
Model Focusing on Wherever You Want. arXiv:2312.03818 [cs.CV] [107] Zeng, Yan, and et al. 2023. X2 -VLM: All-In-One Pre-trained Model For
[77] Zineng Tang, Ziyi Yang, and et al. 2023. Any-to-Any Generation via Vision-Language Tasks. arXiv:2211.12402 [cs.CV]
Composable Diffusion. arXiv:2305.11846 [cs.CV] [108] Zhang, Xi, and et al. 2023. VQACL: A Novel Visual Question Answering
[78] Zineng Tang, Ziyi Yang, and et al. 2023. CoDi-2: In-Context, Interleaved, Continual Learning Setting. In CVPR. 19102–19112.
and Interactive Any-to-Any Generation. arXiv:2311.18775 [cs.CV] [109] Zhang, Xinsong, and et al. 2023. Toward Building General Foundation
[79] Tiong, Anthony Meng Huat, and et al. 2022. Plug-and-play vqa: Zero- Models for Language, Vision, and Vision-Language Understanding Tasks.
shot vqa by conjoining large pretrained models with zero training. arXiv arXiv preprint arXiv:2301.05065 (2023).
preprint arXiv:2210.08773 (2022). [110] Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui
[80] Tong, Zhan, and et al. 2022. Videomae: Masked autoencoders are Chu, and Dong Yu. 2024. Mm-llms: Recent advances in multimodal
data-efficient learners for self-supervised video pre-training. Neurips 35 large language models. arXiv preprint arXiv:2401.13601 (2024).
(2022), 10078–10093. [111] Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-LLaMA: An
[81] Touvron, Hugo, and et al. 2023. Llama 2: Open foundation and fine-tuned Instruction-tuned Audio-Visual Language Model for Video Understand-
chat models. arXiv preprint arXiv:2307.09288 (2023). ing. arXiv:2306.02858 [cs.CL]
[82] Tsimpoukelli, Maria, and et al. 2021. Multimodal few-shot learning with [112] Zhao, Liang, and et al. 2023. Chatspot: Bootstrapping multimodal llms
frozen language models. Neurips 34 (2021), 200–212. via precise referring instruction tuning. arXiv preprint arXiv:2307.09474
[83] vikhyatk. 2024. MoonDream1. (2024). https://huggingface.co/vikhyatk. (2023).
[84] Wang, Weihan, and et al. 2023. Cogvlm: Visual expert for pretrained [113] Zhao, Zijia, and et al. 2023. Chatbridge: Bridging modalities with large
language models. arXiv preprint arXiv:2311.03079 (2023). language model as a language catalyst. arXiv preprint arXiv:2305.16103
[85] Wang, Wenhui, and et al. 2022. Image as a foreign language: Beit (2023).
pretraining for all vision and vision-language tasks. arXiv preprint [114] Yang Zhao, Zhijie Lin, Daquan Zhou, Zilong Huang, Jiashi Feng, and
arXiv:2208.10442 (2022). Bingyi Kang. 2023. Bubogpt: Enabling visual grounding in multi-modal
[86] Wang, Xiao, and et al. 2023. Large-scale multi-modal pre-trained models: llms. arXiv preprint arXiv:2307.08581 (2023).
A comprehensive survey. Machine Intelligence Research (2023), 1–36. [115] Zheng, Kaizhi, and et al. 2023. Minigpt-5: Interleaved vision-
[87] Wang, Zirui, and et al. 2021. Simvlm: Simple visual language model and-language generation via generative vokens. arXiv preprint
pretraining with weak supervision. arXiv preprint arXiv:2108.10904 arXiv:2310.02239 (2023).
(2021). [116] Zhou, Juexiao, and et al. 2023. SkinGPT: A Dermatology Diag-
[88] Wu, Jiayang, and et al. 2023. Multimodal large language models: A nostic System with Vision Large Language Model. arXiv preprint
survey. arXiv preprint arXiv:2311.13165 (2023). arXiv:2304.10691 (2023).
[89] Wu, Shengqiong, and et al. 2023. Next-gpt: Any-to-any multimodal llm. [117] Zhou, Luowei, and et al. 2018. Towards automatic learning of procedures
arXiv preprint arXiv:2309.05519 (2023). from web instructional videos. In AAAI, Vol. 32.
[90] Shengqiong Wu, Hao Fei, and et al. 2023. NExT-GPT: Any-to-Any [118] Zhu, Deyao, and et al. 2023. Minigpt-4: Enhancing vision-language
Multimodal LLM. arXiv:2309.05519 [cs.AI] understanding with advanced large language models. arXiv preprint
[91] Xu, Haiyang, and et al. 2023. mplug-2: A modularized multi- arXiv:2304.10592 (2023).
modal foundation model across text, image and video. arXiv preprint [119] Deyao Zhu, Jun Chen, and et al. 2023. MiniGPT-4: Enhancing
arXiv:2302.00402 (2023). Vision-Language Understanding with Advanced Large Language Models.
[92] Xu, Hu, and et al. 2021. Videoclip: Contrastive pre-training for zero-shot arXiv:2304.10592 [cs.CV]
video-text understanding. arXiv preprint arXiv:2109.14084 (2021). [120] Yichen Zhu, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, and Jian
[93] Xu, Hu, and et al. 2023. Demystifying clip data. arXiv preprint Tang. 2024. LLaVA-phi: Efficient Multi-Modal Assistant with Small
arXiv:2309.16671 (2023). Language Model. arXiv preprint arXiv:2401.02330 (2024).
16