Vila-U Foundation Model
Vila-U Foundation Model
Vila-U Foundation Model
Abstract
VILA-U is a Unified foundation model that integrates Video, Image, Language
understanding and generation. Traditional visual language models (VLMs) use
separate modules for understanding and generating visual content, which can lead
to misalignment and increased complexity. In contrast, VILA-U employs a single
autoregressive next-token prediction framework for both tasks, eliminating the need
for additional components like diffusion models. This approach not only simplifies
the model but also achieves near state-of-the-art performance in visual language
understanding and generation. The success of VILA-U is attributed to two main
factors: the unified vision tower that aligns discrete visual tokens with textual
inputs during pretraining, which enhances visual perception, and autoregressive
image generation can achieve similar quality as diffusion models with high-quality
dataset. This allows VILA-U to perform comparably to more complex models
using a fully token-based autoregressive framework.
1 Introduction
In recent years, large language models (LLMs) have demonstrated superior capabilities in various
language tasks. Their appealing properties like instruction following, zero-shot generalization, and
few-shot in-context learning motivate researchers to combine them with vision models to build visual
language models (VLMs) for multi-modal tasks. Many efforts [15, 51, 45] have been put into this
field, achieving remarkable performance on visual language understanding benchmarks. In these
works, visual inputs are projected onto LLMs’ semantic space through a vision foundation model like
CLIP [58] to bridge two modalities by including text-image alignment training objectives.
In addition to visual understanding, another essential research direction in combining visual and
language modalities is visual generation. There are two popular approaches for text-guided image
generation. One approach employs diffusion models [60], a powerful tool for various generation tasks.
The other line of work converts visual content into discrete tokens through vector quantization (VQ)
and then leveraging autoregressive transformers for high-quality and diverse generation [21, 73, 33].
Witnessing the rapid advancements in both visual understanding and generation, an emerging trend
is to unify these techniques into a single multi-modal framework. There are two main approaches
to achieving such unification. Many VLMs [31, 41, 64, 63] maintain an understanding-oriented
framework and offload the generation task to an external diffusion model. This disjoint approach
adds complexity to infrastructure design. Available large-scale foundation model training pipelines
and deployment systems have already been highly optimized for language modeling with next-token
prediction. Designing a new stack to support diffusion models would incur significant engineering
costs. To circumvent such costs, it is desirable to design a single end-to-end autoregressive framework
∗
equal contribution
for both image understanding and generation. There is a trend in VLMs [48, 75] that adopt VQ
encoders to convert visual inputs into discrete tokens and treat them in the same next-token prediction
manner as language data. However, replacing continuous tokens with VQ tokens in VLMs usually
results in a severe performance drop in downstream visual perception tasks. Other works [52, 65]
have to make various architectural modifications and conduct multi-modal training from scratch,
which is computationally expensive.
In this work, we present VILA-U, an end-to-end autoregressive framework with a unified next-token
prediction objective for both visual and text inputs that can achieve competitive performance on
both visual language understanding and generation tasks, without the help of external components
like diffusion models. We identify two critical principles to unify vision and language modalities
effectively and efficiently. (1) Existing end-to-end autoregressive VLMs cannot achieve competitive
visual understanding performance because the discrete VQ tokens are trained solely on image
reconstruction loss and are not aligned with textual inputs. Therefore, it is crucial to introduce text
alignment during VQ vision tower pretraining to enhance perception capabilities. (2) Autoregressive
image generation can attain similar quality as diffusion models if trained on a high-quality data corpus
with sufficient size. Guided by these insights, VILA-U features a unified foundation vision tower that
converts visual inputs into discrete tokens through vector quantization and aligns these tokens with
textual inputs using contrastive learning. The multi-modal training of VILA-U takes advantage of a
unified next-token prediction objective for both visual and textual tokens on a small-size high-quality
image-text corpus.
We evaluate VILA-U on common visual language tasks, including image-language understanding,
video-language understanding, image generation and video generation. VILA-U significantly nar-
rows the gap in visual understanding performance between end-to-end autoregressive models and
continuous-token VLMs, while introducing competitive native visual generation capabilities.
2 Related Work
Large Language Models (LLMs). LLMs based on pre-trained large-scale transformers [68] has
drastically revolutionized natural language processing field. Featuring gigantic model size and
pre-training data corpus, LLM has achieved remarkable performance on various linguistic tasks.
The development of open-source LLMs such as LLaMA [67], Mixtral [29] and Vicuna [13] has
furthered nourished research on how to adopt LLM for complex language tasks. Besides excellent
zero-shot generalizability to diverse domains, LLM is commonly finetuned on custom datasets for
better performance on specific tasks. Instruction tuning [55, 14, 56] also stands as a key step for better
outputs in applying LLMs. In this work, we adopt the LLaMA-2-7B[67] model as our basic LLM.
Visual Language Models (VLMs). Combining computer vision and natural language processing
gives rise to VLM in this LLM era. In VLMs, researchers leverage vision foundation models such
as CLIP [58], BLIP [38] and CoCa [74] to extract visual features, align with texts, and feed them
into LLM to achieve the cross-modality understanding between texts and visual content. Build-
ing upon such progress, many VLMs [3, 36, 51, 45] have been designed and trained on extensive
vision-language data to achieve remarkable performance on visual understanding and reasoning tasks.
VLMs are divided into two types. (1) BLIP-style VLMs [4, 3, 39, 37, 16, 26] utilizes cross attention
mechanism to fuse language and visual information and optionally apply perceivers [28] to downsam-
ple visual tokens. (2) LLaVA-style VLMs [50, 20, 11, 1, 80, 72, 5, 2, 12, 49, 45, 79] converts visual
inputs to tokens (patches) and pass them through ViTs. The output of ViTs undergoes MLP layers and
gets aligned to the language space. In this work, we aim to develop a VLM with visual understanding
capacities comparable to prior works, while also possessing the new capacity of visual generation.
Unified Visual Language Models. Numerous efforts have been made to develop unified visual
language models capable of generating both text and visual content, including images and videos.
There are two mainstream methods to generate visual content in VLMs. Many works [64, 63, 31,
30, 41] combine VLMs with diffusion models like Stable Diffusion [60] for high-quality image
generation. Other works [48, 75, 52, 65, 70] adopt VQGAN-based vision encoders to convert visual
inputs into discrete tokens and make LLMs learn to predict them. In this work, we design our
framework based on the autoregressive next-token prediction method for visual generation and make
our VLM learn to generate visual content effectively and efficiently.
2
Multi-modal Multi-modal Token Multi-modal Token Multi-modal
Inputs Sequence In Sequence Out Inference Outputs
…
Text 📄 Text
Encoder
… Text
Decoder
The man is
skating
Text Tokens
Generative
🖼
…
Image Multi-modal Model
Text-aligned
Vision
…………… Vision … ……………
Decoder
Encoder Text-aligned Discrete
Video
🎞 Visual Tokens Next-token Prediction
…
Figure 1: An overview of our framework’s multi-modal training and inference process. Visual
inputs are tokenized into discrete tokens and concatenated with textual tokens to form a multi-modal
token sequence. All tokens are involved in our next-token prediction process, enabling a unified
training objective. During inference, the output tokens are decoded by our text detokenizer or vision
tower decoder to yield multi-modal content.
3 Methods
This work proposes a multi-modal framework that aims to unify visual and language modalities
efficiently and effectively. The key components enabling such unification are a unified foundation
vision tower that converts visual inputs into discrete tokens aligned with text, and a unified multi-
modal generative training procedure. An overview of the main multi-modal training and inference
process within our framework is depicted in Figure 1.
To support diverse visual understanding and generation tasks, we first build a unified foundation
vision tower to provide appropriate visual features. We propose to include text-image contrastive loss
and VQ-based image reconstruction loss in our vision tower training, empowering the text alignment
and discrete tokenization abilities for our vision tower. As depicted in Figure 2, the features extracted
from images are primarily discretized through residual quantization. Then in one route, the discrete
visual features are fed into a decoder to reconstruct the image and compute the reconstruction loss;
on the other route, we compute the image-text contrastive loss between the discrete visual features
and the textual features provided by a text encoder. With this training procedure, the vision tower
learns to extract discrete features suitable for both understanding and generation in our VLM.
3
Reconstruction
Loss
🔥 🔥
Vision Residual Vision
…
Encoder Quantizer 🔥 Decoder
Pretrained Discrete Visual Tokens
Figure 2: Overview of our unified foundation vision tower. Given input images the features
extracted by the vision encoder are discretized using residual quantization. Then the discrete vision
features are meanwhile put into the vision decoder to reconstruct images and used to perform the
text-image alignment. During this process, the reconstruction loss and contrastive loss are computed
to update the vision tower, endowing it to produce discrete visual features with text alignment.
Discussion: Failed Training Recipes. We experiment with numerous training recipes and find none
to be as effective as our final approach. We list four alternative recipes and discuss their shortcomings
compared to our final recipe: (1) Load pre-trained CLIP weights into the text encoder only; (2) Load
pre-trained RQ-VAE weights for the vision encoder and decoder while training other parts from
scratch; (3) Freeze the vision encoder; (4) Make the text encoder trainable.
Recipes 1) and 2) fail due to the lack of pre-trained CLIP weights for the vision encoder. Training a
CLIP model from scratch typically requires numerous GPU days with a large global batch size (e.g.,
32k). However, VQ-based reconstruction training necessitates a relatively small global batch size
(e.g., 512) for steady improvement. With such a small batch size, training a text-aligned vision tower
from scratch would be prohibitively time-consuming and resource-intensive.
Recipe 3) fails because freezing the vision encoder prevents it from learning the low-level features
essential for reconstruction. In this case, the burden of reconstruction falls entirely on the vision
decoder, but it is impossible to reconstruct images well using only semantic features.
Recipe 4) fails because the quantized features are chaotic during the initial training steps, and the
contrastive loss disrupts the text encoder weights, slowing down the entire training process.
In contrast, our final training recipe leverages pre-trained CLIP weights for the vision encoder,
enabling it to maintain learned semantic features rather than grasping them from scratch. This allows
us to train with a small batch size while keeping the vision encoder trainable, facilitating the learning
of low-level features for reconstruction during training.
Residual Vector Quantization. Our visual features are discretely quantized, so their representation
ability heavily depends on the code size used in our quantizer. Since we hope they contain both
high-level and low-level features, we need more capacities in their vector feature space, making a
larger code size necessary for good performance in downstream tasks. However, too many codes
for each image will result in too many tokens for LLM to produce in the visual generation process,
incurring much latency. So in an attempt to increase the vector feature capacity and meanwhile
maintain a reasonable number of tokens for LLM, we adopt a residual vector quantization method
following RQ-VAE [33] to discretize a vector z as D discrete codes:
RQ(z; C, D) = (k1 , · · · , kD ) ∈ [K]D , (2)
where C is the codebook, K = |C| and kd is the code of z at depth d. Starting with r0 = z, we
recursively perform vector quantization by
kd = Q (rd−1 , C) ,
(3)
rd = rd−1 − e (kd ) ,
for each depth d = 1, 2, · · · , D, where e is the codebook embedding table and Q is the standard
vector quantization:
Q(z; C) = arg min∥z − e(k)∥22 . (4)
k∈[K]
4
PD
The quantized vector for z is the sum over the depth dim: bz = i=1 e (ki ). Intuitively, in each depth
we choose a code to reduce the quantization error. So compared to the standard vector quantization
methods, we have D codes to quantize one vector, allowing for finer approximation and larger feature
space. During multi-modal training and inference, LLM only needs to predict the code embedding,
with codes in different depth sequentially produced by a depth transformer taking the code embedding
as the initial input, as we will introduce in Section 3.2. So with this residual quantization, we can
enhance the representation capability of our vision tower while incurring little latency.
Figure 1 presents an overview of our unified multi-modal pre-training process. Our vision tower
encoder processes visual inputs sequentially, generating a 1D token sequence. This sequence is then
concatenated with text tokens to form a multi-modal sequence. To distinguish between modalities
and enable visual content generation, we insert special tokens: <image_start> and <image_end>
at the start and end of image tokens, and <video_start> and <video_end> at the start and end
of video tokens. Video tokens are the direct concatenation of multi-frame image tokens.
Pre-training data form. In terms of unified pre-training data, we leverage different concatenation
forms between text and visual tokens to facilitate both understanding and generation. We use [image,
text], [text, image], and [text, video] forms, with supervision loss added only on the latter
modality in each pair to avoid unconditional content generation and promote modality alignment.
We also employ an interleaved text and image concatenation form for enhanced understanding, with
supervision loss applied solely to the text. Notably, we exclude the [video, text] form during
pre-training for efficiency reasons, as we find incorporating it during supervised fine-tuning effectively
yields excellent video understanding ability.
Training Objective. Since both visual tokens and text tokens are discrete, we can train our LLM with
the general language modeling next-token prediction objective. However, due to the use of residual
quantization for visual tokens, the training objectives for text and visual tokens differ slightly. For
text tokens, the negative log-likelihood loss is calculated as
T
X
Ltext = − log Pθ (yi |y<i ) , (5)
i=1
where T is the length of the multi-modal sequence and i only counts when the text token appears at
position i. For visual tokens, residual quantization introduces a depth-stacked structure of codes at
each visual position j. To address this, we leverage the depth transformer introduced in RQ-VAE
[33]. Specifically, given the code embedding hj generated by the LLM for visual tokens at position j,
the depth transformer autoregressively predicts D residual tokens (kj1 , ..., kjD ). During training, the
input of the depth transformer vjd at depth d is defined as the sum of the code embeddings of up to
depth d − 1 for d > 1 such that
d−1
X
vjd = e(kjd′ ), (6)
d′ =1
and vj1 = hj . Thus, the depth transformer predicts the next code for a finer estimation of the feature
ẑ j based on the previous estimations up to d − 1. Then the negative log-likelihood loss for visual
tokens is
XT XD
Lvisual = − log Pδ (kjd |kj,<d ) , (7)
j=1 d=1
where T is the length of the multi-modal sequence and j only counts when a visual token appears at
position j. During the multi-modal pre-training, the weights of the depth transformer are randomly
initialized and updated together with the LLM.
4 Experiments
In this section, we introduce comprehensive experiments to evaluate the performance of our method
on various visual understanding and generation tasks. Firstly, we outline our experimental setup,
5
including the model architecture, training datasets, and evaluation benchmarks. Subsequently, we
evaluate the performance of our unified foundation vision tower. Then, we compare our method with
other popular VLMs on various visual understanding and generation benchmarks. Finally, we give
some qualitative results.
In our experiments, we employ LLaMA-2-7B [66] as our base language model. For the vision
tower, we choose SigLIP-Large-patch16-256 / SigLIP-SO400M-patch14-384 [77] as our vision
encoder architecture, and adopt the residual quantizer, depth transformer as well as the decoder
architecture from RQ-VAE [33]. The quantizer codebook size is 16384. All images and videos are
resized to a resolution of 256 × 256 / 384 × 384, with each image or video frame converted into a
16 × 16 × 4 / 27 × 27 × 16 code with the residual depth D = 4 / D = 16. We train our vision tower
on COYO-700M [6] and evaluate it for zero-shot classification and reconstruction performance on
ImageNet [18]. For visual understanding, we leverage 1M [image, text] data from ShareGPT4V
[10], 6M interleaved text and image data from MMC4 [81]. For visual generation, we incorporate
15M high-quality [text, image] data curated from our internal dataset and 1M [text, video]
data from OpenVid [54] datasets. Classifier-free guidance [25] is employed for visual generation
with a CFG value of 3.
For examining visual understanding ability, we evaluate our model on the widely adopted zero-shot
image-based visual-language benchmarks including VQA-v2 [24], GQA [27], TextVQA [62], POPE
[42], MME [23], SEED [34], MM-Vet [76] and video-based visual-language benchmarks including
ActivityNet [7], MSVD [8], MSRVTT [71], TGIF [43].
To evaluate the visual generation capability, we use MJHQ-30K [35] and GenAI-Bench [46] as our
benchmarks. The former adopts the FID between generated images and 30K high-quality images to
reflect the overall capability of image generation. The latter is a challenging image-to-text generation
benchmark that reflects the comprehensive generative abilities of visual generation models. This
benchmark is divided into two categories of prompts: basic skills, which include attribute, scene, and
relation understanding in text inputs, and advanced skills, which encompass counting, differentiation,
comparison, and logical relation understanding in text inputs.
We present the commonly used metrics reconstruction FID (rFID) and Top-1 accuracy for zero-shot
image classification on ImageNet to measure the reconstruction and text alignment capabilities of the
unified foundation vision tower in Table 1. Our model achieves significantly better reconstruction
results than VQ-GAN. Our rFID is slightly inferior to that of RQ-VAE when using the same code
shape. This is expected as the introduction of contrastive loss during training, aimed at enhancing
image understanding, led to a decrease in reconstruction quality. For the text alignment capability,
our unified vision tower achieves a Top-1 accuracy of 73.3 / 78.0 under 256 / 384 resolution. This
demonstrates the exceptional text alignment capability of our unified vision tower. However, it is
worth noting that both the rFID and Top-1 accuracy of the vision tower only serves as a medium
indicator instead of directly linear correlated to the final performance of our whole multi-modal
framework. The performance on visual understanding and generation tasks presented in the following
sections is more important.
Table 1: The reconstruction FID (rFID) and Top-1 accuracy for zero-shot image classification of our
unified vision tower on ImageNet.
Model Pretrained Weights Resolution Shape of Code rFID↓ Top-1 Accuracy↑
VQ-GAN [22] – 256 × 256 16 × 16 4.98 –
RQ-VAE [33] – 256 × 256 8×8×4 3.20 –
RQ-VAE [33] – 256 × 256 16 × 16 × 4 1.30 –
Ours SigLIP-Large 256 × 256 16 × 16 × 4 1.80 73.3
Ours SigLIP-SO400M 384 × 384 27 × 27 × 16 1.25 78.0
6
Table 2: Comparison with leading methods on image-based visual language benchmarks. Our
performance is close to leading VLMs, surpassing many methods by a large margin under the same
LLM size, even with a discrete visual token type. * indicates that images in the training split of these
datasets are observed during VLM training.
Method LLM Visual Token Res. VQAv2 GQA TextVQA POPE MME SEED MM-Vet
LLaVA-1.5 [51] Vicuna-1.5-7B Continuous 336 78.5∗ 62.0∗ 58.2 85.9 1510.7 58.6 30.5
VILA [45] LLaMA-2-7B Continuous 336 79.9∗ 62.3∗ 64.4 85.5 1533.0 61.1 34.9
Unified-IO 2 [52] 6.8B from scratch Continuous 384 79.4∗ – – 87.7 – 61.8 –
InstructBLIP [15] Vicuna-7B Continuous 224 – 49.2 50.1 – – 53.4 26.2
IDEFICS-9B [32] LLaMA-7B Continuous 224 50.9 38.4 25.9 – – – –
Emu [64] LLaMA-13B Continuous 224 52.0 – – – – – –
LaVIT [31] LLaMA-7B Continuous 224 66.0 46.8 – – – – –
DreamLLM [19] Vicuna-7B Continuous 224 72.9∗ – 41.8 – – – 36.6
Video-LaVIT [30] LLaMA-2-7B Continuous 224 80.2∗ 63.6∗ – – 1581.5 64.4 35.0
CM3Leon-7B [75] 7B from scratch Discrete 256 47.6 – – – – – –
LWM [48] LLaMA-2-7B Discrete 256 55.8 44.8 18.8 75.2 – – 9.6
Show-o [70] Phi-1.5B Discrete 256 59.3∗ 48.7∗ – 73.8 948.4 – –
Ours LLaMA-2-7B Discrete 256 75.3∗ 58.3∗ 48.3 83.9 1336.2 56.3 27.7
Ours LLaMA-2-7B Discrete 384 79.4∗ 60.8∗ 60.8 85.8 1401.8 59.0 33.5
Table 3: Comparison with leading methods on video-based visual language benchmarks. The
performance of our method is close to state-of-the-art VLMs, surpassing many methods under the
same LLM size, even with a discrete visual token type.
Method LLM Visual Token Res. MSVD-QA MSRVTT-QA TGIF-QA Activity Net-QA
Unified-IO 2 [52] 6.8B from scratch Continuous 384 52.1 42.5 – –
Emu [64] LLaMA-13B Continuous 224 – 18.8 8.3 –
VideoChat [40] Vicuna-7B Continuous 224 56.3 45 34.4 –
Video-LLaMA [78] LLaMA-2-7B Continuous 224 51.6 29.6 – –
Video-ChatGPT [53] LLaMA-2-7B Continuous 224 64.9 49.3 51.4 35.2
Video-LLava [44] Vicuna-7B Continuous 224 70.7 59.2 70.0 45.3
Video-LaVIT [30] LLaMA-2-7B Continuous 224 73.5 59.5 – 50.2
LWM [48] LLaMA-2-7B Discrete 256 55.9 44.1 40.9 –
Ours LLaMA-2-7B Discrete 256 73.4 58.9 51.3 51.6
Ours LLaMA-2-7B Discrete 384 75.3 60.0 51.9 52.7
Visual Understanding Tasks. Table 2 and Table 3 summarize the comparison between our method
and other leading VLMs on the image-language and video-language benchmarks respectively. Com-
pared to the mainstream choice of continuous visual tokens produced by foundation models like
CLIP, the VQGAN-based discrete visual tokens have less alignment with text, thus harming VLMs’
performance on visual understanding tasks. With our unified foundation vision tower, our model can
have a performance close to leading VLMs even with discrete visual tokens.
Visual Generation Tasks. As shown in
Table 4, VILA-U can achieve a better FID Method Type #Images FID↓
than other autoregressive methods and
SD-XL [57] Diffusion 2000M 9.55
have comparable performance with some PixArt [9] Diffusion 25M 6.14
diffusion based methods. This result shows Playground v2.5 [35] Diffusion – 4.48
the feasibility of our method for visual gen- LWM [48] Autoregressive – 17.77
eration. Table 5 summarizes the quantitative Show-o [70] Autoregressive 36M 15.18
results of our method and other visual gener- Ours (256) Autoregressive 15M 12.81
Ours (384) Autoregressive 15M 7.69
ation methods on GenAI-Bench. Although
Our method is inferior to diffusion-based Table 4: Comparison with other visual generation methods
visual generation methods that have been on MJHQ-30K evaluation benchmark.
trained on billions-level image-text pairs,
our method has comparable performance with SD v2.1 [61] and SD-XL [57] on advanced prompts
even trained with magnitude-level less data. This further shows that VILA-U can learn the correlation
among visual and textual modalities effectively and efficiently with our unified training framework.
7
Table 5: Comparison with other visual generation methods on GenAI-Bench [46]. The results show
that our method outperforms previous autoregressive visual generation methods. For advanced
prompts that require better text following ability to generate, our method can have a relatively small
performance gap with diffusion-based methods, even with much less training data and time.
Relation↑
Method Type #Training Images Attribute↑ Scene↑ Overall↑
Spatial Action Part
SD v2.1 [60] Diffusion 2000M 0.80 0.79 0.76 0.77 0.80 0.78
SD-XL [57] Diffusion 2000M 0.84 0.84 0.82 0.83 0.89 0.83
Midjourney v6 [59] Diffusion – 0.88 0.87 0.87 0.87 0.91 0.87
DALL-E 3 [47] Diffusion – 0.91 0.90 0.92 0.89 0.91 0.90
LWM [48] Autoregressive – 0.63 0.62 0.65 0.63 0.70 0.63
Show-o [70] Autoregressive 36M 0.72 0.72 0.70 0.70 0.75 0.70
Ours (256) Autoregressive 15M 0.78 0.78 0.77 0.78 0.79 0.76
Ours (384) Autoregressive 15M 0.75 0.76 0.75 0.73 0.75 0.73
(a) VQAScores on basic prompts of GenAI-Bench
Logical↑
Method Type #Training Images Count↑ Differ↑ Compare↑ Overall↑
Negate Universal
SD v2.1 [60] Diffusion 2000M 0.68 0.70 0.68 0.54 0.64 0.62
SD-XL [57] Diffusion 2000M 0.71 0.73 0.69 0.50 0.66 0.63
Midjourney v6 [59] Diffusion – 0.78 0.78 0.79 0.50 0.76 0.69
DALL-E 3 [47] Diffusion – 0.82 0.78 0.82 0.48 0.80 0.70
LWM [48] Autoregressive – 0.59 0.58 0.54 0.49 0.52 0.53
Show-o [70] Autoregressive 36M 0.70 0.62 0.71 0.51 0.65 0.60
Ours (256) Autoregressive 15M 0.70 0.71 0.74 0.53 0.66 0.64
Ours (384) Autoregressive 15M 0.68 0.67 0.71 0.51 0.64 0.61
5 Ablation Study
In this section, we delve into some key design components of our framework and design ablation
experiments to show how they will influence the performance.
Prompt: <video>\n
Elaborate on the visual and narrative elements of the video in detail.
VILA-U(Ours): The video features a person wearing a blue t-shirt, khaki pants, and a cap, executing a series of breakdancing moves on
a concrete road. The background includes a clear sky, a bridge, and a train passing by. The dancer's movements are dynamic, with spins,
flips, and handstands. The lighting is bright, suggesting daytime, and the shadows indicate the sun is high. The dancer's attire is casual,
and the setting appears to be a quiet road with minimal traffic.
Figure 3: VILA-U can correctly caption videos and cover all the details, thanks to the text alignment
of our vision encoder.
8
Input image Question Output Input few shot examples + target image Output
Figure 4: VILA-U has good visual question Figure 5: VILA-U has good in-context learning
answering capability. The images and questions capability. We feed two image-text pairs and a third
are from the test split of VQAv2 dataset. image as the context to prompt the VLM.
We include contrastive loss in vision tower training, which endows it with the text alignment ability.
During our multi-modal training, such text alignment ability is crucial in enhancing modality fusion
and performance on downstream visual language tasks. We validate the importance of this alignment
by training the vision tower with and without the contrastive loss, evaluating its impact on visual
language understanding performance. For this ablation study, we randomly sample 25M data from
COYO-700M to train the vision tower. For multi-modal training, we use ShareGPT4V and MMC4
without text-image and text-video data. The results of the first two lines in Table 6 demonstrate the
crucial role of text alignment in achieving strong visual language understanding performance. Scaling
the dataset size from 25M to 700M further enhances performance, highlighting the importance of
learning text alignment on a large-scale dataset.
We conduct two experiments to demonstrate the influence of contrastive loss to generation perfor-
mance. For efficiency, we conduct only text-to-image pretraining and utilize Sheared-LLaMA-1.3B
[69] instead of LLaMA-2-7B as the LLM. In the first experiment, we use the RQ-VAE as the vision
tower, which has an rFID of 1.30. In the second experiment, we employ our unified vision tower.
Results are shown in Table 7. Our Unified Vision Tower yielded slightly worse FID results than
9
Happy dreamy owl monster sitting on a A cute orange kitten sliding down an aqua
tree branch, colorful glittering particles, A black dog: Crocodile in a sweater: slide, happy excited. Vibrant colors, water
forest background, detailed feathers: splashing on the lens:
the RQ-VAE on MJHQ-30K, possibly due to its inferior rFID resulting from the introduction of
contrastive loss.
Table 8: Impact of CFG.
Table 7: Impact of contrastive loss to visual generation.
CFG Value FID ↓
Vision Tower LLM Resolution rFID ↓ FID ↓
1.0 14.1
RQ-VAE [33] Sheared-LLaMA-1.3B 256 × 256 1.30 12.0 2.0 13.0
Ours Sheared-LLaMA-1.3B 256 × 256 1.80 13.2 3.0 12.8
5.0 13.2
We adopt classifier-free guidance during the visual content generation. We investigate the impact of
the CFG value on our 256-resolution model. Results presented in Table 8 indicate that a CFG value
of 3.0 yields the best FID score.
6 Conclusion
We present VILA-U, a novel and unified visual language model that integrates video, image and
language understanding and generation tasks into one autoregressive next-token prediction framework.
Our method is not only more concise than most VLMs that leverage additional components like
diffusion models for unifying visual generation and understanding, but also demonstrates that
autoregressive methods can achieve comparable performance to state-of-the-art VLMs. Our success
is due to both a unified foundation vision tower that aligns discrete visual features with texts during
10
pre-training and a high-quality dataset suitable for visual understanding and generation training. We
believe VILA-U can serve as a general-purpose framework for diverse visual language tasks.
Limitations. There is still a performance gap in the visual understanding ability between VILA-U and
state-of-the-art VLMs leveraging continuous visual feature. Besides, the visual generation quality is
relatively low compared to state-of-the-art diffusion models. In future work, we will be committed to
overcoming these limitations to build an advanced VLM that can achieve state-of-the-art performance
in all kinds of visual language tasks.
References
[1] ADEPT AI. Fuyu-8B: A multimodal architecture for AI agents. https://www.adept.ai/blog/
fuyu-8b, 2023.
[2] Emanuele Aiello, Lili Yu, Yixin Nie, Armen Aghajanyan, and Barlas Oguz. Jointly training large
autoregressive multimodal models. arXiv preprint arXiv:2309.15564, 2023.
[3] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc,
Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for
few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
[4] Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe,
Yonatan Bitton, Samir Gadre, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell
Wortsman, and Ludwig Schmidt. Openflamingo, March 2023.
[5] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han,
Fei Huang, et al. Qwen technical report. Technical report, Alibaba Group, 2023. https://arxiv.org/
abs/2303.08774.
[6] Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim.
Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
[7] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A
large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on
computer vision and pattern recognition, pages 961–970, 2015.
[8] David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings
of the 49th annual meeting of the association for computational linguistics: human language technologies,
pages 190–200, 2011.
[9] Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James
Kwok, Ping Luo, Huchuan Lu, et al. Pixart-alpha: Fast training of diffusion transformer for photorealistic
text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023.
[10] Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin.
Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793,
2023.
[11] Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme
Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and
language model. arXiv preprint arXiv:2305.18565, 2023.
[12] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang,
Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic
visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
[13] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan
Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4
with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023.
[14] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi
Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.
Journal of Machine Learning Research, 25(70):1–53, 2024.
[15] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang
Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with
instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
[16] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Al-
bert Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models
with instruction tuning. ArXiv, abs/2305.06500, 2023.
[17] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical
image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255.
Ieee, 2009.
11
[18] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical
image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255.
Ieee, 2009.
[19] Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun,
Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation. arXiv
preprint arXiv:2309.11499, 2023.
[20] Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan
Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language
model. arXiv preprint arXiv:2303.03378, 2023.
[21] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image
synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages
12873–12883, 2021.
[22] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image
synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages
12873–12883, 2021.
[23] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng,
Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for
multimodal large language models, 2024.
[24] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA
matter: Elevating the role of image understanding in Visual Question Answering. In Conference on
Computer Vision and Pattern Recognition (CVPR), 2017.
[25] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,
2022.
[26] Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang,
Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. arXiv preprint
arXiv:2312.08914, 2023.
[27] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and
compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, pages 6700–6709, 2019.
[28] Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver:
General perception with iterative attention. In International conference on machine learning, pages 4651–
4664. PMLR, 2021.
[29] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford,
Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of
experts. arXiv:2401.04088, 2024.
[30] Yang Jin, Zhicheng Sun, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu,
Di Zhang, Yang Song, et al. Video-lavit: Unified video-language pre-training with decoupled visual-
motional tokenization. arXiv preprint arXiv:2402.03161, 2024.
[31] Yang Jin, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Quzhe Huang, CHEN Bin, Chengru Song,
Di ZHANG, Wenwu Ou, et al. Unified language-vision pretraining in llm with dynamic discrete visual
tokenization. In The Twelfth International Conference on Learning Representations, 2023.
[32] Hugo Laurençon, Daniel van Strien, Stas Bekman, Leo Tronchon, Lucile Saulnier, Thomas Wang, Sid-
dharth Karamcheti, Amanpreet Singh, Giada Pistilli, Yacine Jernite, et al. Introducing idefics: An open
reproduction of state-of-the-art visual language model, 2023. URL https://huggingface. co/blog/idefics.
Accessed, pages 09–18, 2023.
[33] Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image
generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 11523–11532, 2022.
[34] Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking
multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.
[35] Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground
v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint
arXiv:2402.17245, 2024.
[36] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training
with frozen image encoders and large language models. In International conference on machine learning,
pages 19730–19742. PMLR, 2023.
[37] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training
with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
12
[38] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training
for unified vision-language understanding and generation. In ICML, 2022.
[39] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training
for unified vision-language understanding and generation. In International Conference on Machine
Learning, pages 12888–12900. PMLR, 2022.
[40] KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and
Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023.
[41] Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng
Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models.
arXiv:2403.18814, 2023.
[42] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object
hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023.
[43] Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo.
Tgif: A new dataset and benchmark on animated gif description. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 4641–4650, 2016.
[44] Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual
representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
[45] Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad
Shoeybi, and Song Han. Vila: On pre-training for visual language models, 2023.
[46] Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and
Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. arXiv preprint
arXiv:2404.01291, 2024.
[47] Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and
Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. arXiv preprint
arXiv:2404.01291, 2024.
[48] Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language
with ringattention. arXiv preprint arXiv:2402.08268, 2024.
[49] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction
tuning. arXiv preprint arXiv:2310.03744, 2023.
[50] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023.
[51] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural
information processing systems, 36, 2024.
[52] Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and
Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision, language,
audio, and action. arXiv preprint arXiv:2312.17172, 2023.
[53] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards
detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424,
2023.
[54] Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and
Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint
arXiv:2407.02371, 2024.
[55] OpenAI. Chatgpt. https://openai.com/blog/chatgpt/, 2023.
[56] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with
human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
[57] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna,
and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv
preprint arXiv:2307.01952, 2023.
[58] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish
Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from
natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR,
2021.
[59] Ar Mohesh Radhakrishnan. Is midjourney-ai the new anti-hero of architectural imagery & creativity? GSJ,
11(1):94–104, 2023.
[60] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution
image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 10684–10695, 2022.
13
[61] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution
image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 10684–10695, 2022.
[62] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and
Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 8317–8326, 2019.
[63] Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming
Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners.
arXiv preprint arXiv:2312.13286, 2023.
[64] Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing
Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality. arXiv preprint
arXiv:2307.05222, 2023.
[65] Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models, 2024.
[66] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation
language models. arXiv preprint arXiv:2302.13971, 2023.
[67] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv:2302.13971,
2023.
[68] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach,
R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems,
volume 30. Curran Associates, Inc., 2017.
[69] Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model
pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
[70] Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao
Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify
multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024.
[71] Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video
question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th
ACM international conference on Multimedia, pages 1645–1653, 2017.
[72] Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu,
Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with
multimodality. arXiv preprint arXiv:2304.14178, 2023.
[73] Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu,
Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. arXiv preprint
arXiv:2110.04627, 2021.
[74] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca:
Contrastive captioners are image-text foundation models, 2022.
[75] Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu,
Binh Tang, Brian Karrer, Shelly Sheynin, et al. Scaling autoregressive multi-modal models: Pretraining
and instruction tuning. arXiv preprint arXiv:2309.02591, 2(3), 2023.
[76] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and
Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint
arXiv:2308.02490, 2023.
[77] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image
pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages
11975–11986, 2023.
[78] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model
for video understanding. arXiv preprint arXiv:2306.02858, 2023.
[79] Pan Zhang, Xiaoyi Dong Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui
Ding, Songyang Zhang, Haodong Duan, Hang Yan, et al. Internlm-xcomposer: A vision-language large
model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023.
[80] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing
vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592,
2023.
14
[81] Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu,
Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of
images interleaved with text. Advances in Neural Information Processing Systems, 36, 2024.
15