Aligngpt: Multi-Modal Large Language Models With Adaptive Alignment Capability
Aligngpt: Multi-Modal Large Language Models With Adaptive Alignment Capability
Alignment Capability
Fei Zhao*, Taotian Pang*, Chunhui Li, Zhen Wu†, Junjie Guo, Shangyu Xing, Xinyu Dai
National Key Laboratory for Novel Software Technology, Nanjing University
{zhaof,pangtt,lich,guojj,xsy}@smail.nju.edu.cn,{wuz,daixinyu}@nju.edu.cn
https://aligngpt-vl.github.io
arXiv:2405.14129v2 [cs.CL] 23 Nov 2024
POPE
Abstract VisWiz
SQA-I
86.0 54.2
Multimodal Large Language Models (MLLMs) are widely 68.5 85.1
VQA-V2
regarded as crucial in the exploration of Artificial General 79.1
Intelligence (AGI). The core of MLLMs lies in their ca- MME
1527.4 74.6
pability to achieve cross-modal alignment. To attain this
60.9 32.9
goal, current MLLMs typically follow a two-phase train- 1332.1 60.362.9
62.9 GQA
43.1 52.3
ing paradigm: the pre-training phase and the instruction-
29.1
tuning phase. Despite their success, there are shortcomings 67.3
MMBench
in the modeling of alignment capabilities within these mod- 66.5
30.8 SEED-I InstructBLIP-7B
els. Firstly, during the pre-training phase, the model usu- 68.4 MiniGPT-v2-7B
ally assumes that all image-text pairs are uniformly aligned, LLaVA_W Qwen-VL-Chat-7B
but in fact the degree of alignment between different image- LLaVA-1.5-7B
59.9 MM-Vet AlignGPT-7B
text pairs is inconsistent. Secondly, the instructions cur- MMBench-CN
rently used for finetuning incorporate a variety of tasks and Figure 1. AlignGPT achieves competitive performances on a
different tasks usually require different levels of alignment broad range of vision-language tasks compared with other gen-
capabilities, but previous MLLMs overlook these differen- eralist models. To facilitate observation, we only show the perfor-
tiated alignment needs. To tackle these issues, we pro- mance of MiniGPT-v2 and AlignGPT.
pose a new multimodal large language model AlignGPT.
In the pre-training stage, instead of treating all image-text
pairs equally, we divide them into different groups accord- the pursuit of AGI, this cross-modal understanding and pro-
ing to the degrees of alignment of them. Then, the model cessing capability is essential, as it mimics how humans in-
is trained to learn the representations of different alignment teract with the world and comprehend complex information
levels. In the instruction-tuning phase, we adaptively com- through different senses, such as vision and language. The
bine these representations of alignment levels to meet the development of multimodal large language models not only
dynamic alignment needs of different tasks. Extensive ex- advances the field of artificial intelligence but also provides
perimental results show that our model achieves competi- machines with a way to process and understand information
tive performance on 12 benchmarks. that is closer to human cognition.
Currently, MLLMs typically adhere to a unified training
paradigm, which is divided into two key phases: the pre-
1. Introduction training phase and the instruction-tuning phase [3, 5, 19,
25, 40, 41, 43, 46, 47]. The pre-training phase concentrates
Multimodal Large Language Models (MLLMs) are consid- on aligning images with text, aiming to train the model to
ered a crucial step towards achieving Artificial General In- understand the relation between image contents and the re-
telligence (AGI) [1, 13, 30, 31]. The uniqueness of these spective textual descriptions of them. This alignment im-
models lies in their ability to integrate and understand vari- bues the model with cross-modal comprehension abilities.
ous types of information, especially text and image data. In The instruction-tuning phase further enhances its adaptabil-
* Equal contributions. ity to specific tasks. This includes enabling the model to
† Corresponding author. complete particular visual-language tasks based on given
1
0.186 0.220 0.256 0.297
Figure 2. Examples of image-text pairs in the pre-training dataset, where the numbers in each image represent the CLIP similarity.
instructions, such as generating textual descriptions from the higher scores indicate the higher degrees of align-
images, answering questions related to images, or even per- ment [17, 36]. For example, in Fig. 2, the degree of align-
forming complex reasoning based on both text and images. ment of each image-text pair rises from left to right and
This training paradigm equips multimodal pre-trained mod- the CLIP score of each pair also increases. Subsequently,
els with not only fundamental cross-modal understanding we utilize these group labels as control signals to make the
but also the flexibility to adapt to diverse task demands. model to learn the representations of different alignment
Although current MLLMs have made great progress, the levels. During the instruction-tuning phase, the model is
modeling of alignment capabilities of them is still insuffi- trained to dynamically combine these representations ob-
cient for the following reasons: tained by pre-training for the instructions of each task. In
this process, we not only assign global alignment capabili-
• The degree of alignment is inconsistent between dif-
ties but also adaptively configure different local alignment
ferent image-text pairs: During the pre-training phase,
capabilities according to the alignment needs for instruc-
the model typically operates on a key assumption that all
tions of each task. The broad range of tests conducted
image-text pairs are consistently aligned. However, the
demonstrates that our model achieves competitive perfor-
degree of alignment in image-text pairs is not always uni-
mance across 12 benchmarks, as shown in Fig. 1.
form: in some image-text pairs, the text may describe the
Our contribution can be summarized as follows:
whole image (as shown in the rightmost of Fig. 2) while
in others the text only describes a part of the image (as • We propose a new multi-modal large language model
shown in the left three image-text pairs in Fig. 2) . If these AlignGPT to elevate and empower the alignment capa-
differences are not differentiated during the pre-training bilities of MLLMs.
phase, it could lead to a misunderstanding of the image- • We propose a novel alignment strategy that learns dif-
text alignment relationships in the learning process. ferent alignment levels in the pre-training stage, and
• The different tasks require different levels of align- then adaptively combines these alignment levels in the
ment capabilities: The instructions currently used for instruction-tuning stage to meet the needs of alignment
finetuning cover a variety of tasks. Some of them, like capabilities for different tasks.
image captioning [42], rely more on global image-text • We conduct evaluations across multiple academic bench-
alignment capabilities. In contrast, other tasks, such as marks and multimodal instruction-following benchmarks.
visual question answering (VQA) [2], typically require Extensive experimental results show that our proposed
the model to answer questions based on specific parts of AlignGPT achieves competitive performance. Further
the image, which necessitates not only global image-text analysis verifies the effectiveness of the model.
alignment but also local image-text alignment capabili-
ties. However, previous work has neglected these differ- 2. Related Work
entiated alignment requirements. In this section, we review the existing studies on large lan-
To effectively enhance the alignment capabilities, we guage models and visual language models.
propose a new multimodal large language model called
AlignGPT. In the pre-training phase, we aim to make the Large Language Models. In the field of natural language
model to understand different degrees of the image-text processing, BERT [11] and GPT-2 [33], as pioneering large
alignment relation. Specifically, instead of treating all pre-trained language models, marked a significant break-
image-text pairs equally, we divide image-text pairs into through in this technological direction. Their training on
different groups according to the degrees of alignment of vast web text datasets demonstrated unprecedented lan-
them and give an extra group label to each pair. This pro- guage understanding and generation capabilities. Subse-
cess is achieved by the help of CLIP scores [34], where quently, the launch of GPT-3 [4] further accelerated the de-
2
velopment of this field, with its large model parameters and 80000
extensive training datasets showcasing exceptional abilities 70000
in few-shot learning, significantly enhancing task adaptabil-
ity and flexibility. Following this, the introduction of In- 60000
structGPT and ChatGPT [32] focused on optimizing the ef- 50000
Samples
ficiency and naturalness of interactions between models and
humans, where InstructGPT enhanced the capability to ex- 40000
ecute precise instructions, and ChatGPT improved the con- 30000
versational experience, making these models more fluent
20000
in human-computer communication. Meanwhile, as large
language model (LLM) technology continued to evolve, 10000
emerging models like LLaMA [39] and GLM [14] began
0
to make their mark. To equip these models with the ability 0.10 0.15 0.20 0.25 0.30
to respond to human instructions similar to ChatGPT, re-
Figure 3. The distribution of CLIP similarity scores between im-
search teams finetune LLaMA and GLM using high-quality
ages and texts in the pre-trained dataset.
instruction datasets, thereby further enhancing its capability
to follow instructions, with representative projects such as
Alpaca [38], Vicuna [9], and ChatGLM [45]. ment capabilities in these models remains inadequate. To
Although these models have made significant progress address these limitations, we propose a new multimodal
in interacting with humans through language, we recognize large language model AlignGPT to effectively enhance the
that human understanding and processing of complex in- alignment capabilities of MLLMs.
formation relies not only on language but also critically on
visual and other sensory inputs. The observation has driven 3. Methodology
us to further explore more comprehensive visual-language
models in order to more accurately simulate complex inter- In this section, we initially present the fundamental struc-
actions between humans and the real world. ture of the visual-language model AlignGPT, followed by
a demonstration of how to enhance the alignment capabil-
ity of the model by our pre-training and instruction-tuning
Visual Language Models. In recent years, multimodal paradigms.
large language models (MLLMs) have garnered increas-
ing attention. The core of MLLMs lies in their ability 3.1. Architecture
to achieve cross-modal understanding and generalization. AlignGPT consists of four components: a visual backbone,
Most current models, such as LLaVA [25], MiniGPT-4 [47], a linear projection layer, a large language model, and an
mPLUG-Owl [43], Qwen-VL [3], MiniGPT-v2 [5], NExT- alignment module. Fig. 4 provides an overview of the
GPT [41], InternLM-XComposer [46], CogVLM [40], and AlignGPT architecture and its training process. The follow-
MM1 [29], utilize a standard training framework consist- ings are the implementation details of these components:
ing of two primary phases: pre-training and instruction-
tuning. In the pre-training phase, the model utilizes im-
Visual backbone. We utilize the pre-trained CLIP visual
age caption data to establish a rich understanding of cross-
encoder ViT-L/14 [34] as our visual backbone. We train the
modal semantic knowledge. This training phase enables the
model using an image resolution of 336×336.
model to comprehend and capture the correlation between
images and text, establishing a solid foundation for subse-
quent stage. In the instruction-tuning phase, the model re- Linear projection layer. We adopt a linear projection
ceives specific task instructions to optimize its performance layer to map the representations of images from the vector
on that task. Through this instruction-tuning phase, the space of the vision backbone to that of the language model.
model can further refine its understanding to execute spe-
cific tasks, enabling it to flexibly and accurately address Large language model. We choose the open-source
various task requirements in practical applications. model Vicuna [9] as our language model backbone, given
Although current MLLMs have achieved promising re- its strong ability to follow instructions effectively in various
sults, they overlook two critical factors. First, the degree language tasks.
of alignment between different image-text pairs is inconsis-
tent during the pre-training phase. Second, different tasks Alignment module. We propose to add alignment em-
require different levels of alignment capabilities during the beddings to the inputs of MLLMs to enrich their alignment
instruction-tuning phase. As a result, the modeling of align- capabilities. These alignment embeddings are positioned
3
Pre-training
❄ Large Language Model Caption
weak strong
ahead of the image embeddings and text embeddings. In scores correspond to texts that describe only partial regions
the subsequent sections, we will elaborate on the role of the of the image, indicating weaker alignment. In contrast, pairs
alignment embeddings and the process to acquire them. with higher scores reflect texts that provide a more com-
prehensive description of the image, suggesting a stronger
3.2. Alignment of Image and Text alignment between the text and the image [17, 36].
In our methodology, we utilize the similarity scores gen-
3.3. Alignment Level-aware Pre-training
erated by the CLIP [34] model to evaluate the degree of
alignment between images and text. As shown in Fig. 2, we As mentioned before, in the pre-training stage, the model
present four image-text pairs and the CLIP scores of them. usually assumes that all image-text pairs are uniformly
From left to right, the degree of alignment between each aligned, and these pairs are used to train the model to com-
image-text pair rises, i.e., the text description becomes more prehend the relations between images and their correspond-
comprehensive. Correspondingly, the CLIP score of each ing textual descriptions. However, in reality, the degree of
pair increases. The rationale of adopting CLIP similarity alignment between these image-text pairs may vary consid-
scores lies in that CLIP is pre-trained on a massive dataset erably. Overlooking the difference could lead to a misun-
of paired images and their corresponding textual descrip- derstanding of the image-text alignment relations during the
tions, which enables it to effectively capture the relation- learning process.
ship between visual and linguistic information. By employ- Instead of treating all image-text pairs equally, we divide
ing contrastive learning techniques [8], CLIP minimizes the image-text pairs into different groups according to the de-
distance between representations of similar image-text pairs gree of alignment of them and give each pair an extra group
while maximizing the distance between those of different label. To achieve this, we leverage the similarity scores pro-
pairs. This training approach relies on 400 million data vided by CLIP. The higher the CLIP score is, the stronger
pairs, allowing the model to develop a nuanced understand- the alignment is between image and text. Subsequently, we
ing of image-text relationships. use these group labels as control signals to train the model,
Beside, we also demonstrate the CLIP similarity dis- enabling it to understand the different alignment relations
tribution of image-text pairs in the pre-training dataset in between different image-text pairs.
Fig. 3. The results indicate that the CLIP similarity distri- More precisely, we start by computing the CLIP similar-
bution varies significantly, suggesting a substantial differ- ities s for all training image-text pairs. Subsequently, we
ence in the alignment between images and texts. By jointly rank all image-text pairs based on their similarity scores.
observing Fig. 2 and Fig. 3, we find that pairs with lower Finally, we use a bucketing technique to divide them into N
4
discrete alignment levels. The process can be expressed as: focus on enhancing local alignment abilities. Specifically,
in addition to the global alignment embeddings, we assign
l = bucket(s), l ∈ {1, 2, ..., N }, (1) different weights to the local alignment embeddings via a
gate network. These weights are obtained based on input
where bucket(·) denotes a bucketing function that assigns
instructions and image, as the input instructions greatly in-
each pair into one of N equally spaced intervals and l is the
fluence the visual regions the model should focus on. The
alignment level (i.e., the group label) of an image-text pair.
implementation of the gate network is as follows:
In this way, image-text pairs with lower CLIP similarity
scores are assigned to buckets indicative of lower alignment α = sof tmax(W (HI ⊗ HT ) + b), (2)
levels, whereas those with higher CLIP similarity scores are
grouped into buckets representing higher alignment levels. where HI and HT denote the embeddings of the input in-
Once the alignment level of each image-text pair is de- struction and the image, W and b are weight matrix and
termined, we can regard it as a special token to express the bias, α means the weights of local alignment embeddings.
alignment relation between the image and its textual de- Finally, we aggregate the global alignment embedding and
scription. This special token is placed ahead of the image the local alignment embeddings with varying weights to en-
and text tokens. During the pre-training phase, in addition sure a more precise fulfillment of alignment requirements
to learning the mapping function in the linear projection for instructions of each task:
layer, we also initialize this special token as an alignment
embedding and continuously update its representation. N
X −1
Halign = HN + αHi , (3)
3.4. Adaptive Alignment-based Instruction-tuning i=1
Currently, the instructions used for finetuning cover vari- where Halign means the final alignment embedding for
ous tasks such as image captioning, visual question answer- each instruction during the instruction-tuning stage.
ing, and visual grounding, etc. These tasks place different In general, we can regard the alignment embeddings
requirements on the alignment capabilities. For example, obtained in the pre-training phase as foundational compo-
image captioning tasks mainly rely on global alignment be- nents, each of which has different alignment capabilities.
tween images and text, while VQA and visual grounding During the instruction-tuning phase, we dynamically com-
tasks require not only global alignment but also local align- bine these components to meet the alignment needs for in-
ment capabilities between images and text. To equip the structions of different tasks.
model with the adaptive alignment capability, we propose
an adaptive alignment-based instruction-tuning paradigm,
which dynamically combine the alignment embeddings to
4. Experiments
meet the alignment needs for each task. 4.1. Experimental Settings
To this end, we first clarify how to represent the global
Datasets. For a fair comparison, we use the same pre-
and local alignment capabilities between image-text pairs.
training and instruction dataset as the LLaVA-1.5 [24]. It
As mentioned in Section 3.3, after the pre-training stage,
mainly includes 558K caption pairs for modality align-
we obtain N alignment embeddings {H1 , H2 , ..., HN } cor-
ment and 665K single- or multi-round conversations for
responding to N discrete alignment levels {1, 2, ..., N }.
instruction-tuning. Besides, we evaluate AlignGPT on
Among them, HN represents the highest level of alignment,
a range of academic visual question answering (VQA)
i.e., HN indicates that the text provides very comprehen-
tasks and recent benchmarks designed specifically for
sive description of an image. Here we regard it as a global
MLLMs. This evaluation spans 12 benchmarks, including
alignment embedding. The embeddings below HN repre-
VQAV 2 [16], GQA [18], VizWiz [18], SQAI (ScienceQA-
sent different degrees of alignment between the image and
IMG) [28], TextVQA [37], POPE [23], MME [15], MMB
the text (i.e., {H1 , H2 , ..., HN −1 }), which means the text
(MMBench), MMBCN (MMBench-Chinese) [26], SEEDI
only describes a part of the information of the image from
(SEED-Bench-IMG) [21], LLaVAW (LLaVA-Bench-in-
weak to strong. Thus, we regard them as local alignment
the-Wild) [25], and MM-Vet [44] datasets.
embeddings of varying degrees.
Afterwards, we not only allocate global alignment ca-
pabilities to the instructions of each task, but also adap- Implementation Details. We adopt a ViT [12] model pre-
tively distribute varying degrees of local alignment capabil- trained with CLIP [34] as a vision encoder to process visual
ities based on the distinct alignment needs for each instruc- inputs. On the language side, Vicuna [9] is utilized to han-
tion. The reason behind this is that global alignment serves dle multimodal features, ensuring a cohesive integration of
as the foundation for cross-modal understanding; only by text and visual data. In the pre-training phase, both the vi-
mastering global alignment capabilities can a model truly sual backbone and the large language model of AlignGPT
5
Method LLM Resolution POPE MME MMB MMBCN SEEDI LLaVAW MM-Vet
BLIP-2 Vicuna-13B 224 85.3 1293.8 - - 46.4 38.1 22.4
InstructBLIP Vicuna-7B 224 - - 36.0 23.7 53.4 60.9 26.2
InstructBLIP Vicuna-13B 224 78.9 1212.8 - - - 58.2 25.6
Shikra Vicuna-13B 224 - - 58.8 - - - -
IDEFICS-9B LLaMA-7B 224 - - 48.2 25.2 - - -
IDEFICS-80B LLaMA-65B 224 - - 54.5 38.1 - - -
MiniGPT-v2 LLaMA2-7B 448 85.1♢ 1332.1♢ 43.1♢ 29.1♢ 52.3♢ - -
Qwen-VL Qwen-7B 448 - - 38.2 7.4 56.3 - -
Qwen-VL-Chat Qwen-7B 448 - 1487.5 60.6 56.7 58.2 - -
LLaVA-1.5 Vicuna-7B 336 85.9 1510.7 64.3 58.3 66.2 63.4 30.5
LLaVA-1.5 Vicuna-13B 336 85.9 1531.3 67.7 63.6 68.2 70.7 35.4
AlignGPT Vicuna-7B 336 86.0 1527.4 67.3 59.9 66.5 68.4 30.8
AlignGPT Vicuna-13B 336 86.2 1572.0 69.5 63.7 67.8 75.2 35.6
Table 1. Results on multimodal instruction-following benchmarks. For the baseline methods, the results on the SEEDI dataset are obtained
from [7], and the other results are retrieved from [24].
Sample Size
Method LLM Resolution VQAv2 GQA VisWiz SQAI TextVQA
Pre-train Finetune
BLIP-2 Vicuna-13B 224 129M - 41.0 41.0 19.6 61.0 42.5
InstructBLIP Vicuna-7B 224 129M 1.2M - 49.2 34.5 60.5 50.1
InstructBLIP Vicuna-13B 224 129M 1.2M - 49.5 33.4 63.1 50.7
Shikra Vicuna-13B 224 600K 5.5M 77.4 - - - -
IDEFICS-9B LLaMA-7B 224 353M 1M 50.9 38.4 35.5 - 25.9
IDEFICS-80B LLaMA-65B 224 353M 1M 60.0 45.2 36.0 - 30.9
MiniGPT-v2 LLaMA2-7B 448 - - 74.6♢ 60.3 32.9 60.9♢ 28.0♢
Qwen-VL Qwen-7B 448 1.4B 50M 78.8 59.3 35.2 67.1 63.8
Qwen-VL-Chat Qwen-7B 448 1.4B 50M 78.2 57.5 38.9 68.2 61.5
LLaVA-1.5 Vicuna-7B 336 558K 665K 78.5 62.0 50.0 66.8 58.2
LLaVA-1.5 Vicuna-13B 336 558K 665K 80.0 63.3 53.6 71.6 61.3
AlignGPT Vicuna-7B 336 558K 665K 79.1 62.9 54.2 68.5 58.4
AlignGPT Vicuna-13B 336 558K 665K 80.0 63.6 56.4 70.3 60.2
♢
Table 2. Performance comparison on multiple academic benchmarks. For the baselines, the results with are obtained by running the
code released by the authors, and the other results are retrieved from [5, 24]. Best results are in bold.
remain frozen, with only the parameters of the linear projec- 4.2. Compared Methods
tion layer and alignment embeddings being trained. During
We chose a diverse set of representative MLLMs as
instruction-tuning phase, we freeze the alignment embed-
our baselines, including BLIP-2 [22], InstructBLIP [10],
dings and visual backbone, while adjusting the parameters
Shikra [6], IDEFICS [20], MiniGPT-v2 [5], Qwen-VL [3],
of the linear projection layer, large language model, and the
Qwen-VL-Chat [3], and LLaVA-1.5 [24].
gate network. The global batch sizes for the two phases are
set at 256 and 128 respectively, with DeepSpeed [35] using
5. Results and Analysis
ZeRO2 and ZeRO3 strategies accordingly. Regarding our
training methodology, we conduct a single epoch of opti- 5.1. Main Results
mization for all models using the AdamW [27] optimizer
MLLM-oriented Multi-modal Benchmarks. We apply
coupled with a cosine learning schedule. Moreover, we ini-
AlignGPT to seven recent popular multimodal benchmarks,
tiate pre-training and instruction-tuning with learning rates
as shown in Tab. 1. We discover that, apart from LLaVA-
of 1e-3 and 2e-5, respectively. The framework is trained on
1.5-13B, AlignGPT-7B surpassed all previous multimodal
8 A800 GPUs with 80GB memory.
models. This shows that our model has strong general-
6
Method Alignment Level VQAv2 GQA VisWiz SQAI TextVQA POPE MME MMB SEEDI
AlignGPT Number=4 79.0 62.9 52.3 68.7 58.3 86.2 1463.8 67.2 66.5
AlignGPT Number=6 79.0 62.7 51.2 68.9 58.3 85.8 1436.3 67.3 66.2
AlignGPT Number=8 79.1 62.9 54.2 68.5 58.4 86.0 1527.4 67.3 66.5
AlignGPT Number=10 79.1 62.6 53.0 67.8 58.4 86.2 1481.4 66.4 66.7
Settings Average Local Global VQAv2 GQA VisWiz SQAI TextVQA POPE MME MMB SEEDI
(a) ✘ ✔ ✘ 79.1 62.7 53.3 67.9 58.6 85.9 1467.1 66.9 66.3
(b) ✘ ✘ ✔ 79.1 62.9 52.6 68.3 58.4 85.9 1502.9 66.3 66.2
(c) ✔ ✘ ✔ 79.0 62.8 52.5 68.6 58.4 85.6 1492.5 67.0 66.0
(d) ✘ ✔ ✔ 79.1 62.9 54.2 68.5 58.4 86.0 1527.4 67.3 66.5
POPE VisWiz
ization ability. Additionally, compared to LLaVA-1.5-13B, 54.2
AlignGPT-13B has shown improvements on most datasets,
particularly achieving good advancements on the MME, TextVQA VQA-V2
86.0
MMB, and LLaVAW benchamrks. This further validates 85.7 79.1
79.0
the efficacy of both global and local alignment capabilities. 58.4
58.2
49.1
SQA-I 68.5 68.2 62.6 62.9 GQA
Visual Question Answering. We evaluate AlignGPT us- 1448.1
ing five popular academic benchmarks, as detailed in 66.4
Tab. 2. Despite using less training data, the AlignGPT- 66.5
59.4
7B demonstrates competitive performance, surpassing other 1527.4
MME 65.5 SEED-I
generalist models including InstructBLIP-13B, Shikra-13B,
67.3 AlignGPT-7B (Random)
and IDEFICS-80B on most datasets, except for LLaVA-1.5- AlignGPT-7B
59.9
13B. These results verify the rationality of the structural de- MMBench MMBench-CN
sign of our model. Moreover, considering that AlignGPT
Figure 5. The performance comparison of AlignGPT (random)
utilizes the same training dataset as LLaVA-1.5, it is evi- and AlignGPT in downstream datasets.
dent that AlignGPT-7B outperforms LLaVA-1.5-7B across
all evaluation datasets, and AlignGPT-13B also surpasses
LLaVA-1.5-13B on the majority of datasets. This demon- both methods have the same number of parameters, we can
strates that our approach effectively enhances the alignment clearly assess whether the parameter number of the align-
capabilities of multimodal large language models. The fly ment embedding table influences the improvement in model
in the ointment is that AlignGPT-13B does not perform as performance. The experimental results are shown in Fig. 5.
well as Qwen-VL on the TextVQA dataset. This may stem We find that the model using the randomly initialized align-
from the fact that TextVQA is a text-centric QA task, as ment embedding table (referred to as AlignGPT (random))
it requires identifying text in images to answer questions. shows a performance gap compared to AlignGPT. These re-
AlignGPT is tailored to boost multimodal alignment and sults further confirm that the alignment information learned
might not exhibit strong results in text-focused scenarios. during pre-training is indeed the key factor in enhancing
5.2. Ablation Study model performance, rather than the parameter number.
Without loss of generality, we choose AlignGPT-7b for the
ablation study to analyze the impact of various components. Impact of Number of Alignment Levels. To investi-
gate the effect of the number of alignment levels N on
Impact of Alignment Embedding Table. We design an AlignGPT, we vary the value of N in the range of [4, 10]
experiment to validate the role of the alignment embedding with a step size of 2. Tab. 3 shows the performance of
table. In the fine-tuning phase, we use a randomly ini- AlignGPT with different N on nine datasets. Actually,
tialized alignment embedding table instead of the embed- AlignGPT can achieve good results at N = 4, and their per-
ding table obtained during the pre-training phase. Since formance remains stable as the number of alignment levels
7
Method LLM Resolution VQAv2 GQA SQAI TextVQA POPE MMB SEEDI
AlignGPT Vicuna-7B 336 79.1 62.9 68.5 58.4 86.0 67.3 66.5
AlignGPT Vicuna-7B 672 79.7 63.3 68.3 60.3 86.8 67.2 66.5
AlignGPT Vicuna-7B 1008 79.8 63.4 68.2 60.3 86.8 67.2 66.6
Method LLM Resolution VQAv2 GQA SQAI MME MMB MMBCN SEEDI
AlignGPT LLaMA2-7B-Chat 336 79.1 62.9 65.9 1500.8 66.6 57.9 66.4
AlignGPT Vicuna-v1.5-7B 336 79.1 62.9 68.5 1527.4 67.3 59.9 66.5
AlignGPT LLaMA3-8B-Base 336 79.6 63.1 70.4 1539.7 72.0 67.7 68.2
increases. Depending on the trajectory of the curve, their Tab. 5. The study results show that higher image resolu-
performance has an initial upward trend and then flattens tions can improve model performance on most multimodal
out. These observations indicate that AlignGPT can im- tasks. For example, the score for VQAv2 increased from
prove the alignment capabilities of multi-modal large lan- 79.1 to 79.8, while the score for TextVQA rose from 58.4
guage models based on a small number of alignment levels. to 60.3. Meanwhile, the performance of the POPE improve
Finally, according to the trend of the curve, we set N to 8. by 0.8. These results highlight that appropriately increasing
image resolution is an effective strategy for enhancing per-
formance in studies of multimodal large language models.
Impact of Local and Global Alignment. During the
instruction-tuning phase, we assign global and local align-
ment capabilities to the instructions of each task. Among Impact of different large language models. We also
them, “Local” refers to the local alignment capabilities de- explore the impact of the large language model on the
rived by assigning different weights to various local align- performance of AlignGPT, specifically testing three mod-
ment embeddings using a gate network. “Global” denotes els: LLaMA-2-7B-Chat, Vicuna-v1.5-7B, and the latest
the global alignment capabilities, and “Average” represents LLaMA-3-8B-Base. The results are shown in Tab. 6. Ini-
the local alignment capabilities obtained by assigning equal tially, we observe that LLaMA-3-8B-Base achieves the best
weights to each local alignment embedding. The perfor- performance, followed by Vicuna-v1.5-7B, with LLaMA-
mance of these four strategies (settings a-d) is presented in 2-7B-Chat performing the worst, which is reasonable given
Tab. 4. As we can see, setting (a) and setting (b) demon- LLaMA-3-8B-Base’s larger parameter size and richer train-
strate divergent performances in downstream tasks, which ing data. Besides, we observe that Vicuna-v1.5-7B achieves
can be attributed to the different demands these tasks place superior performance over LLaMA-2-7B-Chat on multi-
on global and local alignment capabilities. It is worth not- modal benchmarks such as MME, MMB, and SEEDI ,
ing that setting (a) and setting (b) perform worse than our while showing comparable results on VQA tasks. This ad-
final approach (setting d) on most datasets, which verifies vantage might be due to Vicuna-v1.5-7B undergoing super-
the necessity of the combination of global and local align- vised instruction-tuning with ShareGPT data, which con-
ment capabilities. Moreover, the performance of setting (c) tains background knowledge relevant to downstream tasks.
is inferior to that of setting (d), which may be due to the
dynamically changing demands for local alignment capa- 5.4. Qualitative Results
bilities across different downstream tasks. Figure 6 presents a comparative analysis of our model with
MiniGPT-v2 [5] and LLaVA-1.5 [24]. When a user sub-
5.3. Discussion
mits an image alongside the instruction “Can you see the
Impact of different input image resolutions. Image res- bright blue feathers on the bird’s wing?”, MiniGPT-v2 and
olution plays a crucial role in vision-language tasks as LLaVA-1.5 both return an incorrect answer “Yes”. In con-
higher resolutions help reduce image blurring and enhance trast, our model produces accurate result “No”, thereby
the understanding of image-text alignment. To evaluate the demonstrating that AlignGPT can effectively enhance the
impact of resolution changes on the performance of mul- model’s alignment capability. In Figure 7, we further
timodal tasks, we increase the image resolution from 336 demonstrate the responses of AlignGPT under different lev-
to 1008, with the resulting performance changes detailed in els of alignment capability. We find that with lower align-
8
Can you see the bright blue Can you see the bright blue Can you see the bright blue
feathers on the bird’s wing? feathers on the bird’s wing? feathers on the bird’s wing?
Yes, I can see the bright Yes, the bird has bright No, the bird in the image
blue feathers on the bird’s blue feathers on its wing, does not have bright blue
wing. which adds to its striking feathers on its wing. It is a
appearance as it flies large bird of prey, possibly
through the air. a hawk, with brown and
white feathers.
Count the number of apples Count the number of apples Count the number of apples
in the image in the image in the image
There are three apples in the There are four apples in the There are four apples in the
image image image
ment levels, the model may only focus on certain regions on the specific demands of each instruction. Results from
of the image, resulting in an undercount of the total num- numerous experiments indicate that our AlignGPT achieves
ber of apples; whereas with higher alignment levels, the better performance than other state-of-the-art MLLMs.
model considers the entire image area, thus achieving ac-
curate apple quantity estimation. This finding once again Limitations
underscores the necessity of enhancing the alignment capa-
The current study has two limitations: (1) This paper in-
bility of MLLMs.
volves two modalities, i.e., text and image, while achiev-
6. Conclusion ing AGI should also encompass video and audio, which
requires us to do further research and exploration; (2) We
In this paper, we propose AlignGPT, a novel multimodal propose a new perspective to enhance the alignment capa-
large language model designed to bolster the alignment bility of MLLMs. However, there may be other methods to
capabilities of MLLMs. Our approach involves utilizing achieve this goal, which merit consideration in the future.
the alignment level of data as a control signal during pre-
training to effectively handle the varying degrees of align- References
ment in image-text pairs. Subsequently, in the instruction-
[1] Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-
tuning phase, we begin by exploiting these control signals Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk-
to shape different levels of alignment capabilities. Contin- wyk, Andrew M. Dai, Anja Hauth, Katie Millican, David
uing from this, we go beyond assigning global alignment Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Ju-
capabilities to instructions of each task; we also dynami- lian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler,
cally configure distinct local alignment capabilities based Timothy P. Lillicrap, Angeliki Lazaridou, Orhan Firat, James
9
Molloy, Michael Isard, Paul Ronald Barham, Tom Hen- https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6,
nigan, Benjamin Lee, Fabio Viola, Malcolm Reynolds, 2023. 3, 5
Yuanzhong Xu, Ryan Doherty, Eli Collins, Clemens Meyer, [10] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat
Eliza Rutherford, Erica Moreira, Kareem Ayoub, Megha Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale
Goel, George Tucker, Enrique Piqueras, Maxim Krikun, Iain Fung, and Steven C. H. Hoi. Instructblip: Towards general-
Barr, Nikolay Savinov, Ivo Danihelka, Becca Roelofs, Anaı̈s purpose vision-language models with instruction tuning. In
White, Anders Andreassen, Tamara von Glehn, Lakshman Advances in Neural Information Processing Systems 36: An-
Yagati, Mehran Kazemi, Lucas Gonzalez, Misha Khalman, nual Conference on Neural Information Processing Systems
Jakub Sygnowski, and et al. Gemini: A family of highly 2023, NeurIPS 2023, New Orleans, LA, USA, December 10
capable multimodal models. CoRR, abs/2312.11805, 2023. - 16, 2023, 2023. 6
1 [11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
[2] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Toutanova. BERT: pre-training of deep bidirectional trans-
Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi formers for language understanding. In Proceedings of the
Parikh. VQA: visual question answering. In 2015 IEEE 2019 Conference of the North American Chapter of the
International Conference on Computer Vision, ICCV 2015, Association for Computational Linguistics: Human Lan-
Santiago, Chile, December 7-13, 2015, pages 2425–2433. guage Technologies, NAACL-HLT 2019, Minneapolis, MN,
IEEE Computer Society, 2015. 2 USA, June 2-7, 2019, Volume 1 (Long and Short Papers),
[3] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan pages 4171–4186. Association for Computational Linguis-
Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren tics, 2019. 2
Zhou. Qwen-vl: A frontier large vision-language model with [12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
versatile abilities. CoRR, abs/2308.12966, 2023. 1, 3, 6 Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
[4] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is
tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- worth 16x16 words: Transformers for image recognition at
hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom scale. In International Conference on Learning Representa-
Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, tions, 2021. 5
Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, [13] Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid,
Jack Clark, Christopher Berner, Sam McCandlish, Alec Rad- Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong
ford, Ilya Sutskever, and Dario Amodei. Language models Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck-
are few-shot learners. In Advances in Neural Information worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman,
Processing Systems 33: Annual Conference on Neural Infor- Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch,
mation Processing Systems 2020, NeurIPS 2020, December and Pete Florence. Palm-e: An embodied multimodal lan-
6-12, 2020, virtual, 2020. 2 guage model. In International Conference on Machine
Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii,
[5] Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun
USA, pages 8469–8488. PMLR, 2023. 1
Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi,
[14] Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong
Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny.
Qiu, Zhilin Yang, and Jie Tang. Glm: General language
Minigpt-v2: large language model as a unified interface for
model pretraining with autoregressive blank infilling. In Pro-
vision-language multi-task learning. CoRR, abs/2310.09478,
ceedings of the 60th Annual Meeting of the Association for
2023. 1, 3, 6, 8
Computational Linguistics (Volume 1: Long Papers), pages
[6] Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng 320–335, 2022. 3
Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s
[15] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin,
referential dialogue magic. CoRR, abs/2306.15195, 2023. 6
Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang,
[7] Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. MME: A
He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: comprehensive evaluation benchmark for multimodal large
Improving large multi-modal models with better captions. language models. CoRR, abs/2306.13394, 2023. 5
CoRR, abs/2311.12793, 2023. 6 [16] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba-
[8] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- tra, and Devi Parikh. Making the V in VQA matter: El-
offrey E. Hinton. A simple framework for contrastive learn- evating the role of image understanding in visual question
ing of visual representations. In Proceedings of the 37th In- answering. In 2017 IEEE Conference on Computer Vision
ternational Conference on Machine Learning, ICML 2020, and Pattern Recognition, CVPR 2017, Honolulu, HI, USA,
13-18 July 2020, Virtual Event, pages 1597–1607. PMLR, July 21-26, 2017, pages 6325–6334. IEEE Computer Soci-
2020. 4 ety, 2017. 5
[9] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao [17] Chanda Grover, Indra Deep Mastan, and Debayan Gupta.
Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Contextclip: Contextual alignment of image-text pairs on
Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source CLIP visual representations. In Proceedings of the Thir-
chatbot impressing gpt-4 with 90%* chatgpt quality. See teenth Indian Conference on Computer Vision, Graphics and
10
Image Processing, ICVGIP 2022, Gandhinagar, India, De- [27] Ilya Loshchilov and Frank Hutter. Decoupled weight de-
cember 8-10, 2022, pages 51:1–51:10. ACM, 2022. 2, 4 cay regularization. In International Conference on Learning
[18] Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Representations, 2019. 6
Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. [28] Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei
Vizwiz grand challenge: Answering visual questions from Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and
blind people. In 2018 IEEE Conference on Computer Vision Ashwin Kalyan. Learn to explain: Multimodal reasoning
and Pattern Recognition, CVPR 2018, Salt Lake City, UT, via thought chains for science question answering. In Ad-
USA, June 18-22, 2018, pages 3608–3617. Computer Vision vances in Neural Information Processing Systems 35: An-
Foundation / IEEE Computer Society, 2018. 5 nual Conference on Neural Information Processing Systems
[19] Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and 2022, NeurIPS 2022, New Orleans, LA, USA, November 28
Zhuowen Tu. BLIVA: A simple multimodal LLM for better - December 9, 2022, 2022. 5
handling of text-rich visual questions. In Thirty-Eighth AAAI [29] Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier,
Conference on Artificial Intelligence, AAAI 2024, Thirty- Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xi-
Sixth Conference on Innovative Applications of Artificial anzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian
Intelligence, IAAI 2024, Fourteenth Symposium on Educa- Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu
tional Advances in Artificial Intelligence, EAAI 2014, Febru- Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan
ary 20-27, 2024, Vancouver, Canada, pages 2256–2264. Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam
AAAI Press, 2024. 1 Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang,
[20] Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bek- Peter Grasch, Alexander Toshev, and Yinfei Yang. MM1:
man, Amanpreet Singh, Anton Lozhkov, Thomas Wang, methods, analysis & insights from multimodal LLM pre-
Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, training. CoRR, abs/2403.09611, 2024. 3
Matthieu Cord, and Victor Sanh. OBELICS: an open web- [30] Seungwhan Moon, Andrea Madotto, Zhaojiang Lin, Tushar
scale filtered dataset of interleaved image-text documents. In Nagarajan, Matt Smith, Shashank Jain, Chun-Fu Yeh,
Advances in Neural Information Processing Systems 36: An- Prakash Murugesan, Peyman Heidari, Yue Liu, Kavya
nual Conference on Neural Information Processing Systems Srinet, Babak Damavandi, and Anuj Kumar. Anymal:
2023, NeurIPS 2023, New Orleans, LA, USA, December 10 An efficient and scalable any-modality augmented language
- 16, 2023, 2023. 6 model. CoRR, abs/2309.16058, 2023. 1
[31] OpenAI. GPT-4 technical report. CoRR, abs/2303.08774,
[21] Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix-
2023. 1
iao Ge, and Ying Shan. Seed-bench: Benchmarking
multimodal llms with generative comprehension. CoRR, [32] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car-
abs/2307.16125, 2023. 5 roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob
[22] Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H.
Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda
Hoi. BLIP-2: bootstrapping language-image pre-training
Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and
with frozen image encoders and large language models. In
Ryan Lowe. Training language models to follow instruc-
International Conference on Machine Learning, ICML 2023,
tions with human feedback. In Advances in Neural Informa-
23-29 July 2023, Honolulu, Hawaii, USA, pages 19730–
tion Processing Systems 35: Annual Conference on Neural
19742. PMLR, 2023. 6
Information Processing Systems 2022, NeurIPS 2022, New
[23] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Orleans, LA, USA, November 28 - December 9, 2022, 2022.
Zhao, and Ji-Rong Wen. Evaluating object hallucination in 3
large vision-language models. In Proceedings of the 2023 [33] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario
Conference on Empirical Methods in Natural Language Pro- Amodei, Ilya Sutskever, et al. Language models are unsu-
cessing, EMNLP 2023, Singapore, December 6-10, 2023, pervised multitask learners. OpenAI blog, 1(8):9, 2019. 2
pages 292–305. Association for Computational Linguistics,
[34] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
2023. 5
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
[24] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
Improved baselines with visual instruction tuning. CoRR, Krueger, and Ilya Sutskever. Learning transferable visual
abs/2310.03744, 2023. 5, 6, 8 models from natural language supervision. In Proceedings
[25] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. of the 38th International Conference on Machine Learning,
Visual instruction tuning. In Advances in Neural Informa- ICML 2021, 18-24 July 2021, Virtual Event, pages 8748–
tion Processing Systems 36: Annual Conference on Neural 8763. PMLR, 2021. 2, 3, 4, 5
Information Processing Systems 2023, NeurIPS 2023, New [35] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and
Orleans, LA, USA, December 10 - 16, 2023, 2023. 1, 3, 5 Yuxiong He. Zero: memory optimizations toward train-
[26] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang ing trillion parameter models. In Proceedings of the In-
Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui ternational Conference for High Performance Computing,
He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is Networking, Storage and Analysis, SC 2020, Virtual Event
your multi-modal model an all-around player? CoRR, / Atlanta, Georgia, USA, November 9-19, 2020, page 20.
abs/2307.06281, 2023. 5 IEEE/ACM, 2020. 6
11
[36] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Zhang, Haodong Duan, Wenwei Zhang, Hang Yan, Xinyue
Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He,
Coombes, Jenia Jitsev, and Aran Komatsuzaki. LAION- Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang.
400M: open dataset of clip-filtered 400 million image-text Internlm-xcomposer: A vision-language large model for ad-
pairs. CoRR, abs/2111.02114, 2021. 2, 4 vanced text-image comprehension and composition. CoRR,
[37] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, abs/2309.15112, 2023. 1, 3
Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus [47] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo-
Rohrbach. Towards VQA models that can read. In IEEE hamed Elhoseiny. Minigpt-4: Enhancing vision-language
Conference on Computer Vision and Pattern Recognition, understanding with advanced large language models. CoRR,
CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages abs/2304.10592, 2023. 1, 3
8317–8326. Computer Vision Foundation / IEEE, 2019. 5
[38] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois,
Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B
Hashimoto. Alpaca: A strong, replicable instruction-
following model. Stanford Center for Research on Founda-
tion Models. https://crfm. stanford. edu/2023/03/13/alpaca.
html, 3(6):7, 2023. 3
[39] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste
Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien
Rodriguez, Armand Joulin, Edouard Grave, and Guillaume
Lample. Llama: Open and efficient foundation language
models. CoRR, abs/2302.13971, 2023. 3
[40] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji
Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan
Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming
Ding, and Jie Tang. Cogvlm: Visual expert for pretrained
language models. CoRR, abs/2311.03079, 2023. 1, 3
[41] Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-
Seng Chua. Next-gpt: Any-to-any multimodal LLM. CoRR,
abs/2309.05519, 2023. 1, 3
[42] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,
Aaron C. Courville, Ruslan Salakhutdinov, Richard S.
Zemel, and Yoshua Bengio. Show, attend and tell: Neu-
ral image caption generation with visual attention. In Pro-
ceedings of the 32nd International Conference on Machine
Learning, ICML 2015, Lille, France, 6-11 July 2015, pages
2048–2057. JMLR.org, 2015. 2
[43] Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan,
Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi,
Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Jun-
feng Tian, Qian Qi, Ji Zhang, and Fei Huang. mplug-owl:
Modularization empowers large language models with mul-
timodality. CoRR, abs/2304.14178, 2023. 1, 3
[44] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang,
Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang.
Mm-vet: Evaluating large multimodal models for integrated
capabilities. CoRR, abs/2308.02490, 2023. 5
[45] Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu
Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao
Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai,
Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong,
and Jie Tang. GLM-130B: an open bilingual pre-trained
model. In The Eleventh International Conference on Learn-
ing Representations, ICLR 2023, Kigali, Rwanda, May 1-5,
2023. OpenReview.net, 2023. 3
[46] Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu,
Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang
12