CogAgent-A Visual Language Model For GUI Agents
CogAgent-A Visual Language Model For GUI Agents
Wenyi Hong1 * Weihan Wang1 * Qingsong Lv2 Jiazheng Xu1 * Wenmeng Yu2
Junhui Ji2 Yan Wang2 Zihan Wang1 * Yuxiao Dong1 Ming Ding2† Jie Tang1†
1
Tsinghua University 2 Zhipu AI
{hwy22@mails, jietang@}.tsinghua.edu.cn, [email protected]
14281
Figure 1. Samples of visual agents generated by CogAgent. More samples are demonstrated in the Appendix.
14282
images share a different distribution from natural im- 2.1. Architecture
ages. We thus construct a large-scale annotated dataset
The architecture of CogAgent is depicted in Fig. 2. We
about GUIs and OCR for continual pre-training.
build our model based on a pre-trained VLM (on the right
• High-Resolution vs. Compute. In GUIs, tiny icons side of the image), and propose to add a cross-attention
and text are ubiquitous, and it is hard to recognize module to process high-resolution input (on the left side
them in commonly-used 224 ⇥ 224 resolution. How- of the image). As our base VLM, We select CogVLM-
ever, increasing the resolution of input images results 17B [38], an open-sourced and state-of-the-art large vison-
in significantly long sequence length in language mod- language model. Specifically, We employ EVA2-CLIP-
els. For example, a 1120 ⇥ 1120 image corresponds E [35] as the encoder for low-resolution images (224⇥224
to a sequence of 6400 tokens if the patch size is 14, pixels), complemented by an MLP adapter that maps its
demanding excessive training and inference compute. output into the feature space of the visual-language de-
To address this, we design a cross-attention branch coder. The decoder, a pre-trained language model, is en-
that allows for a trade-off between the resolution and hanced with a visual expert module introduced by Wang
the hidden size within a proper computation budget. et al. [38] to facilitate a deep fusion of visual and language
Specifically, we propose to combine the original large features. The decoder processes a combined input of the
ViT [12] (4.4B parameters) used in CogVLM [38] and low-resolution image feature sequence and text feature se-
a new small high-resolution cross-module (with image quence, and autoregressively outputs the target text.
encoder of 0.30B parameters) to jointly model visual Similar to most VLMs, the original CogVLM can only
features. accommodate images of relatively low resolution (224 or
490), which hardly meets the demands of GUI where the
Our experiments show that: screen resolution of computers or smartphones is typically
720p (1280 ⇥ 720 pixels) or higher. It is a common prob-
• CogAgent tops popular GUI understanding and
lem among VLMs, e.g. LLaVA [21] and PALI-X [8] are
decision-making benchmarks, including AITW [31]
pre-trained at a low resolution of 224 ⇥ 224 on the general
and Mind2Web [10]. To the best of our knowledge,
domain. The primary reason is that high-resolution image
this is the first time that a generalist VLM can out-
brings prohibitive time and memory overhead: VLMs usu-
perform LLM-based methods with extracted structured
ally concatenate text and image feature sequence as input
text.
to the decoder, thus the overhead of self-attention module
• Though CogAgent focuses on GUIs, it achieves is quadratic to the number of visual tokens (patches), which
state-of-the-art generalist performance on nine visual is quadratic to the image’s side length. There are some ini-
question-answering benchmarks including VQAv2 [1], tial attempts to reduce costs for high-resolution images. For
OK-VQA [23], TextVQA [34], ST-VQA [4], instance, Qwen-VL [2] proposes a position-aware vision-
ChartQA [24], infoVQA [26], DocVQA [25], language adapter to compress image features, but only re-
MM-Vet [41], and POPE [19]. duces sequence length by four and has a maximum reso-
lution of 448 ⇥ 448. Kosmos-2.5 [22] adopts a Perceiver
• The separated design of high- and low-resolution Resampler module to reduce the length of the image se-
branches in CogAgent significantly lows the compute quence. However, the resampled sequence is still long for
cost for consuming high-resolution images, e.g., the self-attention in the large visual-language decoder (2,048
number of the floating-point operations (FLOPs) for tokens), and can only be applied to restricted text recogni-
CogAgent-18B with 1120 ⇥ 1120 inputs is less than tion tasks.
half that of CogVLM-17B with its default 490 ⇥ 490 Therefore, we propose a novel high-resolution cross-
inputs. module as a potent complement to the existing structure
for enhancing understanding at high resolutions, which not
CogAgent is open-sourced at https : / / github .
only maintains efficiency confronting high-resolution im-
com/THUDM/CogVLM. It represents an effort to promote
ages, but also offers flexible adaptability to a variety of
the future research and application of AI agents, facilitated
visual-language model architectures.
by advanced VLMs.
2.2. High-Resolution Cross-Module
2. Method
The structural design of high-resolution cross-module is
In this section, we will first introduce the architecture of Co- mainly based on the following observations:
gAgent, especially the novel high-resolution cross-module,
and then illustrate the process of pre-training and alignment 1. At a modest resolution such as 224 ⇥ 224, images
in detail. can depict most objects and layouts effectively, yet
14283
High-Resolution Cross-Module Original VLM [Target Text] integrate a cross-attention between Xhi and hidden states in
L-th layer
cross-attn every decoder layer.
Formally, suppose that the input hidden states of
…
Cross-Attention
Visual Language Decoder the i-th attention layer in the decoder are Xini 2
i-th layer (hidden size = 1024) (hidden size = 4096) RB⇥(LIlo +LT )⇥Ddec , and the output hidden states of cross-
module’s image encoder are Xhi 2 RB⇥(LIhi )⇥Dhi , where B
…
1st layer
cross-attn is the batch size, LIlo , LIhi and LT are the lengths of the
concat
low-resolution image, high-resolution image and text se-
[ High-resolution image feature ] [Low-resolution image feature] [Text feature] quences, Ddec and Dhi is the hidden size of the decoder and
High-resolution
high-resolution encoder’s output respectively. Each layer’s
Image Encoder MLP Adapter Word Embedding attention procedure can be formulated as
(light-weight)
the original VLM. to get Qicross = Xi0 WQi cross 2 R(LIlo +LT )⇥Dcross in every de-
coder layer. With the residual connection in Eq. 2, the cross-
the resolution falls short in rendering text with clar- attention with high-resolution images can be perceived as
ity. Hence, our new high-resolution module should a complement to the features of low-resolution images,
emphasize text-related features, which are vital for un- thereby effectively utilizing the previous pre-trained model
derstanding GUIs. in low resolution.
Computational complexity. Let the number of attention
2. While pre-trained VLMs in general domain often
head be Hcross and Hdec in cross-attention and self-attention,
need large hidden sizes (e.g. 4,096 in PALI-X and
and the dimension of each head be dcross = Dcross /Hcross
CogVLM, 5,120 in LLaVA), VLMs tailored for text-
and ddec = Ddec /Hdec . If using our high-resolution cross-
centered tasks like document OCR require smaller hid-
module, the computational complexity of attention is
den sizes to achieve satisfying performance (e.g. 1,536
in Kosmos-2.5 and Pix2Struct [16]). This suggests that
Timproved = O (LIlo + LT )LIhi Hcross dcross
text-related features can be effectively captured using (3)
smaller hidden sizes. + (LIlo + LT )2 Hdec ddec .
As shown in Fig. 2, the high-resolution cross-module Note that dcross and Hcross can be flexibly adjusted accord-
acts as a new branch for higher-resolution input, which ing to computational budget and model performance. If not
accepts images of size 1120 ⇥ 1120 pixels in our im- utilizing the high-resolution cross-module and directly sub-
plementation. Different from the original low-resolution stituting low-resolution images with high-resolution ones,
input branch, the high-resolution cross-module adopts a the computational complexity would be
much smaller pre-trained vision encoder (visual encoder of
EVA2-CLIP-L [35] in our implementation, 0.30B parame- Toriginal = O (LIhi + LT )2 Hdec ddec . (4)
ters), and uses cross-attention of a small hidden size to fuse
high-resolution image features with every layer of VLLM In our implementation, dcross = 32, Hcross = 32, and
decoder, thus reducing the computational cost. To be con- we inherits ddec = 128, Hdec = 32 from CogVLM-17B.
crete, for an input image, it is resized to 1120 ⇥ 1120 and Both high- and low-resolution encoders patchify images
224⇥224 and fed into the high-resolution cross-module and with 14 ⇥ 14-pixel patches, thus LIhi = 6400, LIlo = 256.
L +L
the low-resolution branch respectively, then encoded into Our method leads to at least LIIhi +LTT = 6400+L
256+LT ⇥ accel-
T
lo
image feature sequences Xhi and Xlo with two distinct-sized eration which is a stringent lower bound (refer to Appendix
image encoders in parallel. The visual language decoder re- for detailed derivation), and reduces memory overhead at
tains its original computations, while the only change is to the same time.
14284
2.3. Pre-training grounding, we have constructed the CCS400K (Common
Crawl Screenshot 400K) dataset. This extensive dataset is
To enhance the model’s ability to comprehend high-
formed by extracting URLs from the latest Common Crawl
resolution images and adapt it for GUI application scenar-
data, followed by capturing 400,000 web page screenshots.
ios, we focus our pre-training efforts on the following as-
Alongside these screenshots, we compile all visible DOM
pects: the capability to recognize texts of various sizes, ori-
elements and their corresponding rendered boxes using
entations, and fonts in high-resolution images, the ground-
Playwright1 , supplementing the dataset with 140 million
ing ability of text and objects in the image, and a special-
REC and REG question-answer pairs. This rich dataset en-
ized understanding capability for GUI imagery such as web
sures comprehensive training and understanding of GUI el-
page. We divide our pre-train data into three parts based on
ements. To mitigate the risk of overfitting, we employ a
the aforementioned aspects, with samples in the Appendix.
diverse range of screen resolutions for rendering, selected
All the pre-training data are derived from publicly available
randomly from a list of commonly used resolutions across
datasets. The construction methods are detailed below.
various devices. Additionally, to prevent the HTML code
Text recognition. Our data includes (1) Synthetic render-
from becoming overly extensive and unwieldy, we perform
ings with text from language pre-training dataset (80M).
necessary data cleaning by omitting redundant attributes in
This is similar to the Synthetic Document Generator in Kim
the DOM elements, following the method outlined in [16].
et al. [15], with text of varying font, size, color and orienta-
We also incorporate publicly available text-image
tion, and diverse image background from LAION-2B [32].
datasets including LAION-2B and COYO-700M (after re-
(2) Optical Character Recognition (OCR) of natural im-
moving the broken URLs, NSFW images, and images with
ages (18M). We collect natural images from COYO [6] and
noisy captions and political bias) during pre-training.
LAION-2B [32] and employ Paddle-OCR [13] to extract
We pre-train our CogAgent model for a total of 60,000
the texts and their bounding boxes, and filter out images
iterations with a batch size of 4,608 and a learning rate of
with no text boxes. (3) Academic documents (9M). We fol-
2e-5. We freeze all parameters except the newly added high-
low Nougat [5] to construct image-text pairs including text,
resolution cross-module for the first 20,000 steps, resulting
formula and tables from the source code (LaTeX) release on
in a total number of 646M (3.5%) trainable parameters, then
arXiv. For (1)(3), we apply the same data augmentation as
additionally unfreeze the visual expert in CogVLM for the
Nougat. For (2), we additionally employed more aggressive
next 40,000 steps. We warm up with curriculum learning by
rotation and flipping data augmentation techniques.
first training on easier text recognition (synthetic renderings
Visual grounding. It is imperative for GUI agents to pos- and OCR on natural images) and image captioning, then se-
sess the capability to accurately comprehend and locate di- quentially incorporating harder text recognition (academic
verse elements within images. We follow CogVLM [38] to document), grounding data and web page data, as we ob-
use a constructed visual grounding dataset of 40M images served that it leads to faster convergence and more stable
with image-caption pairs sampled from LAION-115M [18], training in our preliminary experiments.
which associate entities in the caption with bounding boxes
to indicate their positions. The format of the bounding box 2.4. Multi-task Fine-tuning and Alignment
is [[x0 , y0 , x1 , y1 ]], where (x0 , y0 ) and (x1 , y1 ) represent
To enhance our model’s performance for diverse tasks and
the coordinates of upper-left and lower-right corners which
ensure it aligns with free-form human instructions in the
are normalized to [000, 999]. If multiple objects are indi-
GUI setting, we further fine-tune our model on a broad
cated by a single noun phrase, their boxes are separated by
range of tasks. We manually collected over two thousand
semicolons in double square brackets.
screenshots from mobile phones and computers, each anno-
GUI imagery. Our approach innovatively addresses the
tated with screen elements, potential tasks, and methods of
scarcity and limited relevance of GUI images in datasets
operation in the question-answering format by human anno-
like LAION and COYO, which predominantly feature nat-
tators (details illustrated in the Appendix). We also utilize
ural images. GUI images, with their distinct elements such
Mind2Web [10] and AITW [31], datasets focusing on web
as input fields, hyperlinks, icons, and unique layout charac-
and Android behaviors which comprise tasks, sequences of
teristics, require specialized handling. To boost the model’s
actions and corresponding screenshots, and convert them
capability in interpreting GUI imagery, we have conceptu-
into a natural language question-and-answer format using
alized two pioneering GUI grounding tasks: (1) GUI Re-
GPT-4. Besides, we incorporate multiple publicly available
ferring Expression Generation (REG) – where the model
visual question-answering (VQA) datasets encompassing a
is tasked with generating HTML code for DOM (Docu-
variety of tasks into our alignment dataset. We unfreeze all
ment Object Model) elements based on a specified area in a
model parameters during this stage and train for 10k itera-
screenshot, and (2) GUI Referring Expression Comprehen-
tions with a batch size of 1024 and a learning rate of 2e-5.
sion (REC) – which involves creating bounding boxes for
given DOM elements. To facilitate robust training in GUI 1 https://playwright.dev
14285
General VQA Text-rich VQA
Method
VQAv2 OKVQA OCRVQA TextVQA STVQA ChartQA InfoVQA DocVQA
task-specific fine-tuning models
Pix2Struct [16] - - - - - 58.6 40.0 76.6
BLIP-2 [18] 82.2 59.3 72.7 - - - - -
PALI-X-55B [8] 86.0 66.1 75.0 71.4 79.9 70.9 49.2 80.0
CogVLMtask-specific [38] 84.7 64.7 74.5 69.7 - - - -
generalist models
UReader [40] - 57.6 - - - 59.3 42.2 65.4
Qwen-VL [2] 79.5 58.6 75.7 63.8 - 65.7 - 65.1
Qwen-VL-chat [2] 78.2 56.6 70.5 61.5 - 66.3 - 62.6
Llava-1.5 [20] 80.0 - - 61.5 - - - -
Fuyu-8B [3] 74.2 60.6 - - - - - -
CogVLMgeneralist [38] 83.4 58.9 74.1 68.1 - - - -
CogAgent (Ours) 83.7 61.2 75.0 76.1 80.5 68.4 44.5 81.6
Table 1. Performance on Visual Question Answering benchmarks. Bold text indicates the best score among the generalist category, and
underlined text represents the best score across both generalist and task-specific categories.
14286
Method cross-task cross-website cross-domain overall Method GoogleApp Install WebShop General Single Overall
Representations of screen inputs: HTML Representations of screen inputs: textual description (OCR+icon)
GPT-3.5[29](few-shot) 10.47 4.38 8.42 5.93 9.39 7.72
GPT-3.5[29](few-shot) 18.6 17.4 16.2 17.4
LLaMA2-7B[37]† 30.99 35.18 19.92 28.56 27.35 28.40
GPT-4[30]† (few-shot) 36.2 30.1 26.4 30.9
Representations of screen inputs: image
Flan-T5XL [10] 52.0 38.9 39.6 43.5
Auto-UIunified[43] 71.37 76.89 70.26 68.24 84.58 74.27
LLaMA2-7B[37] 52.7 47.1 50.3 50.1
CogAgent (Ours) 74.95 78.86 71.73 65.38 93.49 76.88
LLaMA2-70B[37] 55.8 51.6 55.7 54.4
Representations of screen inputs: Image Table 4. Performance on Android in the Wild (AITW) dataset.
Qwen-VL[2] 12.6 10.1 8.0 10.2 † represents models individually fine-tuned on each subset, while
CogVLM[38] 37.1 23.4 26.3 23.9 others are unified models across all subsets. The results of
CogAgent (Ours) 62.3 54.0 59.4 58.2 LLaMA2 and GPT-3.5 are from Zhan and Zhang [43].
Table 3. Performance on Mind2Web. † denotes element selec-
tion from top-10 element candidates, others from top-50, follow- and device types. Each episode in the dataset consists of a
ing Deng et al. [10]. Results for GPT-3.5 and GPT-4 are from goal described in natural language, followed by a sequence
Deng et al. [10]. of actions and corresponding screenshots. The training tar-
hallucinations compared to other models. get is to predict the next action based on the given goal, his-
torical actions, and the screenshot. For each action, models
3.2. GUI Agent: Computer Interface are required to predict the exact action type; for tap, swipe
and type, models are further required to predict the position,
We evaluate CogAgent on Mind2Web, a dataset for web
direction, and content to be typed, respectively.
agents that includes over 2,000 open-ended tasks collected
We conduct comparisons with two kinds of baselines:
from 137 real-world websites across 31 domains. Given
language models using the textual description of UI ele-
the task description, current webpage snapshot and previous
ments provided by the original dataset (text OCR and icon)
actions as inputs, agents are expected to predict the subse-
as the representations of screen inputs2 , and visual-language
quent action. We follow the setting of Deng et al. [10] in our
models using images as the screen inputs. We simultane-
experiments, and report step success rate (step SR) metric.
ously fine-tuned on all the subsets, yielding a unified model
Several language models were evaluated on this bench-
which is then evaluated on all test sets. As the GoogleApps
mark. For instance, AgentTuning [42] and MindAct [10]
subset is 10-100 times larger than other subsets, we down-
evaluated Llama2-70B and Flan-T5-XL in a fine-tuned set-
sample it to 10% to avoid data imbalance.
ting, and GPT-3.5 and GPT-4 in a in-context learning set-
Results are shown in Tab. 4. CogAgent achieves state-
ting. However, limited by the input modality of lan-
of-the-art performance compared to all previous methods.
guage models, these models could only use heavily cleansed
In comparison to language-based methods, our model sur-
HTML as the representation of screen inputs. To the best of
passes both baselines by a large margin. In comparison to
our knowledge, no visually-based web agents have been ex-
the visual-language baseline, Auto-UI, our model achieves
perimented with on this benchmark.
+2.61 improvements in the overall performance. In in-
We fine-tune our model on the train set and evalu-
stances of inaccuracies, we randomly sample hundreds of
ate on three out-of-domain subsets, i.e. cross-website,
cases, and upon reassessment, more than 40% are deter-
cross-domain, and cross-task. We additionally fine-tune
mined to be correct (refer to the appendix for details). This
LLaMA2-7B and LLaMA2-70B as the baseline of fine-
diversity arises from the multiple valid pathways inherent in
tuned LLMs, and adopt the same HTML cleansing process
mobile interactions, resulting in a range of responses.
as Deng et al. [10] to construct HTML input. The results
are presented in Sec. 3.2. Compared to other methods, our
approach achieved significant performance improvements 4. Ablation Study
across all three subsets, surpassing LLaMA2-70B, which To thoroughly comprehend the impact of various compo-
is nearly 4⇥ the scale of CogAgent, by 11.6%, 4.7%, and nents in the methodology, we conduct ablation studies on
6.6%, respectively. This reflects not only the capability of two aspects, model architecture and training data. The eval-
our model but also the advantages of employing a visual uation is conducted on diverse datasets, including multiple
agent in computer GUI scenarios. VQA datasets (STVQA, OCRVQA, DocVQA) and a web
3.3. GUI Agent: Smartphone Interface agent dataset (Mind2Web). For VQA datasets, we fine-
2 Some Android applications may have View Hierarchy which is more
To evaluate our model on diverse smartphone interfaces and
friendly to language-based agents, but most of them tend to be poor quality
tasks, we utilize Android in the Wild (AITW) dataset [31] , or missing altogether. Therefore, as a large-scale, general-purpose dataset,
a large-scale dataset for Android device agents. It comprises AITW retained the results of OCR detection and icon detection as textual
715k operation episodes covering varying Android versions representations of screenshots.
14287
tune the model on four datasets together for 3,000 iters with high-res base cross
STVQA OCRVQA DocVQA Mind2Web
training
TFLOPs
module res res time/it (s)
a batch size of 1,280, and report the generalist score; for
Mind2Web, models are fine-tuned for 2,400 iters with a % 224 — 48.0 70.2 28.6 34.6 2.36 7.77
% 490 — 68.1 74.5 57.6 40.7 6.43 29.14
batch size of 128 and use top-10 setting. Training iterations
! 224 756 73.6 74.2 62.3 40.7 3.57 10.08
are fewer than those in the main experiment, aiming to con- ! 224 1120 78.2 75.9 74.1 41.4 5.17 12.56
trol variables within the constraints of a limited budget.
Table 5. Ablation study on model architecture. Training time is
4.1. Model Architecture evaluated on A800 with the batch size of 8. Models are pre-trained
To ascertain the efficacy of the high-resolution cross- with Caption+OCR data.
module, we compare it with directly increasing the resolu- pre-train data base res cross res STVQA OCRVQA DocVQA Mind2Web
tion using the original model architecture of CogVLM, and Cap 490 — 68.1 74.5 57.6 38.6
ablate on two perspectives: computational efficiency and Cap+OCR 490 — 72.5 75.0 59.8 40.7
model performance. Cap+OCR 224 1120 78.2 75.9 74.1 41.4
To measure computational overhead, we use floating All 224 1120 79.4 75.6 76.4 54.2
point operations (FLOPs) as the metric, and conduct exper-
Table 6. Ablation study on pre-train data with sequentially added
iments on multiple resolutions including 224, 490, 756, and
image captioning, OCR and other pre-train data.
1120. From Fig. 3 we can see that, as the image resolution
increases, models that use a high-resolution cross-module
experience only a modest rise in computational overhead, language training, we sequentially add OCR data (denoted
demonstrating an almost linear relationship with the num- as Cap+OCR), as well as GUI and grounding data (denoted
ber of image patches. In contrast, using the original model as All). The results in Tab. 6 indicate that each part of
structure, i.e. CogVLM, leads to a significant increase in data broadly contributes to enhanced performance. Notably,
the number of FLOPs at higher resolutions. Its FLOPs can web and grounding data have a significant impact on the
even be more than 10 times higher compared to employing Mind2Web dataset, underscoring the importance of con-
a cross-module at a resolution of 1120, which is the resolu- structing domain-specific pre-train data in the training of
tion utilized by CogAgent. GUI agents.
5. Conclusion
We introduce CogAgent, a VLM-based GUI agent with en-
hanced pre-train data construction and efficient architecture
for high-resolution input. CogAgent achieves state-of-the-
art performance on a wide range of VQA and GUI bench-
marks, and will be open-sourced. CogAgent is an initial
exploration of VLM-based GUI agent, and still has some
shortcomings, e.g. imprecise output coordinates and inca-
pability of processing multiple images, necessitating further
research.
Figure 3. Comparison of FLOPs during forward propagation for
different model architectures and resolutions.
Acknowledgments
We further compare the model performance in Tab. 5,
which indicates that models with high-resolution cross- This work is supported by Technology and Innovation Ma-
module at the resolution of 756 require only 1/2 of the com- jor Project of the Ministry of Science and Technology of
putational resources used by the original structure at the China under Grant 2022ZD0118600, Natural Science Foun-
resolution of 490, while delivering significantly better per- dation of China (NSFC) 62277033 and the New Corner-
formance. Additionally, the high-resolution cross-module stone Science Foundation through the XPLORER PRIZE. It
allows for further increasing models’ acceptable resolution also got partial support from the National Engineering Lab-
within a limited computational budget, thereby yielding ad- oratory for Cyberlearning and Intelligent Technology, Bei-
ditional performance improvements. jing Key Lab of Networked Multimedia, Daimler Greater
China Ltd. -Tsinghua University Joint Institute for Sus-
4.2. Pre-train Data tainable Mobility,Tsinghua University(Department of Com-
We further conduct an ablation study on pre-training data, puter Science and Technology)-Siemens Ltd., China Joint
which is an integral part of training visual agents. Build- Research Center for Industrial Intelligence and Internet of
ing upon the image-caption data commonly used in visual- Things (JCIIOT) and a research fund from Zhipu AI.
14288
References [13] Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Weiwei
Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing
[1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Dang, et al. Pp-ocr: A practical ultra lightweight ocr system.
Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. arXiv preprint arXiv:2009.09941, 2020. 5
Vqa: Visual question answering. In Proceedings of the IEEE [14] Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie
international conference on computer vision, pages 2425– Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi-
2433, 2015. 3, 6, 11 angyu Yue, et al. Llama-adapter v2: Parameter-efficient vi-
[2] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan sual instruction model. arXiv preprint arXiv:2304.15010,
Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren 2023. 6
Zhou. Qwen-vl: A frontier large vision-language model with [15] Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon
versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 3, Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sang-
6, 7 doo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free
[3] Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell document understanding transformer. In European Confer-
Nye, Augustus Odena, Arushi Somani, and Sağnak Taşırlar. ence on Computer Vision, pages 498–517. Springer, 2022.
Introducing our multimodal models, 2023. 6 5
[4] Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, [16] Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu,
Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimos- Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandel-
thenis Karatzas. Scene text visual question answering. In wal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova.
Proceedings of the IEEE/CVF international conference on Pix2struct: Screenshot parsing as pretraining for visual lan-
computer vision, pages 4291–4301, 2019. 3, 6, 12 guage understanding. In International Conference on Ma-
[5] Lukas Blecher, Guillem Cucurull, Thomas Scialom, and chine Learning, pages 18893–18912. PMLR, 2023. 4, 5, 6
Robert Stojnic. Nougat: Neural optical understanding for [17] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi
academic documents. arXiv preprint arXiv:2308.13418, Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-
2023. 5 it: Multi-modal in-context instruction tuning. arXiv preprint
[6] Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun arXiv:2306.05425, 2023. 6
Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: [18] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.
Image-text pair dataset. https : / / github . com / Blip-2: Bootstrapping language-image pre-training with
kakaobrain/coyo-dataset, 2022. 5 frozen image encoders and large language models. arXiv
preprint arXiv:2301.12597, 2023. 5, 6
[7] Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier,
[19] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin
Karthik Narasimhan, and Shunyu Yao. Fireact: Toward lan-
Zhao, and Ji-Rong Wen. Evaluating object hallucina-
guage agent fine-tuning. arXiv preprint arXiv:2310.05915,
tion in large vision-language models. arXiv preprint
2023. 1
arXiv:2305.10355, 2023. 3, 6, 11
[8] Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa,
[20] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee.
Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Se-
Improved baselines with visual instruction tuning. arXiv
bastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On
preprint arXiv:2310.03744, 2023. 6
scaling up a multilingual vision and language model. arXiv
[21] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee.
preprint arXiv:2305.18565, 2023. 3, 6
Visual instruction tuning. arXiv preprint arXiv:2304.08485,
[9] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat 2023. 3, 6
Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale [22] Tengchao Lv, Yupan Huang, Jingye Chen, Lei Cui, Shum-
Fung, and Steven Hoi. Instructblip: Towards general- ing Ma, Yaoyao Chang, Shaohan Huang, Wenhui Wang, Li
purpose vision-language models with instruction tuning, Dong, Weiyao Luo, et al. Kosmos-2.5: A multimodal literate
2023. 6 model. arXiv preprint arXiv:2309.11419, 2023. 3
[10] Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel [23] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and
Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Roozbeh Mottaghi. Ok-vqa: A visual question answering
Towards a generalist agent for the web. arXiv preprint benchmark requiring external knowledge. In Proceedings
arXiv:2306.06070, 2023. 3, 5, 7, 12 of the IEEE/cvf conference on computer vision and pattern
[11] Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng recognition, pages 3195–3204, 2019. 3, 6, 11
Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, [24] Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty,
Haoran Wei, et al. Dreamllm: Synergistic multimodal com- and Enamul Hoque. Chartqa: A benchmark for question an-
prehension and creation. arXiv preprint arXiv:2309.11499, swering about charts with visual and logical reasoning. arXiv
2023. 6 preprint arXiv:2203.10244, 2022. 3, 6, 12
[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [25] Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar.
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Docvqa: A dataset for vqa on document images. In Proceed-
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- ings of the IEEE/CVF winter conference on applications of
vain Gelly, et al. An image is worth 16x16 words: Trans- computer vision, pages 2200–2209, 2021. 3, 6, 12
formers for image recognition at scale. arXiv preprint [26] Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis
arXiv:2010.11929, 2020. 3 Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa.
14289
In Proceedings of the IEEE/CVF Winter Conference on Ap- et al. Ureader: Universal ocr-free visually-situated language
plications of Computer Vision, pages 1697–1706, 2022. 3, 6, understanding with multimodal large language model. arXiv
12 preprint arXiv:2310.05126, 2023. 6
[27] Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and [41] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang,
Anirban Chakraborty. Ocr-vqa: Visual question answering Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang.
by reading text in images. In 2019 international conference Mm-vet: Evaluating large multimodal models for integrated
on document analysis and recognition (ICDAR), pages 947– capabilities. arXiv preprint arXiv:2308.02490, 2023. 3, 6,
952. IEEE, 2019. 6, 11 11
[28] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, [42] Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu,
Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Yuxiao Dong, and Jie Tang. Agenttuning: Enabling gener-
Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: alized agent abilities for llms. abs/2310.12823, 2023. 1, 7,
Browser-assisted question-answering with human feedback. 12
arXiv preprint arXiv:2112.09332, 2021. 1 [43] Zhuosheng Zhan and Aston Zhang. You only look at screens:
[29] OpenAI. Introducing chatgpt. 2022. 1, 7 Multimodal chain-of-action agents. abs/2309.11436, 2023.
[30] OpenAI. Gpt-4 technical report, 2023. 7 7, 13
[31] Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana [44] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo-
Riva, and Timothy Lillicrap. Android in the wild: A large- hamed Elhoseiny. Minigpt-4: Enhancing vision-language
scale dataset for android device control. arXiv preprint understanding with advanced large language models. arXiv
arXiv:2307.10088, 2023. 1, 3, 5, 7, 13 preprint arXiv:2304.10592, 2023. 6
[32] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo
Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
man, et al. Laion-5b: An open large-scale dataset for training
next generation image-text models. Advances in Neural In-
formation Processing Systems, 35:25278–25294, 2022. 1,
5
[33] Significant-Gravitas. Autogpt. https://github.com/
Significant-Gravitas/AutoGPT, 2023. 1
[34] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang,
Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus
Rohrbach. Towards vqa models that can read. In Proceedings
of the IEEE/CVF conference on computer vision and pattern
recognition, pages 8317–8326, 2019. 3, 6, 11
[35] Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue
Cao. Eva-clip: Improved training techniques for clip at scale.
arXiv preprint arXiv:2303.15389, 2023. 3, 4
[36] Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong
Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun
Huang, and Xinlong Wang. Generative pretraining in multi-
modality. arXiv preprint arXiv:2307.05222, 2023. 6
[37] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert,
Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al.
Llama 2: Open foundation and fine-tuned chat models. arXiv
preprint arXiv:2307.09288, 2023. 7
[38] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji
Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan
Song, et al. Cogvlm: Visual expert for pretrained language
models. arXiv preprint arXiv:2311.03079, 2023. 1, 3, 5, 6,
7
[39] Shunyu Yao, Howard Chen, John Yang, and Karthik
Narasimhan. Webshop: Towards scalable real-world web in-
teraction with grounded language agents. Advances in Neu-
ral Information Processing Systems, 35:20744–20757, 2022.
1
[40] Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan,
Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang,
14290