CogAgent-A Visual Language Model For GUI Agents

Uploaded by

shraddha.vijay.pawar.work

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

122 views10 pages

CogAgent-A Visual Language Model For GUI Agents

Uploaded by

shraddha.vijay.pawar.work

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;

the final published version of the proceedings is available on IEEE Xplore.

CogAgent: A Visual Language Model for GUI Agents

Wenyi Hong1 * Weihan Wang1 * Qingsong Lv2 Jiazheng Xu1 * Wenmeng Yu2
Junhui Ji2 Yan Wang2 Zihan Wang1 * Yuxiao Dong1 Ming Ding2† Jie Tang1†
1
Tsinghua University 2 Zhipu AI
{hwy22@mails, jietang@}.tsinghua.edu.cn, [email protected]

Abstract and local file operations. Researchers are also starting to

develop agent-oriented LLMs [7, 42]. However, the poten-
People are spending an enormous amount of time on dig- tial of purely language-based agents is quite limited in real-
ital devices through graphical user interfaces (GUIs), e.g., world scenarios, as most applications interact with humans
computer or smartphone screens. Large language mod- through Graphical User Interfaces (GUIs), which are char-
els (LLMs) such as ChatGPT can assist people in tasks acterized by the following perspectives:
like writing emails, but struggle to understand and interact • Standard APIs for interaction are often lacking.
with GUIs, thus limiting their potential to increase automa-
tion levels. In this paper, we introduce CogAgent, an 18- • Important information including icons, images, dia-
billion-parameter visual language model (VLM) specializ- grams, and spatial relations are difficult to directly con-
ing in GUI understanding and navigation. By utilizing both vey in words.
low-resolution and high-resolution image encoders, CogA-
• Even in text-rendered GUIs like web pages, elements
gent supports input at a resolution of 1120⇥1120, enabling
like canvas and iframe cannot be parsed to grasp their
it to recognize tiny page elements and text. As a general-
functionality via HTML.
ist visual language model, CogAgent achieves the state of
the art on five text-rich and four general VQA benchmarks, Agents based on visual language models (VLMs) have
including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, the potential to overcome these limitations. Instead of re-
infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using lying exclusively on textual inputs such as HTML [28] or
only screenshots as input, outperforms LLM-based meth- OCR results [31], VLM-based agents directly perceive vi-
ods that consume extracted HTML text on both PC and sual GUI signals. Since GUIs are designed for human users,
Android GUI navigation tasks—Mind2Web and AITW, ad- VLM-based agents can perform as effectively as humans,
vancing the state of the art. The model and codes are avail- as long as the VLMs match human-level vision understand-
able at https://github.com/THUDM/CogVLM . ing. In addition, VLMs are also capable of skills such as
extremely fast reading and programming that are usually
beyond the reach of most human users, extending the po-
1. Introduction tential of VLM-based agents. A few prior studies utilized
visual features merely as auxiliaries in specific scenarios.
Autonomous agents in the digital world are ideal assistants e.g. WebShop [39] which employs visual features primar-
that many modern people dream of. Picture this scenario: ily for object recognition purposes. With the rapid devel-
You type in a task description, then relax and enjoy a cup opment of VLM, can we naturally achieve universality on
of coffee while watching tasks like booking tickets online, GUIs by relying solely on visual inputs?
conducting web searches, managing files, and creating Pow- In this work, we present CogAgent, a visual language
erPoint presentations get completed automatically. foundation model specializing in GUI understanding and
Recently, the emergence of agents based on large lan- planning while maintaining a strong ability for general
guage models (LLMs) is bringing us closer to this dream. cross-modality tasks. By building upon CogVLM [38]—a
For example, AutoGPT [33], a 150,000-star open-source recent open-source VLM, CogAgent tackles the following
project, leverages ChatGPT [29] to integrate language un- challenges for building GUI agents:
derstanding with pre-defined actions like Google searches
• Training Data. Most current VLMs are pre-trained
* Work was done when interned at Zhipu AI. on datasets like LAION [32], consisting of natural im-
† Corresponding authors ages on the Web. However, we notice that the GUI

14281
Figure 1. Samples of visual agents generated by CogAgent. More samples are demonstrated in the Appendix.

14282
images share a different distribution from natural im- 2.1. Architecture
ages. We thus construct a large-scale annotated dataset
The architecture of CogAgent is depicted in Fig. 2. We
about GUIs and OCR for continual pre-training.
build our model based on a pre-trained VLM (on the right
• High-Resolution vs. Compute. In GUIs, tiny icons side of the image), and propose to add a cross-attention
and text are ubiquitous, and it is hard to recognize module to process high-resolution input (on the left side
them in commonly-used 224 ⇥ 224 resolution. How- of the image). As our base VLM, We select CogVLM-
ever, increasing the resolution of input images results 17B [38], an open-sourced and state-of-the-art large vison-
in significantly long sequence length in language mod- language model. Specifically, We employ EVA2-CLIP-
els. For example, a 1120 ⇥ 1120 image corresponds E [35] as the encoder for low-resolution images (224⇥224
to a sequence of 6400 tokens if the patch size is 14, pixels), complemented by an MLP adapter that maps its
demanding excessive training and inference compute. output into the feature space of the visual-language de-
To address this, we design a cross-attention branch coder. The decoder, a pre-trained language model, is en-
that allows for a trade-off between the resolution and hanced with a visual expert module introduced by Wang
the hidden size within a proper computation budget. et al. [38] to facilitate a deep fusion of visual and language
Specifically, we propose to combine the original large features. The decoder processes a combined input of the
ViT [12] (4.4B parameters) used in CogVLM [38] and low-resolution image feature sequence and text feature se-
a new small high-resolution cross-module (with image quence, and autoregressively outputs the target text.
encoder of 0.30B parameters) to jointly model visual Similar to most VLMs, the original CogVLM can only
features. accommodate images of relatively low resolution (224 or
490), which hardly meets the demands of GUI where the
Our experiments show that: screen resolution of computers or smartphones is typically
720p (1280 ⇥ 720 pixels) or higher. It is a common prob-
• CogAgent tops popular GUI understanding and
lem among VLMs, e.g. LLaVA [21] and PALI-X [8] are
decision-making benchmarks, including AITW [31]
pre-trained at a low resolution of 224 ⇥ 224 on the general
and Mind2Web [10]. To the best of our knowledge,
domain. The primary reason is that high-resolution image
this is the first time that a generalist VLM can out-
brings prohibitive time and memory overhead: VLMs usu-
perform LLM-based methods with extracted structured
ally concatenate text and image feature sequence as input
text.
to the decoder, thus the overhead of self-attention module
• Though CogAgent focuses on GUIs, it achieves is quadratic to the number of visual tokens (patches), which
state-of-the-art generalist performance on nine visual is quadratic to the image’s side length. There are some ini-
question-answering benchmarks including VQAv2 [1], tial attempts to reduce costs for high-resolution images. For
OK-VQA [23], TextVQA [34], ST-VQA [4], instance, Qwen-VL [2] proposes a position-aware vision-
ChartQA [24], infoVQA [26], DocVQA [25], language adapter to compress image features, but only re-
MM-Vet [41], and POPE [19]. duces sequence length by four and has a maximum reso-
lution of 448 ⇥ 448. Kosmos-2.5 [22] adopts a Perceiver
• The separated design of high- and low-resolution Resampler module to reduce the length of the image se-
branches in CogAgent significantly lows the compute quence. However, the resampled sequence is still long for
cost for consuming high-resolution images, e.g., the self-attention in the large visual-language decoder (2,048
number of the floating-point operations (FLOPs) for tokens), and can only be applied to restricted text recogni-
CogAgent-18B with 1120 ⇥ 1120 inputs is less than tion tasks.
half that of CogVLM-17B with its default 490 ⇥ 490 Therefore, we propose a novel high-resolution cross-
inputs. module as a potent complement to the existing structure
for enhancing understanding at high resolutions, which not
CogAgent is open-sourced at https : / / github .
only maintains efficiency confronting high-resolution im-
com/THUDM/CogVLM. It represents an effort to promote
ages, but also offers flexible adaptability to a variety of
the future research and application of AI agents, facilitated
visual-language model architectures.
by advanced VLMs.
2.2. High-Resolution Cross-Module
2. Method
The structural design of high-resolution cross-module is
In this section, we will first introduce the architecture of Co- mainly based on the following observations:
gAgent, especially the novel high-resolution cross-module,
and then illustrate the process of pre-training and alignment 1. At a modest resolution such as 224 ⇥ 224, images
in detail. can depict most objects and layouts effectively, yet

14283
High-Resolution Cross-Module Original VLM [Target Text] integrate a cross-attention between Xhi and hidden states in
L-th layer
cross-attn every decoder layer.
Formally, suppose that the input hidden states of

…
Cross-Attention
Visual Language Decoder the i-th attention layer in the decoder are Xini 2
i-th layer (hidden size = 1024) (hidden size = 4096) RB⇥(LIlo +LT )⇥Ddec , and the output hidden states of cross-
module’s image encoder are Xhi 2 RB⇥(LIhi )⇥Dhi , where B
…

1st layer
cross-attn is the batch size, LIlo , LIhi and LT are the lengths of the
concat
low-resolution image, high-resolution image and text se-
[ High-resolution image feature ] [Low-resolution image feature] [Text feature] quences, Ddec and Dhi is the hidden size of the decoder and
High-resolution
high-resolution encoder’s output respectively. Each layer’s
Image Encoder MLP Adapter Word Embedding attention procedure can be formulated as
(light-weight)

Task: How can I find an Xi0 = MSA(layernorm(Xini )) + Xini , (1)

apartment that offers free
Low-resolution Wi-Fi?
Plan: 1.Locate and select
Xouti = MCA(layernorm(Xi0 ), Xhi ) + Xi0 , (2)
Image Encoder the price filter option.
2.Select the 'Free Wi-Fi'
option. 3. Apply the filters
to update the search where MSA and MCA represent multi-head self-attention
results, and choose one
satisfying apartment.
Action: Move the cursor
with visual expert and multi-head cross-attention, while
to the 'Price filter' on the
left sidebar where it says
Xi0 and Xouti represent their respective output features
downsample
'Your previous filters’, and
click on the 'Free Wi-Fi' with the residual connection. To implement cross-attention
checkbox.
between them, we add learnable transformation matrices
Input Image ( ) Input Image ( ) Input Text
WK i
cross
, WVi cross 2 RDhi ⇥Dcross to get Kcross
i
= Xhi WK i
cross
,
LIhi ⇥Dcross
Figure 2. Model architecture of CogAgent. We adopt CogVLM as Vcross = Xhi WVcross 2 R
i i
, and WQcross 2 R
i Ddec ⇥Dcross

the original VLM. to get Qicross = Xi0 WQi cross 2 R(LIlo +LT )⇥Dcross in every de-
coder layer. With the residual connection in Eq. 2, the cross-
the resolution falls short in rendering text with clar- attention with high-resolution images can be perceived as
ity. Hence, our new high-resolution module should a complement to the features of low-resolution images,
emphasize text-related features, which are vital for un- thereby effectively utilizing the previous pre-trained model
derstanding GUIs. in low resolution.
Computational complexity. Let the number of attention
2. While pre-trained VLMs in general domain often
head be Hcross and Hdec in cross-attention and self-attention,
need large hidden sizes (e.g. 4,096 in PALI-X and
and the dimension of each head be dcross = Dcross /Hcross
CogVLM, 5,120 in LLaVA), VLMs tailored for text-
and ddec = Ddec /Hdec . If using our high-resolution cross-
centered tasks like document OCR require smaller hid-
module, the computational complexity of attention is
den sizes to achieve satisfying performance (e.g. 1,536
in Kosmos-2.5 and Pix2Struct [16]). This suggests that
Timproved = O (LIlo + LT )LIhi Hcross dcross
text-related features can be effectively captured using (3)
smaller hidden sizes. + (LIlo + LT )2 Hdec ddec .

As shown in Fig. 2, the high-resolution cross-module Note that dcross and Hcross can be flexibly adjusted accord-
acts as a new branch for higher-resolution input, which ing to computational budget and model performance. If not
accepts images of size 1120 ⇥ 1120 pixels in our im- utilizing the high-resolution cross-module and directly sub-
plementation. Different from the original low-resolution stituting low-resolution images with high-resolution ones,
input branch, the high-resolution cross-module adopts a the computational complexity would be
much smaller pre-trained vision encoder (visual encoder of
EVA2-CLIP-L [35] in our implementation, 0.30B parame- Toriginal = O (LIhi + LT )2 Hdec ddec . (4)
ters), and uses cross-attention of a small hidden size to fuse
high-resolution image features with every layer of VLLM In our implementation, dcross = 32, Hcross = 32, and
decoder, thus reducing the computational cost. To be con- we inherits ddec = 128, Hdec = 32 from CogVLM-17B.
crete, for an input image, it is resized to 1120 ⇥ 1120 and Both high- and low-resolution encoders patchify images
224⇥224 and fed into the high-resolution cross-module and with 14 ⇥ 14-pixel patches, thus LIhi = 6400, LIlo = 256.
L +L
the low-resolution branch respectively, then encoded into Our method leads to at least LIIhi +LTT = 6400+L
256+LT ⇥ accel-
T
lo
image feature sequences Xhi and Xlo with two distinct-sized eration which is a stringent lower bound (refer to Appendix
image encoders in parallel. The visual language decoder re- for detailed derivation), and reduces memory overhead at
tains its original computations, while the only change is to the same time.

14284
2.3. Pre-training grounding, we have constructed the CCS400K (Common
Crawl Screenshot 400K) dataset. This extensive dataset is
To enhance the model’s ability to comprehend high-
formed by extracting URLs from the latest Common Crawl
resolution images and adapt it for GUI application scenar-
data, followed by capturing 400,000 web page screenshots.
ios, we focus our pre-training efforts on the following as-
Alongside these screenshots, we compile all visible DOM
pects: the capability to recognize texts of various sizes, ori-
elements and their corresponding rendered boxes using
entations, and fonts in high-resolution images, the ground-
Playwright1 , supplementing the dataset with 140 million
ing ability of text and objects in the image, and a special-
REC and REG question-answer pairs. This rich dataset en-
ized understanding capability for GUI imagery such as web
sures comprehensive training and understanding of GUI el-
page. We divide our pre-train data into three parts based on
ements. To mitigate the risk of overfitting, we employ a
the aforementioned aspects, with samples in the Appendix.
diverse range of screen resolutions for rendering, selected
All the pre-training data are derived from publicly available
randomly from a list of commonly used resolutions across
datasets. The construction methods are detailed below.
various devices. Additionally, to prevent the HTML code
Text recognition. Our data includes (1) Synthetic render-
from becoming overly extensive and unwieldy, we perform
ings with text from language pre-training dataset (80M).
necessary data cleaning by omitting redundant attributes in
This is similar to the Synthetic Document Generator in Kim
the DOM elements, following the method outlined in [16].
et al. [15], with text of varying font, size, color and orienta-
We also incorporate publicly available text-image
tion, and diverse image background from LAION-2B [32].
datasets including LAION-2B and COYO-700M (after re-
(2) Optical Character Recognition (OCR) of natural im-
moving the broken URLs, NSFW images, and images with
ages (18M). We collect natural images from COYO [6] and
noisy captions and political bias) during pre-training.
LAION-2B [32] and employ Paddle-OCR [13] to extract
We pre-train our CogAgent model for a total of 60,000
the texts and their bounding boxes, and filter out images
iterations with a batch size of 4,608 and a learning rate of
with no text boxes. (3) Academic documents (9M). We fol-
2e-5. We freeze all parameters except the newly added high-
low Nougat [5] to construct image-text pairs including text,
resolution cross-module for the first 20,000 steps, resulting
formula and tables from the source code (LaTeX) release on
in a total number of 646M (3.5%) trainable parameters, then
arXiv. For (1)(3), we apply the same data augmentation as
additionally unfreeze the visual expert in CogVLM for the
Nougat. For (2), we additionally employed more aggressive
next 40,000 steps. We warm up with curriculum learning by
rotation and flipping data augmentation techniques.
first training on easier text recognition (synthetic renderings
Visual grounding. It is imperative for GUI agents to pos- and OCR on natural images) and image captioning, then se-
sess the capability to accurately comprehend and locate di- quentially incorporating harder text recognition (academic
verse elements within images. We follow CogVLM [38] to document), grounding data and web page data, as we ob-
use a constructed visual grounding dataset of 40M images served that it leads to faster convergence and more stable
with image-caption pairs sampled from LAION-115M [18], training in our preliminary experiments.
which associate entities in the caption with bounding boxes
to indicate their positions. The format of the bounding box 2.4. Multi-task Fine-tuning and Alignment
is [[x0 , y0 , x1 , y1 ]], where (x0 , y0 ) and (x1 , y1 ) represent
To enhance our model’s performance for diverse tasks and
the coordinates of upper-left and lower-right corners which
ensure it aligns with free-form human instructions in the
are normalized to [000, 999]. If multiple objects are indi-
GUI setting, we further fine-tune our model on a broad
cated by a single noun phrase, their boxes are separated by
range of tasks. We manually collected over two thousand
semicolons in double square brackets.
screenshots from mobile phones and computers, each anno-
GUI imagery. Our approach innovatively addresses the
tated with screen elements, potential tasks, and methods of
scarcity and limited relevance of GUI images in datasets
operation in the question-answering format by human anno-
like LAION and COYO, which predominantly feature nat-
tators (details illustrated in the Appendix). We also utilize
ural images. GUI images, with their distinct elements such
Mind2Web [10] and AITW [31], datasets focusing on web
as input fields, hyperlinks, icons, and unique layout charac-
and Android behaviors which comprise tasks, sequences of
teristics, require specialized handling. To boost the model’s
actions and corresponding screenshots, and convert them
capability in interpreting GUI imagery, we have conceptu-
into a natural language question-and-answer format using
alized two pioneering GUI grounding tasks: (1) GUI Re-
GPT-4. Besides, we incorporate multiple publicly available
ferring Expression Generation (REG) – where the model
visual question-answering (VQA) datasets encompassing a
is tasked with generating HTML code for DOM (Docu-
variety of tasks into our alignment dataset. We unfreeze all
ment Object Model) elements based on a specified area in a
model parameters during this stage and train for 10k itera-
screenshot, and (2) GUI Referring Expression Comprehen-
tions with a batch size of 1024 and a learning rate of 2e-5.
sion (REC) – which involves creating bounding boxes for
given DOM elements. To facilitate robust training in GUI 1 https://playwright.dev

14285
General VQA Text-rich VQA
Method
VQAv2 OKVQA OCRVQA TextVQA STVQA ChartQA InfoVQA DocVQA
task-specific fine-tuning models
Pix2Struct [16] - - - - - 58.6 40.0 76.6
BLIP-2 [18] 82.2 59.3 72.7 - - - - -
PALI-X-55B [8] 86.0 66.1 75.0 71.4 79.9 70.9 49.2 80.0
CogVLMtask-specific [38] 84.7 64.7 74.5 69.7 - - - -
generalist models
UReader [40] - 57.6 - - - 59.3 42.2 65.4
Qwen-VL [2] 79.5 58.6 75.7 63.8 - 65.7 - 65.1
Qwen-VL-chat [2] 78.2 56.6 70.5 61.5 - 66.3 - 62.6
Llava-1.5 [20] 80.0 - - 61.5 - - - -
Fuyu-8B [3] 74.2 60.6 - - - - - -
CogVLMgeneralist [38] 83.4 58.9 74.1 68.1 - - - -
CogAgent (Ours) 83.7 61.2 75.0 76.1 80.5 68.4 44.5 81.6

Table 1. Performance on Visual Question Answering benchmarks. Bold text indicates the best score among the generalist category, and
underlined text represents the best score across both generalist and task-specific categories.

3. Experiments CogAgent is initially based on, CogAgent demonstrates

certain improvements on both general and Text-rich VQA
To evaluate the foundational capabilities and GUI-related
tasks, suggesting the efficacy of our proposed model archi-
performance of our model, we conduct extensive experi-
tecture and training methods.
ments on a broad range of datasets. First, we conduct eval-
uations on eight VQA benchmarks, as well as MM-Vet [41] Furthermore, we conducted zero-shot tests of our model
and POPE [19], which validate our model’s enhanced abil- on the challenging MM-Vet [41] and POPE [19] datasets,
ity in visual understanding, especially on those that are re- both of which are instrumental in gauging the multi-modal
liant on text recognition. Then we evaluate our model on capabilities and the generalization performance in com-
Mind2Web and AITW datasets, as the representative of two plex tasks including conversation question-answering, de-
major GUI scenarios — computers and smartphones. tailed descriptions, complex reasoning tasks. MM-Vet is
designed with six core tasks to assess multi-modal models’
3.1. Foundational Visual Understanding proficiency in handling intricate assignments, and POPE-
adversarial models on their susceptibility to hallucinations.
We first extensively evaluate CogAgent’s foundational vi-
Our experimental results, as detailed in Table 2, show-
sual understanding capability across eight VQA bench-
case that our model significantly outperforms other existing
marks, covering a wide range of visual scenes. The bench-
models in both datasets. Notably, on the MM-Vet dataset,
marks can be divided into two categories: general VQA, in-
our model achieved a remarkable score of 52.8, surpassing
cluding VQAv2 [1] and OK-VQA [23], and text-rich VQA,
the closest competitor, LLaVA-1.5, by a substantial margin
including TextVQA [34], OCR-VQA [27], ST-VQA [4],
(+16.5). On the POPE-adversarial evaluation, our model at-
DocVQA [25], InfoVQA [26] and ChartQA [24]. The latter
tained a score of 85.9, demonstrating superior handling of
category emphasizes the understanding of visually-situated
text, including documents, charts, photographs containing
text, etc. To demonstrate the model’s versatility and robust- Method LLM MM-Vet POPEadv
ness across tasks, our model is fine-tuned collectively on all BLIP-2 [18] Vicuna-13B 22.4 -
datasets simultaneously, yielding a single generalist model Otter [17] MPT-7B 24.7 -
MiniGPT4 [44] Vicuna-13B 24.4 70.4
which is then evaluated across all datasets.
InstructBLIP [9] Vicuna-13B 25.6 77.3
The results are presented in Tab. 1. For general VQA, LLaVA [21] LLaMA2-7B 28.1 66.3
CogAgent achieves state-of-the-art generalist results on LLaMA-Adapter v2 [14] LLaMA-7B 31.4 -
both datasets. For text-rich VQA, CogAgent achieves DreamLLM [11] Vicuna-7B 35.9 76.5
state-of-the-art results on 5 out of 6 benchmarks, signif- LLaVA-1.5 [20] Vicuna-13B 36.3 84.5
Emu [36] LLaMA-13B 36.3 -
icantly surpassing generalist competitors (TextVQA+8.0, CogAgent (Ours) Vicuna-7B 52.8 85.9
ChartQA+2.1, InfoVQA+2.3, DocVQA+16.2), even out-
performing the task-specific state-of-the-art models on Table 2. Evaluation of CogAgent on conversational style QA
TextVQA(+4.7), STVQA(+0.6) and DocVQA(+1.6). No- and hallucination assessment. Regarding the POPE dataset, we
tably, compared to the generalist results of CogVLM which use its adversarial subset for this evaluation.

14286
Method cross-task cross-website cross-domain overall Method GoogleApp Install WebShop General Single Overall
Representations of screen inputs: HTML Representations of screen inputs: textual description (OCR+icon)
GPT-3.5[29](few-shot) 10.47 4.38 8.42 5.93 9.39 7.72
GPT-3.5[29](few-shot) 18.6 17.4 16.2 17.4
LLaMA2-7B[37]† 30.99 35.18 19.92 28.56 27.35 28.40
GPT-4[30]† (few-shot) 36.2 30.1 26.4 30.9
Representations of screen inputs: image
Flan-T5XL [10] 52.0 38.9 39.6 43.5
Auto-UIunified[43] 71.37 76.89 70.26 68.24 84.58 74.27
LLaMA2-7B[37] 52.7 47.1 50.3 50.1
CogAgent (Ours) 74.95 78.86 71.73 65.38 93.49 76.88
LLaMA2-70B[37] 55.8 51.6 55.7 54.4
Representations of screen inputs: Image Table 4. Performance on Android in the Wild (AITW) dataset.
Qwen-VL[2] 12.6 10.1 8.0 10.2 † represents models individually fine-tuned on each subset, while
CogVLM[38] 37.1 23.4 26.3 23.9 others are unified models across all subsets. The results of
CogAgent (Ours) 62.3 54.0 59.4 58.2 LLaMA2 and GPT-3.5 are from Zhan and Zhang [43].
Table 3. Performance on Mind2Web. † denotes element selec-
tion from top-10 element candidates, others from top-50, follow- and device types. Each episode in the dataset consists of a
ing Deng et al. [10]. Results for GPT-3.5 and GPT-4 are from goal described in natural language, followed by a sequence
Deng et al. [10]. of actions and corresponding screenshots. The training tar-
hallucinations compared to other models. get is to predict the next action based on the given goal, his-
torical actions, and the screenshot. For each action, models
3.2. GUI Agent: Computer Interface are required to predict the exact action type; for tap, swipe
and type, models are further required to predict the position,
We evaluate CogAgent on Mind2Web, a dataset for web
direction, and content to be typed, respectively.
agents that includes over 2,000 open-ended tasks collected
We conduct comparisons with two kinds of baselines:
from 137 real-world websites across 31 domains. Given
language models using the textual description of UI ele-
the task description, current webpage snapshot and previous
ments provided by the original dataset (text OCR and icon)
actions as inputs, agents are expected to predict the subse-
as the representations of screen inputs2 , and visual-language
quent action. We follow the setting of Deng et al. [10] in our
models using images as the screen inputs. We simultane-
experiments, and report step success rate (step SR) metric.
ously fine-tuned on all the subsets, yielding a unified model
Several language models were evaluated on this bench-
which is then evaluated on all test sets. As the GoogleApps
mark. For instance, AgentTuning [42] and MindAct [10]
subset is 10-100 times larger than other subsets, we down-
evaluated Llama2-70B and Flan-T5-XL in a fine-tuned set-
sample it to 10% to avoid data imbalance.
ting, and GPT-3.5 and GPT-4 in a in-context learning set-
Results are shown in Tab. 4. CogAgent achieves state-
ting. However, limited by the input modality of lan-
of-the-art performance compared to all previous methods.
guage models, these models could only use heavily cleansed
In comparison to language-based methods, our model sur-
HTML as the representation of screen inputs. To the best of
passes both baselines by a large margin. In comparison to
our knowledge, no visually-based web agents have been ex-
the visual-language baseline, Auto-UI, our model achieves
perimented with on this benchmark.
+2.61 improvements in the overall performance. In in-
We fine-tune our model on the train set and evalu-
stances of inaccuracies, we randomly sample hundreds of
ate on three out-of-domain subsets, i.e. cross-website,
cases, and upon reassessment, more than 40% are deter-
cross-domain, and cross-task. We additionally fine-tune
mined to be correct (refer to the appendix for details). This
LLaMA2-7B and LLaMA2-70B as the baseline of fine-
diversity arises from the multiple valid pathways inherent in
tuned LLMs, and adopt the same HTML cleansing process
mobile interactions, resulting in a range of responses.
as Deng et al. [10] to construct HTML input. The results
are presented in Sec. 3.2. Compared to other methods, our
approach achieved significant performance improvements 4. Ablation Study
across all three subsets, surpassing LLaMA2-70B, which To thoroughly comprehend the impact of various compo-
is nearly 4⇥ the scale of CogAgent, by 11.6%, 4.7%, and nents in the methodology, we conduct ablation studies on
6.6%, respectively. This reflects not only the capability of two aspects, model architecture and training data. The eval-
our model but also the advantages of employing a visual uation is conducted on diverse datasets, including multiple
agent in computer GUI scenarios. VQA datasets (STVQA, OCRVQA, DocVQA) and a web
3.3. GUI Agent: Smartphone Interface agent dataset (Mind2Web). For VQA datasets, we fine-
2 Some Android applications may have View Hierarchy which is more
To evaluate our model on diverse smartphone interfaces and
friendly to language-based agents, but most of them tend to be poor quality
tasks, we utilize Android in the Wild (AITW) dataset [31] , or missing altogether. Therefore, as a large-scale, general-purpose dataset,
a large-scale dataset for Android device agents. It comprises AITW retained the results of OCR detection and icon detection as textual
715k operation episodes covering varying Android versions representations of screenshots.

14287
tune the model on four datasets together for 3,000 iters with high-res base cross
STVQA OCRVQA DocVQA Mind2Web
training
TFLOPs
module res res time/it (s)
a batch size of 1,280, and report the generalist score; for
Mind2Web, models are fine-tuned for 2,400 iters with a % 224 — 48.0 70.2 28.6 34.6 2.36 7.77
% 490 — 68.1 74.5 57.6 40.7 6.43 29.14
batch size of 128 and use top-10 setting. Training iterations
! 224 756 73.6 74.2 62.3 40.7 3.57 10.08
are fewer than those in the main experiment, aiming to con- ! 224 1120 78.2 75.9 74.1 41.4 5.17 12.56
trol variables within the constraints of a limited budget.
Table 5. Ablation study on model architecture. Training time is
4.1. Model Architecture evaluated on A800 with the batch size of 8. Models are pre-trained
To ascertain the efficacy of the high-resolution cross- with Caption+OCR data.
module, we compare it with directly increasing the resolu- pre-train data base res cross res STVQA OCRVQA DocVQA Mind2Web
tion using the original model architecture of CogVLM, and Cap 490 — 68.1 74.5 57.6 38.6
ablate on two perspectives: computational efficiency and Cap+OCR 490 — 72.5 75.0 59.8 40.7
model performance. Cap+OCR 224 1120 78.2 75.9 74.1 41.4
To measure computational overhead, we use floating All 224 1120 79.4 75.6 76.4 54.2
point operations (FLOPs) as the metric, and conduct exper-
Table 6. Ablation study on pre-train data with sequentially added
iments on multiple resolutions including 224, 490, 756, and
image captioning, OCR and other pre-train data.
1120. From Fig. 3 we can see that, as the image resolution
increases, models that use a high-resolution cross-module
experience only a modest rise in computational overhead, language training, we sequentially add OCR data (denoted
demonstrating an almost linear relationship with the num- as Cap+OCR), as well as GUI and grounding data (denoted
ber of image patches. In contrast, using the original model as All). The results in Tab. 6 indicate that each part of
structure, i.e. CogVLM, leads to a significant increase in data broadly contributes to enhanced performance. Notably,
the number of FLOPs at higher resolutions. Its FLOPs can web and grounding data have a significant impact on the
even be more than 10 times higher compared to employing Mind2Web dataset, underscoring the importance of con-
a cross-module at a resolution of 1120, which is the resolu- structing domain-specific pre-train data in the training of
tion utilized by CogAgent. GUI agents.

5. Conclusion
We introduce CogAgent, a VLM-based GUI agent with en-
hanced pre-train data construction and efficient architecture
for high-resolution input. CogAgent achieves state-of-the-
art performance on a wide range of VQA and GUI bench-
marks, and will be open-sourced. CogAgent is an initial
exploration of VLM-based GUI agent, and still has some
shortcomings, e.g. imprecise output coordinates and inca-
pability of processing multiple images, necessitating further
research.
Figure 3. Comparison of FLOPs during forward propagation for
different model architectures and resolutions.
Acknowledgments
We further compare the model performance in Tab. 5,
which indicates that models with high-resolution cross- This work is supported by Technology and Innovation Ma-
module at the resolution of 756 require only 1/2 of the com- jor Project of the Ministry of Science and Technology of
putational resources used by the original structure at the China under Grant 2022ZD0118600, Natural Science Foun-
resolution of 490, while delivering significantly better per- dation of China (NSFC) 62277033 and the New Corner-
formance. Additionally, the high-resolution cross-module stone Science Foundation through the XPLORER PRIZE. It
allows for further increasing models’ acceptable resolution also got partial support from the National Engineering Lab-
within a limited computational budget, thereby yielding ad- oratory for Cyberlearning and Intelligent Technology, Bei-
ditional performance improvements. jing Key Lab of Networked Multimedia, Daimler Greater
China Ltd. -Tsinghua University Joint Institute for Sus-
4.2. Pre-train Data tainable Mobility,Tsinghua University(Department of Com-
We further conduct an ablation study on pre-training data, puter Science and Technology)-Siemens Ltd., China Joint
which is an integral part of training visual agents. Build- Research Center for Industrial Intelligence and Internet of
ing upon the image-caption data commonly used in visual- Things (JCIIOT) and a research fund from Zhipu AI.

14288
References [13] Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Weiwei
Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing
[1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Dang, et al. Pp-ocr: A practical ultra lightweight ocr system.
Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. arXiv preprint arXiv:2009.09941, 2020. 5
Vqa: Visual question answering. In Proceedings of the IEEE [14] Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie
international conference on computer vision, pages 2425– Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi-
2433, 2015. 3, 6, 11 angyu Yue, et al. Llama-adapter v2: Parameter-efficient vi-
[2] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan sual instruction model. arXiv preprint arXiv:2304.15010,
Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren 2023. 6
Zhou. Qwen-vl: A frontier large vision-language model with [15] Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon
versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 3, Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sang-
6, 7 doo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free
[3] Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell document understanding transformer. In European Confer-
Nye, Augustus Odena, Arushi Somani, and Sağnak Taşırlar. ence on Computer Vision, pages 498–517. Springer, 2022.
Introducing our multimodal models, 2023. 6 5
[4] Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, [16] Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu,
Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimos- Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandel-
thenis Karatzas. Scene text visual question answering. In wal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova.
Proceedings of the IEEE/CVF international conference on Pix2struct: Screenshot parsing as pretraining for visual lan-
computer vision, pages 4291–4301, 2019. 3, 6, 12 guage understanding. In International Conference on Ma-
[5] Lukas Blecher, Guillem Cucurull, Thomas Scialom, and chine Learning, pages 18893–18912. PMLR, 2023. 4, 5, 6
Robert Stojnic. Nougat: Neural optical understanding for [17] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi
academic documents. arXiv preprint arXiv:2308.13418, Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-
2023. 5 it: Multi-modal in-context instruction tuning. arXiv preprint
[6] Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun arXiv:2306.05425, 2023. 6
Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: [18] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.
Image-text pair dataset. https : / / github . com / Blip-2: Bootstrapping language-image pre-training with
kakaobrain/coyo-dataset, 2022. 5 frozen image encoders and large language models. arXiv
preprint arXiv:2301.12597, 2023. 5, 6
[7] Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier,
[19] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin
Karthik Narasimhan, and Shunyu Yao. Fireact: Toward lan-
Zhao, and Ji-Rong Wen. Evaluating object hallucina-
guage agent fine-tuning. arXiv preprint arXiv:2310.05915,
tion in large vision-language models. arXiv preprint
2023. 1
arXiv:2305.10355, 2023. 3, 6, 11
[8] Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa,
[20] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee.
Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Se-
Improved baselines with visual instruction tuning. arXiv
bastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On
preprint arXiv:2310.03744, 2023. 6
scaling up a multilingual vision and language model. arXiv
[21] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee.
preprint arXiv:2305.18565, 2023. 3, 6
Visual instruction tuning. arXiv preprint arXiv:2304.08485,
[9] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat 2023. 3, 6
Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale [22] Tengchao Lv, Yupan Huang, Jingye Chen, Lei Cui, Shum-
Fung, and Steven Hoi. Instructblip: Towards general- ing Ma, Yaoyao Chang, Shaohan Huang, Wenhui Wang, Li
purpose vision-language models with instruction tuning, Dong, Weiyao Luo, et al. Kosmos-2.5: A multimodal literate
2023. 6 model. arXiv preprint arXiv:2309.11419, 2023. 3
[10] Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel [23] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and
Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Roozbeh Mottaghi. Ok-vqa: A visual question answering
Towards a generalist agent for the web. arXiv preprint benchmark requiring external knowledge. In Proceedings
arXiv:2306.06070, 2023. 3, 5, 7, 12 of the IEEE/cvf conference on computer vision and pattern
[11] Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng recognition, pages 3195–3204, 2019. 3, 6, 11
Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, [24] Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty,
Haoran Wei, et al. Dreamllm: Synergistic multimodal com- and Enamul Hoque. Chartqa: A benchmark for question an-
prehension and creation. arXiv preprint arXiv:2309.11499, swering about charts with visual and logical reasoning. arXiv
2023. 6 preprint arXiv:2203.10244, 2022. 3, 6, 12
[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [25] Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar.
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Docvqa: A dataset for vqa on document images. In Proceed-
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- ings of the IEEE/CVF winter conference on applications of
vain Gelly, et al. An image is worth 16x16 words: Trans- computer vision, pages 2200–2209, 2021. 3, 6, 12
formers for image recognition at scale. arXiv preprint [26] Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis
arXiv:2010.11929, 2020. 3 Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa.

14289
In Proceedings of the IEEE/CVF Winter Conference on Ap- et al. Ureader: Universal ocr-free visually-situated language
plications of Computer Vision, pages 1697–1706, 2022. 3, 6, understanding with multimodal large language model. arXiv
12 preprint arXiv:2310.05126, 2023. 6
[27] Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and [41] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang,
Anirban Chakraborty. Ocr-vqa: Visual question answering Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang.
by reading text in images. In 2019 international conference Mm-vet: Evaluating large multimodal models for integrated
on document analysis and recognition (ICDAR), pages 947– capabilities. arXiv preprint arXiv:2308.02490, 2023. 3, 6,
952. IEEE, 2019. 6, 11 11
[28] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, [42] Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu,
Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Yuxiao Dong, and Jie Tang. Agenttuning: Enabling gener-
Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: alized agent abilities for llms. abs/2310.12823, 2023. 1, 7,
Browser-assisted question-answering with human feedback. 12
arXiv preprint arXiv:2112.09332, 2021. 1 [43] Zhuosheng Zhan and Aston Zhang. You only look at screens:
[29] OpenAI. Introducing chatgpt. 2022. 1, 7 Multimodal chain-of-action agents. abs/2309.11436, 2023.
[30] OpenAI. Gpt-4 technical report, 2023. 7 7, 13
[31] Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana [44] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo-
Riva, and Timothy Lillicrap. Android in the wild: A large- hamed Elhoseiny. Minigpt-4: Enhancing vision-language
scale dataset for android device control. arXiv preprint understanding with advanced large language models. arXiv
arXiv:2307.10088, 2023. 1, 3, 5, 7, 13 preprint arXiv:2304.10592, 2023. 6
[32] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo
Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
man, et al. Laion-5b: An open large-scale dataset for training
next generation image-text models. Advances in Neural In-
formation Processing Systems, 35:25278–25294, 2022. 1,
5
[33] Significant-Gravitas. Autogpt. https://github.com/
Significant-Gravitas/AutoGPT, 2023. 1
[34] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang,
Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus
Rohrbach. Towards vqa models that can read. In Proceedings
of the IEEE/CVF conference on computer vision and pattern
recognition, pages 8317–8326, 2019. 3, 6, 11
[35] Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue
Cao. Eva-clip: Improved training techniques for clip at scale.
arXiv preprint arXiv:2303.15389, 2023. 3, 4
[36] Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong
Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun
Huang, and Xinlong Wang. Generative pretraining in multi-
modality. arXiv preprint arXiv:2307.05222, 2023. 6
[37] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert,
Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al.
Llama 2: Open foundation and fine-tuned chat models. arXiv
preprint arXiv:2307.09288, 2023. 7
[38] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji
Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan
Song, et al. Cogvlm: Visual expert for pretrained language
models. arXiv preprint arXiv:2311.03079, 2023. 1, 3, 5, 6,
7
[39] Shunyu Yao, Howard Chen, John Yang, and Karthik
Narasimhan. Webshop: Towards scalable real-world web in-
teraction with grounded language agents. Advances in Neu-
ral Information Processing Systems, 35:20744–20757, 2022.
1
[40] Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan,
Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang,

14290

Salesforce: Notes From Siva
67% (3)
Salesforce: Notes From Siva
87 pages
Large Language Model-Brained GUI Agents: A Survey
No ratings yet
Large Language Model-Brained GUI Agents: A Survey
78 pages
Flames of Freedom Quickstart V1.23
100% (2)
Flames of Freedom Quickstart V1.23
61 pages
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
No ratings yet
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
76 pages
DLP RW 2018 Orientation
75% (12)
DLP RW 2018 Orientation
2 pages
Car Rental System Project Report
70% (210)
Car Rental System Project Report
40 pages
Galois Theory PDF
100% (1)
Galois Theory PDF
230 pages
Alice
No ratings yet
Alice
7 pages
Cogagent: A Visual Language Model For Gui Agents
No ratings yet
Cogagent: A Visual Language Model For Gui Agents
27 pages
Large Language ModelBrained GUI Agents
No ratings yet
Large Language ModelBrained GUI Agents
78 pages
Cogvlm Paper
No ratings yet
Cogvlm Paper
18 pages
Visual GPT
No ratings yet
Visual GPT
17 pages
An Interactive Agent Foundation Model
No ratings yet
An Interactive Agent Foundation Model
22 pages
Minigpt-4 Enhancing Vision-Language Understanding With Advanced Large Language Models
No ratings yet
Minigpt-4 Enhancing Vision-Language Understanding With Advanced Large Language Models
15 pages
What Matters When Building Vision-Language Models?: Prompt Idefics2 Output
No ratings yet
What Matters When Building Vision-Language Models?: Prompt Idefics2 Output
26 pages
CogView2 - May 2022
No ratings yet
CogView2 - May 2022
15 pages
Visualwebarena:: Evaluating Multimodal Agents On Realistic Visually Grounded Web Tasks
No ratings yet
Visualwebarena:: Evaluating Multimodal Agents On Realistic Visually Grounded Web Tasks
25 pages
04 HCIP-Datacom-NCE Northbound Openness Lab Guide
No ratings yet
04 HCIP-Datacom-NCE Northbound Openness Lab Guide
57 pages
Agent S - Framework For Computer Like A Human
No ratings yet
Agent S - Framework For Computer Like A Human
23 pages
Towards System 2 Reasoning in LLMS: Learning How To Think With Meta Chain-of-Thought
No ratings yet
Towards System 2 Reasoning in LLMS: Learning How To Think With Meta Chain-of-Thought
14 pages
A REAL-WORLD WEBAGENT - Paper
No ratings yet
A REAL-WORLD WEBAGENT - Paper
28 pages
GUIAgents With Foundation Models
No ratings yet
GUIAgents With Foundation Models
10 pages
1-2024-arxiv-MobileVLM V2：更快更强的视觉语言模型基线
No ratings yet
1-2024-arxiv-MobileVLM V2：更快更强的视觉语言模型基线
15 pages
Large Language Model-Brained GUI Agents: A Survey
No ratings yet
Large Language Model-Brained GUI Agents: A Survey
80 pages
NeurIPS 2021 Cogview Mastering Text To Image Generation Via Transformers Paper
No ratings yet
NeurIPS 2021 Cogview Mastering Text To Image Generation Via Transformers Paper
14 pages
Screenagent: A Vision Language Model-Driven Computer Control Agent
No ratings yet
Screenagent: A Vision Language Model-Driven Computer Control Agent
21 pages
Deepseek-Vl2: Mixture-Of-Experts Vision-Language Models For Advanced Multimodal Understanding
No ratings yet
Deepseek-Vl2: Mixture-Of-Experts Vision-Language Models For Advanced Multimodal Understanding
28 pages
Cog-GA: A Large Language Models-Based Generative Agent For Vision-Language Navigation in Continuous Environments
No ratings yet
Cog-GA: A Large Language Models-Based Generative Agent For Vision-Language Navigation in Continuous Environments
11 pages
Visual Large Language Models For Generalized and Specialized Applications
No ratings yet
Visual Large Language Models For Generalized and Specialized Applications
29 pages
Cogagent: A Visual Language Model For Gui Agents
No ratings yet
Cogagent: A Visual Language Model For Gui Agents
27 pages
Exploring
No ratings yet
Exploring
16 pages
Paper Ieee Tai
No ratings yet
Paper Ieee Tai
10 pages
VCoder Versatile Vision Encoders For Multimodal Large Language Models
No ratings yet
VCoder Versatile Vision Encoders For Multimodal Large Language Models
14 pages
2501.02189v3 - 2025
No ratings yet
2501.02189v3 - 2025
35 pages
Visual Agent Bench
No ratings yet
Visual Agent Bench
40 pages
Cognizant Syllabus and Exam Pattern For 2025 Batch
No ratings yet
Cognizant Syllabus and Exam Pattern For 2025 Batch
7 pages
Borobudur As A Complete YANTRA Signifying Exposition of Buddhist Doctrine
No ratings yet
Borobudur As A Complete YANTRA Signifying Exposition of Buddhist Doctrine
20 pages
Code Agents
No ratings yet
Code Agents
24 pages
Soal B Inggris Uas KLS X Olla
No ratings yet
Soal B Inggris Uas KLS X Olla
25 pages
LVLM Survey
No ratings yet
LVLM Survey
22 pages
VIAssist Adapting Multi-Modal Large Language
No ratings yet
VIAssist Adapting Multi-Modal Large Language
6 pages
PostgreSQL Architecture 2
No ratings yet
PostgreSQL Architecture 2
5 pages
Unit 7 & 8 Choose The Correct Option To Complete The Sentences
No ratings yet
Unit 7 & 8 Choose The Correct Option To Complete The Sentences
3 pages
Functional English Paper Solution
No ratings yet
Functional English Paper Solution
3 pages
BestPractices zDevOps
No ratings yet
BestPractices zDevOps
16 pages
WWW - AD-POWER - CN: Class-D Amplifier Module
No ratings yet
WWW - AD-POWER - CN: Class-D Amplifier Module
6 pages
Cloud Computing - Lab 3
No ratings yet
Cloud Computing - Lab 3
2 pages
First Summative Test, Oral Communication
No ratings yet
First Summative Test, Oral Communication
2 pages
Jenn Lesson Plan My Many Colored Days
No ratings yet
Jenn Lesson Plan My Many Colored Days
3 pages
Gabriel LP Week6
No ratings yet
Gabriel LP Week6
21 pages
Ioqm Practice Test-01: Instructions
No ratings yet
Ioqm Practice Test-01: Instructions
3 pages
Unbundling Pokémon Go - Applidium
No ratings yet
Unbundling Pokémon Go - Applidium
4 pages
HILONGO - Midterm-Assignment No. 2
No ratings yet
HILONGO - Midterm-Assignment No. 2
7 pages
Code Optimization-II
No ratings yet
Code Optimization-II
16 pages
Research 2 Chapter 8 - 082441
No ratings yet
Research 2 Chapter 8 - 082441
13 pages
Bank Catering Assistant - JD
No ratings yet
Bank Catering Assistant - JD
4 pages
Agus Trisman Pengalaman Kerja
No ratings yet
Agus Trisman Pengalaman Kerja
6 pages
Annual Plan 2025
No ratings yet
Annual Plan 2025
1 page
The Linguistic Elements and Literary Features of Text
No ratings yet
The Linguistic Elements and Literary Features of Text
7 pages
Types of Computers
No ratings yet
Types of Computers
2 pages
500 450 Demo
No ratings yet
500 450 Demo
6 pages
Microsoft Visual C++ Windows Applications by Example
From Everand
Microsoft Visual C++ Windows Applications by Example
Stefan BjÃ¶rnander
3.5/5 (3)
The Way to Go: A Thorough Introduction to the Go Programming Language
From Everand
The Way to Go: A Thorough Introduction to the Go Programming Language
Ivo Balbaert
3/5 (4)
Professional WebGL Programming: Developing 3D Graphics for the Web
From Everand
Professional WebGL Programming: Developing 3D Graphics for the Web
Andreas Anyuru
No ratings yet
OpenCL Programming by Example
From Everand
OpenCL Programming by Example
Ravishekhar Banger
No ratings yet
Programming Backend with Go
From Everand
Programming Backend with Go
Julian Braun
No ratings yet
Web Programming with Go
From Everand
Web Programming with Go
Ian Taylor
No ratings yet
PyGTK Techniques and Applications: Definitive Reference for Developers and Engineers
From Everand
PyGTK Techniques and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
From Everand
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
Eric Vargas
No ratings yet
Programming Backend with Go: Build robust and scalable backends for your applications using the efficient and powerful tools of the Go ecosystem
From Everand
Programming Backend with Go: Build robust and scalable backends for your applications using the efficient and powerful tools of the Go ecosystem
Julian Braun
No ratings yet
Learn C++
From Everand
Learn C++
Aishik Dutta
No ratings yet
Groovy for Domain-Specific Languages, Second Edition: Extend and enhance your Java applications with domain-specific scripting in Groovy
From Everand
Groovy for Domain-Specific Languages, Second Edition: Extend and enhance your Java applications with domain-specific scripting in Groovy
Fergal Dearle
No ratings yet
Learn IoT Programming Using Node-RED: Begin to Code Full Stack IoT Apps and Edge Devices with Raspberry Pi, NodeJS, and Grafana
From Everand
Learn IoT Programming Using Node-RED: Begin to Code Full Stack IoT Apps and Edge Devices with Raspberry Pi, NodeJS, and Grafana
Bernardo Ronquillo Japón
No ratings yet
Java Beginner Guide
From Everand
Java Beginner Guide
Namo
No ratings yet
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
From Everand
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
Manoj R Chakravarthi
No ratings yet
Byte by Byte
From Everand
Byte by Byte
Manuel Oliveira
No ratings yet
JavaScript for Beginners
From Everand
JavaScript for Beginners
Hernando Abella
5/5 (1)
C Programming For Beginners: The Simple Guide to Learning C Programming Language Fast!
From Everand
C Programming For Beginners: The Simple Guide to Learning C Programming Language Fast!
Tim Warren
5/5 (1)
C Clearly - Programming With C In Linux and On Raspberry Pi
From Everand
C Clearly - Programming With C In Linux and On Raspberry Pi
Andrew Johnson
No ratings yet
Go Debugging from Scratch: A Practical Guide with Examples
From Everand
Go Debugging from Scratch: A Practical Guide with Examples
William E. Clark
No ratings yet
GTKSharp Programming Guide: Definitive Reference for Developers and Engineers
From Everand
GTKSharp Programming Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Understanding Python: Beginner's Guide to Programming
From Everand
Understanding Python: Beginner's Guide to Programming
Sabry Fattah
No ratings yet
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
C Programming: C Programming Language for beginners, teaching you how to learn to code in C fast!
From Everand
C Programming: C Programming Language for beginners, teaching you how to learn to code in C fast!
Adam Dodson
No ratings yet
Code Beneath the Surface: Mastering Assembly Programming
From Everand
Code Beneath the Surface: Mastering Assembly Programming
Kameron Hussain
No ratings yet
SRS - How to build a Pen Test and Hacking Platform
From Everand
SRS - How to build a Pen Test and Hacking Platform
alasdair gilchrist
2/5 (1)
Color Profile: Exploring Visual Perception and Analysis in Computer Vision
From Everand
Color Profile: Exploring Visual Perception and Analysis in Computer Vision
Fouad Sabry
No ratings yet
Image Collection Exploration: Unveiling Visual Landscapes in Computer Vision
From Everand
Image Collection Exploration: Unveiling Visual Landscapes in Computer Vision
Fouad Sabry
No ratings yet
Software Suite: Revolutionizing Computer Vision with the Ultimate Software Suite
From Everand
Software Suite: Revolutionizing Computer Vision with the Ultimate Software Suite
Fouad Sabry
No ratings yet
Smart Camera: Revolutionizing Visual Perception with Computer Vision
From Everand
Smart Camera: Revolutionizing Visual Perception with Computer Vision
Fouad Sabry
No ratings yet

CogAgent-A Visual Language Model For GUI Agents

Uploaded by

CogAgent-A Visual Language Model For GUI Agents

Uploaded by

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;

CogAgent: A Visual Language Model for GUI Agents

Abstract and local file operations. Researchers are also starting to

Task: How can I find an Xi0 = MSA(layernorm(Xini )) + Xini , (1)

3. Experiments CogAgent is initially based on, CogAgent demonstrates

You might also like