Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
Ruisheng Cao∗12 Fangyu Lei 1 Haoyuan Wu 1 Jixuan Chen 1 Yeqiao Fu 1 Hongcheng Gao 1
Xinzhuang Xiong 1 Hanchong Zhang 2 Yuchen Mao 1 Wenjing Hu 1 Tianbao Xie 1 Hongshen Xu 2
Danyang Zhang 12 Sida Wang Ruoxi Sun 3 Pengcheng Yin 4 Caiming Xiong 5 Ansong Ni 6
arXiv:2407.10956v1 [cs.AI] 15 Jul 2024
Abstract
Data science and engineering workflows often span multiple stages, from ware-
housing to orchestration, using tools like BigQuery, dbt, and Airbyte. As vision
language models (VLMs) advance in multimodal understanding and code genera-
tion, VLM-based agents could potentially automate these workflows by generating
SQL queries, Python code, and GUI operations. This automation can improve the
productivity of experts while democratizing access to large-scale data analysis. In
this paper, we introduce Spider2-V, the first multimodal agent benchmark focusing
on professional data science and engineering workflows, featuring 494 real-world
tasks in authentic computer environments and incorporating 20 enterprise-level
professional applications. These tasks, derived from real-world use cases, evaluate
the ability of a multimodal agent to perform data-related tasks by writing code
and managing the GUI in enterprise data software systems. To balance realistic
simulation with evaluation simplicity, we devote significant effort to developing
automatic configurations for task setup and carefully crafting evaluation metrics
for each task. Furthermore, we supplement multimodal agents with comprehensive
documents of these enterprise data software systems. Our empirical evaluation
reveals that existing state-of-the-art LLM/VLM-based agents do not reliably auto-
mate full data workflows (14.0% success). Even with step-by-step guidance, these
agents still underperform in tasks that require fine-grained, knowledge-intensive
GUI actions (16.2%) and involve remote cloud-hosted workspaces (10.6%). We
hope that Spider2-V paves the way for autonomous multimodal agents to transform
the automation of data science and engineering workflow. Our code and data are
available at https://spider2-v.github.io.
1 Introduction
Data science and engineering pipelines usually rely on professional data software systems such
as BigQuery, dbt, and Airbyte to acquire, process, and orchestrate large-scale data. Utilizing
these enterprise systems involves writing SQL and Python code, as well as frequent and repetitive
graphical user interface (GUI) controls, which can be complex even for experienced data scien-
tists and engineers. With rapid advances in large language models (LLMs) and vision language
models (VLMs), LLM/VLM-based autonomous agents have the potential to automate these work-
∗
Work done while interning at the University of Hong Kong.
GUI control
GUI control on popup panel
(fill in information)
on Drive UI
(copy data link)
Task2: Save top 20 dramatic movies since 2000 from Snowflake database IMDB into file "top20movies.csv" on Desktop with
detailed requirements in the opened .txt file.
GUI control
on Snowflake UI
(create new worksheet)
GUI control
cross apps
(rename output file)
write SQL
data warehousing data ingestion data transformation data visualization data orchestration
Figure 1: Spider2-V is a multimodal agent benchmark spanning across complete data science and
engineering workflows (e.g., two task examples in the Figure above). It involves various professional
enterprise-level applications and includes intensive GUI controls apart from code writing throughout
the real-time multi-turn interaction with an executable computer environment.
flows [37, 32], enhancing productivity for data scientists and engineers [38, 16] while democratizing
access to large-scale data [15, 40].
Previous studies on data agents focused mainly on daily life data processing and analysis by generating
code or API calls [42, 9, 4], neglecting other crucial stages of data science and engineering (e.g., data
ingestion and integration) using enterprise applications (e.g., Snowflake, Airflow, and Dagster).
Additionally, to complete data workflows, data scientists and engineers often need to navigate multiple
professional data systems, combining code writing with intensive GUI controls, such as navigating
web pages and clicking buttons [5, 45]. However, there is currently no benchmark that integrates both
code generation and GUI controls for professional data science and engineering.
To address this gap, we propose Spider2-V, the first multimodal agent benchmark covering the entire
data science and engineering workflow, involving 494 real-world tasks in a real-time executable
computer environment and 20 professional enterprise data software. Spider2-V aims to evaluate a
multimodal agent’s ability to perform professional data-related tasks by writing code and managing the
GUI in enterprise data software systems, including data warehousing (e.g., BigQuery), data ingestion
and integration (e.g., Airbyte), data transformation (e.g., dbt), data analysis and visualization (e.g.,
Superset), and data orchestration (e.g., Dagster). These tasks are derived from real-world practices,
such as official tutorials on professional applications and open-source data engineering projects (with
two task examples presented in Figure 1). We also supplement retrieval-augmented agents with
official documentation and tutorials of these software systems to assess their capability to generalize
and learn from these resources.
Each task in Spider2-V is defined within an executable computer environment based on OS-
W ORLD [34], which allows multimodal agents to simulate human actions (e.g., typing code or
clicking buttons) in a realistic setting. Specifically, a multimodal agent can observe real-time image-
style screenshots and text-style accessibility tree of professional data applications in the current
workflow and execute its predicted actions in dynamic multi-round interaction with the computer. This
environment is connected to the real-world Internet, allowing the inclusion of professional software
requiring authentic user accounts (e.g., Snowflake). To ensure reproducible and reliable experiments
with this enterprise data software, 10 authors with computer science backgrounds developed 170
automatic task setup configurations and 151 customized evaluation metrics in total.
We experiment with state-of-the-art LLMs and VLMs including closed-source ones GPT-4 series [21],
Gemini-Pro-1.5 [26], Claude-3-Opus [2], QWen-Max [3] and open-source representatives Mixtral-
2
8x7B [11] and Llama-3-70B [20]. Performances reveal that even the top-tier VLM (GPT-4V [1])
achieves only 14.0% success rate. In the most challenging subset, with action steps exceeding 15,
the performance drops to 1.2%. And for those open-source LLMs, the success rate is less than 2%.
This indicates that existing LLMs or VLMs are still far away from achieving full data workflow
automation. Even provided with an oracle step-by-step plan, the overall performance only increases
to 16.2%. This observation uncovers the poor capability of action grounding (e.g., identifying the
precise coordinates of elements in the current focused application window) for multimodal agents.
Furthermore, extensive analysis (§ 4.3) on Spider2-V demonstrate that these strategies remarkably
promote the final performance, which include enhancing the alignment between different observation
modalities, introducing feedback on action execution, integrating retrieved document context and
enlarging the history trajectory length. These findings lay the groundwork for developing practical
multimodal agents that can revolutionize the automation of data science and engineering workflows.
Generally, an autonomous data agent is modeled as a partially observable Markov decision pro-
cess (POMDP). Given the current observation ot ∈ O which includes a natural language instruction
and a screenshot, accessibility tree (a11ytree), or their combination, an agent generates an exe-
cutable action at ∈ A. This action can be clicking on a certain pixel of the screen (CLICK(560,
200)), or writing code through keyboard (TYPE("ls -lh")). The execution of at results in a new
state st+1 ∈ S (e.g., the updated computer state) and a new partial observation ot+1 ∈ O. The
a11ytree is a text-style representation of the desktop environment, which describes the status,
position, and text content of each element (e.g., windows, buttons, and input boxes). The interaction
loop repeats until an action that marks termination (DONE or FAIL) is generated or the agent reaches
the max number of steps. See App. D for more details about the observation space and action space.
empty cloud
database
OR
data API calls
local cloud uploading
resource
management
(a) File Transfer (b) Application Launch (c) Remote API Calls
To ensure that an agent starts from a consistent initial state, we invoke a series of function calls based
on a pre-stored virtual machine (VM) snapshot to reset the environment. These function calls vary
among tasks. And we summarize 5 universal categories with their functionalities (see Figure 2),
namely: 1) File Transfer: transfer files or project archives (either from local or cloud storage) into the
VM; 2) Application Launch: open software on the desktop, e.g., Visual Studio Code and Chromium;
3) Remote API Calls: invoke tool-specific API calls for professional applications, especially those
requiring authentic user accounts, to reset and configure cloud workspaces; 4) Script Execution:
execute a shell script in VM to set up the initial state, e.g., run a Docker container to start a localhost
webserver for Superset; 5) Playwright Automation: run web browser simulation with Playwright,
e.g., sign into an account or click a specific button and redirect to the target web page.
3
2.3 Task-specific Evaluation
POST /v1/connections/get
curl –X POST \
-H "Content-Type:application/json" \
ground truth http://.../v1/connections/get \
-d '{"connectionId": "xxx-xxx"}'
"scheduleType": "cron",
"scheduleData": { 3 check running status/logs
"cron": {
status=$(astro dev run dags list-runs \
"cronExpression": --dag-id ${DAG_ID} \
compare_csv "0 0 18 * * ?", --start-date ${now} \
"cronTimeZone": "UTC" | grep -m 1 "manual" \
} | awk '{print $3}')
... ... }
if [ "status" = "success" ]; then
echo "DAG ${DAG_ID} run succeed"
fi
After the interaction terminates, we only have access to the open-ended resulting state of the computer.
Thus, to measure whether the goal of each task is accomplished, we write task-specific functions to
retrieve the desired result from the open-ended resulting state and return the success flag (0/1). In
total, Spider2-V contains 170 initial state configurations and 151 evaluation scripts, respectively. And
we classify all evaluation methods into 3 generic categories, also shown in Figure 3:
a) File-based comparison: this method finds and copies the target files from VM to the host, and
resorts to file-type based metrics (e.g., .json, .csv, etc.) to compare the specified aspect of the
generated file with ground truth. Sometimes, the ground truth may be updated over time. In this
case, we will fetch the latest labels from the Internet during evaluation.
b) Information-based validation: this scheme is usually utilized to extract and check desired
information from the computer. For example, in Figure 3(b), we want to confirm whether the time
schedule of the data transportation is correctly configured in Airbyte. We can invoke Airbyte
APIs to retrieve, or Chromium Playwright to locate the target value.
c) Execution-based verification: to verify whether an expected goal is achieved, we may also need to
first execute a complicated Shell script in the final VM. For example, in Figure 3(c), we manually
trigger the target Airflow DAG 2 and check the eventual status through running logs.
3 Benchmark Construction
In this section, we introduce the general annotation pipeline, document warehouse construction, and
dataset statistics for Spider2-V. For concrete examples, refer to App. F.
To construct tasks in different categories, we find that official tutorials of enterprise applications serve
as an excellent starting point. The 6-step annotation pipeline is illustrated in Figure 4(a), and we
elaborate it with a concrete and real example “Orchestrate dbt Core jobs with Airflow and Cosmos” 3 :
1) Collect tutorials: firstly, we find tutorials from official websites for each professional tool in
Figure 5. In total, 10 annotators collected 217 source URLs. Note that these tutorials may utilize
other professional software, e.g., MySQL. All involved professional tools are listed in App. B.
2) Learn tutorials: the annotator selects one tutorial, learns and realizes it in the VM. After that,
they can summarize key knowledge points from this tutorial. For example, in Figure 4(b), five key
steps in integrating a dbt project into an Airflow task are extracted.
2
A DAG in Airflow is defined as a collection of tasks to run, and DAG_ID is used to uniquely identify it.
3
The selected Airflow tutorial URL: https://www.astronomer.io/docs/learn/airflow-dbt
4
abstract instruction:
𝒊).configure the Astro project I want to integrate an existing dbt
project `jaffle-shop` into an Airflow DBT
environment task group. The connection to PostgresSQL
𝒊𝒊).prepare a dbt project setup has been configured. Could you help me
... finish the remaining work? Name the target
DAG `jaffle_shop_dag`, and schedule it to
1 collect tutorials 𝒊𝒊𝒊).create an Airflow connection run at 10:00 a.m. everyday.
Figure 4: General annotation pipeline with one selected demonstration from the official Airflow
tutorial: Orchestrate dbt Core jobs with Airflow and Cosmos.
3) Write instructions: since the chosen tutorial is extremely complicated, the annotator can select
a few key points to construct the task instruction. In Figure 4, we only select key steps iv)
and v) to write two versions of instructions, abstract and verbose, indicating different levels of
proficiency. Note that, to avoid potential data contamination and make the task more realistic,
we ask the annotator to introduce at least two modifications to the raw tutorial. In this example,
we a) replace the original “my_simple_dbt_project” into an open-source dbt project called
“jaffle-shop” 4 , and b) add one extra requirement on the time schedule (10:00 a.m. daily).
4) Write environment setup functions: the next step is to write initialization functions using
operations defined in § 2.2. In the example above, we need to: a) Upload an unfinished Airflow
project into the VM. b) Execute a Shell script to launch the web server (via Docker containers) for
Airflow under the project folder. c) Open all relevant applications on the desktop to simulate
real user scenarios. d) Use Playwright to auto-login to the default Airflow account.
5) Write task-specific evaluation functions: In this step, annotators are required to programmati-
cally obtain results from the open-ended states of VM and assess whether the task is completed
using methods in § 2.3. In this example, the evaluator contains: a) manually run the target
Airflow DAG and verify the final status is “success”; b) using Airflow CLIs to retrieve details
of the target Airflow DAG, and compare dbt sub-tasks, status and schedule with ground truth.
6) Cross-validate on VM: to ensure correctness, we go through strict cross-validation. Each
annotated task is sent to two other annotators to check: a) whether the chosen task reflects a
real-world use case; b) whether verbose instruction accurately fulfills the task and its requirements
in the abstract instruction; c) whether the environment can be reset to the same state in different
trials; d) whether the evaluation is robust when we exactly follow the verbose instruction or
modify some inconsequential steps; e) whether the evaluation score is 0 if we deliberately make
some mistakes (red-teaming). The task is preserved only if it withstands all these tests.
On average, the annotation of one task (including cross-validation) costs roughly 4 hours.
Even senior data scientists query official documentation of professional applications when completing
a complicated data engineering task. To compensate for the deficiencies of the data agents in utilizing
enterprise professional software (e.g., unaware of coding specifications or APIs), we build a document
warehouse for Spider2-V. Concretely, we recursively crawl the web pages from the root websites of
the professional applications in Figure 5. After pre-processing through heuristics (refer to App. C),
4
URL of open-source dbt project “jaffle-shop”: https://github.com/dbt-labs/jaffle-shop
5
raw HTML web pages are convert into 3 different formats for retrieval, namely a) pure text, b)
markdown, and 3) simplified HTML. Eventually, we obtain 11, 231 documents in total.
0.10 0.5
0.025 abstract
verbose
0.08 0.4
0.020
Frequency
Frequency
Frequency
0.06 0.015 0.3
0.04 0.010 0.2
0.02 0.005 0.1
0.00 0.000 0.0
0 10 20 30 40+ 0 200 400 600+ 0 2 4 6 8
Steps Words Number of Related Apps
Figure 6: Distribution of action steps, instruction length, and related applications per task.
Tasks We classify all 494 tasks in Spider2-V into 7 categories and 11 software sub-categories
with main statistics in Figure 5 and Table 1. Specifically, most (280 tasks, 56.7%) involve CLI and
GUI operations. And 34% examples request registering authentic software accounts. Since each
task is associated with a detailed, step-by-step tutorial (verbose instruction), the entire task set can
be categorized into three distinct levels based on the number of actions in these instructions. The
proportion of easy, medium, and hard tasks is approximately 1 : 2 : 1. According to the rightmost
distribution depicted in Figure 6, most tasks necessitate the coordinated utilization of multiple
professional applications, thereby establishing Spider2-V as a particularly challenging benchmark.
Comparison with existing benchmarks In Table 2, we compare Spider2-V with other agent
benchmarks. Spider2-V incorporates generic computer control commands into the field of data
science and engineering and is distinguished by these salient features: 1) a real-time executable
environment. Instead of providing static input-output pairs, Spider2-V is equipped with a dynamic
computer desktop such that agents can proactively explore it; 2) multiple enterprise software. We
integrate 20 professional applications into the benchmark, which include not only tools installed on
local hosts but also cloud-based enterprise services; 3) intensive GUI operations. Unlike traditional
coding or data science domains, experienced data scientists frequently manipulate the UIs of those
professional software to simplify the data workflow (e.g., enabling a specific function on the UI
page or visualizing the graph view of data inputs). In summary, Spider2-V focuses on the use of
professional enterprise software with visual interface in an interactive computer environment.
6
Table 2: Comparison with existing agent benchmarks. Columns include the research field (Field),
whether an executable environment is provided (Exec. Env.?), whether enterprise service is utilized
(Ent. Serv.?), whether GUI actions are supported (GUI Support?) and some other statistics.
Exec. Ent. GUI # Apps/ # Exec.-based
Benchmark Field
Env? Serv.? Support? Sites Eval. Func.
# Tasks
Agent baselines The baseline method includes 3 schemes in zero-shot prompt learning: 1) Set-
of-Mark (SoM, [36]): following OSW ORLD [34] and VisualWebArena [14], we adopt heuristic
methods to retrieve coordinates of visible elements from a11ytree (a text-format observation type)
and draw indexed bounding box for these elements on the screenshot. We further insert these
indexes into the pruned a11ytree to enhance the alignment between screenshot and a11ytree. 2)
Execution Feedback (EF, [28]): we append execution feedback messages of those actions which
failed to be grounded in the environment due to unexpected errors. The two techniques mentioned
above are elaborated in App. D.3.1. 3) Retrieval-Augmented Generation (RAG, [8]): we leverage
the task instruction as the query vector, bge-large-en-v1.5 [33] as the embedding model, and
LlamaIndex [18] framework as the retrieval to generate document context for each task example.
Documents are pre-chunked into segments with maximum length 512 and tokens overlapping size 20.
Top 4 segments are selected as additional context in the task prompt (detailed in App. G.3).
LLMs and VLMs We experiment with state-of-the-art LLMs and VLMs, including open-source
representatives such as Mixtral-8x7B [11] and Llama-3-70B [20], and closed-source ones including
Qwen-Max [3], Gemini-Pro-1.5 [26], Claude-3-Opus [2] and GPT [1] families (GPT-4o and GPT-
4V 5 ). With respect to the two open-source LLMs and QWen-Max, we utilize pure text-format
a11ytree as the observation type on account of their incapability of image processing. For the
remaining 4 VLMs which support vision input, we use aligned text and image (that is Set-of-Mark) as
the observation type in main experiments. Unless otherwise specified, we set the temperature to 0.5
and top_p to 0.9, the history trajectory window size to 3, the maximum length of a11ytree to 5000
tokens, and the maximum output tokens to 1500 in each turn. Heuristically, we require the agent to
complete the tasks within both 15 interaction turns and one hour, which suffices for most tasks 6 .
In Table 3, we compare performances of different LLMs and VLMs. All results above integrate
techniques of both execution feedback (EF) and retrieval-augmented generation (RAG) in § 4.1.
Accordingly, we can summarize that:
1) Existing data agents are far from satisfactory in completing real-world data science and
engineering tasks. Even state-of-the-art VLMs (GPT-4o and GPT-4V) perform terribly on
Spider2-V, achieving at best 14.0% overall success rate. As for their strongest competitors,
5
We utilize the version gpt-4o-2024-05-13 for GPT-4o and gpt-4-1106-vision-preview for GPT-4V.
6
Although some tasks require more than 15 actions, we encourage the multimodal agent to predict multiple
actions in one response in order to save the budget in the prompt design (see App. G.1.2).
7
Table 3: Success rates of baseline agents on Spider2-V grouped by 7 task categories (see Figure 5),
namely data warehousing (ware.), transformation (trans.), ingestion (inges.), visualization (visual.),
orchestration (orche.), traditional data processing (proc.), and IT service management (manag.). For
the first three LLMs, since they do not support visual information, we only utilize the text-based
a11ytree as the observation. For the remaining four VLMs, we adopt Set-of-Mark (see § 4.1).
Success Rate (%)
LLM / VLM Observation
ware. trans. inges. visual. orches. proc. serv. Overall
Mixtral-8x7B 1.2 0.0 0.0 0.0 2.6 0.9 0.0 0.8
Llama-3-70B a11ytree 2.4 0.0 0.0 2.5 3.9 2.8 0.0 2.0
Qwen-Max 1.2 0.0 0.0 0.0 2.6 0.0 0.0 0.6
Claude-3-Opus 2.4 2.5 10.4 15.0 11.5 3.8 12.1 8.1
Gemini-Pro-1.5 3.6 2.5 14.6 15.0 10.3 2.8 19.0 9.1
Set-of-Mark
GPT-4o 7.2 7.5 24.0 14.1 19.8 10.1 13.8 13.8
GPT-4V 10.8 10.0 12.0 25.0 18.4 8.5 12.1 14.0
Gemini-Pro-1.5 [26] and Claude-3-Opus [2], they attain worse performances, even less than 10%
percents. There is still ample room for improvement in future work.
2) Closed-source models are much more superior than open-source ones. For those open-source
LLMs, the success rate is exceedingly low, with some categories approaching zero. On one hand,
it can be attributed to the fact that closed-source VLMs are pre-trained and fine-tuned on data of
higher quality. On the other hand, closed-source VLMs support inputs with longer contexts and
integrate both vision and text modalities (further analyzed in § 4.3).
3) Performances of data agents exhibit high variance, especially in categories “data ingestion”
and “data visualization”. The majority of these two partitions are pure GUI tasks, which means
agents mostly interact with the environment through time-dependent GUI operations. However, a
minor error in one intermediate step can be amplified, resulting in the entire sequence of actions
being wasted. Through error analysis on trajectories, we discover that once agents mispredict the
coordinates of the correct button, they will open the wrong window and become trapped in the
incorrect area, unable to return.
4) Across 7 data categories, the partitions “data warehousing” and “traditional data processing”
are extremely challenging. The reasons for this observation are two-fold: a) data warehousing
tasks mostly involve authentic user accounts (e.g., BigQuery and Snowflake). Compared to
other tasks which can be accomplished in a local host, these dynamic real-world scenarios incur
extra burden on data agents, such as network connection delay and pop-up windows. Multimodal
agents need to deal with these unexpected situations in real-time interaction with the computer.
b) As for traditional data processing, the bottleneck is that spreadsheets in Excel contain many
cells, and it is particularly difficult for data agents to accurately locate the coordinates of cells.
For example, applying the same math formula to the entire column requests multimodal agents to
firstly pinpoint the right corner of a specific cell, wait for the mouse to become a cross, press and
drag the mouse towards the target cell. This series of actions requires precise and fine-grained
GUI controls which are difficult to implement.
4.3 Analysis
In this section, we delve into different factors which influence the eventual success rates, and analyze
the underlying logics. The following analyses are based on our agent baseline with VLM GPT-4o
unless otherwise specified. Firstly, we split the overall results into different subsets in Table 4.
1) Tasks with more inherent action steps are more difficult. Each task is associated with one
verbose task instruction which gives a step-by-step guidance on how to complete it. We count the
number of actions in the verbose instruction and split the entire task set into 3 difficulty levels:
≤ 5 steps (Easy), 5 ∼ 15 steps (Medium), and > 15 steps (Hard). Not surprisingly, as the number
of intrinsic action steps increases, the average performance decreases significantly. And for those
extremely tough tasks, existing VLM-based data agents can hardly accomplish the goal.
8
Table 4: Success rate of GPT-4o with agent base- Table 5: Ablation study on action space, observa-
line SoM+EF+RAG across different partitions. tion types and 3 tricks in § 4.1 on a task subset.
Task Splits Ratio (%) SR (%) Action Observation
Space Types
SR (%)
Easy 19.8 38.8
Medium 62.8 9.7 JSON dict 4.2
Hard 17.4 1.2 screenshot
pyautogui 4.2
w/o account 66.0 15.6 JSON dict 10.5
w/ account 34.0 10.6 a11ytree
pyautogui 12.6
CLI 5.7 7.1 screenshot+a11ytree 11.4
GUI 37.7 20.1 w/ Set-of-Mark 15.6
CIL+GUI 56.7 10.6 pyautogui w/ exec. feedback 13.6
Abstract 50 11.3 w/ retrieval aug. 14.4
Verbose 50 16.2 w/ all tricks 16.3
2) Tasks involving authentic user accounts are much more challenging. One salient feature of
Spider2-V is the integration of professional applications that require authentic user accounts. We
also split the entire task set accordingly (w/o or w/ account). Notably, data agents struggle to
complete tasks involving authentic user accounts (10.6% success rate). These tasks deal with
real-world scenarios and incorporate cloud-hosted enterprise services. Compared with Web
servers which are launched locally in the VM (e.g., from Docker containers), the cloud Web UIs
1) generally integrate more comprehensive functionalities or options in their menu panel, and
2) potentially suffer from emergency situation, such as extended network response delay due to
bandwidth limitation or server overload. We conjecture these two causes collectively contribute to
the inferior performances.
3) Incorporating GUI operations typically lead to improved performances. We split the task set
by interfaces. If the task can be completed with pure CLIs (e.g., code editor or bash terminal),
we classify it as cli. If the task only requires the agent to manipulate the GUI (usually on the
Web page), we classify it into gui. For the remaining cases (cli+gui), an agent must write code
or scripts, and control the UI screen. We observe that pure gui tasks are much easier than cli
tasks. This conclusion can be explained by the following two reasons: 1) GUIs of professional
applications are designed to simplify the original coding task. Clicking buttons or typing values on
UIs can avoid handling the rigorous and complex coding specification. 2) Both observation types,
namely the screenshot and a11ytree, are naturally proposed for GUI tasks. For pure cli tasks,
data agents must perform extra actions to locate and switch to the target panel before writing code.
In Table 5, we analyze the influence of different combinations of action space, observation types,
and the 3 techniques described § 4.1. The findings include: 1) Regarding action space, pyautogui
code slightly outperforms self-customized JSON dict (12.6% v.s. 10.5%). This can be at-
tributed to the advantage that agents can also generate functional Python code like file traversal
apart from the limited GUI control operations using the first action space. And it improves the
efficiency of action grounding. 2) As for observation types, single screenshot leads to very
low performances (4.2%) on account of the agent’s failure in pinpointing concrete elements.
When inserting a11ytree into the observation which contains precise coordinates, the agent ca-
pability of locating target pixels is remarkably promoted. 3) All 3 tricks we integrate into the
agent baseline (namely SoM, EF and RAG) will boost eventual performances. It is interest-
ing that if we do not adopt Set-of-Mark (that is, enhancing the alignment between two modal-
ities of observations), the result of screenshot+a11ytree is even worse than that using pure
a11ytree. This emphasizes the significance of modal alignment when handling state observations.
9
A moderate temperature and longer
history window size improve per-
formances. In Figure 7, we inves-
tigate the influences of two hyper-
parameters on a task subset: 1) The
top-ranked performance is achieved
with sampling temperature 0.5. 2)
With the history window size enlarges, Figure 7: Ablation study on hyper-parameters.
from 0 (no history, only the current
observation) to 3, the performance in-
creases stably. However, due to constraints on input length and considerations of cost-effectiveness,
we are unable to extend the history trajectories any further. This also points out that the interaction
efficiency is a serious issue and promising research direction.
5 Related Work
Benchmarks for data science and engineering In the field of data science and engineering, several
recent works propose novel benchmarks to evaluate the capabilities of LLM agents in manipulating
Excel spreadsheets [16, 4], common data science libraries (e.g., SQL and pandas) [42, 15, 9, 40],
machine learning [10] or software engineering [16] projects. They are usually confined to a single
stage within the entire data pipeline, predominantly data processing and analysis, thus overlooking
other stages such as data warehousing and orchestration from a broader perspective. Besides, like
other coding-related datasets [38, 29, 41], they merely focus on the command line interface, neglecting
the fact that enterprise software usually has rich graphical user interfaces (GUIs). And data scientists
often combine code programming with intensive GUI operations to fulfill a data workflow. To this
end, Spider2-V is proposed as the first-of-its-kind multimodal agent benchmark in the field of data
science and engineering, which covers the entire data workflow and integrates visual interfaces.
Benchmarks for multimodal agents Existing works on GUI interaction mainly encompass web
navigation [27, 17, 39, 5, 14], mobile device [43, 44, 24, 25, 30], and computer desktop [34, 32, 7, 13].
One trend of recent advanced benchmarks is to provide an executable simulation environment. Multi-
modal agents can explore and interact with this platform through keyboard, mouse, gesture and
touch screen actions in a more realistic and complex scenario. However, previous literature mostly
focuses on daily life applications (e.g., Web browser and calendar) [35, 23] or workflows of non-
specialized business tasks [31]. Few works [6, 34, 31] investigate the capability of multimodal agents
to manipulate enterprise-level software. GUIs of professional applications often contain abundant
domain-specific terminologies (e.g., “materialization” in Dagster), which requires multimodal
agents to understand the specialized knowledge. Spider2-V incorporates 20 professional tools into
a real-time computer environment to test the proficiency of agents in data science and engineering.
Furthermore, we supplement a large volume of documents for retrieval to compensate for deficiencies
of agents in domain knowledge.
6 Conclusion
In this work, we propose Spider2-V, the first data science and engineering benchmark which integrates
enterprise professional applications and supports intensive GUI operations besides code writing across
the full data pipeline. It contains 494 tasks, involves 20 professional tools, and provides a real-time
executable computer environment. The most advanced VLM (GPT-4V) still performs poorly on
Spider2-V (achieving 14.0% success rate), rendering it a very challenging benchmark. Although
current multimodal agents are still far from automating data workflows, Spider2-V presents an easily
accessible benchmark and lays the foundation for future research.
References
[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4
10
technical report. arXiv preprint arXiv:2303.08774, 2023.
[2] Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www-
cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf,
2024.
[3] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin
Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu,
Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren,
Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu,
Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu,
Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang,
Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report. arXiv
preprint arXiv:2309.16609, 2023.
[4] Yibin Chen, Yifu Yuan, Zeyu Zhang, Yan Zheng, Jinyi Liu, Fei Ni, and Jianye Hao. Sheetagent:
A generalist agent for spreadsheet reasoning and manipulation via large language models. arXiv
preprint arXiv:2403.03636, 2024.
[5] Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and
Yu Su. Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070,
2023.
[6] Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom
Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, et al. Workarena: How
capable are web agents at solving common knowledge work tasks? In ICLR 2024 Workshop on
Large Language Model (LLM) Agents, 2024.
[7] Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu,
Weichen Zhang, Peiyi Wang, Xiangwu Guo, et al. Assistgui: Task-oriented desktop graphical
user interface automation. arXiv preprint arXiv:2312.13108, 2023.
[8] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun,
and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv
preprint arXiv:2312.10997, 2023.
[9] Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing
Xu, Ming Zhu, Yao Cheng, et al. Infiagent-dabench: Evaluating agents on data analysis tasks.
arXiv preprint arXiv:2401.05507, 2024.
[10] Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Benchmarking large language models
as ai research agents. arXiv preprint arXiv:2310.03302, 2023.
[11] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris
Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand,
et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
[12] Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik
Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint
arXiv:2310.06770, 2023.
[13] Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem
Alshikh, and Ruslan Salakhutdinov. Omniact: A dataset and benchmark for enabling multimodal
generalist autonomous agents for desktop and web. arXiv preprint arXiv:2402.17553, 2024.
[14] Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang,
Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena:
Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649,
2024.
[15] Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-
tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark
for data science code generation. In International Conference on Machine Learning, pages
18319–18345. PMLR, 2023.
11
[16] Hongxin Li, Jingran Su, Yuntao Chen, Qing Li, and ZHAO-XIANG ZHANG. Sheetcopilot:
Bringing software productivity to the next level through large language models. Advances in
Neural Information Processing Systems, 36, 2024.
[17] Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement
learning on web interfaces using workflow-guided exploration. arXiv preprint arXiv:1802.08802,
2018.
[18] Jerry Liu. LlamaIndex, 11 2022. URL https://github.com/jerryjliu/llama_index.
Accessed: 2024-05-08.
[19] Xing Han Lù, Zdeněk Kasner, and Siva Reddy. Weblinx: Real-world website navigation with
multi-turn dialogue. arXiv preprint arXiv:2402.05930, 2024.
[20] Meta AI. Introducing meta Llama 3: The most capable openly available LLM to date, April
2024. URL https://ai.meta.com/blog/meta-llama-3/. Accessed: 2024-04-18.
[21] R OpenAI. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2:13, 2023.
[22] Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien Bouaziz, Dan B Goldman, Steven M.
Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. ICCV, 2021.
[23] Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei
Huang, Chaojun Xiao, Chi Han, et al. Tool learning with foundation models. arXiv preprint
arXiv:2304.08354, 2023.
[24] Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android
in the wild: A large-scale dataset for android device control. arXiv preprint arXiv:2307.10088,
2023.
[25] Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary-
beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert
Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic
benchmarking environment for autonomous agents, 2024.
[26] Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-
baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al.
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv
preprint arXiv:2403.05530, 2024.
[27] Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of
bits: An open-domain platform for web-based agents. In International Conference on Machine
Learning, pages 3135–3144. PMLR, 2017.
[28] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao.
Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information
Processing Systems, 36, 2024.
[29] Hongjin Su, Shuyang Jiang, Yuhang Lai, Haoyuan Wu, Boao Shi, Che Liu, Qian Liu, and Tao Yu.
Arks: Active retrieval in knowledge soup for code generation. arXiv preprint arXiv:2402.12317,
2024.
[30] Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and
Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception.
arXiv preprint arXiv:2401.16158, 2024.
[31] Michael Wornow, Avanika Narayan, Ben Viggiano, Ishan S Khare, Tathagat Verma, Tibor
Thompson, Miguel Angel Fuentes Hernandez, Sudharsan Sundar, Chloe Trujillo, Krrish Chawla,
et al. Do multimodal foundation models understand enterprise workflows? a benchmark for
business process management tasks. arXiv preprint arXiv:2406.13264, 2024.
[32] Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu
Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer agents with
self-improvement. arXiv preprint arXiv:2402.07456, 2024.
12
[33] Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-pack: Packaged resources
to advance general chinese embedding, 2023.
[34] Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing
Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio
Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal
agents for open-ended tasks in real computer environments, 2024.
[35] Qiantong Xu, Fenglu Hong, Bo Li, Changran Hu, Zhengyu Chen, and Jian Zhang. On the tool
manipulation capability of open-source large language models, 2023.
[36] Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark
prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441,
2023.
[37] John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik
Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software
engineering. arXiv preprint arXiv:2405.15793, 2024.
[38] John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. Intercode: Standardizing
and benchmarking interactive coding with execution feedback. Advances in Neural Information
Processing Systems, 36, 2024.
[39] Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable
real-world web interaction with grounded language agents. Advances in Neural Information
Processing Systems, 35:20744–20757, 2022.
[40] Pengcheng Yin, Wen-Ding Li, Kefan Xiao, Abhishek Rao, Yeming Wen, Kensen Shi, Joshua
Howland, Paige Bailey, Michele Catasta, Henryk Michalewski, Oleksandr Polozov, and Charles
Sutton. Natural language to code generation in interactive data science notebooks. In Anna
Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
126–173, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.
18653/v1/2023.acl-long.9. URL https://aclanthology.org/2023.acl-long.9.
[41] Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene
Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. Spider: A large-scale
human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task.
In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,
pages 3911–3921, Brussels, Belgium, October-November 2018. Association for Computational
Linguistics. doi: 10.18653/v1/D18-1425.
[42] Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene
Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for
complex and cross-domain semantic parsing and text-to-sql task. In Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing, pages 3911–3921, 2018.
[43] Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang
Yu. Appagent: Multimodal agents as smartphone users. arXiv e-prints, pages arXiv–2312,
2023.
[44] Danyang Zhang, Lu Chen, and Kai Yu. Mobile-env: A universal platform for training and
evaluation of mobile interaction. arXiv preprint arXiv:2305.08144, 2023.
[45] Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng,
Yonatan Bisk, Daniel Fried, Uri Alon, et al. Webarena: A realistic web environment for building
autonomous agents. arXiv preprint arXiv:2307.13854, 2023.
13
A Relevant URLs
Data transformation
Separate the logic of model "customers" GUI
out into two staged models,
"stg_customers" and "stg_orders".
Figure 8: Overview of Spider2-V, which includes task examples across the full data pipeline, an
executable computer environment, and a document warehouse for agent retrieval.
Github Repository The task examples, environment, documents, code and experiments are publicly
available in Github repository https://github.com/xlang-ai/Spider2-V under Apache-2.0
license. Both the environment and task examples will be maintained by the authors continuously.
Concretely, the environment code is adapted from previous work OSW ORLD [34], which is released
under Apache-2.0 license. A non-exhaustive list of artifacts (or task examples) used in Spider2-
V includes: 1) SheetCopilot [16] which is released under GPL-3.0 license, 2) WorkArena [6]
which is distributed under Apache-2.0 license, and 3) official tutorials or guides on professional
applications (e.g., dbt, Airflow, Dagster, Superset, etc.). These tutorials are free to use and
publicly available. For those enterprise applications which require real accounts, namely BigQuery,
Snowflake, dbt-cloud and ServiceNow, we only exploit their sandbox functions or free-trials
without introducing any extra cost or privacy issues.
Table 8 lists the official documentation websites corresponding to different software. We crawled only
the English documentation from each official website and selected documents matching the version
installed in our testing environment for download. We used HTTrack 7 , a free and easy-to-use offline
browser utility, to download the HTML files to a local directory, building all directories recursively.
7
https://www.httrack.com/
14
Table 6: Summary of all applications in Spider2-V (label ♡ means a real account is needed).
Category Software Description
Fully-managed enterprise data warehouse service offered
BigQuery♡ by Google Cloud Platform (GCP). It enables rapid process-
ing and analysis of large datasets using SQL-like queries.
Cloud-based data warehousing and analytics platform for
Snowflake♡ large-scale data storage and processing, providing services
to load, store, query, and analyze datasets at scale.
High-performance and scalable relational database man-
MySQL agement system (RDBMS) that is widely used and suited
Data Warehousing
for fast data retrieval.
15
We also retained the directory structure of each website, as we believe the path of each document
can, to some extent, represent the document’s purpose. For example, the HTML files under the path
“docs.getdbt.com/docs/deploy” are about deploying dbt in production or staging environments. This
crawling step resulted in a total of 21, 239 HTML files.
We further filtered the crawled HTML pages based on two criteria: irrelevant content to software
usage and pages containing invalid content. For the former, we mainly judged whether the page
contained content related to software usage based on its path and manually confirmed it. For example,
pages under "author" on the website often relate to the website developer or development team
rather than software usage. Additionally, we removed category-type pages that only contained
navigation information. Furthermore, we filtered out pages based on the number of tokens obtained
by whitespace tokenization. We mainly removed pages with token counts less than 100, as we found
that these pages predominantly contained invalid information such as access failures, invalid links,
or webpage redirections. For example, the official website of Dagster contained numerous links to
unreleased versions of documents, all of which resulted in access failures. Therefore, after removal,
the number of valid pages corresponding to Dagster decreased from 10,065 to 332. Finally, We
obtained 11, 231 filtered HTML files (see Table 8).
HTML files contain a significant amount of content unrelated to the actual content of the webpage,
such as “<script>”, “<style>” tags, tag attributes, and developer comments. These parts may
provide aesthetics to the page but are irrelevant to the document-level information. Additionally,
they often occupy a large portion of the HTML file, making it excessively long for LLMs to input.
To perform Retrieval-Augmented Generation (RAG) more efficiently and to help models better
understand software documentation, we preprocessed these HTML files in three formats: plain text,
HTML, and Markdown. These three formats of data and the original HTML files will be released to
facilitate future research. The token statistics of all data formats are shown in Table 9. We describe
the preprocessing details below:
Plain Text: We used BeautifulSoup4 8 to extract the textual elements from the HTML DOM 9
tree and connected these elements using “\n”. This method allows us to obtain the HTML content
in the simplest manner, but losing the structural information of the HTML may affect the model’s
understanding of the webpage content.
Simplified HTML: We remove all sub-trees of the HTML DOM which do not contain textual
elements. We also filter out all headers, footers, copyrights, forms, and iFrames. We removed
all HTML tag attributes since they mostly do not contain actual content or semantic information.
Additionally, when a node in the HTML DOM tree has only one child node, we remove that node
and directly connect its child node to its parent node. This effectively simplifies the structure and
depth of the HTML. The simplified HTML preserves both the structure and content information of
the original HTML with fewer tokens.
Markdown: We further used the markdownify 10 tool to convert the simplified HTML into Mark-
down format. Markdown format uses fewer tokens to represent structural information compared to
HTML, striking a good balance between HTML and plain text formats. Moreover, since pure text
includes a substantial number of newline characters used to concatenate text elements and some parts
of the text content in markdown files are directly concatenated without these newlines, this results in
a smaller average number of tokens in markdown files compared to the pure text format.
Concrete examples of these three formats are detailed in the task prompts (see App. G.3). In our
pilot experiments (see Table 7), we compare the performances using different formats of retrieved
documents on a subset (130 task samples) of Spider2-V. And pure text format outperforms the others.
8
https://beautiful-soup-4.readthedocs.io/en/latest/
9
The Document Object Model (DOM) is an interface that represents an HTML document as a tree structure,
where each node is an object corresponding to a part of the document.
10
https://github.com/matthewwithanm/python-markdownify
16
Table 7: Performances with different formats of retrieved documents on a subset of Spider2-V.
RAG Format Success Rate (%)
Pure Text 16.92
Markdown Syntax 15.38
Simplified HTML 15.38
Table 8: Summary of software documentation. OrigPageNum: The number of all web pages we
crawled from the documentation website. FilteredPageNum: The number of web pages obtained after
filtering out irrelevant or invalid pages.
Software Documentation Website OrigPageNum FilteredPageNum
D.1 Overview
Spider2-V formalizes the interaction with a Ubuntu desktop as a partially observable Markov decision
process (POMDP) (S, O, A, T , R) with state space S, observation space O, action space A, state
transition function T : S × A → S and reward function R : S × A → R. Given the current
observation ot ∈ O from the desktop, the agent needs to predict action at+1 ∈ A for the next step.
An admissible action incurs a change in the latent state space st+1 ∈ S, and the environment feedback
ot+1 . The interaction loop repeats until a special “DONE” or “FAIL” action is issued, wherein the task
episode ends and a reward r = R(sT ) ∈ {0, 1} is computed, with 1 indicating task success.
The executable computer environment (a Ubuntu operating system) is built upon virtual ma-
chines (VMs). By using the “snapshot” functionality of VM, the localhost environment state can be
17
Table 9: Average number of page tokens of different documentation formats. We used TikToken, a
fast BPE tokenizer for use with OpenAI’s models, to calculate the token count for gpt-3.5-turbo.
Software OrigHTML PlainText SimpHTML Markdown
dbt/dbt-cloud 17954 1669 2963 1510
Dagster 131777 2615 4704 2290
Airflow 35011 2007 3885 1829
Airbyte 30124 2448 4328 2329
Superset 10798 1398 2389 1415
Metabase 33523 2288 4690 2333
Snowflake 105155 1750 3342 1595
Bigquery 103748 6245 11777 5718
Jupyter 224153 11240 19917 6743
Total 109119 4273 7789 3212
Agent
one step
delay
Task
Eval
Metadata
(1) pyautogui code
Controller
Action
Grounding
ENV Final
Reset State
Figure 9: Overview of the executable environment of Spider2-V and two types of action space.
completely recovered to a stored history state. This snapshot with task-specific setup functions (see
§ 2.2) serve as the initial state s0 ∈ S for different tasks. And a core controller is responsible
for grounding action at (see App. D.2) into the VM desktop and obtaining observations ot (see
App. D.3) from the resulting state of VM. After the agent issues a special action “DONE” or “FAIL”,
the controller will invoke the customized evaluation function for the current task (see § 2.3) and
report the metric score. The entire procedure is shown in Figure 9(a).
For generic actions that support both CLI and GUI, we introduce two different action spaces:
pyautogui code This action space accepts arbitrary executable python code. Particularly, code
snippets that using python library “pyautogui” to control the mouse and keyboard are strongly
recommended. Generally, mouse-based actions (e.g., click and scroll) directly manipulate the GUI
screen, while keyboard-based actions (e.g., typewrite and hotkey) interact with the CLI such as the
bash terminal and code editor (e.g., Visual Studio Code).
18
JSON dict Inspired by the “pyautogui” library, we summarize 7 actions to simplify the action
space. This small set can cover all CLI and GUI actions needed on the desktop. For each action
and its parameters, we further encapsulate it into a JSON dict to restrict the output format. The API
specification and use cases are formally described in prompt messages (see App. G.1.2). And the
checklist of all 7 actions is presented in Figure 9(b).
Figure 10: Two observation types: screenshot and accessibility tree (a11ytree).
With respect to observations, there are two widely used alternatives (see Figure 10): 1) image-style
screenshot of the entire desktop, and 2) text-format accessibility tree (a11ytree). The accessibility
tree, obtained from the Assistive Technology Service Provider Interface (ATSPI) library 11 , is a
text-format abstraction of the entire computer desktop which describes the name, type, status (e.g., a
menu bar is “selectable”), position (e.g., in Figure 10 (2), the attributes “coord” and “size” together
define the rectangle position), and text content embedded in each element (e.g., windows, panels,
buttons, and input boxes). We extract a11ytree using python library pyatspi and convert it into
the XML format. It functions similar to DOM (Document Object Model) tree for websites.
. . . . . .
Figure 11: Screenshot with bounding boxes. Figure 12: Converted table of a11ytree.
Figure 13: Illustration of the aligned observation type set-of-mark (SoM).
Set-of-Mark (SoM) The original text-style accessibility tree (a11ytree) and image-style screen-
shot do not align with each other. To compensate for this deficiency, we follow OSW ORLD [34] and
WebArena [45] to draw bounding boxes for elements of interest in the screenshot and label these
elements with numeric indexes. The accurate coordinates of these bounding boxes are extracted from
the a11ytree. Furthermore, we re-organize the a11ytree into a table (each leaf node in a11ytree
is converted into one row) and insert another attribute/column “index” for each node in the tree. The
value of attribute “index” is exactly the numeric label of the corresponding element in the screenshot.
The aligned screenshot and a11ytree (a.k.a., set-of-mark, SoM [36]) are illustrated in Figure 13.
11
https://docs.gtk.org/atspi2/
19
Examples of Execution Feedback Messages
Here are failed actions with their error messages in your last response:
# Action 1
import pyautogui
index_34 = (23, 43)
pyautogui.click(index_343)
# Execution error:
Traceback (most recent call last):
NameError: name 'index_343' is not defined
# Action 2
import pyautogui
import time
pyautogui.write('USE DATABASE IMDB\n\\n')
# Execution error:
File "<string>" line 3
pyautogui.write('USE DATABASE IMDB
^
SyntaxError: unterminated string literal
Execution Feedback We also incorporate another type of information as the observation, namely
the execution feedback of actions (see messages above). We notice that, some predicted actions may
be parsed erroneously or fail to be executed. In this case, the two observation types mentioned before
are not changed at all. And the agent repeatedly urges to conduct the same incorrect action. To inform
the agent of execution errors, we include this execution feedback as the third observation type.
20
Figure 14: The format of a simple task example (.json configuration file).
21
F Task Examples
In this part, we present diverse examples in Spider2-V.
22
Table 10 – continued from previous page
Related
Instruction Screenshot After Initialization
App (s)
23
Table 10 – continued from previous page
Related
Instruction Screenshot After Initialization
App (s)
Metabase
Help me finish the metabase login setup
Postgresql
with information shown in setup.json.
Chromium
24
G Prompts for Multi-modal Agents
Multi-modal agent baseline involves complex prompt engineering. The following sections will
introduce the system prompt, task prompt, and retrieved context augmented prompt.
The entire system prompt consists of the environment prompt, observation space prompt, action space
prompt, and general tips. Different action/observation types have different prompts. In this section,
we will introduce each one in turn and present the overall system prompt at last.
Screenshot Setting
After each action step, you will get a text-style observation, which
,→ is extracted and pruned from the accessibility tree based on
,→ AT-SPI library. The accessibility tree describes the elements
,→ (e.g., panels, icons, buttons, frames, windows, applications) on
,→ the computer desktop, as well as its embedded text content,
,→ status and positions. For simplicity, we prune the original tree
,→ and only extract useful information into a tabular format for you.
,→ Here is a quick glance on the observation:
TAG, NAME, POSITION (top-left x & y), SIZE (width & height), TEXT
menu, Visual Studio Code, (99, 0), (184, 27), ''
push-button, Chromium Web Browser, (0, 33), (70, 64), ''
terminal, Terminal, (70, 74), (1430, 832), '(base)
,→ user@ubuntu:~/projects/$'
And you will predict the next action of the computer based on the
,→ accessibility tree.
25
Screenshot + Accessibility Tree Setting
### Screenshot
After each action step, you will get a image-style observation, which
,→ is the screenshot of the computer screen. And you need to predict
,→ the next action on the computer based on this image. You can use
,→ this image to locate the elements on the screen or check the
,→ status of the computer, especially whether the previous action is
,→ successful or not.
TAG, NAME, POSITION (top-left x & y), SIZE (width & height), TEXT
menu, Visual Studio Code, (99, 0), (184, 27), ''
push-button, Chromium Web Browser, (0, 33), (70, 64), ''
terminal, Terminal, (70, 74), (1430, 832), '(base)
,→ user@ubuntu:~/projects/$'
26
SoM Setting
After each action step, you will get a image-style observation, which
,→ is the screenshot of the computer screen. For ease of locating
,→ positions of elements, we extend the original screenshot with
,→ index marks. That is, some salient elements which can be
,→ interacted with (e.g., a button or editable input box) are marked
,→ with line boudaries and numeric indexes. You can use this image
,→ to locate the elements on the screen or check the status of the
,→ computer, especially whether the previous action is successful or
,→ not.
, where `INDEX` indicates exactly the numeric label for each element
,→ marked in the screenshot. You can use this alignment information
,→ to simplify your predicted action. For example, you can use
,→ `pyautogui.click(index_2)` to represent clicking the CENTER of
,→ the element with index 2 on the screenshot. We will automatically
,→ perform the position calculation and substitution for you. `TAG`
,→ / `NAME` is the element type / name respectively. `POSITION` and
,→ `SIZE` together describe the square position of this element on
,→ the computer screen. For example, if you want to click one button,
,→ you can click any point in the square area defined by `POSITION`
,→ and `SIZE`. Assume that the position of this button is (100, 200),
,→ and the size is (40, 40), the CENTER of this button is (120, 220),
,→ which is calculated by x = 100 + 40 / 2 = 120, y = 200 + 40 / 2 =
,→ 220. `TEXT` refers to the text content embedded in the element,
,→ e.g., the bash terminal output or texts in an editable input box.
You can use the accessibility tree to accurately locate positions of
,→ useful elements on the screen and check the concrete textual
,→ contents of elements.
By combining the screenshot and accessibility tree, you should be
,→ intelligent to predict the next feasible and meaningful action.
27
G.1.2 Action Space Prompt
As for the prompt of action space, we provide two choices: 1) pyautogui code, and 2) JSON dict.
pyautogui Code
28
JSON Dict (truncated)
Firstly, we use json dict to describe the types and parameters for
,→ each action we allowed (`required=true` means this argument must
,→ be provided). Then, we demonstrate use cases, and precautions.
ACTION_LIST = [
{
"action_type": "MOVE_TO",
"note": "move the cursor to a specified position (x, y)",
"parameters": {
"x": {
"type": float,
"range": [0, MAX_SCREEN_WIDTH],
"required": true,
},
"y": {
"type": float,
"range": [0, MAX_SCREEN_HEIGHT],
"required": true,
}
}
},
... more action dicts ...
]
### Precautions
1) The output action MUST BE CHOSEN and CAN ONLY BE CHOSEN from the
,→ action space (json dict) defined above, otherwise your action
,→ will be considered as invalid and you will get a penalty. For
,→ example, bash, sql, or python code WILL NOT be executed;
2) For each action dict, STRICTLY OBEY THE FORMAT, which must contain
,→ the `action_type` field and required parameters. Optional
,→ parameters will be set to default values if not provided. NEVER
,→ RETURN ME ANYTHING ELSE WHICH IS NOT DEFINED;
3) For efficiency, you CAN predict multiple actions in one response,
,→ but REMEMBER TO WRAP EACH ACTION DICT SEPARATELY using backticks
,→ ```json and ```.
29
G.1.3 Overall System Prompt
Here are the details of the action spaces (including usage and
,→ precautions) and observation spaces:
{{action_prompt}}
{{observation_prompt}}
Besides, here are some important tips for you to better complete the
,→ task:
1. My computer's password is 'password', feel free to use it when you
,→ need sudo rights.
2. The screen size for the running desktop is: ({screen_width},
,→ {screen_height}).
3. Some action may need time to reflect in the environment (e.g.,
,→ code execution and web page loading), please be patient and refer
,→ to the WAIT action.
4. Try to complete the task in as few steps as possible, we are on a
,→ tight budget.
5. Try to use the applications we opened for you as possible, e.g.,
,→ use the opened gnome-terminal instead of the embedded one in
,→ Visual Studio Code.
6. For critical actions (e.g., opening an application or clicking a
,→ button), ensure the action succeeds before predicting or
,→ proceeding to the next one. That is, DO NOT be greedy to predict
,→ all actions all at once in one response without confirming the
,→ observation of a significant action.
7. When you try to write codes or texts, please ensure you have
,→ focused on the right window or input panel. If the input panel
,→ already has some texts, be careful that you may need to clear or
,→ selecting them before overwritting.
8. DO NOT be stubborn to complete the task in one step. You can break
,→ down the task into several steps and complete them one by one.
9. DO NOT be stupid to repeat the same actions without any progress.
,→ If you find that the action is not effective in the observation,
,→ try another one.
10. RETURN ME ONLY THE ACTION DEFINED IN ACTION SPACES. NEVER EVER
,→ RETURN ME ANYTHING ELSE. THIS IS CRITICAL!!!
30
G.2 Task Prompt
The task instruction for Spider2-V has two forms. The abstract instruction describes the overall
goal of a task without a step-by-step solution, thus testing both planning and grounding abilities.
The verbose instruction provides a detailed tutorial-like solution to the task, primarily validating the
grounding ability.
You can exactly follow the detailed plan above or proactively tackle
,→ the task based on the real-time environment interaction by
,→ yourself.
31
G.3 Example of Retrieved Context Augmented Task Prompt
We also introduce a RAG setting, where we collect and clean the official documents of the professional
tools as the retrieval corpus. We select top k (k may depend on the constraint on input length)
chunks (each chunk is a token sequence with maximum length 512) and insert them into the prompt
input. Here are three demonstrations of different formats of the retrieved context.
We also retrieve relevant documentation from the web to help you with
,→ the task:
Documentation Source:
release-1-7-2.dagster.dagster-docs.io/integrations/dagstermill/using-
,→ notebooks-with-dagster.html
Documentation Title:
Using Jupyter notebooks with Papermill and Dagster Tutorial
Documentation Content:
The page will display the notebook asset in the Asset Graph.
If you click the notebook asset, a sidebar containing info about the
,→ asset will slide out from the right side of the page. In the
,→ Description
section of the panel is a View Source Notebook button:
This button allows you to view the notebook directly in the UI. When
,→ clicked, Dagster will render the notebook - referenced in the
notebook_path parameter - that'll be executed when the
,→ iris_kmeans_jupyter asset is materialized:
Click the Materialize button. To view the execution as it happens,
,→ click the View button in the alert that displays.
After the run completes successfully, you can view the executed
,→ notebook in the UI. Click the asset again and locate the View
,→ Notebook button in the Materialization in Last Run section of the
,→ sidebar:
Click the button to display the executed notebook - specifically, the
,→ notebook that was executed and written to a persistent location:
Step 5: Add an upstream asset #
While our iris-kmeans notebook asset now materializes successfully,
,→ there are still some improvements we can make. The beginning of
,→ the notebook fetches the Iris dataset, which means that every
,→ time the notebook is materialized, the data is re-fetched.
To address this, we can factor the Iris dataset into its own asset.
,→ This will allow us to:
Use the asset as input to additional notebooks.
This means all notebooks analyzing the Iris dataset will use the same
,→ source data, which we only have to fetch once.
Materialize notebooks without fetching data for each materialization.
Instead of making potentially expensive API calls, Dagster can fetch
,→ the data from the previous materialization of the Iris dataset
,→ and provide that data as input to the notebook.
32
Markdown Syntax Format
We also retrieve relevant documentation from the web to help you with
,→ the task:
Documentation Source:
release-1-7-2.dagster.dagster-docs.io/integrations/dagstermill/using-
notebooks-with-dagster.md
Documentation Title:
Using Jupyter notebooks with Papermill and Dagster Tutorial
Documentation Content:
When clicked, Dagster will render the notebook - referenced in the
,→ `notebook_path`parameter - that'll be executed when the
,→ `iris_kmeans_jupyter`asset is materialized:
After the run completes successfully, you can view the executed
,→ notebook in the UI. Click the asset again and locate the **View
,→ Notebook**button in the **Materialization in Last Run**section of
,→ the sidebar:
To address this, we can factor the Iris dataset into its own asset.
,→ This will allow us to:
Create the Iris dataset assetProvide the Iris dataset as input to the
,→ notebookModify the notebook
33
Simplified HTML Format
We also retrieve relevant documentation from the web to help you with
,→ the task:
Documentation Source:
release-1-7-2.dagster.dagster-docs.io/integrations/dagstermill/using-
notebooks-with-dagster.html
Documentation Title:
Using Jupyter notebooks with Papermill and Dagster Tutorial
Documentation Content:
If you execute these cells, several plots of the Iris dataset will be
,→ created:
<p>Next, we conduct our K-means analysis:</p>
<code>estimator
,→ <span>=</span>sklearn<span>.</span>cluster<span>.</span>KMeans
<span>(</span>n_clusters<span>=</span><span>3</span><span>)</span>
estimator<span>.</span>fit<span>(</span>iris<span>[</span>
<span>[</span><span>"Sepal length (cm)"</span><span>,</span>
<span>"Sepal width (cm)"</span><span>,</span>
<span>"Petal length (cm)"</span><span>,</span>
<span>"Petal width (cm)"</span>
<span>]</span><span>]</span><span>)</span>
</code>
<p>Lastly, we plot the results of the K-means analysis. From the
,→ plots, we can see that one species of Iris is separable from the
,→ other two, but a more sophisticated model will be required to
,→ distinguish the other two species:</p>
<p>Like many notebooks, this example does some fairly sophisticated
,→ work, including producing diagnostic plots and a statistical
,→ model. For now, this work is locked away in the
,→ <code>.ipynb</code>format, only reproducible using a complex
,→ Jupyter setup, and only programmatically accessible within the
,→ notebook context. We'll address this in the remainder of the
,→ tutorial.</p>
<h2>Step 2: Create a Dagster asset from the Jupyter
,→ Notebook<span>#</span></h2>
<p>By creating a Dagster asset from our notebook, we can integrate
,→ the notebook as part of our data platform. This enables us to
,→ make its contents more accessible to developers, stakeholders,
,→ and other assets in Dagster.</p>
<p>To create a Dagster asset from a Jupyter notebook, we can use the
,→ <code>define_dagstermill_asset</code>function.
34