0% found this document useful (0 votes)

77 views39 pages

OpenDevin Project

Uploaded by

Chris Forbes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views39 pages

OpenDevin Project

Uploaded by

Chris Forbes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

OpenDevin: An Open Platform for

AI Software Developers as Generalist Agents

Xingyao Wang1,10 , Boxuan Li2 , Yufan Song2 , Frank F. Xu2 , Xiangru Tang3 ,
Mingchen Zhuge6 , Jiayi Pan4 , Yueqi Song2 , Bowen Li, Jaskirat Singh7 ,
Hoang H. Tran8 , Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian3 , Yanjun Shao3 ,
Niklas Muennighoff 5 , Yizhe Zhang, Binyuan Hui9 , Junyang Lin9 ,
arXiv:2407.16741v1 [cs.SE] 23 Jul 2024

Robert Brennan10 , Hao Peng1 , Heng Ji1 , Graham Neubig2,10

1
UIUC 2 CMU 3 Yale 4 UC Berkeley 5 Contextual AI 6 KAUST 7 ANU
8
HCMUT 9 Alibaba 10 All Hands AI
[email protected], [email protected]

Abstract
Software is one of the most powerful tools that we humans have at our disposal;
it allows a skilled programmer to interact with the world in complex and pro-
found ways. At the same time, thanks to improvements in large language models
(LLMs), there has also been a rapid development in AI agents that interact with
and affect change in their surrounding environments. In this paper, we introduce
OpenDevin, a platform for the development of powerful and flexible AI agents
that interact with the world in similar ways to those of a human developer: by
writing code, interacting with a command line, and browsing the web. We describe
how the platform allows for the implementation of new agents, safe interaction
with sandboxed environments for code execution, coordination between multiple
agents, and incorporation of evaluation benchmarks. Based on our currently in-
corporated benchmarks, we perform an evaluation of agents over 15 challenging
tasks, including software engineering (e.g., S WE - BENCH) and web browsing (e.g.,
W EBA RENA), among others. Released under the permissive MIT license, Open-
Devin is a community project spanning academia and industry with more than 1.3K
contributions from over 160 contributors and will improve going forward.
Code https://github.com/OpenDevin/OpenDevin
Benchmark https://hf.co/spaces/OpenDevin/evaluation
Slack https://bit.ly/OpenDevin-Slack

1 Introduction
Powered by large language models (LLMs; [5, 20, 40, 60]), user-facing AI systems (such as ChatGPT)
have become increasingly capable of performing complex tasks such as accurately responding to
user queries, solving math problems, and generating code. In particular, AI agents, systems that can
perceive and act upon the external environment, have recently received ever-increasing research focus.
They are moving towards performing complex tasks such as developing software [21], navigating
real-world websites [79], doing household chores [1], or even performing scientific research [4, 56].
As AI agents become capable of tackling complex problems, their development and evaluation have
also become challenging. There are numerous recent efforts in creating open-source frameworks
that facilitate the development of agents [7, 16, 67]. These agent frameworks generally include:
1) interfaces through which agents interact with the world (such as JSON-based function calls or
code execution), 2) environments in which agents operate, and 3) interaction mechanisms for

Preprint. Under review.

human-agent or agent-agent communication. These frameworks streamline and ease the development
process in various ways (Tab. 1, §C).
When designing AI agents, we can also consider how human interacts with the world. The most
powerful way in which humans currently interact with the world is through software – software
powers every aspect of our life, supporting everything from the logistics for basic needs to the
advancement of science, technology, and AI itself. Given the power of software, as well as the
existing tooling around its efficient development, use, and deployment, it provides the ideal interface
for AI agents to interact with the world in complex ways. However, building agents that can effectively
develop software comes with its own unique challenges. How can we enable agents to effectively
create and modify code in complex software systems? How can we provide them with tools to gather
information on-the-fly to debug problems or gather task-requisite information? How can we ensure
that development is safe and avoids negative side effects on the users’ systems?
In this paper, we introduce OpenDevin, a community-driven platform designed for the development
of generalist and specialist AI agents that interact with the world through software.1 It features:
(1) An interaction mechanism which allows user interfaces, agents, and environments to interact
through a event stream architecture that is powerful and flexible (§2.1).
(2) An environment that consists of a sandboxed operating system and a web browser that the
agents can utilize for their tasks (§2.2).
(3) An interface allowing the agent to interact with the environment in a manner similar to actual
software engineers (§2.3). We provide the capability for agents to (a) create complex software,
(b) execute the code, and (c) browse websites to collect information.
(4) Multi-agent delegation, allowing multiple specialized agents to work together (§2.4).
(5) Evaluation framework, facilitating the evaluation of agents across a wide range of tasks (§4).
Importantly, OpenDevin is not just a conceptual framework, but it also includes a comprehensive
and immediately usable implementation of agents, environments, and evaluations. As of this writing,
OpenDevin includes an agent hub with over 10 implemented agents (§3), including a strong generalist
agent implemented based on the CodeAct architecture [63], with additions for web browsing [52]
and code editing [72]. Interaction with users is implemented through a chat-based user interface that
visualizes the agent’s current actions and allows for real-time feedback (Fig. 1, §D). Furthermore, the
evaluation framework currently supports 15 benchmarks, which we use to evaluate our agents (§4).
Released under a permissive MIT license allowing commercial use, OpenDevin is poised to support
a diverse array of research and real-world applications across academia and industry. OpenDevin
has gained significant traction, with 28K GitHub stars and more than 1.3K contributions from over
160 contributors. We envision OpenDevin as a catalyst for future research innovations and diverse
applications driven by a broad community of practitioners.

2 OpenDevin Architecture

We describe, using OpenDevin, (1) how to define and implement an agent (§2.1), (2) how each action
execution leads to an observation (§2.2), (3) how to reliably manage and extend commonly used skills
for agents (§2.3), and (4) how to compose multiple agents together for task solving (§2.4). Fig. 3
provides an overview.

2.1 Agent Definition and Implementation

An agent can perceive the state of the environment (e.g., prior actions and observations) and produce
an action for execution while solving a user-specified task.
The State and Event Stream. In OpenDevin, the state is a data structure that encapsulates all relevant
information for the agent’s execution. A key component of this state is the event stream, which is a
chronological collection of past actions and observations, including the agent’s own actions and user
interactions (e.g., instructions, feedback). However, the state extends beyond just the event stream. It

1
While initially inspired by the AI software engineer Devin [9], OpenDevin has quickly evolved to support a
much wider range of applications beyond software engineering through diverse community contributions.

2
Figure 1: OpenDevin User Interface (UI, §D) allows users to view files, check executed bash
commands/Python code, observe the agent’s browser activity, and directly interact with the agent.

also incorporates auxiliary information for agent’s operation, such as the accumulative cost of LLM
calls, metadata to track multi-agent delegation (§2.4), and other execution-related parameters.
Actions. Inspired by CodeAct [63], Open- Figure 2: Minimal example of implementing an
Devin connects an agent with the environment agent in OpenDevin.
through a core set of general actions. Actions class MinimalAgent:
IPythonRunCellAction and CmdRunAction def reset(self) -> None:
self.system_message = "You are a helpful assistant ..."
enable the agent to execute arbitrary Python
def step(self, state: State):
code and bash commands inside the sandbox en- messages: list[dict[str, str]] = [
vironment (e.g., a securely isolated Linux oper- {'role': 'system', 'content': self.system_message}
]
ating system). BrowserInteractiveAction for prev_action, obs in state.history:
enables interaction with a web browser with action_message = get_action_message(prev_action)
messages.append(action_message)
a domain-specific language for browsing intro- obs_message = get_observation_message(obs)
duced by BrowserGym [12]. The action space messages.append(obs_message)

based on programming languages (PL) is power- # use llm to generate response (e.g., thought, action)
ful and flexible enough to perform any task with response = self.llm.do_completion(messages)

tools in different forms (e.g., Python function, # parse and execute action in the runtime
action = self.parse_response(response)
REST API, etc.) [63] while being reliable and if self.is_finish_command(action):
easy to maintain. return AgentFinishAction()
elif self.is_bash_command(action):
This design is also compatible with existing tool- return CmdRunAction(command=action.command)
elif self.is_python_code(action):
calling agents that require a list of pre-defined return IPythonRunCellAction(code=action.code)
elif self.is_browser_action(action):
tools [6]. That is, users can easily define tools return BrowseInteractiveAction(code=action.code)
using PL supported in OpenDevin primitive ac- else:
return MessageAction(content=action.message)
tions (e.g., write a Python function for calcula-
tor) and make those tools available to the agent through JSON-style function-calling experiences [49].
Moreover, the framework’s powerful PL-based primitives further make it possible for the agents to
create tools by themselves (e.g., by generating Python functions [75]) when directly relevant APIs
to complete the task are unavailable. Refer to §2.3 for how these core PL-based actions can be
composed into a diverse set of tools.
Observations. Observations describe environmental changes that the agent observes. It may or may
not be caused by the agent’s action: It can be either 1) natural language messages from the user

3
User Interface
Event Stream Agent Runtime
Command Line where each Action execution
where all Action-Observation(s) are tracked
We b UI leads to an Observation

ulti-turn
[6] CmdRunAction (source=agent)

Sandbox
M [1] Message (source=user)

IDE Plugins
User-Agent
Can you create a list of numbers from 1 to 10, and I have created a Python file `app.py` that will
[...] create a web page to display them at port 5000? display a list of numbers from 1 to 10 when you run
Interaction it. Let me run the server for you:

<execute_bash>

[2] IPythonRunCellAction (source=agent)

python3 app.py > server.log 2>&1 &

Interactive Python
Sure! Let me create a file first:

cat server.log
(IPython) Server
State (Event History)
<execute_ipython>

</execute_bash>
[Past Action(s) & Observation(s)] create_file('app.py')
</execute_ipython>

[7] CmdRunObservation

Agent
[3] IPythonRunCellObservation

[1] 124

[File: /workspace/app.py (1 lines total)]

* Serving Flask app 'app'

* Debug mode: off

where State is [File app.py created.]

WARNING: This is a development server. Do
use it in a production deployment. Use a
not
Bash Shell
turned into Action production WSGI server instead.

[4] IPythonRunCellAction (source=agent)

* Running on http://127.0.0.1:5000

I will write the Python code for starting a server.

Press CTRL+C to quit
The Event Stream connects to
Action <execute_ipython>

Computer Sandbox via SSH Protocol

EDITED_CODE=\"\"\"[Omitted for Brevity]\"\"\"
edit_file(start=1, end=1, content=EDITED_CODE) [8] BrowseInteractiveAction (source=agent)

</execute_ipython> The server is running on port 5000 with PID 124. Let
me browse to the page at http://127.0.0.1:5000

OpenDevin Abstraction <execute_browse>

[5] IPythonRunCellObservation

[... Omitted for Brevity ...]

goto("http://127.0.0.1:5000")

Event Stream: List[Action_1, 4|@app.route('/')

5|def index():

</execute_browse> Browser

Observation_1, Action_2, ...]

6| numbers = list(range(1, 11))

Playwright Chromium
Agent: State (History) -> Action
7| return str(numbers)
[9] BrowserOutputObservation

[... Omitted for Brevity ...] [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Runtime: Action -> Observation [File updated.]

Figure 3: OpenDevin consists of 3 main components: 1) Agent abstraction where community can
contribute different implementation of agents (§2.1) into an Agent Hub (§3); 2) Event stream for
tracking history of actions and observations; 3) Agent runtime to execute all agent actions into
observations (§2.2).

instructing agents to perform certain tasks, 2) the execution outcome of the agent’s previous action
(e.g., code execution result, an accessibility tree [35], a screenshot of the web page, etc.).
Implement a New Agent. The agent abstraction is designed to be simple yet powerful, allowing
users to create and customize agents for various tasks easily. The core of the agent abstraction lies in
the step function, which takes the current state as input and generates an appropriate action based
on the agent’s logic. Simplified example code for the agent abstraction is illustrated in Fig. 2. By
providing this abstraction, OpenDevin allows the users to focus on defining desired agent behavior
and logic without worrying about the low-level details of how actions are executed (§2.2).

2.2 Agent Runtime: How Execution of Actions Results in Observations

Agent Runtime provides a general environment that equips the agent with an action space comparable
to that of human software developers, enabling OpenDevin agents to tackle a wide range of software
development and web-based tasks, including complex software development workflows, data analysis
projects, web browsing tasks, and more. It allows the agent to access a bash terminal to run code and
command line tools, utilize a Jupyter notebook for writing and executing code on-the-fly, and interact
with a web browser for web-based tasks (e.g., information seeking).
Linux SSH Sandbox. For each task session, OpenDevin spins up a securely isolated docker container
sandbox, where all the bash commands from the agent are executed. OpenDevin connects to the
sandbox through SSH protocol, executes arbitrary commands from the agent, and returns the execution
results as observations to the agent. A configurable workspace directory containing files the user
wants the agent to work on is mounted into that secure sandbox for OpenDevin agents to access.
Jupyter IPython. The Linux sandbox also supports running an interactive Jupyter server, which can
be used by the agent for interactive python code execution [19] and debugging.
Web Browser. OpenDevin implements a Chromium browser based on Playwright [47]. It interfaces
with agents using a set of browser action primitives defined by BrowserGym [12, 52], such as
navigation, clicking, typing, scrolling. The full set of actions is detailed in §I. After executing these
actions, the browser runtime provides a rich set of observations about the current state of the browser,
including HTML, DOM, accessibility tree [35], screenshot, opened tabs, etc. These observations can
be also augmented with configurable attributes that could allow agents to better understand web page
observations, such as using a set-of-marks on screenshot [15, 71], visible element marking, focused
element, interactable element marking, in-viewport element filtering [79], etc.

4
Table 1: Comparison of different AI agent frameworks (§C). S WE refers to ‘software engineering’.
Standardized tool library: if framework contains reusable tools for different agent implementations
(§2.3); Built-in sandbox & code execution: if it supports sandboxed execution of arbitrary agent-
generated code; Built-in web browser: if it provides agents access to a fully functioning web
browser; Human-AI collaboration: if it enables multi-turn human-AI collaboration (e.g., human
can interrupt the agent during task execution and/or provide additional feedback and instructions);
AgentHub: if it hosts implementations of various agents (§3); Evaluation Framework: if it offers
systematic evaluation of implemented agents on challenging benchmarks (§4); Agent QC (Quality
Control): if the framework integrates tests (§E) to ensure overall framework software quality.
Graphic Standardized Built-in Sandbox Built-in Web Multi-agent Human-AI Evaluation Agent
Framework Domain AgentHub
User Interface Tool Library & Code Execution Browser Collaboration Collaboration Framework QC

AutoGPT [14] General " % % % % % " % "

LangChains [6] General % " %∗ %∗ % % " % %
MetaGPT [16] General % " % " " % " % "
AutoGen [67] General % " " " " " " " %
AutoAgents [7] General % % % % " % % % %
Agents [80] General % % % % " " % % %
Xagents [61] General " " % " " % " % %
OpenAgents [69] General " % " " % % " % %
GPTSwarm [83] General % " % % " " % % %

AutoCodeRover [78] SWE % % " % % % % % %

SWE-Agent [72] SWE % % " % % % % % %

OpenDevin General " " " " " " " " "
*
No native support. Third-party commercial options are available.

2.3 Agent Skills: The Extensible Agent-Computer Interface

SWE-Agent [72] highlights the importance of a carefully crafted Agent-Computer Interface (ACI,
i.e., specialized tools for particular tasks) in successfully solving complex tasks. However, creating,
maintaining, and distributing a wide array of tools can be a daunting engineering challenge, especially
when we want to make these tools available to different implementations of agents (§3). To tackle
these, we build an AgentSkills library, a toolbox designed to enhance the capabilities of agents,
offering utilities not readily available through basic bash commands or python code.
Easy to create and extend tools. AgentSkills is designed as a Python package consisting of
different utility functions (i.e., tools) that are automatically imported into the Jupyter IPython
environment (§2.2). The ease of defining a Python function as a tool lowers the barrier for community
members to contribute new tools to the skill library. The generality of Python packages also
allows different agent implementations to easily leverage these tools through one of our core action
IPythonRunCellAction (§2.1).
Rigorously tested and maintained. We follow best practices in software engineering and write
extensive unit tests for tools in AgentSkills to ensure their reliability and usability.
Inclusion criteria and philosophy. In the AgentSkills library, we do not aim to wrap every possible
Python package and re-teach agents their usage (e.g., LLM already knows pandas library that can
read CSV file, so we don’t need to re-create a tool that teaches the agent to read the same file format).
We only add a new skill when: (1) it is not readily achievable for LLM to write code directly (e.g.,
edit code and replace certain lines), and/or (2) it involves calling an external model (e.g., calling a
speech-to-text model, or model for code editing [51]).
Currently supported skills. AgentSkills library includes file editing utilities adapted from SWE-
Agent [72] like edit_file, which allows modifying an existing file from a specified line; scrolling
functions scroll_up and scroll_down for viewing a different part of files. It also contains tools
that support reading multi-modal documents, like parse_image and parse_pdf for extracting
information from images using vision-language models (e.g., GPT-4V) and reading text from PDFs,
respectively. A complete list of supported skills can be found in §H.

2.4 Agent Delegation: Cooperative Multi-agent Interaction

OpenDevin allows interactions between multiple agents as well. To this end, we use a special
action type AgentDelegateAction, which enables an agent to delegate a specific subtask to another
agent. For example, the generalist CodeActAgent, with limited support for web-browsing, can
use AgentDelegateAction to delegate web browsing tasks to the specialized BrowsingAgent to
perform more complex browsing activity (e.g., navigate the web, click buttons, submit forms, etc.).

5
3 AgentHub: A Hub of Community-Contributed Agents

Based on our agent abstraction (§2.1), OpenDevin supports a wide range of community-contributed
agent implementations for end users to choose from and act as baselines for different agent tasks.
CodeAct Agent. CodeActAgent is the default generalist agent based on the CodeAct framework
[63]. At each step, the agent can (1) converse to communicate with humans in natural language to ask
for clarification, confirmation, etc., or (2) to perform the task by executing code (a.k.a., CodeAct),
including executing bash commands, Python code, or browser-specific programming language (§2.2).
This general action space allows the agent (v1.5 and above) to perform various tasks, including
editing files, browsing the web, running programs, etc.
Browsing Agent. We implemented a generalist web agent called Browsing Agent, to serve as a
simple yet effective baseline for web agent tasks. The agent is similar to that in WebArena [79],
but with improved observations and actions, with only zero-shot prompting. At each step, the agent
prompts the LLM with the task description, browsing action space description, current observation
of the browser using accessibility tree, previous actions, and an action prediction example with
chain-of-thought reasoning. The expected response from the LLM will contain chain-of-thought
reasoning plus the predicted next actions, including the option to finish the task and convey the result
to the user. Full prompts are in §J. It can be extended to create more capable web agents, or called by
other agents through delegation (§2.4) to enable browsing capability.
GPTSwarm Agent. GPTSwarm [83] pioneers the use of optimizable graphs to construct agent
systems, unifying language agent frameworks through modularity. Each node represents a distinct
operation, while edges define collaboration and communication pathways. This design allows
automatic optimization of nodes and edges, driving advancements in creating multi-agent systems.
Micro Agent(s). In addition, OpenDevin enables the creation of micro agent, an agent specialized
towards a particular task. A micro agent re-uses most implementations from an existing generalist
agent (e.g., CodeAct Agent). It is designed to lower the barrier to agent development, where
community members can share specialized prompts that work well for their particular use cases.
Without programming, a user can create a micro agent by providing the agent’s name, description, the
schema for its inputs and outputs, and optionally a specialized prompt (e.g., example demonstrations
showing how to perform a particular task) that gear the generalist agent toward specific tasks, for
example, the CommitWriterAgent for generating git commit messages, and the TypoFixerAgent
for correcting typos across the entire code repository.

4 Evaluation

To systematically track progress in Table 2: Evaluation benchmarks in OpenDevin.

building generalist digital agents, as Category Benchmark Required Capability
listed in Tab. 2, we integrate 15 es- SWE-Bench [21] Fixing Github issues
tablished benchmarks into OpenDevin. HumanEvalFix [37] Fixing Bugs
BIRD [27] Text-to-SQL
These benchmarks cover software en- Software BioCoder [58] Bioinformatics coding
gineering, web browsing, and miscel- ML-Bench [57] Machine learning coding
Gorilla APIBench [46] Software API calling
laneous assistance. In this section, we ToolQA [81] Tool use
compare OpenDevin to open-source re- Web
WebArena [79] Goal planning & realistic browsing
MiniWoB++ [30] Short trajectory on synthetic web
producible baselines that do not perform
GAIA [34] Tool-use, browsing, multi-modality
manual prompt engineering specifically GPQA [50] Graduate-level Google-proof Q&A
based on the benchmark content. Please Misc. Assistance
AgentBench [31]
MINT [64]
Operating system interaction (bash)
Multi-turn math and code problems
note that we use ‘OD’ as shorthand for Entity Deduction Arena [77] State tracking & strategic planning
ProofWriter [55] Deductive Logic Reasoning
OpenDevin for the rest of this section
for brevity reasons.

4.1 Result Overview

In OpenDevin, our goal is to develop general digital agents capable of interacting with the world
through software interfaces (as exemplified by the code actions described in §2.1). We recognize that
a software agent should excel not only in code editing but also in web browsing and various auxiliary
tasks, such as answering questions about code repositories or conducting online research.

6
Table 3: Selected evaluation results for OpenDevin agents (§4). See Tab. 4 (software), Tab. 5 (web),
Tab. 6 (miscellaneous assistance) for full results across benchmarks.
Software (§4.2) Web (§4.3) Misc. (§4.4)
Agent Model SWE-Bench Lite WebArena GPQA GAIA
Software Engineering Agents
SWE-Agent [72] gpt-4-1106-preview 18.0 − − −
AutoCodeRover [78] gpt-4-0125-preview 19.0 − − −
Aider [13] gpt-4o & claude-3-opus 26.3 − − −
Moatless Tools [85] claude-3.5-sonnet 26.7 − − −
Agentless [68] gpt-4o 27.3 − − −
Web Browsing Agents
Lemur [70] Lemur-chat-70b − 5.3 − −
Patel et al. [45] Trained 72B w/ synthetic data − 9.4 − −
AutoWebGLM [24] Trained 7B w/ human/agent annotation − 18.2 − −
Auto Eval & Refine [42] GPT-4 + Reflexion w/ GPT-4V − 20.2 − −
WebArena Agent [79] gpt-4-turbo − 14.4 − −
Misc. Assistance Agents
AutoGPT [14] gpt-4-turbo − − − 13.2
Llama-2-70b-chat − − 28.1 −
Few-shot Prompting +
gpt-3.5-turbo-16k − − 29.6 −
Chain-of-Thought [50]
gpt-4 − 38.8 −
OpenDevin Agents
gpt-4o-mini-2024-07-18 6.3 8.3 − −
CodeActAgent v1.8 gpt-4o-2024-05-13 22.0 14.5 ∗
53.1 −
claude-3-5-sonnet 26.0 15.3 52.0 −
GPTSwarm v1.0 gpt-4o-2024-05-13 − − − 32.1
*
Numbers are reported from CodeActAgent v1.5.

Tab. 3 showcases a curated set of evaluation results. While OpenDevin agents may not achieve top
performance in every category, they are designed with generality in mind. Notably, the same CodeAct
agent, without any modifications to its system prompt, demonstrates competitive performance across
three major task categories: software development, web interaction, and miscellaneous tasks. This
is particularly significant when compared to the baseline agents, which are typically designed and
optimized for specific task categories.

4.2 Software Engineering

Next, we report results specifically for software engineering benchmarks in Tab. 4.

4.2.1 SWE-Bench
SWE-bench [21] is designed to assess agents’ abilities in solving real-world GitHub issues, such
as bug reports or feature requests. The agent interacts with the repository and attempts to fix the
issue provided through file editing and code execution. The agent-modified code repository is tested
against a test suite incorporating new tests added from human developers’ fixes for the same issue.
Each test instance accompanies a piece of “hint text” that consists of natural language suggestions
for how to solve the problem. Throughout this paper, we report all results without using hint text. A
canonical subset, SWE-bench Lite, is created to facilitate accessible and efficient testing. We default
to use this subset for testing for cost-saving consideration.2
Result. As shown in Tab. 4, our most recent version of CodeActAgent v1.8 generalist, us-
ing claude-3.5-sonnet, achieves a competitive resolve rate of 26% compared to other open-
source agents specialized for software development. We also evaluated CodeActAgent v1.8 using
gpt-4o-mini. While it only solves 6.3% of the problems, it does so at less than 1% of the cost
compared to other models.

4.2.2 HumanEvalFix
HumanEvalFix [37] tasks agents to fix a bug in a provided function with the help of provided test
cases. The bugs are created to ensure one or more test cases fail. We focus on the Python subset of
2
Running the complete set of 2294 instances costs $6.9k, using a conservative estimate of $3 per instance.

7
Table 4: OpenDevin Software Engineering evaluation results (§4.2).
Agent Model Success Rate (%) $ Avg. Cost
SWE-Bench Lite [21], 300 instances, w/o Hint
SWE-Agent [72] gpt-4-1106-preview 18.0 1.67
AutoCodeRover [78] gpt-4-0125-preview 19.0 −
Aider [13] gpt-4o & claude-3-opus 26.3 −
OD CodeActAgent v1.5, Generalist. gpt-4o-2024-05-13 16.7 1.50
gpt-4o-mini-2024-07-18 7.0 0.01
OD CodeActAgent v1.8, Generalist. gpt-4o-2024-05-13 22.0 1.72
claude-3-5-sonnet@20240620 26.0 1.10
HumanEvalFix [37], 164 instances
BLOOMZ-176B 16.6 −
OctoCoder-15B 30.4 −
Prompting, 0-shot [11, 32, 36, 37, 66, 84]
DeepSeekCoder-33B-Instruct 47.5 −
StarCoder2-15B 48.6 −
SWE-agent, 1-shot [72] gpt-4-turbo 87.7 −
gpt-3.5-turbo-16k-0613 20.1 0.11
OD CodeActAgent v1.5, Generalist, 0-shot.
gpt-4o-2024-05-13 79.3 0.14
BIRD [27], 300 instances
CodeLlama-7B-Instruct 18.3 -
Prompting, 0-shot
CodeQwen-7B-Chat 31.3 -
gpt-4-1106-preview 42.7 0.19
OD CodeActAgent v1.5, Generalist.
gpt-4o-2024-05-13 47.3 0.11
ML-Bench [57], 68 instances
gpt-3.5-turbo 11.0 -
prompting + BM25, 0-shot
gpt-4-1106-preview 22.1 -
gpt-4o-2024-05-13 26.2 -
SWE-Agent [72] gpt-4-1106-preview 42.6 1.91
Aider [13] gpt-4o 64.4 -
gpt-4o-2024-05-13 76.5 0.25
OD CodeActAgent v1.5, Generalist. gpt-4-1106-preview 58.8 1.22
gpt-3.5-turbo-16k-0613 13.2 0.12
BioCoder (Python) [57], 157 instances
gpt-3.5-turbo 11.0 -
prompting, 0-shot
gpt-4-1106-preview 12.7 -
OD CodeActAgent v1.5, Generalist. gpt-4o-2024-05-13 27.5 0.13
BioCoder (Java) [57], 50 instances
gpt-3.5-turbo 4.1 -
prompting, 0-shot
gpt-4-1106-preview 6.4 -
OD CodeActAgent v1.5, Generalist. gpt-4o-2024-05-13 44.0 0.11
Gorilla APIBench [46], 1775 instances
claude-v1 8.7 -
Prompting, 0-shot [2, 39, 41, 62] gpt-4-0314 21.2 -
gpt-3.5-turbo-0301 29.7 -
Gorilla, finetuned for API calls, 0-shot [46, 62] llama-7b 75.0 -
gpt-3.5-turbo-0125 21.6 0.002
OD CodeActAgent v1.5, Generalist.
gpt-4o-2024-05-13 36.4 0.04
ToolQA [81], 800 instances
ChatGPT + CoT 5.1 -
Prompting, 0-shot [23, 33, 39, 65] ChatGPT 5.6 -
Chameleon 10.6 -
gpt-3.5-turbo 36.8 -
ReAct, 0-shot [39, 73]
gpt-3 43.1 -
gpt-3.5-turbo-0125 2.3 0.03
OD CodeActAgent v1.5, Generalist.
gpt-4o-2024-05-13 47.2 0.91

the benchmark and allow models to solve the bugs by self-debug over multiple turns, incorporating
feedback from test execution. We follow the setup from Muennighoff et al. [37] using pass@k [8].
Results. In Tab. 4, OpenDevin CodeActAgent successfully fixes 79.3% of bugs in the Python split.
This is significantly better than all non-agentic approaches, almost doubling the performance of
StarCoder2-15B [28, 32]. While SWE-Agent achieves 87.7%, Yang et al. [72] provides the model
a full demonstration of a successful sample trajectory fixing one of the bugs in the test dataset

8
(“1-shot”), whereas our evaluation of OpenDevin is 0-shot. As HumanEvalFix has been created
by humans and all bugs carefully validated, achieving 100% on this benchmark is entirely feasible,
which we seek to do in future iterations of OpenDevin.

4.2.3 ML-Bench

ML-Bench [57] evaluates agents’ ability to solve machine learning tasks across 18 GitHub repositories.
The benchmark comprises 9,641 tasks spanning 169 diverse ML problems, requiring agents to
generate bash scripts or Python code in response to user instructions. In the sandbox environment,
agents can iteratively execute commands and receive feedback, allowing them to understand the
repository context and fulfill user requirements progressively. Following the setup from the original
paper, we perform agent evaluation on the quarter subset of ML-Bench.
Results. As shown in Table 4, OpenDevin agents with GPT-4o achieve the highest success rate of
76.47% on ML-Bench, outperforming SWE-Agent (42.64%). Performance drops with less capable
models. These results demonstrate the effectiveness of OpenDevin agent in complex ML tasks. We
notice that agents show potential in reducing hallucination and syntax errors compared to non-agent
approaches in the ML-LLM-Bench settings [57].

4.2.4 Gorilla APIBench

Gorilla APIBench [46] evaluates agents’ abilities to use APIs. it incorporates tasks on TorchHub,
TensorHub, and HuggingFace. During the evaluation, models are given a question related to API
usage, such as "identify an API capable of converting spoken language in a recording to text."
Correctness is evaluated based on whether the model’s API call is in the correct domain.
Results. As shown in Table 4, OpenDevin using GPT-4o, with a success rate of 36.4%, outperforms
baselines not specifically finetuned for API calling. While Gorilla shows higher performance on
APIBench, Patil et al. [46] finetune this model for API calling in particular.

4.2.5 ToolQA

ToolQA [81] evaluates agents’ abilities to use external tools. This benchmark includes tasks on
various topics like flight status, coffee price, Yelp data, and Airbnb data, requiring the use of various
tools such as text tools, database tools, math tools, graph tools, code tools, and system tools. It
features two levels: easy and hard. Easy questions focus more on single tool usage, while hard
questions emphasize reasoning. For evaluation, the easy subset is used to assess tool use capabilities.
Results. Compared to all baselines, OpenDevin with GPT-4o shows the highest performance. We
notice that agents perform better on tasks related to CSV and database tool usage but requires
improvements on math and calculator tool usage.

4.2.6 BioCoder

BioCoder [58] is a repository-level code generation benchmark that evaluates agents’ performance
on bioinformatics-related tasks, specifically the ability to retrieve and accurately utilize context. The
original prompts contain the relevant context of the code; however, in this study, we have removed
them to demonstrate the capability of OpenDevin to perform context retrieval, self-debugging, and
reasoning in multi-turn interactions. BioCoder consists of 157 Python and 50 Java functions, each
targeting a specific area in bioinformatics, such as proteomics, genomics, and other specialized
domains. The benchmark targets real-world code by generating code in existing repositories where
the relevant code has been masked out.
Results. Table 4 shows that OpenDevin, using GPT-4o, achieves a success rate of 44.0%. This out-
performs all prompting-based non-agent baselines, with GPT-4 alone only achieving 6.4%. BioCoder
proves to be a particularly challenging benchmark for non-agent methods, as they did not incorpo-
rate any repository-level retrieval methods, making these models lack access to crucial repo-level
information such as global variables and function declarations.

9
Table 5: OpenDevin Web Browsing Evaluation Results (§4.3).
Agent Model Success Rate (%) $ Avg. Cost
WebArena [79], 812 instances
Lemur [70] Lemur-chat-70b 5.3 −
Patel et al. [45] Trained 72B with self-improvement synthetic data 9.4 −
AutoWebGLM [24] Trained 7B with human/agent hybrid annotation 18.2 −
Auto Eval & Refine [42] GPT-4 + Reflexion w/ GPT-4V reward model 20.2 −
Llama3-chat-8b 3.3 −
Llama3-chat-70b 7.0 −
WebArena Agent [79]
gpt-3.5-turbo 6.2 −
gpt-4-turbo 14.4 −
gpt-3.5-turbo-0125 5.2 0.02
gpt-4o-mini-2024-07-18 8.5 0.01
OD BrowsingAgent v1.0
gpt-4o-2024-05-13 14.8 0.15
claude-3-5-sonnet-20240620 15.5 0.10
gpt-4o-mini-2024-07-18 8.3 −
OD CodeActAgent v1.8, Generalist.
gpt-4o-2024-05-13 14.5 −
via delegation to BrowsingAgent v1.0
claude-3-5-sonnet-20240620 15.3 −
MiniWoB++ [30], 125 environments
Workflow Guided Exploration [30] Trained specialist model with environment exploration 34.6 −
CC-NET [18] Trained specialist model with RL and human annotated BC 91.1 −
gpt-3.5-turbo-0125 27.2 0.01
OD BrowsingAgent v1.0
gpt-4o-2024-05-13 40.8 0.05
OD CodeActAgent v1.8, Generalist.
gpt-4o-2024-05-13 39.8 −
via delegation to BrowsingAgent v1.0

4.2.7 BIRD
BIRD [27] is a benchmark for text-to-SQL tasks (i.e., translate natural language into executable SQL)
aimed at realistic and large-scale database environments. We select 300 samples from the dev set to
integrate into OpenDevin and evaluate on execution accuracy. Additionally, we extend the setting by
allowing the agent to engage in multi-turn interactions to arrive at the final SQL query, enabling it to
correct historical results by observing the results of SQL execution.
Results. As shown in Table 4, OpenDevin with GPT-4o achieves an execution accuracy of 47.3% on
a subset of BIRD, showcasing the potential of OpenDevin as a SQL agent. The result outperforms
approaches utilizing prompting with code LLMs, such as CodeLlama-7B-Instruct (18.3%) and
CodeQwen-7B-Chat [3] (31.3%).

4.3 Web Browsing

We report evaluation results for web browsing benchmarks in Tab. 5.

4.3.1 WebArena
WebArena [79] is a self-hostable, execution-based web agent benchmark that allows agents to freely
choose which path to take in completing their given tasks. WebArena comprises 812 human-curated
task instructions across various domains, including shopping, forums, developer platforms, and
content management systems. Each task is paired with a handwritten test case that verifies agent
success, e.g., by checking the status of a web page element against a reference or the textual answer
returned by the agent.
Results. From Tab. 5, we can see that our BrowsingAgent achieves competitive performance among
agents that use LLMs with domain-general prompting techniques. Some agents (e.g., AutoWebGLM)
require manual effort tailored to the WebArena task domain. This showcases the performance trade-
off between a generalist vs. a domain-tailored specialist web agent, and we opt for a more general
browsing agent as a building block in OpenDevin.

4.3.2 MiniWoB
MiniWoB++ [30] is an interactive web benchmark, with built-in reward functions. The tasks are
synthetically initialized on 125 different minimalist web interfaces. Unlike WebArena, tasks are
easier without page changes, require fewer steps, and provide low-level step-by-step task directions.
Note that it contains a portion of environments that require vision capability to tackle successfully,

10
and many existing work choose to focus only on a subset of the tasks [22, 29, 53]. Still, we report the
performance on the full set and only include baselines that are evaluated on the full set.
Results. From Tab. 5 we see that our generalist BrowsingAgent finishes nearly half of the tasks
without any adaptation on the environment. However, due to the synthetic nature of MiniWoB++,
the state-of-the-art agents explicitly trained for the environments with reinforcement learning and/or
human behavior cloning have almost saturated the performance. This shows an even bigger trade-off
between generalist and specialist agents than on the more general WebArena benchmark.

4.4 Miscellaneous Assistance

Results for miscellaneous assistance benchmarks are reported in Tab. 6. In particular, we report
results for:

4.4.1 GAIA
GAIA [34] evaluates agents’ general task-solving skills, covering different real-world scenarios. It
requires various agent capabilities, including reasoning, multi-modal understanding, web browsing,
and coding. GAIA consists of 466 curated tasks across three levels. Setting up GAIA is traditionally
challenging due to the complexity of integrating various tools with the agent, but OpenDevin’s
infrastructure (e.g., runtime §2.2, tool library §2.3) simplifies the integration significantly.
Results. In our experiments, we achieved a score of 32.1 on the GAIA (level-1 val), significantly
improving over the original AutoGPT [14]. GAIA is sensitive to the support of multimodal input
and web navigation skills, suggesting further score improvements as OpenDevin’s infrastructure
improves.

4.4.2 GPQA
GPQA [50] evaluates agents’ ability for coordinated tool use when solving challenging graduate-level
problems. It consists of 448 curated and difficult multiple-choice questions in biology, physics, and
chemistry. Tool use (e.g., python) and web search are often useful to assist agents in answering these
questions since they provide accurate calculations that LLMs are often incapable of and access to
information outside of the LLM’s parametric knowledge base.
Results. Results are shown in Tab. 6 and 7. We observe that OpenDevin’s integrated for supporting
diverse tool use (e.g., python for calculations) as well as web-search (for searching relevant facts)
allows the resulting agent to better solve complex multi-step problems, surpassing the prior state-of-
the-art by 9.6% and 12.3% on the main and diamond subsets respectively on GPQA [50].

4.4.3 AgentBench
AgentBench [31] evaluates agents’ reasoning and decision-making abilities in a multi-turn, open-
ended generation setting. We selected the code-grounded operating system (OS) subset with 144
tasks. Agents from OpenDevin interact directly with the task-specific OS using bash commands in a
multi-turn manner, combining interaction and reasoning to automate task completion.
Results. In our experiments (Tab. 6), OpenDevin CodeActAgent v1.5 achieves a score of 57.6% on
the AgentBench using gpt-4o, outperforming the 42.4% baseline using gpt-4 from the original
paper. Interestingly, when employing weaker models such as gpt-3.5-turbo, OpenDevin agents
generally underperform compared to the original baseline agents. This finding suggests that generalist
agents, like those implemented in OpenDevin, require a certain threshold of foundation model
capability - particularly instruction following - to function effectively.

4.4.4 MINT
MINT [64] is a benchmark designed to evaluate agents’ ability to solve challenging tasks through
multi-turn interactions using tools and natural language feedback simulated by GPT-4. We use
coding and math subsets used in Eurus [76] for evaluation. We follow the same setting as the original
paper and allows the agent to interact up-to five iterations with two chances to propose solutions.
Results. As shown in Tab. 6), OpenDevin agents achieve comparable performance to the default
agent in the original benchmark, with a performance improvement in the math subset.

11
Table 6: OpenDevin miscellaneous assistance evaluation results (§4.4).
Agent Model Success Rate (%) $ Avg. Cost
GAIA [34], L1 validation set, 53 instances
AutoGPT [14] gpt-4-turbo 13.2 −
gpt-4-0125-preview 30.2 0.110
OD GPTSwarm v1.0
gpt-4o-2024-05-13 32.1 0.050
GPQA [50], diamond set, 198 instances (refer to §F, Tab. 7 for other subsets)
Expert human 81.3 −
Human [50]
Non-expert human 21.9 −
Llama-2-70b-chat 28.1 −
Few-shot Prompting + Chain-of-Thought [50] gpt-3.5-turbo-16k 29.6 −
gpt-4 38.8 −
gpt-3.5-turbo-16k 27.9 0.012
OD CodeActAgent v1.5 gpt-4 (azure:2024-02-15-preview) 51.8 0.501
gpt-4o-2024-05-13 53.1 0.054
OD CodeActAgent v1.8 claude-3-5-sonnet-20240620 52.0 0.065
AgentBench [31], OS (bash) subset, 144 instances
gpt-4 42.4 −
AgentBench Baseline Agent [31]
gpt-3.5-turbo 32.6 −
gpt-4o-2024-05-13 57.6 0.085
OD CodeActAgent v1.5, Generalist.
gpt-3.5-turbo-0125 11.8 0.006
MINT [64]: math subset, 225 instances
MINT Baseline Agent gpt-4-0613 65.8 −
gpt-4o-2024-05-13 77.3 0.070
OD CodeActAgent v1.5, Generalist.
gpt-3.5-turbo-16k-0613 33.8 0.048
MINT [64]: code subset, 136 instances
MINT Baseline Agent gpt-4-0613 59.6 −
gpt-4o-2024-05-13 50.0 0.087
OD CodeActAgent v1.5, Generalist.
gpt-3.5-turbo-16k-0613 5.2 0.030
ProofWriter [55], 600 instances
Few-shot Prompting + Chain-of-Thought [43] gpt4 68.1 −
Logic-LM [43] gpt4 + symbolic solver 79.6 −
OD CodeActAgent v1.5, Generalist. gpt-4o-2024-05-13 78.8 −
Entity Deduction Arena [77], 200 instances
Human - 21.0 −
gpt-4-0314 40.0 −
Zero-shot Prompting [77]
gpt-3.5-turbo-0613 27.0 −
gpt-4o-2024-05-13 38.0 −
OD CodeActAgent v1.5, Generalist.
gpt-3.5-turbo-16k-0613 24.0 −

4.4.5 ProofWriter

ProofWriter [55] is a synthetic dataset created to assess deductive reasoning abilities of LLMs. Same
as Logic-LM [43], we focus on the most challenging subset, which contains 600 instances requiring
5-hop reasoning. To minimize the impact of potential errors in semantic parsing, we use the logical
forms provided by Logic-LM.
Results. In Tab. 6, OpenDevin agent employs a symbolic solver to solve the task, achieving
performance comparable to the state-of-the-art neuro-symbolic model (i.e., Logic-LM) [43].

4.4.6 Entity Deduction Arena

Entity Deduction Arena (EDA) [77] evaluates agents’ ability to deduce unknown entities through
strategic questioning, akin to the 20 Questions game. This benchmark tests the agent’s state tracking,
strategic planning, and inductive reasoning capabilities over multi-turn conversations. We evaluate
two datasets “Things” and “Celebrities”, each comprising 100 instances, and report the average
success rate over these two datasets.
Results. Tab. 6 shows that CodeActAgent yields comparable performance comparing with the results
reported in the original paper [77].

12
5 Conclusion
We introduce OpenDevin, a community-driven platform that enables the development of agents that
interact with the world through software interfaces. By providing a powerful interaction mechanism,
a safe sandboxed environment, essential agent skills, multi-agent collaboration capabilities, and a
comprehensive evaluation framework, OpenDevin accelerates research innovations and real-world
applications of agentic AI systems. Despite challenges in developing safe and reliable agents (§A),
we are excited about our vibrant community and look forward to OpenDevin’s continued evolution.

Acknowledgments
We extend our heartfelt gratitude to all contributors who have made OpenDevin possible. Over 160
individuals have made more than 1,300 contributions3 , demonstrating the power of collaborative
development. We are deeply appreciative of everyone4 who has participated in discussions, raised
issues, suggested improvements, and contributed code or documentation.

References
[1] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David,
Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not
as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022. 1

[2] https://www.anthropic.com/index/introducing-claude Anthropic. Claude, 2022. URL https:

//www.anthropic.com/index/introducing-claude. 8

[3] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin
Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu,
Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren,
Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu,
Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu,
Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang,
Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report. arXiv
preprint arXiv:2309.16609, 2023. 10

[4] Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical
research with large language models. Nature, 624(7992):570–578, 2023. 1

[5] Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen,
Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language
models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, 2024. 1

[6] Harrison Chase. LangChain, October 2022. URL https://github.com/langchain-ai/langchain. 3,

5, 23

[7] Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, Börje F. Karlsson, Jie Fu, and
Yemin Shi. Autoagents: A framework for automatic agent generation, 2024. 1, 5, 23

[8] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared
Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large
language models trained on code. arXiv preprint arXiv:2107.03374, 2021. 8

[9] Cognition.ai. Introducing devin, the first ai software engineer. URL https://www.cognition.ai/
blog/introducing-devin. 2

[10] Jiaxi Cui, Zongjian Li, Yang Yan, Bohua Chen, and Li Yuan. Chatlaw: Open-source legal large
language model with integrated external knowledge bases, 2023. 23
3
A full list of contributors can be found at https://github.com/OpenDevin/OpenDevin/graphs/contributors.
4
Including our stargazers at https://github.com/OpenDevin/OpenDevin/stargazers

13
[11] DeepSeek-AI, Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu,
Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei
Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao,
Bingxuan Wang, Junxiao Song, Deli Chen, Xin Xie, Kang Guan, Yuxiang You, Aixin Liu, Qiushi
Du, Wenjun Gao, Xuan Lu, Qinyu Chen, Yaohui Wang, Chengqi Deng, Jiashi Li, Chenggang
Zhao, Chong Ruan, Fuli Luo, and Wenfeng Liang. Deepseek-coder-v2: Breaking the barrier of
closed-source models in code intelligence, 2024. URL https://arxiv.org/abs/2406.11931. 8
[12] Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom
Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and
Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge
work tasks?, 2024. 3, 4
[13] Paul Gauthier. How aider scored sota 26.3% on swe bench lite | aider. https://aider.chat/2024/
05/22/swe-bench-lite.html. Accessed: 2024-06-05. 7, 8
[14] Significant Gravitas. Auto-gpt: An autonomous gpt-4 experiment, 2023. URL https://github.
com/Significant-Gravitas/Auto-GPT, 2023. 5, 7, 11, 12, 23
[15] Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong
Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal
models. arXiv preprint arXiv:2401.13919, 2024. 4
[16] Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang,
Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming
for a multi-agent collaborative framework. In The Twelfth International Conference on Learning
Representations, 2023. 1, 5, 23
[17] Dong Huang, Qingwen Bu, Jie M. Zhang, Michael Luck, and Heming Cui. Agentcoder:
Multi-agent-based code generation with iterative testing and optimisation, 2024. 23
[18] Peter C Humphreys, David Raposo, Tobias Pohlen, Gregory Thornton, Rachita Chhaparia,
Alistair Muldal, Josh Abramson, Petko Georgiev, Adam Santoro, and Timothy Lillicrap. A
data-driven approach for learning to control computers. In International Conference on Machine
Learning, pages 9466–9482. PMLR, 2022. 10
[19] IPython. Jupyter and the future of IPython — IPython. URL https://ipython.org. 4
[20] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris
Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand,
et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024. 1
[21] Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and
Karthik R Narasimhan. SWE-bench: Can Language Models Resolve Real-world Github
Issues? In The Twelfth International Conference on Learning Representations, 2024. URL
https://openreview.net/forum?id=VTF8yNQM66. 1, 6, 7, 8, 22, 24
[22] Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Language models can solve computer tasks.
Advances in Neural Information Processing Systems, 36, 2024. 11
[23] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large
language models are zero-shot reasoners. Advances in neural information processing systems,
35:22199–22213, 2022. 8
[24] Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu,
Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, et al. Autowebglm: Bootstrap and reinforce a
large language model-based web navigating agent. arXiv preprint arXiv:2404.03648, 2024. 7,
10
[25] Hareton K. N. Leung and Lee J. White. A study of integration testing and software regression at
the integration level. In Proceedings of the Conference on Software Maintenance, ICSM 1990,
San Diego, CA, USA, 26-29 November, 1990, pages 290–301. IEEE, 1990. doi: 10.1109/ICSM.
1990.131377. URL https://doi.org/10.1109/ICSM.1990.131377. 24

14
[26] Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard
Ghanem. Camel: Communicative agents for" mind" exploration of large scale language model
society. arXiv preprint arXiv:2303.17760, 2023. 23

[27] Jinyang Li, Binyuan Hui, GE QU, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen
Qin, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, Kevin Chang, Fei
Huang, Reynold Cheng, and Yongbin Li. Can LLM already serve as a database interface? a BIg
bench for large-scale database grounded text-to-SQLs. In Thirty-seventh Conference on Neural
Information Processing Systems Datasets and Benchmarks Track, 2023. 6, 8, 10, 22

[28] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao
Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii,
Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João
Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee,
Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang,
Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan
Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha
Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav
Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank
Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Dan-
ish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz
Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries.
Starcoder: may the source be with you!, 2023. URL https://arxiv.org/abs/2305.06161. 8

[29] Tao Li, Gang Li, Zhiwei Deng, Bryan Wang, and Yang Li. A zero-shot language agent for
computer control with structured reflection. arXiv preprint arXiv:2310.08740, 2023. 11

[30] Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement
learning on web interfaces using workflow-guided exploration. In International Conference on
Learning Representations (ICLR), 2018. URL https://arxiv.org/abs/1802.08802. 6, 10, 22

[31] Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding,
Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui
Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie
Tang. Agentbench: Evaluating llms as agents. arXiv preprint arXiv: 2308.03688, 2023. 6, 11,
12, 22

[32] Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Noua-
mane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian,
Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov,
Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo,
Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan
Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang,
Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra,
Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen
Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Car-
olyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite,
Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro
von Werra, and Harm de Vries. Starcoder 2 and the stack v2: The next generation, 2024. URL
https://arxiv.org/abs/2402.19173. 8

[33] Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun
Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language
models. Advances in Neural Information Processing Systems, 36, 2024. 8

[34] Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas
Scialom. GAIA: a benchmark for general AI assistants. CoRR, abs/2311.12983, 2023. doi:
10.48550/ARXIV.2311.12983. URL https://doi.org/10.48550/arXiv.2311.12983. 6, 11, 12, 22

[35] Mozilla. Accessibility tree - MDN Web Docs Glossary: Definitions of Web-related terms |
MDN. URL https://developer.mozilla.org/en-US/docs/Glossary/Accessibility_tree. 4

15
[36] Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman,
Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, Xiangru
Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai,
Albert Webson, Edward Raff, and Colin Raffel. Crosslingual generalization through multitask
finetuning, 2023. URL https://arxiv.org/abs/2211.01786. 8
[37] Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo,
Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. Octopack: Instruction
tuning code large language models, 2024. 6, 7, 8, 22
[38] Y Nakajima. Babyagi. URL https://github.com/yoheinakajima/babyagi, 2023. 23
[39] OpenAI. Chatgpt: May 2024 version. https://www.openai.com/chatgpt, 2024. Accessed:
2024-05-29. 8, 23
[40] OpenAI. Hello gpt-4o. https://openai.com/index/hello-gpt-4o/, 2024. Accessed: 2024-05-15. 1
[41] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Flo-
rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat,
Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao,
Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro,
Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman,
Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, An-
drew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis
Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester
Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory
Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve
Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus,
Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges,
Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan
Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei
Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke,
Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu,
Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang,
Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan,
Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan
Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros,
Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis,
Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike,
Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz
Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Man-
ning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob
McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David
Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie
Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély,
Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo
Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pan-
tuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov,
Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde
de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea
Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh,
Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez,
Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt,
David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh,
Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Kata-
rina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski
Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil
Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan
Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright,
Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila
Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens

16
Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu,
Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers,
Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk,
and Barret Zoph. Gpt-4 technical report, 2024. 8, 23

[42] Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Au-
tonomous evaluation and refinement of digital agents. arXiv preprint arXiv:2404.06474, 2024.
7, 10, 23

[43] Liangming Pan, Alon Albalak, Xinyi Wang, and William Yang Wang. Logic-lm: Empowering
large language models with symbolic solvers for faithful logical reasoning. arXiv preprint
arXiv:2305.12295, 2023. 12

[44] Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and
Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023. 23

[45] Ajay Patel, Markus Hofmarcher, Claudiu Leoveanu-Condrei, Marius-Constantin Dinu, Chris
Callison-Burch, and Sepp Hochreiter. Large language models can self-improve at web agent
tasks. arXiv preprint arXiv:2405.20309, 2024. 7, 10

[46] Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language
model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023. 6, 8, 9, 22

[47] Playwright. Fast and reliable end-to-end testing for modern web apps | Playwright. URL
https://playwright.dev/. 4

[48] Chen Qian, Xin Cong, Wei Liu, Cheng Yang, Weize Chen, Yusheng Su, Yufan Dang, Jiahao
Li, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Communicative agents for software
development, 2023. 23

[49] Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong,
Xiangru Tang, Bill Qian, Sihan Zhao, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein,
Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master
16000+ real-world apis. CoRR, abs/2307.16789, 2023. doi: 10.48550/ARXIV.2307.16789.
URL https://doi.org/10.48550/arXiv.2307.16789. 3

[50] David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien
Dirani, Julian Michael, and Samuel R Bowman. GPQA: A Graduate-Level Google-Proof Q&A
Benchmark. arXiv preprint arXiv:2311.12022, 2023. 6, 7, 11, 12, 22, 25

[51] Aman Sanger. Near-instant full-file edits. https://www.cursor.com/blog/instant-apply. Accessed:

2024-06-05. 5

[52] ServiceNow. BrowserGym: a Gym Environment for Web Task Automation. URL https:
//github.com/ServiceNow/BrowserGym. 2, 4

[53] Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu,
Urvashi Khandelwal, Kenton Lee, and Kristina N Toutanova. From pixels to ui actions: Learning
to follow instructions via graphical user interfaces. Advances in Neural Information Processing
Systems, 36:34354–34370, 2023. 11

[54] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao.
Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information
Processing Systems, 36, 2024. 23

[55] Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. ProofWriter: Generating implications,
proofs, and abductive statements over natural language. In Chengqing Zong, Fei Xia, Wenjie
Li, and Roberto Navigli, editors, Findings of the Association for Computational Linguistics:
ACL-IJCNLP 2021, pages 3621–3634, Online, August 2021. Association for Computational
Linguistics. doi: 10.18653/v1/2021.findings-acl.317. URL https://aclanthology.org/2021.
findings-acl.317. 6, 12, 22

17
[56] Xiangru Tang, Qiao Jin, Kunlun Zhu, Tongxin Yuan, Yichi Zhang, Wangchunshu Zhou, Meng
Qu, Yilun Zhao, Jian Tang, Zhuosheng Zhang, et al. Prioritizing safeguarding over autonomy:
Risks of llm agents for science. arXiv preprint arXiv:2402.04247, 2024. 1
[57] Xiangru Tang, Yuliang Liu, Zefan Cai, Yanjun Shao, Junjie Lu, Yichi Zhang, Zexuan Deng,
Helan Hu, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Liang Chen,
Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao Chang, Yin Fang, Yujia Qin, Wangchunshu
Zhou, Yilun Zhao, Arman Cohan, and Mark Gerstein. ML-Bench: Evaluating large language
models and agents for machine learning tasks on repository-level code, 2024. URL https:
//arxiv.org/abs/2311.09835. 6, 8, 9, 22
[58] Xiangru Tang, Bill Qian, Rick Gao, Jiakang Chen, Xinyun Chen, and Mark B Gerstein.
BioCoder: a benchmark for bioinformatics code generation with large language models. Bioin-
formatics, 40(Supplement_1):i266–i276, 06 2024. ISSN 1367-4811. 6, 9, 22
[59] Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman
Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot
medical reasoning, 2024. 23
[60] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu,
Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly
capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 1
[61] XAgent Team. Xagent: An autonomous agent for complex task solving, 2023. 5, 23
[62] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo-
thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez,
Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation
language models. arXiv preprint arXiv:2302.13971, 2023. 8
[63] Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji.
Executable Code Actions Elicit Better LLM Agents. In ICML, 2024. 2, 3, 6, 22
[64] Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji.
MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback. In ICLR,
2024. 6, 11, 12, 22
[65] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le,
Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.
Advances in neural information processing systems, 35:24824–24837, 2022. 8
[66] BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana
Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias
Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka
Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral,
Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu
Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon,
Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi,
Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen
Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien,
David Ifeoluwa Adelani, Dragomir Radev, Eduardo González Ponferrada, Efrat Levkovizh,
Ethan Kim, Eyal Bar Natan, Francesco De Toni, Gérard Dupont, Germán Kruszewski, Giada
Pistilli, Hady Elsahar, Hamza Benyamina, Hieu Tran, Ian Yu, Idris Abdulmumin, Isaac John-
son, Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan
Chang, Jörg Frohberg, Joseph Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen,
Kyle Lo, Leandro Von Werra, Leon Weber, Long Phan, Loubna Ben allal, Ludovic Tanguy,
Manan Dey, Manuel Romero Muñoz, Maraim Masoud, María Grandury, Mario Šaško, Max
Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang, Minh Chien Vu, Moham-
mad A. Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurulaqilla Khamis, Olivier
Nguyen, Omar Espejel, Ona de Gibert, Paulo Villegas, Peter Henderson, Pierre Colombo,
Priscilla Amuok, Quentin Lhoest, Rheza Harliman, Rishi Bommasani, Roberto Luis López, Rui
Ribeiro, Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose, Shamsuddeen Hassan

18
Muhammad, Shanya Sharma, Shayne Longpre, Somaieh Nikpoor, Stanislav Silberberg, Suhas
Pai, Sydney Zink, Tiago Timponi Torrent, Timo Schick, Tristan Thrush, Valentin Danchev,
Vassilina Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Zaid Alyafeai, Zeerak
Talat, Arun Raja, Benjamin Heinzerling, Chenglei Si, Davut Emre Taşar, Elizabeth Salesky,
Sabrina J. Mielke, Wilson Y. Lee, Abheesht Sharma, Andrea Santilli, Antoine Chaffin, Arnaud
Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Han Wang, Harshit Pandey,
Hendrik Strobelt, Jason Alan Fries, Jos Rozen, Leo Gao, Lintang Sutawika, M Saiful Bari,
Maged S. Al-shaibani, Matteo Manica, Nihal Nayak, Ryan Teehan, Samuel Albanie, Sheng
Shen, Srulik Ben-David, Stephen H. Bach, Taewoon Kim, Tali Bers, Thibault Fevry, Trishala
Neeraj, Urmish Thakker, Vikas Raunak, Xiangru Tang, Zheng-Xin Yong, Zhiqing Sun, Shaked
Brody, Yallow Uri, Hadar Tojarieh, Adam Roberts, Hyung Won Chung, Jaesung Tae, Jason
Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim Bourfoune, Jared Casper, Jeff
Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi, Myriam Pey-
rounette, Nicolas Patry, Nouamane Tazi, Omar Sanseviero, Patrick von Platen, Pierre Cornette,
Pierre François Lavallée, Rémi Lacroix, Samyam Rajbhandari, Sanchit Gandhi, Shaden Smith,
Stéphane Requena, Suraj Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet Singh, Anastasia
Cheveleva, Anne-Laure Ligozat, Arjun Subramonian, Aurélie Névéol, Charles Lovering, Dan
Garrette, Deepak Tunuguntla, Ehud Reiter, Ekaterina Taktasheva, Ekaterina Voloshina, Eli
Bogdanov, Genta Indra Winata, Hailey Schoelkopf, Jan-Christoph Kalo, Jekaterina Novikova,
Jessica Zosa Forde, Jordan Clive, Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat,
Miruna Clinciu, Najoung Kim, Newton Cheng, Oleg Serikov, Omer Antverg, Oskar van der
Wal, Rui Zhang, Ruochen Zhang, Sebastian Gehrmann, Shachar Mirkin, Shani Pais, Tatiana
Shavrina, Thomas Scialom, Tian Yun, Tomasz Limisiewicz, Verena Rieser, Vitaly Protasov,
Vladislav Mikhailov, Yada Pruksachatkun, Yonatan Belinkov, Zachary Bamberger, Zdeněk
Kasner, Alice Rueda, Amanda Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, Ana San-
tos, Anthony Hevia, Antigona Unldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tammour,
Azadeh HajiHosseini, Bahareh Behroozi, Benjamin Ajibade, Bharat Saxena, Carlos Muñoz Fer-
randis, Daniel McDuff, Danish Contractor, David Lansky, Davis David, Douwe Kiela, Duong A.
Nguyen, Edward Tan, Emi Baylor, Ezinwanne Ozoani, Fatima Mirza, Frankline Ononiwu,
Habib Rezanejad, Hessie Jones, Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, Isar Ne-
jadgholi, Jesse Passmore, Josh Seltzer, Julio Bonis Sanz, Livia Dutra, Mairon Samagaio, Maraim
Elbadri, Margot Mieskes, Marissa Gerchick, Martha Akinlolu, Michael McKenna, Mike Qiu,
Muhammed Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Nour Elkott, Nour Fahmy,
Olanrewaju Samuel, Ran An, Rasmus Kromann, Ryan Hao, Samira Alizadeh, Sarmad Shubber,
Silas Wang, Sourav Roy, Sylvain Viguier, Thanh Le, Tobi Oyebade, Trieu Le, Yoyo Yang,
Zach Nguyen, Abhinav Ramesh Kashyap, Alfredo Palasciano, Alison Callahan, Anima Shukla,
Antonio Miranda-Escalada, Ayush Singh, Benjamin Beilharz, Bo Wang, Caio Brito, Chenxi
Zhou, Chirag Jain, Chuxin Xu, Clémentine Fourrier, Daniel León Periñán, Daniel Molano, Dian
Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann, Gabriel Altay, Giyaseddin Bayrak,
Gully Burns, Helena U. Vrabec, Imane Bello, Ishani Dash, Jihyun Kang, John Giorgi, Jonas
Golde, Jose David Posada, Karthik Rangasai Sivaraman, Lokesh Bulchandani, Lu Liu, Luisa
Shinzato, Madeleine Hahn de Bykhovetz, Maiko Takeuchi, Marc Pàmies, Maria A Castillo,
Marianna Nezhurina, Mario Sänger, Matthias Samwald, Michael Cullan, Michael Weinberg,
Michiel De Wolf, Mina Mihaljcic, Minna Liu, Moritz Freidank, Myungsun Kang, Natasha
Seelam, Nathan Dahlberg, Nicholas Michio Broad, Nikolaus Muellner, Pascale Fung, Patrick
Haller, Ramya Chandrasekhar, Renata Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline
Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda, Shlok S Deshmukh, Shubhanshu Mishra,
Sid Kiblawi, Simon Ott, Sinee Sang-aroonsiri, Srishti Kumar, Stefan Schweter, Sushil Bharati,
Tanmay Laud, Théo Gigant, Tomoya Kainuma, Wojciech Kusa, Yanis Labrak, Yash Shailesh
Bajaj, Yash Venkatraman, Yifan Xu, Yingxin Xu, Yu Xu, Zhe Tan, Zhongli Xie, Zifan Ye,
Mathilde Bras, Younes Belkada, and Thomas Wolf. Bloom: A 176b-parameter open-access
multilingual language model, 2023. URL https://arxiv.org/abs/2211.05100. 8

[67] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li,
Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via
multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023. 1, 5, 23

[68] Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying
llm-based software engineering agents. arXiv preprint, 2024. 7

19
[69] Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua,
Junning Zhao, Qian Liu, Che Liu, et al. Openagents: An open platform for language agents in
the wild. arXiv preprint arXiv:2310.10634, 2023. 5

[70] Yiheng Xu, Hongjin Su, Chen Xing, Boyu Mi, Qian Liu, Weijia Shi, Binyuan Hui, Fan Zhou,
Yitao Liu, Tianbao Xie, et al. Lemur: Harmonizing natural language and code for language
agents. arXiv preprint arXiv:2310.06830, 2023. 7, 10

[71] Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark
prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441,
2023. 4

[72] John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik
Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software
engineering, 2024. 2, 5, 7, 8, 23

[73] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan
Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International
Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WE_
vluYUL-X. 8

[74] Yining Ye, Xin Cong, Shizuo Tian, Jiannan Cao, Hao Wang, Yujia Qin, Yaxi Lu, Heyang
Yu, Huadong Wang, Yankai Lin, et al. Proagent: From robotic process automation to agentic
process automation. arXiv preprint arXiv:2311.10751, 2023. 23

[75] Lifan Yuan, Yangyi Chen, Xingyao Wang, Yi R. Fung, Hao Peng, and Heng Ji. CRAFT:
customizing llms by creating and retrieving from specialized toolsets. CoRR, abs/2309.17428,
2023. doi: 10.48550/ARXIV.2309.17428. URL https://doi.org/10.48550/arXiv.2309.17428. 3

[76] Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin
Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, and
Maosong Sun. Advancing llm reasoning generalists with preference trees, 2024. 11

[77] Yizhe Zhang, Jiarui Lu, and Navdeep Jaitly. Probing the multi-turn planning capabilities of llms
via 20 question games. 2024. 6, 12, 22

[78] Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Au-
tonomous program improvement, 2024. 5, 7, 8, 23

[79] Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng,
Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for build-
ing autonomous agents. In The Twelfth International Conference on Learning Representations,
2023. 1, 4, 6, 7, 10, 22

[80] Wangchunshu Zhou, Yuchen Eleanor Jiang, Long Li, Jialong Wu, Tiannan Wang, Shi Qiu,
Jintian Zhang, Jing Chen, Ruipu Wu, Shuai Wang, Shiding Zhu, Jiyu Chen, Wentao Zhang,
Xiangru Tang, Ningyu Zhang, Huajun Chen, Peng Cui, and Mrinmaya Sachan. Agents: An
open-source framework for autonomous language agents, 2023. 5, 23

[81] Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. Toolqa: A dataset for llm
question answering with external tools. Advances in Neural Information Processing Systems,
36, 2024. 6, 8, 9, 22

[82] Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R Ashley, Róbert Csordás, Anand
Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann,
Kazuki Irie, et al. Mindstorms in natural language-based societies of mind. arXiv preprint
arXiv:2305.17066, 2023. 23

[83] Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jurgen
Schmidhuber. Language agents as optimizable graphs. arXiv preprint arXiv:2402.16823, 2024.
5, 6, 22, 23

20
[84] Terry Yue Zhuo, Armel Zebaze, Nitchakarn Suppattarachai, Leandro von Werra, Harm de Vries,
Qian Liu, and Niklas Muennighoff. Astraios: Parameter-efficient instruction tuning code large
language models, 2024. URL https://arxiv.org/abs/2401.00788. 8
[85] Albert Örwall. Moatless tools. URL https://github.com/aorwall/moatless-tools. 7

21
Author Contributions
This work was an open-source collaborative effort across multiple institutions. We employed a
point-based system to determine contributions and award authorships, with technical contributions
tracked and measured in units of pull requests (PRs)5 . Xingyao Wang led the project, coordinating
overall development and paper writing efforts. Detailed contributions were as follows:
• Agent Development (§3): Xingyao Wang led the implementation of CodeAct [63] and Code-
ActSWE agents. Frank F. Xu led the development of web browsing agents [79]. Mingchen
Zhuge orchestrated the integration of the GPTSwarm agent [83]. Robert Brennan and Boxuan Li
lead the development of the Micro Agent.
• Architectural Development (Fig. 3): Robert Brennan initiated the architecture design. Boxuan
Li, Frank F. Xu, Xingyao Wang, Yufan Song, and Mingzhang Zheng further refined and expanded
the architecture. Boxuan Li implemented the initial version of integration tests (§E), maintained
the agentskills library (§2.3), managed configurations, and resolved resource leaks in evaluation.
Frank F. Xu developed the web browsing environment (§I) for both agent execution and evaluation
and integrated it with both agent and front-end user interfaces. Xingyao Wang authored the
initial code for the agentskills library and the Docker sandbox. Yufan Song implemented cost
tracking for evaluation, while Mingzhang Zheng developed an image-agnostic docker sandbox
for more stable SWE-Bench evaluation.
• Benchmarking, Integration, and Code Review: Boxuan Li and Yufan Song led benchmark
integration efforts, including coordination, evaluation, and code review. Yufan Song also helped
track PR contributions. Graham Neubig, Xingyao Wang, Mingzhang Zheng, Robert Brennan,
Hoang H. Tran, Frank F. Xu, Xiangru Tang, Fuqiang Li, and Yanjun Shao provided additional
support in integration and code reviews. Specific benchmark contributions included:
– SWE-Bench [21]: Bowen Li and Xingyao Wang
– WebArena [79] and MiniWob++ [30]: Frank F. Xu
– GAIA [34]: Jiayi Pan (integration) and Mingchen Zhuge (GPTSwarm evaluation)
– API-Bench [46] and ToolQA [81]: Yueqi Song
– HumanEvalFix [37]: Niklas Muennighoff and Xiangru Tang
– ProofWriter [55]: Ren Ma
– MINT [64]: Hoang H. Tran
– AgentBench [31]: Fuqiang Li
– BIRD [27]: Binyuan Hui
– GPQA [50]: Jaskirat Singh
– BioCoder [58]: Xiangru Tang and Bill Qian
– ML-Bench [57]: Xiangru Tang and Yanjun Shao
– Entity-Deduction-Arena [77]: Yizhe Zhang
• Advising: Graham Neubig advised the project, providing guidance, resources, and substantial
paper edits. Heng Ji and Hao Peng offered additional project advice and assisted with paper
writing. Junyang Lin contributed advisory support and sponsored resources.
All authors contributed to the result discussions and manuscript preparation.

A Limitations and Future Work

We are excited about the foundations our vibrant community has laid in OpenDevin and look forward
to its continued evolution. We identify several directions for future work:
Enhanced multi-modality support. While our current implementation already supports a wide
range of file formats through predefined agent skills, we are interested in enabling multi-modality
in a principled way through standard IPython and browser integration, such as viewing images and
videos using vision-language model through a browser or processing XLSX files with code.
Stronger agents. Current agents still struggle with complex tasks, and we are interested in building
better agents through both training and inference time techniques.
5
For more details, please refer to https://github.com/OpenDevin/OpenDevin/pull/1917.

22
Web browsing improvements. Due to the extensible nature of OpenDevin, orthogonal components
that could improve agents can be integrated easily. For example, thanks to OpenDevin’s extensible
architecture, Auto Eval & Refine [42], an agent retry-on-error strategy with Reflexion [54] prompts
and task completion reward models, will be integrated as an optional component attached to our
browsing agent.
Stable Runtime. Currently, OpenDevin’s default sandbox environment, SSH sandbox, relies on
pexpect library6 to pass commands and messages between the sandbox and agent controller. It
doesn’t suit other types of sandboxes (e.g. commercial cloud sandboxes) and is not stable enough.
In the future, we plan to use EventStream as the bridge for the agent controller and the sandbox to
communicate, which will be more stable, sandbox-agnostic, and enables better observability.
Automatic workflow generation. Currently, OpenDevin’s workflow still requires a substantial
handcrafted workload. We believe that graph-based frameworks such as GPTSwarm [83] and
LangGraph [6] could serve as alternative solutions for building agents. Particularly in GPTSwarm,
when agents are constructed using graphs, it becomes easier to incorporate various optimization
methods (e.g., reinforcement learning, meta-prompting). OpenDevin considers these methods to lay
the groundwork for promising solutions in automatic workflow generation in future versions.

B Ethics Statement

Most AI agents today are still research artifacts and lack the ability to perform complex, long-horizon
tasks in the real world reliably. However, as their performance continues to improve and they are
increasingly deployed in real world, they have the potential to boost productivity while also posing
security risks to society significantly. OpenDevin helps mitigate risks by:
(1) Enabling systematic evaluation of these agents, which can identify and address risks before they
are widely deployed.
(2) Facilitating human-agent interaction rather than allowing agents to operate autonomously without
oversight.
(3) More importantly, we hope OpenDevin allows researchers worldwide to access the best suites of
agents to conduct frontier safety research towards building safe and helpful agents.

C Related Work

The breakthroughs in large language models (LLMs) like ChatGPT [39] and GPT-4 [41] have
significantly enhanced the capabilities of autonomous agents across various domains [10, 44, 59,
74]. These advances have spurred a multitude of generalist agent proposals [14, 38, 67] aimed at
performing diverse user tasks and have gained attention from both developers and broader audiences.
Notable works such as Auto-GPT [14] harness LLMs for task completion by decomposing user
goals into executable steps. Multi-agent collaboration systems leverage LLMs for elements like role-
playing and task-solving capabilities [26, 61, 80, 82], with MetaGPT [16] emphasizing standardized
operating procedures, and AutoGen [67] providing a conversation framework for interactive systems.
AGENTS [80] and AutoAgents [7] offer new paradigms for customizable agent architecture, while
XAgent [61] and GPTSwarm [83] introduce complex management systems and optimizable graphs,
respectively, for enhanced agent operations.
Software development, a front-runner in applying LLM-based agents, has seen advancements in
frameworks for facilitating the development processes [16, 48]. Innovations such as ChatDev [48]
automate the software development lifecycle akin to the waterfall model, and AutoCodeRover [78]
addresses GitHub issues via code search and abstract syntax tree manipulation. AgentCoder [17]
iteratively refines code generation with integrated testing and feedback, while SWE-Agent [72]
integrates LLMs for automated Github issue fixing, streamlining software engineering.

6
https://pexpect.readthedocs.io/en/stable/

23
D Graphical User Interface
Besides running from the command line, OpenDevin features a rich graphical user interface that
visualizes the agent’s current actions (e.g., browsing the web, executing base commands or Python
code, etc.) and allows for real-time feedback from the user. Screenshots of the UI are shown in
Fig. 1. The user may interrupt the agent at any moment to provide additional feedback, comments, or
instruction while the agent is working. This user interface directly connects with the event streams
(§2.1) to control and visualize the agents and runtime, making it agent and runtime agnostic.

E Quality Control: Integration Tests for Agents

Integration tests [25] have long been used by software developers to ensure software quality. Unlike
large language models with simple input-output schema, agents are typically complex pieces of
software where minor errors can be easily introduced during the development process and hurt final
task performance. While running a full suite evaluation (§4) is the ultimate measure of performance
degradation, running them for every code changes can be prohibitively slow and expensive. 7 . In
OpenDevin, we pioneer an end-to-end agent test framework that tests prompt regression, actions, and
sandbox environments. It combines integration testing from software engineering and foundation
model mocking for deterministic behavior to prevent the accidental introduction of bugs during agent
development.
Defining an integration test. The integration test framework for OpenDevin is structured to validate
end-to-end functionality by automating task execution and result verification. Developers define
tasks and expected results; for instance, a task might involve correcting typos in a document named
"bad.txt". Upon task execution through OpenDevin, outputs are compared against a predefined "gold
file" to ensure accuracy.
Mocking LLM for deterministic behavior. Addressing the challenge of non-determinism in large
language models (LLMs) and the associated high costs, the framework intercepts all LLM calls
and supplies predefined responses based on exact prompt matches. This method not only ensures
consistency in test outcomes but also reduces operational costs by minimizing the reliance on real
LLMs.
Regenerate LLM responses on breaking changes. Prompt-response pairs are managed through
a script that generates and stores these pairs when new tests are introduced or existing prompts are
modified. For routine tests, the framework attempts to reuse existing LLM responses by slightly
adjusting the prompts. Substantial changes that affect task handling require regeneration of these
pairs using real LLMs.
Benefits of integration tests. The framework offers several advantages, including 1) Prompt regres-
sion testing: Stored prompt-response pairs facilitate change tracking and provide a reference for new
team members to understand LLM interactions, 2) Multi-platform support: Tests are automatically
scheduled for every pull request and commit on the main branch, running across multiple platforms,
environments, and agents, including Linux and Mac, and in local, SSH, and exec sandboxes, and
3) Comprehensive error detection: It captures errors in prompt generation, message passing, and
sandbox execution, thereby maintaining a high test coverage.

F Additional Results For GPQA Benchmark

We showcase more detailed results, including performance on other subsets for GPQA benchmark
in Tab. 7.

G In-context Demonstration for CodeActSWEAgent

The prompt is re-adopted from the SWE-agent’s released trajectory (https://github.com/princeton-nlp/
SWE-agent/tree/main/trajectories/demonstrations). The prompt can be found at https://github.com/
OpenDevin/OpenDevin/blob/main/agenthub/codeact_swe_agent/prompt.py.
7
Running a SWE-Bench Lite [21] evaluation with gpt-4o costs around 600 USD.

24
Table 7: Full Evaluation Results on the GPQA Benchmark [50] (§4.4).
Accuracy by subset (%)
Evaluation Method and Model Avg Cost ($)
Diamond Set Main Set Extended Set
Expert Human Validators 81.2 72.5 65.4 N/A
Non-Expert Human Validators 21.9 30.5 33.9 N/A
Few-Shot CoT Llama-2-70B-chat 28.1 29.1 30.4 N/A
Few-Shot CoT GPT-3.5-turbo-16k 29.6 28.0 28.2 N/A
Few-Shot CoT GPT-4 38.8 39.7 38.7 N/A
GPT-4 with search (backoff to CoT on abstention) 38.8 41.0 39.4 N/A
OpenDevin + CodeActAgent v1.5 + GPT3.5-turbo 27.9 23.4 26.1 0.012
OpenDevin + CodeActAgent v1.5 + GPT4-turbo 51.8 47.4 42.4 0.501
OpenDevin + CodeActAgent v1.5 + GPT4o 53.1 49.3 52.8 0.054

H Supported AgentSkills

As of OpenDevin v0.6, we support the following list of skills. Please refer to the source code for
the most up-to-date list of skills: https://github.com/OpenDevin/OpenDevin/blob/main/opendevin/
runtime/plugins/agent_skills/agentskills.py

def open_file(path: str, line_number: Optional[int] = None) -> None:

"""
Opens the file at the given path in the editor. If line_number is
,→ provided, the window will be moved to include that line.

Args:
path: str: The path to the file to open.
line_number: Optional[int]: The line number to move to.
"""
pass

def goto_line(line_number: int) -> None:

"""
Moves the window to show the specified line number.

Args:
line_number: int: The line number to move to.
"""
pass

def scroll_down() -> None:

"""Moves the window down by 100 lines.

Args:
None
"""
pass

def scroll_up() -> None:

"""Moves the window up by 100 lines.

Args:
None
"""
pass

def create_file(filename: str) -> None:

25
"""Creates and opens a new file with the given name.

Args:
filename: str: The name of the file to create.
"""
pass

def edit_file(start: int, end: int, content: str) -> None:

"""Edit a file.

It replaces lines `start` through `end` (inclusive) with the given text
,→ `content` in the open file. Remember, the file must be open before
,→ editing.

Args:
start: int: The start line number. Must satisfy start >= 1.
end: int: The end line number. Must satisfy start <= end <= number
,→ of lines in the file.
content: str: The content to replace the lines with.
"""
pass

def search_dir(search_term: str, dir_path: str = './') -> None:

"""Searches for search_term in all files in dir. If dir is not provided,
,→ searches in the current directory.

Args:
search_term: str: The term to search for.
dir_path: Optional[str]: The path to the directory to search.
"""
pass

def search_file(search_term: str, file_path: Optional[str] = None) -> None:

"""Searches for search_term in file. If file is not provided, searches
,→ in the current open file.

Args:
search_term: str: The term to search for.
file_path: Optional[str]: The path to the file to search.
"""
pass

def find_file(file_name: str, dir_path: str = './') -> None:

"""Finds all files with the given name in the specified directory.

Args:
file_name: str: The name of the file to find.
dir_path: Optional[str]: The path to the directory to search.
"""
pass

def parse_pdf(file_path: str) -> None:

"""Parses the content of a PDF file and prints it.

Args:
file_path: str: The path to the file to open.
"""
pass

26
def parse_docx(file_path: str) -> None:
"""
Parses the content of a DOCX file and prints it.

Args:
file_path: str: The path to the file to open.
"""
pass

def parse_latex(file_path: str) -> None:

"""
Parses the content of a LaTex file and prints it.

Args:
file_path: str: The path to the file to open.
"""
pass

def parse_audio(file_path: str, model: str = 'whisper-1') -> None:

"""
Parses the content of an audio file and prints it.

Args:
file_path: str: The path to the audio file to transcribe.
model: Optional[str]: The audio model to use for transcription.
,→ Defaults to 'whisper-1'.
"""
pass

def parse_image(
file_path: str, task: str = 'Describe this image as detail as
,→ possible.'
) -> None:
"""
Parses the content of an image file and prints the description.

Args:
file_path: str: The path to the file to open.
task: Optional[str]: The task description for the API call.
,→ Defaults to 'Describe this image as detail as possible.'.
"""
pass

def parse_video(
file_path: str,
task: str = 'Describe this image as detail as possible.',
frame_interval: int = 30,
) -> None:
"""
Parses the content of an image file and prints the description.

Args:
file_path: str: The path to the video file to open.
task: Optional[str]: The task description for the API call.
,→ Defaults to 'Describe this image as detail as possible.'.
frame_interval: Optional[int]: The interval between frames to
,→ analyze. Defaults to 30.

27
"""
pass

def parse_pptx(file_path: str) -> None:

"""
Parses the content of a pptx file and prints it.

Args:
file_path: str: The path to the file to open.
"""
pass

I BrowserGym Actions

The following are all the supported actions defined in BrowserGym8 as of v0.3.4. The actions can
be categorized into several types and can be configured to use only a subset of the functionality.
There are agent control actions, navigation actions, page element-based actions, coordinate-based
actions, as well as tab-related actions. We use these actions from the BrowserGym library as our
main browsing action primitives.

def send_msg_to_user(text: str):

"""
Sends a message to the user.

Examples:
send_msg_to_user("Based on the results of my search, the city was
,→ built in 1751.")
"""
pass

def report_infeasible(reason: str):

"""
Notifies the user that their instructions are infeasible.

Examples:
report_infeasible("I cannot follow these instructions because there
,→ is no email field in this form.")
"""
pass

def noop(wait_ms: float = 1000):

"""
Do nothing, and optionally wait for the given time (in milliseconds).

Examples:
noop()
noop(500)
"""
pass

# https://playwright.dev/docs/input#text-input
def fill(bid: str, value: str):
"""

8
https://github.com/ServiceNow/BrowserGym/blob/main/core/src/browsergym/core/action/functions.py

28
Fill out a form field. It focuses the element and triggers an input
,→ event with the entered text.
It works for <input>, <textarea> and [contenteditable] elements.

Examples:
fill('237', 'example value')
fill('45', "multi-line\\nexample")
fill('a12', "example with \\"quotes\\"")
"""
pass

# https://playwright.dev/python/docs/api/class-locator#locator-check
def check(bid: str):
"""
Ensure a checkbox or radio element is checked.

Examples:
check('55')
"""
pass

# https://playwright.dev/python/docs/api/class-locator#locator-uncheck
def uncheck(bid: str):
"""
Ensure a checkbox or radio element is unchecked.

Examples:
uncheck('a5289')
"""
pass

# https://playwright.dev/docs/input#select-options
def select_option(bid: str, options: str | list[str]):
"""
Select one or multiple options in a <select> element. You can specify
option value or label to select. Multiple options can be selected.

Examples:
select_option('a48', "blue")
select_option('c48', ["red", "green", "blue"])
"""
pass

# https://playwright.dev/python/docs/api/class-locator#locator-click
def click(
bid: str,
button: Literal["left", "middle", "right"] = "left",
modifiers: list[Literal["Alt", "Control", "Meta", "Shift"]] = [],
):
"""
Click an element.

Examples:
click('a51')
click('b22', button="right")

29
click('48', button="middle", modifiers=["Shift"])
"""
pass

# https://playwright.dev/python/docs/api/class-locator#locator-dblclick
def dblclick(
bid: str,
button: Literal["left", "middle", "right"] = "left",
modifiers: list[Literal["Alt", "Control", "Meta", "Shift"]] = [],
):
"""
Double click an element.

Examples:
dblclick('12')
dblclick('ca42', button="right")
dblclick('178', button="middle", modifiers=["Shift"])
"""
pass

# https://playwright.dev/python/docs/api/class-locator#locator-hover
def hover(bid: str):
"""
Hover over an element.

Examples:
hover('b8')
"""
pass

# https://playwright.dev/python/docs/input#keys-and-shortcuts
def press(bid: str, key_comb: str):
"""
Focus the matching element and press a combination of keys. It accepts
the logical key names that are emitted in the keyboardEvent.key
,→ property
of the keyboard events: Backquote, Minus, Equal, Backslash, Backspace,
Tab, Delete, Escape, ArrowDown, End, Enter, Home, Insert, PageDown,
,→ PageUp,
ArrowRight, ArrowUp, F1 - F12, Digit0 - Digit9, KeyA - KeyZ, etc. You
,→ can
alternatively specify a single character you'd like to produce such as
,→ "a"
or "#". Following modification shortcuts are also supported: Shift,
,→ Control,
Alt, Meta.

Examples:
press('88', 'Backspace')
press('a26', 'Control+a')
press('a61', 'Meta+Shift+t')
"""
pass

# https://playwright.dev/python/docs/api/class-locator#locator-focus

30
def focus(bid: str):
"""
Focus the matching element.

Examples:
focus('b455')
"""
pass

# https://playwright.dev/python/docs/api/class-locator#locator-clear
def clear(bid: str):
"""
Clear the input field.

Examples:
clear('996')
"""
pass

# https://playwright.dev/python/docs/input#drag-and-drop
def drag_and_drop(from_bid: str, to_bid: str):
"""
Perform a drag & drop. Hover the element that will be dragged. Press
left mouse button. Move mouse to the element that will receive the
drop. Release left mouse button.

Examples:
drag_and_drop('56', '498')
"""
pass

# https://playwright.dev/python/docs/api/class-mouse#mouse-wheel
def scroll(delta_x: float, delta_y: float):
"""
Scroll horizontally and vertically. Amounts in pixels, positive for
,→ right or down scrolling, negative for left or up scrolling.
,→ Dispatches a wheel event.

Examples:
scroll(0, 200)
scroll(-50.2, -100.5)
"""
pass

# https://playwright.dev/python/docs/api/class-mouse#mouse-move
def mouse_move(x: float, y: float):
"""
Move the mouse to a location. Uses absolute client coordinates in
,→ pixels.
Dispatches a mousemove event.

Examples:
mouse_move(65.2, 158.5)
"""
pass

31
# https://playwright.dev/python/docs/api/class-mouse#mouse-up
def mouse_up(x: float, y: float, button: Literal["left", "middle", "right"]
,→ = "left"):
"""
Move the mouse to a location then release a mouse button. Dispatches
mousemove and mouseup events.

Examples:
mouse_up(250, 120)
mouse_up(47, 252, 'right')
"""
pass

# https://playwright.dev/python/docs/api/class-mouse#mouse-down
def mouse_down(x: float, y: float, button: Literal["left", "middle",
,→ "right"] = "left"):
"""
Move the mouse to a location then press and hold a mouse button.
,→ Dispatches
mousemove and mousedown events.

Examples:
mouse_down(140.2, 580.1)
mouse_down(458, 254.5, 'middle')
"""
pass

# https://playwright.dev/python/docs/api/class-mouse#mouse-click
def mouse_click(x: float, y: float, button: Literal["left", "middle",
,→ "right"] = "left"):
"""
Move the mouse to a location and click a mouse button. Dispatches
,→ mousemove,
mousedown and mouseup events.

Examples:
mouse_click(887.2, 68)
mouse_click(56, 712.56, 'right')
"""
pass

# https://playwright.dev/python/docs/api/class-mouse#mouse-dblclick
def mouse_dblclick(x: float, y: float, button: Literal["left", "middle",
,→ "right"] = "left"):
"""
Move the mouse to a location and double click a mouse button.
,→ Dispatches
mousemove, mousedown and mouseup events.

Examples:
mouse_dblclick(5, 236)
mouse_dblclick(87.5, 354, 'right')
"""
pass

32
def mouse_drag_and_drop(from_x: float, from_y: float, to_x: float, to_y:
,→ float):
"""
Drag and drop from a location to a location. Uses absolute client
coordinates in pixels. Dispatches mousemove, mousedown and mouseup
events.

Examples:
mouse_drag_and_drop(10.7, 325, 235.6, 24.54)
"""
pass

# https://playwright.dev/python/docs/api/class-keyboard#keyboard-press
def keyboard_press(key: str):
"""
Press a combination of keys. Accepts the logical key names that are
emitted in the keyboardEvent.key property of the keyboard events:
Backquote, Minus, Equal, Backslash, Backspace, Tab, Delete, Escape,
ArrowDown, End, Enter, Home, Insert, PageDown, PageUp, ArrowRight,
ArrowUp, F1 - F12, Digit0 - Digit9, KeyA - KeyZ, etc. You can
alternatively specify a single character you'd like to produce such
as "a" or "#". Following modification shortcuts are also supported:
Shift, Control, Alt, Meta.

Examples:
keyboard_press('Backspace')
keyboard_press('Control+a')
keyboard_press('Meta+Shift+t')
page.keyboard.press("PageDown")
"""
pass

# https://playwright.dev/python/docs/api/class-keyboard#keyboard-up
def keyboard_up(key: str):
"""
Release a keyboard key. Dispatches a keyup event. Accepts the logical
key names that are emitted in the keyboardEvent.key property of the
keyboard events: Backquote, Minus, Equal, Backslash, Backspace, Tab,
Delete, Escape, ArrowDown, End, Enter, Home, Insert, PageDown, PageUp,
ArrowRight, ArrowUp, F1 - F12, Digit0 - Digit9, KeyA - KeyZ, etc.
You can alternatively specify a single character you'd like to produce
such as "a" or "#".

Examples:
keyboard_up('Shift')
keyboard_up('c')
"""
pass

# https://playwright.dev/python/docs/api/class-keyboard#keyboard-down
def keyboard_down(key: str):
"""
Press and holds a keyboard key. Dispatches a keydown event. Accepts the
logical key names that are emitted in the keyboardEvent.key property of

33
the keyboard events: Backquote, Minus, Equal, Backslash, Backspace,
,→ Tab,
Delete, Escape, ArrowDown, End, Enter, Home, Insert, PageDown, PageUp,
ArrowRight, ArrowUp, F1 - F12, Digit0 - Digit9, KeyA - KeyZ, etc. You
,→ can
alternatively specify a single character such as "a" or "#".

Examples:
keyboard_up('Shift')
keyboard_up('c')
"""
pass

# https://playwright.dev/python/docs/api/class-keyboard#keyboard-type
def keyboard_type(text: str):
"""
Types a string of text through the keyboard. Sends a keydown,
,→ keypress/input,
and keyup event for each character in the text. Modifier keys DO NOT
,→ affect
keyboard_type. Holding down Shift will not type the text in upper case.

Examples:
keyboard_type('Hello world!')
"""
pass

#
,→ https://playwright.dev/python/docs/api/class-keyboard#keyboard-insert-text
def keyboard_insert_text(text: str):
"""
Insert a string of text in the currently focused element. Dispatches
,→ only input
event, does not emit the keydown, keyup or keypress events. Modifier
,→ keys DO NOT
affect keyboard_insert_text. Holding down Shift will not type the text
,→ in upper
case.

Examples:
keyboard_insert_text('Hello world!')
"""
pass

# https://playwright.dev/python/docs/api/class-page#page-goto
def goto(url: str):
"""
Navigate to a url.

Examples:
goto('http://www.example.com')
"""
pass

# https://playwright.dev/python/docs/api/class-page#page-go-back

34
def go_back():
"""
Navigate to the previous page in history.

Examples:
go_back()
"""
pass

# https://playwright.dev/python/docs/api/class-page#page-go-forward
def go_forward():
"""
Navigate to the next page in history.

Examples:
go_forward()
"""
pass

#
,→ https://playwright.dev/python/docs/api/class-browsercontext#browser-context-new-page
def new_tab():
"""
Open a new tab. It will become the active one.

Examples:
new_tab()
"""
global page
# set the new page as the active page
page = page.context.new_page()
# trigger the callback that sets this page as active in browsergym
pass

# https://playwright.dev/python/docs/api/class-page#page-close
def tab_close():
"""
Close the current tab.

Examples:
tab_close()
"""
pass

# https://playwright.dev/python/docs/api/class-page#page-bring-to-front
def tab_focus(index: int):
"""
Bring tab to front (activate tab).

Examples:
tab_focus(2)
"""
pass

35
# https://playwright.dev/python/docs/input#upload-files
def upload_file(bid: str, file: str | list[str]):
"""
Click an element and wait for a "filechooser" event, then select one
or multiple input files for upload. Relative file paths are resolved
relative to the current working directory. An empty list clears the
selected files.

Examples:
upload_file("572", "my_receipt.pdf")
upload_file("63", ["/home/bob/Documents/image.jpg",
,→ "/home/bob/Documents/file.zip"])
"""
pass

# https://playwright.dev/python/docs/input#upload-files
def mouse_upload_file(x: float, y: float, file: str | list[str]):
"""
Click a location and wait for a "filechooser" event, then select one
or multiple input files for upload. Relative file paths are resolved
relative to the current working directory. An empty list clears the
selected files.

Examples:
mouse_upload_file(132.1, 547, "my_receipt.pdf")
mouse_upload_file(328, 812, ["/home/bob/Documents/image.jpg",
,→ "/home/bob/Documents/file.zip"])
"""
pass

J Browsing Agent Details

The following shows an example prompt containing all the information required for the current step
to make a prediction about the next browsing actions. Note that we also instruct the agent to predict
multiple actions in one turn if the agent thinks they are meant to be executed sequentially without any
feedback from the page. This could save turns for common workflows that consist of a sequence of
actions on the same page without any observation change, such as filling the username and password
and submit in a login page.

# Instructions
Review the current state of the page and all other information to find the
,→ best possible next action to accomplish your goal. Your answer will be
,→ interpreted and executed by a program, make sure to follow the
,→ formatting instructions.

# Goal:
Browse localhost:8000, and tell me the ultimate answer to life. Do not ask
,→ me for confirmation at any point.

# Action Space

16 different types of actions are available.

noop(wait_ms: float = 1000)

Examples:
noop()

36
noop(500)

send_msg_to_user(text: str)
Examples:
send_msg_to_user('Based on the results of my search, the city was
,→ built in 1751.')

scroll(delta_x: float, delta_y: float)

Examples:
scroll(0, 200)

scroll(-50.2, -100.5)

fill(bid: str, value: str)

Examples:
fill('237', 'example value')

fill('45', 'multi-line\nexample')

fill('a12', 'example with "quotes"')

select_option(bid: str, options: str | list[str])

Examples:
select_option('48', 'blue')

select_option('48', ['red', 'green', 'blue'])

click(bid: str, button: Literal['left', 'middle', 'right'] = 'left',

,→ modifiers: list[typing.Literal['Alt', 'Control', 'Meta', 'Shift']] =
,→ [])
Examples:
click('51')

click('b22', button='right')

click('48', button='middle', modifiers=['Shift'])

dblclick(bid: str, button: Literal['left', 'middle', 'right'] = 'left',

,→ modifiers: list[typing.Literal['Alt', 'Control', 'Meta', 'Shift']] =
,→ [])
Examples:
dblclick('12')

dblclick('ca42', button='right')

dblclick('178', button='middle', modifiers=['Shift'])

hover(bid: str)
Examples:
hover('b8')

press(bid: str, key_comb: str)

Examples:
press('88', 'Backspace')

press('a26', 'Control+a')

press('a61', 'Meta+Shift+t')

37
focus(bid: str)
Examples:
focus('b455')

clear(bid: str)
Examples:
clear('996')

drag_and_drop(from_bid: str, to_bid: str)

Examples:
drag_and_drop('56', '498')

upload_file(bid: str, file: str | list[str])

Examples:
upload_file('572', 'my_receipt.pdf')

upload_file('63', ['/home/bob/Documents/image.jpg',
,→ '/home/bob/Documents/file.zip'])

go_back()
Examples:
go_back()

go_forward()
Examples:
go_forward()

goto(url: str)
Examples:
goto('http://www.example.com')

Multiple actions can be provided at once. Example:

fill('a12', 'example with "quotes"')
click('51')
click('48', button='middle', modifiers=['Shift'])
Multiple actions are meant to be executed sequentially without any feedback
,→ from the page.
Don't execute multiple actions at once if you need feedback from the page.

# Current Accessibility Tree:

RootWebArea 'The Ultimate Answer', focused
[8] heading 'The Ultimate Answer'
[9] paragraph ''
StaticText 'Click the button to reveal the answer to life,
,→ the universe, and everything.'
[10] button 'Click me', clickable

# Previous Actions
goto('http://localhost:8000')

Here is an example with chain of thought of a valid action when clicking on

,→ a button:
"
In order to accomplish my goal I need to click on the button with bid 12
```click("12")```

And an example response to the above prompt is:

38
In order to accomplish my goal, I need to click on the button with bid 10
,→ to reveal the answer to life, the universe, and everything.
```click("10")```

For the evaluation on WebArena benchmark, since some of the tasks require checking for answer
exact match on the agent’s message back to the user, we add the following instruction to let the agent
reply with only a concise answer string when messaging the user to prevent the agent from failing the
test due to extra text:

Here is another example with chain of thought of a valid action when

,→ providing a concise answer to user:
"
In order to accomplish my goal I need to send the information asked back to
,→ the user. This page list the information of HP Inkjet Fax Machine,
,→ which is the product identified in the objective. Its price is $279.49.
,→ I will send a message back to user with the answer.
```send_msg_to_user("$279.49")```
"