0% found this document useful (0 votes)
26 views53 pages

Agents For Software Development

The document discusses the evolution and potential of software development agents, highlighting their roles in improving productivity and addressing challenges in coding. It outlines various types of development tools, metrics for evaluating code generation, and safety considerations for coding models. The author emphasizes the importance of supporting developers through advanced coding agents and anticipates future advancements in the field.

Uploaded by

mhwani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views53 pages

Agents For Software Development

The document discusses the evolution and potential of software development agents, highlighting their roles in improving productivity and addressing challenges in coding. It outlines various types of development tools, metrics for evaluating code generation, and safety considerations for coding models. The author emphasizes the importance of supporting developers through advanced coding agents and anticipates future advancements in the field.

Uploaded by

mhwani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Agents for Software

Development
Graham Neubig
My Pro le
● Professor at CMU

● Chief Scientist at All Hands AI (building


open-source coding agents)

● Maintainer of OpenHands

https://github.com/All-Hands-AI/OpenHands
● Software developer
fi
More and more major businesses and
industries are being run on software and
delivered as online services—from movies to
agriculture to national defense. […] Over the
next 10 years, I expect many more industries
to be disrupted by software […].
— Marc Andreessen - Why Software is Eating the World (2011)

If we gave everyone the ability to quickly write


software to achieve their goals, what could they do?
What is Involved in
Developing Software?

17% 15%
Coding
Bug xing
14% Testing
Documents/Reviews
Communication
8% Other
36%
10%

Today was a Good Day: The Daily Life of Software Developers


Meyer et al. 2019
fi
How Can We Support
Developers?
Development Copilots
• Work synchronously with the developer to ease
writing code
• e.g. Github Copilot/Cursor
Development Agents
• For coding (e.g. SWE-Agent, Aider)
• For broader development (e.g. Devin, OpenHands)
Autonomous Issue Resolution

https://github.com/All-Hands-AI/OpenHands-resolver
How Promising?
• Code generation leads to large improvements in
productivity (Github 2023)
Challenges in Coding
Agents
• De ning the Environment
• Designing an Observations/Actions
• Code Generation (atomic actions)
• File Localization (exploration)
• Planning and Error Recovery
• Safety
fi
Software Development
Environments
Types of Environments
• Actual Environments:
• Source Repositories: Github, Gitlab
• Task Management Software: Jira, Linear
• Of ce Software: Google Docs, Microsoft Of ce
• Communication Tools: Gmail, Slack
• Testing Environments:
• Mostly focused on coding!
• Developers do more, e.g. browse the web (next session)
fi
fi
Simple Coding
(Chen et al. 2021, Austin et al. 2021)
• e.g. HumanEval/
MBPP
• Examples of usage
of the Python
standard library
• Includes docstring,
some example
inputs/outputs, and
tests
Broader Domains:
CoNaLa/ODEX
(Yin et al. 2018, Wang et al. 2022)

• CoNaLa: Broader data scraped


from StackOver ow

• ODEX: Adds execution-based


evaluation

• Wider variety of libraries


fl
Data Science
Notebooks: ARCADE
(Yin et al. 2022)

• Data science notebooks


(e.g. Jupyter) allow for
incremental
implementation
• Allows evaluation of code
in context
Dataset: SWEBench
(Jiminez et al. 2023)
• Issues from GitHub + codebases -> pull request

• Requires long-context understanding, precise


implementation
Metric: Pass@K
(Chen et al. 2021)
• Basic idea: “if we generate K examples, will at least
one of them pass unit tests”
• Generating only K will result in high variance, so we
generate N > K with C correct answers, and then
calculate expected value
Metric: Lexical/Semantic Overlap
• Issues w/ execution-based evaluation:
• Requires that code be easily executable (requires unit tests and hard in large
libraries)
• Ignores stylistic considerations
• BLEU: consider text n-gram overlap with human code
• CodeBLEU: also considers syntax and semantic ow (Ren et al. 2020)
• CodeBERTScore: BERTScore with CodeBERT trained on lots of code (Zhou et al. 2023)

fl
An Aside: Dataset Leakage
Existing New
• Leakage of datasets
is a big problem
• ARCADE shows that
novel notebooks are
harder than online
notebooks
• LiveCodeBench
(Jain et al. 2023)
shows that some
code LMs
outperform on
HumanEval
Dataset: Design2Code
(Si et al. 2024)
• Code generation from web sites

• Also proposed Design2Code model


Metric: Visual Similarity of
Web Site

• Design2Code evaluates by two metrics


• High-level visual similarity: Similarity between
visual embeddings of the generated sites
• Low-level element similarity: Recall of each
individual element
Designing Observation/
Action Spaces
Coding Agents Must

• Understand repository structure


• Read in existing code
• Modify or produce code
• Run code and debug
Example: CodeAct (Wang et al. 2024)
• Interact w/ the environment through code

• Can execute bash commands, Jupyter commands


• Faster resolution, higher success than direct tool use
Example: SWE-Agent
(Yang+Jimenez et al. 2024)
• De ne specialized tools that make it possible to
ef ciently explore repositories and edit code
fi
fi
Example: OpenHands
(Wang et al. 2024)

• De nes “event
stream” for coding,
execution, and
browsing actions/
observations
• Implements SWE-
agents style actions
as “agent skills” that
can be called
fi
Code-based LLMs
Basic Method: Code-
generating LM

• Feed instructions and/or input code to an LM


• Virtually all serious LMs are trained on code
nowadays, but some are specialized
Code Data Example:
The Stack 2
• Code pre-training dataset w/ license considerations
Method: Code In lling
(Fried et al. 2022)
• In code generation, we often want to ll in code
• Solution: train for in lling
fi
fi
fi
Method: Long-context Extension
(see Lu et al. 2024)
• In LMs, it is standard to use RoPE,
a method for encoding positional
information

• It does not generalize well beyond


the training data, but code is long!

• It has a parameter, typically with b=10000

• Position interpolation: Multiply θ by a constant scaling


factor (e.g. Cshort/Clong)

• Neural tangent kernel: Scale low-frequency


components, but maintain high-frequency components
32
Lots of Available Information
for Coding!

• Current code context


• Description of issue to x
• Repo context
• Open tabs
fi
Example: Copilot Prompting
Strategy (Thakkar 2023)
• Extract prompt given current doc and cursor position
• Identify relative path and language
• Find most recently accessed 20 les of the same
language
• Include: text before, text after, similar les, imported
les, metadata about language and path
• TL;DR: lots of prompt engineering to get most useful
context tin the prompt
fi
fi
fi
File Localization
LLM-based Localization
• Finding the correct les given user intent

What problem or use case are you trying to solve?


When in con rmation mode it's not possible to give instructions in between
steps. You have to reject an action and it seems like it doesn't know that the
action was rejected.

Describe the UX of the solution you'd like


The simplest would be to have a third option, con rm action and wait. This way
the action is con rmed but before it tries to take the next step you are able to
give some feedback. Also if it somehow knows the action was rejected that
would be helpful as well so when you do reject an action it knows that action
wasn't taken.

https://github.com/All-Hands-AI/OpenHands/issues/4259

• Which JavaScript le should I modify?


• Analogous to environment understanding / exploration problems in other
agents
fi
fi
fi
fi
fi
Solution 1:
Of oad to the User
• Experienced users familiar with prompting and the
project can specify which les to use

In .github/workflows/openhands-resolver.yml and .github/


workflows/openhands-resolver-experimental.yml, we should check to
make sure that all required environment variables are set before running any
additional work ows. If all of the variables are not set, we can fail immediately with
an error.

https://github.com/All-Hands-AI/openhands-resolver/issues/146
fl
fl
fi
Solution 2:
Prompt the Agent w/ Search Tools
• e.g. SWE-agent provides a tool for searching repositories
Solution 3:
A-priori Map the Repo
• Create a map of the repo and prompt agent with it
• Aider repomap creates a tree-structured map of the
repo
• Agentless (Xia et al. 2024) does a hierarchical
search for every issue
Solution 4: Retrieval-
augmented Code Generation
• Retrieve similar code, and ll it in with a retrieval-
augmented LM (Hayati et al. 2018)
• Particularly, in code there is also documentation, which
can be retrieved (Zhou et al. 2022)

• Unsolved issue: when to perform RAG in agent


fi
Planning and Error Recovery
Hard-coded Task
Completion Process
• e.g. Agentless (Xie et al. 2024) has a hard-coded
progress of
• File Localization
• Function Localization
• Patch Generation
• Patch Application
LLM-Generated Plans
• LLM-generated planning step, then one or more executors
• CodeR (Chen et al. 2024)
Planning and Revisiting
• CoAct goes back and xes (Hou et al. 2024)

fi
Fixing Based on Error
Messages
• e.g. InterCode (Yang et al. 2023)
Safety
Coding Models
can Cause Harm!
• By accident
• The coding model accidentally pushes to your
main branch
• The coding model is told to “make the tests
pass”, so it deletes the tests
• Intentionally
• Coding agents can be used for hacking (Yang et
al. 2023)
Safety Mitigation 1:
Sandboxing

● We can improve safety by limiting the


execution environment
● e.g. OpenHands execute all the actions in
Docker sandboxes
Safety Mitigation 2:
Credentialing

• The principle of least privilege


• Example: GitHub access tokens

https://github.com/settings/tokens?type=beta
Safety Mitigation 3:
Post-hoc Auditing
• e.g. OpenHands security analyzer
Action

OK NO
X

Observation

• Using LMs, analysis, or both


Conclusion
Summary
• Copilots already very useful, code agents getting there
• Current challenges: code LLMs, editing, localization,
planning, safety
• Future directions:
• Agentic training methods
• Human-in-the-loop
• Broader software tasks than coding
• Thanks! And you can try out agents yourself

https://github.com/All-Hands-AI/OpenHands
Questions?

You might also like