Agent Coder 2312.13010v2
Agent Coder 2312.13010v2
Abstract generation for downstream tasks, where they play a vital role
in aiding developers in creating software [Feng et al., 2020;
Advances in natural language processing (NLP) Wang et al., 2021; Wang et al., 2023b; Nijkamp et al., 2023b;
have been significantly boosted by the develop- Nijkamp et al., 2023a; Li et al., 2023b]. Through extensive
ment of transformer-based large language models pretraining on substantial code-related datasets, such as pub-
(LLMs). These models have revolutionized NLP licly available data on GitHub, these code LLMs acquire intri-
tasks, particularly in code generation, aiding de- cate contextual understanding that can be effectively applied
velopers in creating software with enhanced effi- to diverse code-related tasks.
ciency. Despite their advances, challenges remain Numerous recent efforts have been made to improve the
in balancing code snippet generation with effec- effectiveness of code generation models by incorporating
tive test case generation and execution. To address in-context learning and its variations [Dong et al., 2023b;
these issues, this paper introduces Multiagent-Code Wei et al., 2022; Le et al., 2023; Huang et al., 2023;
Generation (AgentCoder), a novel solution com- Zhang et al., 2023b; Chen et al., 2023b; Madaan et al., 2023],
prising a multi-agent framework with specialized where an important optimization path is self-refinement. For
agents: the programmer agent, the test designer example, Zhang et al. proposed Self-Edit to enhance the per-
agent, and the test executor agent. During the formance of LLMs in code generation. In particular, Self-Edit
coding procedure, the programmer agent focuses runs the code generation model’s generated code against test
on the code generation and refinement based on cases that are manually written by developers. If the code
the test executor agent’s feedback. The test de- fails to pass these test cases, Self-Edit prompts the code gen-
signer agent generates test cases for the generated eration model to refine the function using the provided error
code, and the test executor agent runs the code with messages with its fault-aware code editor. Nevertheless, Self-
the test cases and writes feedback to the program- Edit requires that developers write test cases to verify the cor-
mer. This collaborative system ensures more effec- rectness of the generated function. This requirement can be
tive code generation, surpassing the limitations of particularly demanding and challenging for users who lack
single-agent models and previous strategies. Our expertise in the specific domain, which potentially impedes
extensive experiments on 12 LLMs and 13 optimi- the effectiveness of the self-editing process.
sation approaches showcase AgentCoder’s superior To overcome this challenge, Huang et al. introduced Code-
performance over existing code generation models CoT, which adopts a step-by-step strategy for code genera-
and prompt engineering techniques across various tion, tasking the code generation model to generate both the
benchmarks. For example, AgentCoder achieves function and the corresponding test cases. CodeCoT also es-
77.4% and 89.1% pass@1 in HumanEval-ET and tablishes a connection with a terminal interface, instructing
MBPP-ET with GPT-3.5, while state-of-the-art ob- the code generation model to self-refine the code based on
tains only 69.5% and 63.0%. the error messages returned by the terminal. This approach
not only reduces the burden on developers in terms of writing
test cases but also ensures that the generated code undergoes
1 Introduction software testing and refinement.
In recent years, natural language processing (NLP) has been Although CodeCoT makes substantial strides in enhanc-
dramatically transformed by transformer-based large lan- ing the effectiveness of code generation models, the tests and
guage models (LLMs). These models, notably exemplified code are generated within the same conversation. In other
by the GPT-x series [Brown et al., 2020b; OpenAI, 2023] words, the code generation and test generation processes are
developed by OpenAI, have consistently set the benchmark not independent. This practice brings constraints that arise
for performance across a wide array of standard NLP tasks. from the potential trade-off between excelling in code gen-
One of the most pivotal applications for these LLMs is code eration and maintaining the effectiveness of test case gener-
ation: as the model achieves high performance in generat- ally, its modular design provides the flexibility and scalability
ing code snippets, there may be a corresponding decrease in crucial to adapting to technological advancements. Agents
the effectiveness of test case generation [Chen et al., 2023a; within AgentCoder can be individually updated or replaced
Zhang et al., 2023a]. This trade-off scenario occurs due to with more sophisticated models, maintaining the framework’s
the model’s limited resources and its focus on optimizing technological edge. This adaptability positions AgentCoder
one aspect of the code generation process, which might in- as an effective and evolving solution in the ever-changing
advertently compromise the quality of other tasks [Chen et landscape of software development.
al., 2023a; Zhang et al., 2023a]. In addition, the tests gener- Our main contributions are as follows:
ated immediately following the code in one conversation can
be biased and affected by the code, losing objectivity and di- • Introduction of AgentCoder: We propose AgentCoder,
versity in the testing (See Tab. 5). a novel multi-agent framework for code generation
In this paper, we address the above-mentioned problems by that contains three distinct agents, i.e., the program-
proposing Multiagent-Code Generation, namely AgentCoder. mer agent, the test designer agent, and the test executor
AgentCoder contains three different agents, i.e., the program- agent.
mer agent, the test designer agent, and the test executor agent. • Comprehensive Evaluation: We conduct an extensive
The programmer agent interacts with advanced code gener- evaluation with 12 LLMs and 13 LLM-based optimi-
ation models to create code based on coding requirements. sation approaches which demonstrates that AgentCoder
The test designer agent designs diverse and comprehensive outperforms all the baselines in code generation. In par-
test cases with code generation models independently based ticular, AgentCoder obtains 77.4% and 89.1% pass@1
on the coding requirements. The test executor agent interacts with GPT-3.5, while state-of-the-art obtains only 69.5%
with both the programmer agent and the test designer agent: it and 63.0%.
executes the tests from the test designer agent against the code
generated by the programmer agent and then provides test ex- • In-Depth Analysis and Ablation Studies: We conduct a
ecution results to the programmer agent. Once the feedback deep analysis of our results and ablation studies, which
is obtained by the test executor agent from the local envi- demonstrate the contribution of different agents, the ef-
ronment (i.e., local terminal), it checks whether the feedback fectiveness of the tests generated by the test designer
contains error information (e.g., runtime error and assertion agent, and the necessity of using separate agents for code
error). If all test cases pass the generated code, the test ex- generation and test case design.
ecutor agent provides the code snippets with the human de- • Modularity: The modular structure of our framework
veloper. Otherwise, the test executor agent feeds back to the not only ensures adaptability and scalability but also fa-
programmer agent and then requires it to fix the bug reported cilitates future enhancements and integration with more
in the feedback. Then the iteration continues once the feed- advanced models, positioning AgentCoder as a resilient
back is that all test cases pass the code snippets or the iteration solution in the evolving landscape of code generation.
budget is done, when the code snippets will be reported to the
human developer even if the code is still buggy.
Our extensive experiments with 12 LLMs and 13 en- 2 Related Work
hancement approaches demonstrate that AgentCoder signif-
2.1 Large Language Model for Code Generation
icantly improves the effectiveness of existing code genera-
tion models, outperforming all baseline approaches. In par- Large Language Models (LLMs) have been widely studied
ticular, AgentCoder obtains an average of 91.5% and 84.1% for code generation tasks. Various architectures have been ex-
pass@1 on all the datasets with GPT-4 and GPT-3.5, respec- plored in these models, some notable examples being Code-
tively, while the state-of-the-art approaches obtain 86.8% and BERT [Feng et al., 2020], PLBART [Ahmad et al., 2021],
75.3%. On HumanEval-ET and MBPP-ET, AgentCoder ob- and CodeGPT [Zan et al., 2022]. These models are pre-
tains 77.4% and 89.1% pass@1 with GPT-3.5, while the state- trained on code corpora to develop a deep understanding of
of-the-art approaches obtain only 69.5% and 63.0%. The ef- code syntax, semantics, and idiomatic constructs. Some in-
fectiveness of AgentCoder is fueled by the goal of leverag- novative approaches integrate structured representations to
ing collaborative synergy within its agents. Within this agent enhance their comprehension of the complexities in code.
system, the programmer agent excels in crafting high-quality For example, GraphCodeBERT [Guo et al., 2020] incorpo-
code snippets, complemented by the test designer agent’s ex- rates graph-based representations, while CodeT5+ [Wang et
pertise in designing varied, challenging, and objective test al., 2023b] combines the encoder-decoder paradigm with the
cases. The test executor agent plays a pivotal role by crit- structural essence of code. These enhancements aim to give
ically evaluating the code using these test cases, ensuring the models a more fine-grained understanding of code rela-
both functionality and reliability. Such collaboration fos- tionships and dependencies beyond just syntactic patterns. A
ters a dynamic feedback loop that facilitates successive en- current trend is the construction of large scale models (e.g.,
hancements. AgentCoder overcomes the constraints inher- Codex [Chen et al., 2021b] and CodeGen [Nijkamp et al.,
ent in single-agent code generation models by allocating dis- 2023b]) with billions of parameters, which have illustrated
tinct tasks to different agents. This division not only bal- the performance of state-of-the-art in code generation tasks.
ances the focus between code and test case generation but Recently, foundation models (e.g., GPT-3.5-turbo, GPT-4)
also strengthens a more objective testing process. Addition- have also been used for code generations [Madaan et al.,
2023; Huang et al., 2023]. These foundation models illus- case generator (Agent#2: the test designer agent) is tasked
trated the state-of-the-art performance for code generation with generating test cases, which are used to evaluate the cor-
tasks. rectness of the code snippets produced by the programmer
agent. The code snippets and test cases are collected by the
2.2 Enhancing Code Generation through Prompt test executor agent (Agent#3) and executed in the local envi-
Engineering ronment (local terminal) to obtain feedback (i.e., whether the
Recent advances in code generation have been significantly code passes all tests and the error message if the code fails
influenced by the integration of few-shot learning techniques for some tests). If the test executor agent finds that the code
with LLMs. A notable contribution in this realm is the con- snippets pass all test cases, it will return the code to the user
cept of self-refinement with few-shot prompting, as proposed and finish the iteration. Otherwise, the test executor agent
by Madaan et al.. This approach involves an LLM itera- will return the test execution error messages to the program-
tively refining its own generated code, leading to significant mer agent. The iteration then continues, with the programmer
improvement in code quality. Another approach is the Self- agent regenerating code snippets to address the issues identi-
Debugging technique introduced by Chen et al., which in- fied in the feedback, and the test executor agent re-executes
volves testing the generated code against user-provided test the new code and provides new feedback to the programmer
cases. In scenarios where such test cases are unavailable, agent, until the test executor agent finds that the code passes
the model engages in direct debugging by explaining the all the tests.
code, thus addressing potential issues. Complementing these
methods, Huang et al. introduced CodeCoT, employing a
Self-Exam Chain of Thought (CoT) process. This technique 3.1 Programmer agent: code generation with
guides the model to generate code alongside test cases, par- Chain-of-Thought instruction
ticularly useful when external test cases are not available.
CodeCoT adds a layer of logical reasoning to the code gener-
ation process. However, it is important to note that while this In our framework, The programmer agent is powered by
method can identify syntax errors, functional errors may still LLMs. It needs to consider two scenarios, i.e., code gener-
go undetected as both the code and its test cases are generated ation and code refinement. Specifically, as shown in Fig. 1,
by the same model. Building upon these concepts, Dong et during the code generation stage, the human developer will
al. proposed the Self-Collaboration model, which divides the require the programmer agent to generate code snippets to
LLMs into different roles: an analyst, a coder, and a tester. complete specific tasks, the programmer agent employs a
The tester is powered by an LLM which predicts whether the Chain-of-Thought approach to simulate the typical program-
code is buggy. Such practice may ignore many bugs in the ming process, methodically breaking down the task into
code because the code is not executed in the local environ- smaller, manageable steps. The Chain-of-Thought process
ments. is instructed to contain four steps, i.e., problem understand-
ing and clarification, algorithm and method selection, pseu-
2.3 Multi-agent Collaboration docode creation, and code generation (the prompt and re-
A multi-agent system (MAS) is a framework where multiple sponse example is shown in Appendix A.3 Figure 6 and 7).
autonomous agents interact with each other. These agents, Taking the coding task Check if in given list of numbers,
which can be program scripts, software bots, or robots, oper- are any two numbers closer to each other than given thresh-
ate in a shared environment and can communicate, cooperate, old (shown in Figure 1) as an example, during the initial code
compete, or negotiate with each other. Each agent in a multi- generation, the programmer agent will try to understand and
agent system has its own capabilities, goals, and perceptions, clarify the given task, in this case interpreting the requirement
and works either independently or together to achieve com- to identify pairs of numbers in a list that are within a speci-
plex goals or solve problems. The integration of LLMs within fied threshold of each other. The programmer agent will then
multi-agent collaboration systems represents a cutting-edge decide on an algorithm or method to solve the problem. This
area of research in the deep learning community. For exam- could involve choosing an efficient way to compare each pair
ple, HuggingFace proposes HuggingGPT to solve complex of numbers in the list. Next, during the pseudocode creation,
AI tasks with HuggingFace models. Zhang et al. propose the programmer agent will develop a step-by-step guide or
ProAgent to address robotic tasks by analyzing the current pseudocode for the solution, ensuring a logical flow of oper-
context, anticipating teammates’ intentions, and formulating ations. Finally, in the code generation stage, the programmer
its strategies based on the above reasoning. Chen et al. pro- will translate the pseudocode into executable code.
pose VisualGPT to utilize vision PLM to address image cap-
tioning tasks. Code snippets generated by the programmer agent can be
incorrect, containing various types of errors (e.g., syntax and
runtime errors), leading to failed test cases provided by the
3 Methodology test designer agent. Under such circumstances, the program-
The framework of AgentCoder and its pipeline are illustrated mer agent will take feedback from other agents and refine the
in Fig. 1. The process begins by inputting tasks/code genera- code snippets. The refinement process is iterative, with the
tion requirements/descriptions into the code generation agent programmer agent continuously enhancing the code based on
(Agent#1: the programmer agent). Subsequently, the test feedback until the code successfully passes all test cases.
The Human Developer Code Requirements Code Snippets
Complete the code snippet: “I got the ready-to-
“I want to build a from typing import List def has_close_elements(numbers, threshold) -> bool:
for idx, elem in enumerate(numbers): use code !”
program that …” def has_close_elements(numbers, threshold) -> bool:
for idx2, elem2 in enumerate(numbers):
""" Check if in given list of numbers, are any two
numbers closer to each other than given threshold. if idx != idx2:
>>> has_close_elements([1.0, 2.0, 3.0], 0.5) distance = abs(elem - elem2)
if distance < threshold:
Code Requirement
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, return True
return False
Programs
Local Environment
assert has_close_elements([1.0, 1.4, 3.0], 0.5) == True
Error Feedback
of coding scenarios. The enhanced versions, HumanEval- Table 2: Contribution of different agents in AgentCoder.
ET and MBPP-ET, include more adequate test cases, mak- Agents HumanEval HumanEval-ET MBPP MBPP-ET
programmer agent only 61.0 52.4 47.9 35.0
ing them more challenging and better suited for evaluating programmer + test designer 64.0 (11.7%) 54.3 (27.2%) 62.3 (19.3%) 45.9 (24.7%)
advanced models. programmer + test executor
AgentCoder
64.6 (12.7%)
79.9 (39.4%)
55.5 (30.0%)
77.4 (81.3%)
69.3 (32.8%)
89.9 (72.2%)
51.4 (39.7%)
89.1 (142.1%)
CodeCoT
from typing import List
AgentCoder
from typing import List
Figure 2: A case illustration of CodeCoT and AgentCoder generated code for HumanEval task. CodeCoT ignores to use of abs() function to
check further the absolute values are lower than the threshold, while AgentCoder employs it to handle the negative values.
HumanEval Task 1 Tests Generation
from typing import List
CodeCoT
AgentCoder
# Basic Cases
# Test 1: Standard list with some close elements
assert has_close_elements([1.0, 2.5, 3.5, 5.0], 1.0) == True
# Edge Cases
# Test 1: Empty list
assert has_close_elements([], 1.0) == False
# Large Cases
large_list = [i * 0.1 for i in range(100000)] # Creates a list [0, 0.1, 0.2, ..., 9999.9]
Figure 3: A case illustration of CodeCoT and AgentCoder generated tests for HumanEval task. CodeCoT only considers the left values to
be lower than the right values, which is due to the tests generated with its code where it also ignores the use of the abs() function, while
AgentCoder considers two scenarios (i.e., left value lower/larger than the right values).
MBPP Example Task Code Generation
Write a python function to check whether the given array is monotonic or not.
CodeCoT
def is_Monotonic(array):
"""
Check whether the given array is monotonic or not.
:param array: List of integers
:return: True if the array is monotonic, False otherwise
"""
increasing = decreasing = True
AgentCoder
def is_Monotonic(arr):
"""
Check whether the given array is monotonic or not.
Args:
arr (list): Input list of integers
Returns:
bool: True if the array is monotonic, False otherwise
"""
for i in range(len(arr)-1):
if arr[i] > arr[i+1]:
increasing = False
if arr[i] < arr[i+1]:
decreasing = False
Figure 4: A case illustration of CodeCoT and AgentCoder generated code for MBPP task. Both CodeCoT and AgentCoder’s code are correct.
However, CodeCoT ignores the edge cases (e.g., the list does not contain values).
MBPP Example Task Tests Generation
Write a python function to check whether the given array is monotonic or not.
CodeCoT
AgentCoder
# Large input
large_input = [i for i in range(1, 100001)]
# An increasing sequence of 100,000 elements
assert isMonotonic(large_input) == True # Should be considered monotonic
Figure 5: A case illustration of CodeCoT and AgentCoder generated tests for MBPP task. CodeCoT ignores to consider the list does not
contain values and in its generated code this scenario is also ignored. However, AgentCoder’s edge cases will cover these edge scenarios.
AgentCoder: Programmer Prompt
**Role**: You are a software programmer.
**Task**: As a programmer, you are required to complete the function. Use a Chain-of-Thought approach to break
down the problem, create pseudocode, and then write the code in Python language. Ensure that your code is
efficient, readable, and well-commented.
For example:
**Instructions**:
1. **Understand and Clarify**: Make sure you understand the task.
2. **Algorithm/Method Selection**: Decide on the most efficient way.
3. **Pseudocode Creation**: Write down the steps you will follow in pseudocode.
4. **Code Generation**: Translate your pseudocode into executable Python code.
2. **Algorithm/Method Selection**:
- A straightforward way to solve this would be to compare each number with every other number in the list.
- We can use a nested loop where each number is compared to every number that comes after it in the list.
- The absolute difference between each pair will be calculated and compared to the threshold.
3. **Pseudocode Creation**:
```
function has_close_elements(numbers, threshold):
loop through each number in the list (i)
for each number, loop through the remaining numbers in the list (j)
if the absolute difference between numbers[i] and numbers[j] is less than or equal to the threshold
return True
if no such pair is found, return False
```
4. **Code Generation**:
- Now, I'll translate the pseudocode into Python code.
```python
from typing import List
**Instructions**:
- Implement a comprehensive set of test cases following the guidelines above.
- Ensure each test case is well-documented with comments explaining the scenario it covers.
- Pay special attention to edge cases as they often reveal hidden bugs.
- For large-scale tests, focus on the function's efficiency and performance under heavy loads.
```python
# Test 1: Standard list with some close elements
assert has_close_elements([1.0, 2.5, 3.5, 5.0], 1.0) == True
```python
# Test 1: Empty list
assert has_close_elements([], 1.0) == False
For large-scale testing, I'll focus on the function's performance with a significantly large list. Due to the constraints
of this platform, I'll conceptualize the test case:
```python
# Large Scale Test 1: List with 100,000 elements in a predictable pattern
large_list = [i * 0.1 for i in range(100000)] # Creates a list [0, 0.1, 0.2, ..., 9999.9]
def preprocess_data(task):
if f"```py" in task["completion"]:
task["completion"] = task["completion"][task["completion"].find(f"```py") +len(f"```py"):]
task["completion"] = task["completion"][:task["completion"].find("```")]
elif "```" in task["completion"]:
task["completion"] = task["completion"][task["completion"].find("```") +3:]
task["completion"] = task["completion"][:task["completion"].find("```")]
return task
def test_report(dataset,lg):
correct = 0
for i in tqdm(range(len(dataset))):
def test_agent(dataset,lg):
correct = 0
for i in tqdm(range(len(dataset))):
dataset[i]["full_code"] = process_humaneval_test(dataset[i], dataset,
example_test=False,language=lg,test_case=False)
result = check_correctness(dataset[i]["task_id"],dataset[i],lg,5,"./tmp")
if result["passed"]==True:
correct+=1
dataset[i]["result"] = result["result"]
dataset[i]["passed"] = result["passed"]
print("============Start Agent Testing=================")
print("test_agent",correct)
return dataset
model_list = ["gpt-3.5-turbo","palm-2-codechat-bison","claude-instant-1","gpt-4-1106-preview","gpt-4"]
language = ["py"]
for model_name in model_list:
print(f"=================={model_name}================")
epoch = 5
path = AgentCoderProgrammerSaveResultPath
with open(path, "r") as f:
dataset = json.load(f)
for current_epoch in range(epoch):
with open(f"./dataset/{model_name}_{current_epoch}.json", "w") as f:
json.dump(dataset, f)
test_report(dataset,lg)
test_agent(dataset,lg)
dataset = call_completion(dataset,model_name,lg)
**Instructions**:
1. **Understand and Clarify**: Make sure you understand the task. If necessary, write down what the function
should do.
2. **Algorithm/Method Selection**: Decide on the most efficient way to compare the numbers in the list to find if
any two are within the threshold.
3. **Pseudocode Creation**: Write down the steps you will follow in pseudocode. This should outline how you will
iterate through the list and compare the numbers.
4. **Code Generation**: Translate your pseudocode into executable Python code. Remember to test your function
with the provided examples and any additional cases you think are relevant.
# Run doctest
if __name__ == "__main__":
doctest.testmod()