LLM-Based Multi-Agent Systems For Software Engineering: Literature Review, Vision and The Road Ahead
LLM-Based Multi-Agent Systems For Software Engineering: Literature Review, Vision and The Road Ahead
1 INTRODUCTION
Autonomous agents, defined as intelligent entities that autonomously perform specific tasks through environ-
mental perception, strategic self-planning, and action execution [6, 36, 98], have emerged as a rapidly expanding
research field since the 1990s [93]. Despite initial advancements, these early iterations often lack the sophistication
of human intelligence [127]. However, the recent advent of Large Language Models (LLMs) [65] has marked a
turning point. This LLM breakthrough has demonstrated cognitive abilities nearing human levels in planning and
reasoning [3, 65], which aligns with the expectations for autonomous agents. As a result, there is an increased
research interest in integrating LLMs at the core of autonomous agents [90, 134, 148] (for short, we refer to them
as LLM-based agents in this paper).
Nevertheless, the application of singular LLM-based agents encounters limitations, since real-world problems
often span multiple domains, requiring expertise from various fields. In response to this challenge, developing
LLM-Based Multi-Agent (LMA) systems represents a pivotal evolution, aiming to boost performance via synergistic
collaboration. An LMA system harnesses the strengths of multiple specialized agents, each with unique skills and
responsibilities. These agents work in concert towards a common goal, engaging in collaborative activities like
debate and discussion. These collaborative mechanisms have been proven to be instrumental in encouraging
Authors’ addresses: Junda He, [email protected], Singapore Management University, Singapore, Singapore; Christoph Treude, ctreude@
smu.edu.sg, Singapore Management University, Singapore, Singapore; David Lo, [email protected], Singapore Management University,
Singapore, Singapore.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.
Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected].
© 2025 Copyright held by the owner/author(s).
ACM 1557-7392/2025/1-ART
https://doi.org/10.1145/3712003
divergent thinking [80], enhancing factuality and reasoning [32], and ensuring thorough validation [146]. As a
result, LMA systems hold promise in addressing a wide range of complicated real-world scenarios across various
sectors [49, 131, 137], such as software engineering [48, 75, 90, 109].
The study of software engineering (SE) focuses on the entire lifecycle of software systems [63], including stages
like requirements elicitation [39], development [2], and quality assurance [126], among others. This multifaceted
discipline requires a broad spectrum of knowledge and skills to effectively tackle its inherent challenges in each
stage. Integrating LMA systems into software engineering introduces numerous benefits:
(1) Autonomous Problem-Solving: LMA systems can bring significant autonomy to SE tasks. It is an intuitive
approach to divide high-level requirements into sub-tasks and detailed implementation, which mirrors
agile and iterative methodologies [68] where tasks are broken down and assigned to specialized teams or
individuals. By automating this process, developers are freed to focus on strategic planning, design thinking,
and innovation.
(2) Robustness and Fault Tolerance: LMA systems address robustness issues through cross-examination in
decision-making, akin to code reviews and automated testing frameworks, thus detecting and correcting
faults early in the development process. On their own, LLMs may produce unreliable outputs, known as
hallucination [151, 164], which can lead to bugs or system failure in software development. However, by
employing methods like debating, examining, or validating responses from multiple agents, LMA systems
ensure convergence on a single, more accurate, and robust solution. This enhances the system’s reliability
and aligns with best practices in software quality assurance.
(3) Scalability to Complex Systems: The growth in complexity of software systems, with increasing lines of
code, frameworks, and interdependencies, demands scalable solutions in project management and develop-
ment practices. LMA systems offer an effective scaling solution by incorporating additional agents for new
technologies and reallocating tasks among agents based on evolving project needs. LMA systems ensure
that complex projects, which may be overwhelming for individual developers or traditional teams, can be
managed effectively through distributed intelligence and collaborative agent frameworks.
Existing research has illuminated the critical roles of these collaborative agents in advancing toward the era of
Software Engineering 2.0 [90]. LMA systems are expected to significantly speed up software development, drive
innovation, and transform the current software engineering practices. This article aims to delve deeper into the
roles of LMA systems in shaping the future of software engineering. It spotlights the current progress, emerging
challenges, and the road ahead. We provide a systematic review of LMA applications in SE, complemented by
two case studies that assess current LMA systems’ capabilities and limitations. From this analysis, we identify
key research gaps and propose a comprehensive agenda structured in two phases: (1) enhancing individual agent
capabilities and (2) optimizing agent collaboration and synergy. This roadmap aims to guide the development
of autonomous, scalable, and trustworthy LMA systems, paving the way for the next generation of software
engineering
To summarize, this study makes the following key contributions:
• We conduct a systematic review of 71 recent primary studies on the application of LMA systems in software
engineering.
• We perform two case studies to illustrate the current capabilities and limitations of LMA systems.
• We identify the key research gaps and propose a structured research agenda that outlines potential future
directions and opportunities to advance LMA systems for software engineering tasks.
2 PRELIMINARY
2.1 Autonomous Agent
An autonomous agent is a computational entity designed for independent and effective operation in dynamic
environments [98]. Its essential attributes are:
• Autonomy: Independently manages its actions and internal state without external controls.
• Perception: Detects the changes in the surrounding environment through sensory mechanisms.
• Intelligence and Goal-Driven: Aims for specific goals using domain-specific knowledge and problem-solving
abilities.
• Social Ability: Can interact with humans or other agents, manages social relationships to achieve goals.
• Learning Capabilities: Continuously adapts, learns, and integrates new knowledge and experiences.
1 https://openai.com/chatgpt/
2 https://claude.ai/
3 https://gemini.google.com/app
2.3.1 Orchestration Platform. The orchestration platform serves as the core infrastructure that manages interac-
tions and information flow among agents. It facilitates coordination, communication, planning, and learning,
ensuring efficient and coherent operation. The orchestration platform defines various key characteristics:
(1) Coordination Models: Defines how agents interact, such as cooperative (collaborating towards shared goals) [1],
competitive (pursuing individual goals that may conflict) [147], hierarchical (organized with leader-follower
relationships) [166], or mixed models.
(2) Communication Mechanisms: Determines how the information flows between the agents: It defines the orga-
nization of communication channels, including centralized (a central agent facilitates communication [4]),
decentralized (agents communicate directly [22]), or hierarchical (information flows through layers of author-
ity [166]). Moreover, it specifies the data exchanged among agents, often in text form. In software engineering
contexts, this may include code snippets, commit messages [169], forum posts [42–44], bug reports [12], or
vulnerability reports [55].
(3) Planning and Learning Styles: The orchestration platform specifies how planning and learning are conducted
within the multi-agent system. It determines how tasks are allocated and coordinated among agents. It includes
strategies like Centralized Planning, Decentralized Execution (CPDE) – planning is conducted centrally, but
agents execute tasks independently, or Decentralized Planning, Decentralized Execution (DPDE) – both planning
and execution are distributed among agents.
2.3.2 LLM-Based Agents. Each agent may have unique abilities and specialized roles, enhancing the system’s
ability to handle diverse tasks effectively. Agents can be:
(1) Predefined or Dynamically Generated: Agent profiles can be explicitly predefined [48] or dynamically
generated by LLMs [134], allowing for flexibility and adaptability.
(2) Homogeneous or Heterogeneous: Agents may have identical functions (homogeneous) or diverse functions
and expertise (heterogeneous).
Each LLM-based agent can be represented as a node 𝑣𝑖 in a graph 𝐺 (𝑉 , 𝐸), where edges 𝑒𝑖,𝑗 ∈ 𝐸 represent
interactions between agents 𝑣𝑖 and 𝑣 𝑗 .
3 LITERATURE REVIEW
In this section, we review recent studies on LMA systems in software engineering, organizing these applications
across various stages of the software development lifecycle, including requirements engineering, code generation,
quality assurance, and software maintenance. We also examine studies on LMA systems for end-to-end software
development, covering multiple SDLC phases rather than isolated stages.
Search Strategy: We conduct a keyword-based search on the DBLP publication database [28] to match paper
titles. DBLP is a widely used resource in software engineering surveys [20, 23, 161], which indexes over 7.5
million publications across 1,800 journals and 6,700 academic conferences in computer science.
Our search included two sets of keywords: one set targeting LLM-based Multi-Agent Systems (called [agent
words]) and the other focusing on specific software engineering activities (called [SE words]). Papers may use
variations of the same keyword. For example, the term “vulnerability” may appear as “vulnerable” or “vulnera-
bilities.” To address this, we use truncated terms like “vulnerab” to capture all related forms. For LMA systems,
we used keywords: “Agent” OR “LLM” OR “Large Language Model” OR “Collaborat”. To ensure comprehensive
coverage of SE activities, we incorporated phase-specific keywords for each stage of the SDLC into our search
queries:
(1) Requirements Engineering: requirement, specification, stakeholder
(2) Code Generation: software, code, coding, program
(3) Quality Assurance: bug, fault, defect, fuzz, test, vulnerab, verificat, validat
documenter, performing nine actions to help generate high-quality requirements models and specifications.
Sami et al. [115] propose another LMA framework to generate, evaluate, and prioritize user stories through a
collaborative process involving four agents: product owner, developer, quality assurance (QA), and manager.
The produce owner generates user stories and initiates prioritization. The QA agent assesses story quality and
identifies risks, while the developer prioritizes based on technical feasibility. Finally, the manager synthesizes
these inputs and finalizes prioritization after discussions with all agents
Agent Forest [77] adopts a different paradigm instead of role specialization. Instead, it utilizes a sampling-
and-voting framework, where multiple agents independently generate candidate outputs. Each output is then
evaluated based on its similarity to the others, with a cumulative similarity score calculated for each. The output
with the highest score—indicating the greatest consensus among the agents—is selected as the final solution.
AutoCodeOver by improving program fixes through iterative searches and specification analysis based on inferred
code intent. ACFIX [160] targets access control vulnerabilities in smart contracts, focusing on Role-Based Access
Control. It mines common RBAC patterns from over 344,000 contracts to guide agents in generating patches.
DEI [159] resolves GitHub issues by using a meta-policy to select the best solution, integrating and re-ranking
patches generated by different agents for improved issue resolution. SWE-Search [7] consists of three agents: the
SWE-Agent for adaptive exploration, the Value Agent paired with a Monte Carlo tree search module for iterative
feedback and utility estimation, and the Discriminator Agent for collaborative decision-making through debate.
RepoUnderstander [92] constructs a knowledge graph for a full software repository and also uses Monte Carlo
tree search to assist in understanding complex dependencies.
Code Review. Rasheed et al. [111] developed an automated code review system that identifies bugs, detects
code smells, and provides optimization suggestions to improve code quality and support developer education.
This system uses four specialized agents focused on code review, bug detection, code smells, and optimization.
Similarly, CodeAgent [124] performs code reviews with sub-tasks such as vulnerability detection, consistency
checking, and format verification. A supervisory agent, QA-Checker, ensures the relevance and coherence of
interactions between agents during the review process.
Test Case Maintenance. Lemner et al. [74] propose two multi-agent architectures to predict which test cases
need maintenance after source code changes. These agents perform tasks including summarizing code changes,
identifying maintenance triggers, and localizing relevant test cases.
Sprint
Backlog
Requirements
Design Planning
Development
Development Review
Testing
Maintenance Testing
Waterfall Agile
cycles. AgileGen enhances Agile practices with human-AI collaboration, integrating close user involvement to
ensure alignment between requirements and generated code. A notable feature of AgileGen is its use of the
Gherkin language to create testable requirements, bridging the gap between user needs and code implementation.
While most methods rely on predefined roles and fixed workflows for software development, a few work [78, 82,
135] investing in dynamic process models. Think-on-Process (ToP) [82] introduces a dynamic process generation
framework. Since software development processes can vary significantly depending on project requirements,
ToP moves beyond the limitations of static, one-size-fits-all workflows to enable more flexible and efficient
development practices. Given a software requirement, this framework leverages LLMs to create tailored process
instances based on their knowledge of software development. These instances act as blueprints to guide the
architecture of the LMA system, adapting to the specific and diverse needs of different projects. Similarly, in
MegaAgent [135], agent roles and tasks are not predefined but are generated and planned dynamically based
on project requirements. Both ToP and MegaAgent highlight the shift from rigid, static workflows to dynamic,
adaptive systems. These frameworks promise more efficient, flexible, and context-aware software development
practices, aligning processes with project-specific requirements and complexities.
Additionally, instead of focusing on the process model, several works [107, 108] explore leveraging experiences
from past software projects to enhance new software development efforts. Co-Learning [107] enhances agents’
software development abilities by utilizing insights gathered from historical communications. This framework
fosters cooperative learning between two agent roles—instructor and assistant—by extracting and applying
heuristics from their task execution histories. Building on this, Qian et al. [108] propose an iterative experience
refinement (IER) framework that enables agents to continuously adapt by acquiring, utilizing, and selectively
refining experiences from previous tasks, improving agents’ effectiveness and collaboration in dynamic software
development scenarios.
4 CASE STUDY
To demonstrate the practical effectiveness of LMA systems, we conduct two case studies. Specifically, we utilize
the state-of-the-art LMA framework, ChatDev [109], to autonomously develop two classic games: Snake and
Tetris. ChatDev structures the software development process into three phases: designing, coding, and testing.
ChatDev employs specialized roles, including CEO, CTO, programmer, reviewer, and tester. ChatDev’s agents are
powered by GPT-3.5-turbo4 . The temperature setting controls the randomness and creativity of the GPT-3.5’s
responses. Following the original ChatDev setting, we set the temperature of GPT-3.5-turbo as 0.2.
“Design and implement a grid-based snake game displayed on the screen. Initialize the snake with a defined
starting position, length, and direction. Enable continuous movement controlled by arrow keys. Introduce
food that spawns randomly on the grid, ensuring it does not overlap with the snake. Trigger snake growth
when food is consumed, adding a new segment to its body. Implement a game-over condition for boundary or
body collisions, displaying a message and providing a restart option. Include a scoring system displayed in
the user interface, along with clear instructions. ”
While the first attempt to generate the Snake game was unsuccessful, we resubmitted the same prompt to
ChatDev, and the second attempt successfully produced a playable version. ChatDev also generated a detailed
manual that included information on dependencies, step-by-step instructions for running the game, and an
overview of its features. Figure ⁇ displays the graphical user interfaces (GUIs) of the generated Snake game,
showing the starting state, in-game state, and game-over state. The development process was consistently efficient,
taking an average of 76 seconds and costing $0.019. Upon playing the game, we confirmed that it fulfilled all the
requirements outlined in the prompt.
“Design and implement a Tetris game. Start with a randomly chosen piece dropping from the top. Allow
players to control the tetromino using arrow keys for movement (left, right, down) and rotation. Enable
automatic downward movement with an adjustable speed. Handle collisions with the boundaries and existing
pieces, locking the tetromino in place when it cannot move further. Check for complete rows after each
placement and remove them. End the game if new tetrominoes cannot spawn due to a full board, displaying a
game over message.”
4 https://platform.openai.com/docs/models/gp#gpt-3-5-turbo
During development, ChatDev faced challenges in producing functional gameplay across the first nine attempts.
Notice that the same prompt was used for each run. On the tenth attempt, ChatDev successfully produced a
Tetris game that met most of the prompt requirements, as shown in Figure 3. The figure illustrates the game’s key
states: the starting state, in-game states, and game-over state. However, the game still lacks the core functionality
to remove completed rows, as demonstrated in the third subplot of Figure 3. Overall, the development process
remained efficient, with an average time of 70 seconds and a cost of $0.020 per attempt.
Summary of Findings. From our case studies, current LMA systems demonstrate strong performance in
reasonably complex tasks like developing a Snake game. The generated Snake game meets all requirements in the
prompt within just a few iterations. The process was efficient and cost-effective, with an average completion
time of 76 seconds and a cost of $0.019 per attempt. These results emphasize the suitability of LMA systems
for moderately complex software engineering tasks. However, when tasked with more complex challenges like
developing a Tetris game, ChatDev successfully generates a playable Tetris game only by the tenth attempt. The
game still lacks the core functionality, i.e., removing completed rows. This highlights the limitations of current
LMA systems in handling more complex tasks that require deeper logical reasoning and abstraction. Nevertheless,
development remains efficient and cost-effective, averaging 70 seconds and $0.020 per run, making the system a
promising tool for rapid prototyping.
5 RESEARCH AGENDA
Previous research has laid the groundwork for the exploration of LMA systems in software engineering, yet this
domain remains in its nascent stages, with many critical challenges awaiting resolution. In this section, we outline
our perspective on these challenges and suggest research questions that could advance this burgeoning field. As
illustrated in Figure 4, we envision two phases for the development of LMA systems in software engineering. We
discuss each of these phases below and suggest a series of research questions that could form the basis of future
research projects.
5 https://www.linkedin.com/products/linkedin-talent-insights/
6 https://www.gartner.com/en/products/special-reports
7 https://survey.stackoverflow.co/2024/
(3) Value Addition Modeling: The next crucial step is value addition modeling [100], which evaluates the
potential advantages that LLM-based agents could bring to each prioritized role. This process involves
constructing detailed, data-driven models to analyze key performance indicators such as efficiency
improvements, cost reductions, quality enhancements, and the acceleration of innovation resulting from
the integration of agents. Pilot projects can be deployed to gather empirical data on these metrics when
LLM-based agents are applied to specific tasks. Important factors to consider include the automation
of repetitive tasks, the augmentation of human capabilities, and the inclusion of new functionalities
that were previously unattainable. It is important to note that the value added by LLM-based agents can
differ significantly across different domains; for example, roles in software development may prioritize
automation, whereas domains like systems architecture might see more value in LLMs augmenting complex
decision-making around resource allocation or performance optimization, where human expertise and
contextual understanding remain essential. By quantifying these value propositions, organizations can
allocate resources more strategically to roles where LLM-based agents are likely to yield the highest
return on investment.
The second step involves understanding the limitations of LLM-based agents relative to the demands of the
identified SE roles:
(1) Competency Mapping: Competency mapping [66] entails developing comprehensive competency frame-
works for each specialized role. These frameworks define the essential skills, knowledge areas, and
competencies required, encompassing both technical and soft skills. For instance, technical skills might
encompass proficiency in specific programming languages, tools, methodologies, and domain-specific
knowledge. For a machine learning engineer, this would include expertise in algorithms, data preprocessing,
model training, and tools such as TensorFlow8 or PyTorch9 . Soft skills include skills like problem-solving,
critical thinking, and collaboration. Clearly outlining these competencies creates a benchmark against
which the agents’ abilities can be measured.
(2) Performance Evaluation: The next phase is performance evaluation, which involves designing or selecting
tasks that closely replicate the real-world challenges associated with each role. These tasks should be
practical and scenario-based to accurately gauge the agents’ capabilities. They should assess a wide range
of competencies, from technical execution to critical thinking. For example, in evaluating a DevOps
engineer, the agent might be tasked with automating a deployment pipeline using tools like Jenkins10 or
Docker11 , or troubleshooting a continuous integration failure. Such tasks allow for a thorough assessment
of both technical and soft skills.
(3) Gap Analysis: This step compares the agents’ outputs with the expected outcomes for each task. Key
areas where the agents underperform–such as misunderstanding domain-specific terminology, neglecting
security best practices, or failing to optimize code–are identified and documented. This analysis emphasizes
both the agents’ strengths and weaknesses, offering valuable insights into recurring patterns of errors or
misconceptions.
(4) Expert Consultation and Iterative Refinement: To further refine the evaluation process, expert consultation
and iterative refinement are essential. By engaging with SE professionals who specialize in the assessed
roles, qualitative feedback on the agent’s performance can be obtained. These experts provide insights into
subtle nuances that may not be captured through quantitative metrics. For instance, while the agent’s code
8 https://www.tensorflow.org/
9 https://pytorch.org/
10 https://www.jenkins.io/
11 https://www.docker.com/
may work, it might not follow best practices or address scalability. This feedback helps refine evaluation
methods, update competency frameworks, and uncover deeper issues in the agent’s understanding.
The final step involves tailoring the LLM-based agents to effectively represent the identified SE roles through
specialized training and prompt engineering. :
(1) Curating Specialized Training Data: At first, this involves creating training datasets that reflect the
unique requirements of each specific role. A comprehensive corpus should be built from a variety of
sources, including technical documentation such as API guides, technical manuals, and user guides to
provide in-depth knowledge of specific technologies. It is also important to incorporate academic and
industry research papers, case studies, and whitepapers to capture the latest developments, best practices,
and theoretical foundations. Additionally, discussions from forums and software Q&A sites like Stack
Overflow12 , Reddit13 , and specialized industry forums can provide practical problem-solving approaches
and real-world challenges faced by professionals.
(2) Fine-tuning the LLM: After preparing the data, the curated datasets are used to fine-tune the LLM-based
agents. Advanced techniques like parameter-efficient fine-tuning (PEFT) [83] are often employed to
optimize both efficiency and accuracy.
(3) Designing Customized Prompts: A key step is designing prompts tailored to improve the agents’ role
adaptability. These prompts should clearly define the role, tasks, and goals to ensure the agent understands
the requirements. For instance, in a cybersecurity analyst role, the prompt should outline specific security
protocols, potential vulnerabilities, and compliance standards. Contextual instructions, including relevant
background, constraints, and examples, help the agent grasp task nuances. Creating a library of effective
prompts for various scenarios can also serve as reusable templates for future tasks.
(4) Continuous Learning and Adaptation: To keep agents aligned with industry developments, continuous
adaptation mechanisms are essential. Training data should be regularly updated, and models may be
retrained to incorporate new technologies, best practices, and trends in software engineering. Moni-
toring systems can track agent performance over time, enabling proactive adjustments and continuous
improvement. Additionally, agents should be guided to consistently reference the latest documentation
and standards to ensure their outputs remain relevant and accurate.
While LMA roles may overlap with traditional software engineering roles, it is important to recognize that they
are not necessarily the same, as LMA roles often involve specialized, collaborative tasks suited for agent-based
systems. By systematically identifying key roles, assessing agent competencies, and enhancing their capabilities
through targeted fine-tuning, we aim to significantly improve the effectiveness of LLM-based agents in specialized
SE roles.
5.1.2 Advancing Prompts through Agent-oriented Programming Paradigms.
Effective prompts are crucial for the performance of LLM-based agents. However, creating such prompts is
challenging due to the need for a framework that is versatile, effective, and robust across diverse scenarios. Natural
language, while flexible, often contains ambiguities and inconsistencies that LLMs may misinterpret. Natural
language is inherently designed for human communication, where human communication relies on shared
context and intuition that LLMs lack. In contrast, LLMs interpret text based on statistical patterns from large
datasets, which may lead to different interpretations than those intended for humans [121, 156]. This highlights
the need for a specialized prompting language designed to augment the cognitive functions of LLM-based agents
and treats LLMs as the primary audience. Such a language can minimize ambiguities and ensure clear instructions,
resulting in more reliable and accurate outputs.
12 https://stackoverflow.com/
13 https://www.reddit.com/
Current State. Multiple prompting frameworks are released to facilitate the usage of LLMs. For example,
DSPy [67] and Vieira [79] enable fully automated generation of prompts. AutoGen [145] and LangChain [97]
support retrieval-augmented generation (RAG) [38] and agent-based workflows. However, these frameworks are
still human-centered. They often prioritize human readability and developer convenience. As a result, there is a
lack of research on a language that treats LLMs as the primary audience for prompts.
Opportunities. Agent-oriented programming (AOP) [118] offers a promising foundation for this approach.
Similar to how Object-Oriented Programming (OOP) [141] organize objects, AOP treats agents as fundamental
units, focusing on their reasoning, objectives, and interactions. An AOP-based prompting language could enable
the precise expression of complex tasks and constraints, allowing LLM-based agents to perform their roles with
greater efficiency and accuracy. Extending this concept to Multi-Agent-Oriented Programming (MAOP) [13, 14]
allows for the creation of systems where multiple LLM-based agents can collaborate, communicate, and adapt to
evolving contexts. By explicitly defining agent behaviors, communication patterns, and task hierarchies, we can
reduce ambiguity, mitigate hallucinations, and improve task execution in LMA systems.
Further, such a prompting language must be expressive enough to handle diverse and complex tasks, yet simple
enough for users to easily adopt. Conversely, overly simplified languages may lack the expressive power needed
to represent complex software engineering workflows. A complex language that introduce a steep learning curve
due to their syntax, hinder adoption, especially for users who require simpler interfaces for prompt creation and
modification. Balancing functionality and usability will be another key research question to its success.
Additionally, this process may involve tailoring prompts specifically for different LLM models and their versions,
as variations in model architectures, training data, and capabilities can affect how they interpret and respond
to prompts. What works effectively for one model may not perform as well for another, necessitating careful
adjustments. Current prompting languages lack mechanisms to easily adapt prompts across models, requiring
manual adjustments and experimentation to achieve consistent performance.
While AOP-based prompting may not be the final solution, it represents an important step toward devel-
oping an AI-oriented language with grammar tailored specifically for LLMs. This new approach could further
refine communication with LLM-based agents, reducing misinterpretation and significantly enhancing overall
performance.
Current State. Several LMA systems incorporate human-in-the-loop designs. For instance, AISD [162] involves
human input during requirement analysis and system validation, where users provide feedback on use cases,
system designs, and prototypes. Similarly, MARE [59] leverages human assessment to refine generated require-
ments and specifications. Although these works demonstrate the feasibility of human contributions, key research
questions, including optimizing human roles, enhancing feedback mechanisms, and identifying appropriate
intervention points, are still underexplored.
Opportunity. Developing role-specific guidelines that outline when and how human intervention should occur
is essential. These guidelines should assist in identifying critical decision points where human judgment is
indispensable, such as ethical considerations, conflict resolution, ambiguity handling, and creative problem-
solving. For example, ethical decisions necessitate human oversight to ensure alignment with societal norms and
values, and conflict resolution may require negotiation skills that LLM-based agents lack.
To facilitate seamless collaboration, designing intuitive user-friendly interfaces and interaction protocols is
essential [133]. Natural language interfaces and adaptive visualization techniques can make interactions more
accessible. These interfaces should efficiently present agent outputs in a digestible format and collect user feedback,
while also managing the cognitive load on human collaborators. It is important to note that these interfaces may
need to be tailored differently for each human role, as the needs of a project manager, a software developer, and a
quality assurance engineer will vary significantly.
Given the complexity of information generated during the agents’ workflows [60], designing such interfaces
poses challenges. For instance, presenting modifications suggested by an agent at varying levels of abstraction
ensures that each stakeholder can engage with the information at the right depth. A project manager might
focus on the broader implications, such as the high-level impact on project timelines or deliverables, whereas a
developer or architect might drill down into specific implementation details. Role-specific interfaces will be key
to ensuring each stakeholder can effectively collaborate with the agents and extract the necessary information in
a manner suited to their specific responsibilities.
Additionally, developing predictive models to determine the optimal human-to-agent ratio across different
project types and stages is a fundamental concern. These models must assess factors such as project complexity,
time constraints, project priorities, and the specific capabilities and limitations of both human participants and
LMA agents. By doing so, tasks can be allocated in a manner that fully harnesses both human ingenuity and
agent efficiency throughout the project. Machine learning techniques could also be leveraged to analyze historical
project data to predict effective collaboration strategies.
(1) Participate in collaborative design: Agents should contribute ideas, propose design solutions, and converge
on a unified architecture that balances trade-offs.
(2) Delegate and coordinate tasks: Effective task division is crucial. Agents should assign responsibilities based
on expertise, manage dependencies, and adjust as the project evolves.
(3) Identify conflicts and negotiate: In collaborative settings, disagreements are inevitable. First, LLMs often
struggle to identify conflicts in real time unless explicitly guided to do so [1]. Therefore, agents should be
evaluated on their ability to recognize these conflicts—whether in logic, goals, or execution. Moreover,
agents should be tested on their ability to handle conflicts constructively. This includes proposing com-
promises, engaging in constructive negotiation, and ensuring that the team remains aligned with the
overarching objectives. Evaluations should focus on the agents’ capacity to balance competing priorities,
mitigate misunderstandings, and foster consensus, all while maintaining progress toward shared goals.
(4) Integrate components and perform peer reviews: Agents should seamlessly integrate their work, review
each other’s code for quality assurance, and provide constructive feedback.
(5) Proactive Clarification Request : Agents should not assume complete understanding when uncertainty
arises. Instead, they should preemptively ask for additional information or clarification to avoid potential
errors or misunderstandings. Evaluating agents on this ability ensures they are capable of identifying gaps
in their knowledge or instructions and can actively seek out the necessary context or data to complete
tasks effectively.
To develop such benchmarks, we need to create realistic project scenarios that require multi-agent collaboration
over extended periods. These scenarios should reflect common software development challenges, such as evolving
requirements and tight deadlines. Additionally, platforms or sandboxes must be built to provide controlled
environments where collaborative interactions between agents can be observed and measured. These platforms
should establish clear interaction rules, including languages, formats, and communication channels, to facilitate
effective information exchange.
Most importantly, comprehensive metrics must be developed to assess not just the final output, but also the
collaboration process itself. These metrics could measure communication efficiency, ambiguity resolution, conflict
management, adherence to best practices, and overall project success.
overhead and memory usage. From the outset, the system should be designed with scalability in mind, ensuring
that both software and hardware resources can expand efficiently as the number of agents increases.
Moreover, with more agents comes the risk of inconsistencies and conflicts in the shared information. A
centralized knowledge repository or shared blackboard system can ensure that all agents have access to consistent,
up-to-date information, acting as a single source of truth and minimizing the spread of misinformation. Robust
error handling mechanisms should also be implemented to detect and correct issues autonomously before they
escalate into significant failures.
Finally, as the number of agents grows, so do the rounds of discussion and decision-making, which can slow
down progress. To avoid this, decision-making hierarchies or consensus algorithms can streamline the process.
For example, only a subset of agents responsible for a specific module may need to reach a consensus, rather
than involving the entire agent network.
5.2.4 Leveraging Industry Principles.
As LLM-based agents can closely mimic human developers in SE tasks, they can greatly benefit from adopting
established industry principles and management strategies. By emulating organizational frameworks used by
successful companies, LMA systems can improve their design and optimization processes. These industrial
mechanisms enable LMA systems to remain agile, efficient, and effective, even as project complexities grow.
Current State. As we described in Section 3, numerous works [5, 48, 101, 109] are designed using popular process
models like the Waterfall and Agile. For example, ChatDev [109] emulates a traditional Waterfall approach,
breaking tasks into distinct phases (e.g., requirement analysis, design, implementation, testing), with agents
dedicated to each phase. AgileCoder [101] incorporates the Agile methodologies, leveraging iterative development,
continuous feedback loops, and collaborative sprints.
Opportunities. However, current LMA systems often do not leverage more specialized and modern industry
practices, such as Value Stream Mapping, Design Thinking, or Model-Based Systems Engineering (MBSE).
Additionally, frameworks like Domain-Driven Design (DDD), Behavior-Driven Development (BDD), and Team
Topologies remain underutilized. These methodologies emphasize aligning development with business goals,
improving user-centric design, and optimizing team structures—key components that could further enhance the
efficiency, adaptability, and effectiveness of LMA systems.
Leadership and governance structures from industrial organizations provide valuable insights for designing
LMA systems. Project management tools and practices, essential for coordinating large development teams,
can be applied to LMA systems to enhance their operational efficiency. Using established project management
frameworks, LMA systems can monitor progress, allocate resources, and manage timelines effectively. Agents can
dynamically update task boards, report milestones, and adjust workloads in real-time based on project data. This
not only improves transparency but also allows for early detection of bottlenecks or delays, ensuring projects
stay on track.
Incorporating design patterns and software architecture best practices further strengthens LMA systems [70].
By adhering to these principles, agents can produce well-structured, maintainable code that is scalable and
reusable. This reduces technical debt and ensures that the solutions developed by LMA systems are easier to
integrate, maintain, and expand in the future.
5.2.5 Dynamic Adaptation.
In the context of software development, predicting the optimal configuration for LMA systems at the outset
is unrealistic due to the inherent complexity and variability of tasks [71]. The dynamic nature of software
requirements and the unpredictable challenges that arise during development necessitate systems that can adapt
on the fly [88]. For example, a sudden shift in project requirements or unexpected delays caused by dependencies
on external components. Therefore, LMA systems must be capable of dynamically adjusting their scale, strategies,
and structures throughout the development process.
Current State. Most existing LMA systems [48, 109] operate with static architectures characterized by fixed agent
roles and predefined communication patterns. Recent research efforts [89, 157] have introduced mechanisms for
adaptive agent team selection and task-specific collaboration strategies. These methods enable the selection of
suitable agent team configurations for specific tasks, however, they still fall short of true dynamic adaptation and
lack the capability to adjust to real-time changes. To the best of our knowledge, no previous work addresses the
need for on-the-fly adjustments in response to evolving project demands.
Opportunities. To minimize redundant work, LMA systems should continuously evaluate existing solutions [136],
identifying reusable elements for new requirements. By learning from each development cycle, the system can
recognize patterns of efficiency and inefficiency, enabling it to make informed decisions when handling similar
tasks in the future or adapting existing solutions to new requirements.
A key element of dynamic adaptation is the ability to automatically adjust the number of agents involved in a
project [41]. This includes not only scaling the number of agents up or down as needed but also generating new
agents with new specialized roles to meet emerging task requirements, ensuring both efficiency and responsiveness.
Additionally, the system can replicate agents in existing roles to manage increased workloads. Furthermore, LMA
systems can generate new agents that come equipped with contextual knowledge of the project—such as its
history, current state, and objectives—by accessing shared knowledge bases, project documentation, and recent
communications. This allows new agents to integrate smoothly and contribute effectively right from the start,
reducing onboarding time and minimizing disruptions.
Another key component is the dynamic redefinition of agent roles [61]. As the project evolves, certain roles
may become obsolete while new ones emerge. LMA systems should be capable of reassigning roles to agents or
modifying their responsibilities to better align with current project needs. This flexibility enhances the system’s
ability to adapt to changing requirements and priorities.
Dynamic adaptation also involves the reallocation of memory and computational resources. As agents are
added or removed and tasks shift in complexity, the system must efficiently distribute resources to where they
are most needed. This may include scaling computational power for agents handling intensive tasks or increasing
memory allocation for agents processing large datasets. Effective resource management ensures that the system
operates optimally without unnecessary strain on infrastructure.
Finally, the uncertainty of the software development process makes it challenging to define effective termination
conditions [119]. Relying solely on predefined criteria may result in infinite loops or premature task completion.
To address this, LMA systems must incorporate real-time monitoring and feedback loops to continuously evaluate
progress. Machine learning techniques can help predict optimal stopping points by analyzing historical data
and current performance metrics, allowing for informed adjustments to task completion criteria as the project
evolves.
the varied data access needs of the system. Traditional models like Role-Based Access Control (RBAC) [116] and
Attribute-Based Access Control (ABAC) [52] may need to be extended to handle the dynamic nature of multi-
agent systems effectively. Establishing protocols that allow agents to share insights derived from sensitive data,
without exposing the data itself, is critical. Advanced privacy-preserving techniques like Differential Privacy [34],
Secure Multi-Party Computation (SMPC) [40], Federated Learning [62], or Homomorphic Encryption [153] can
be leveraged to ensure that agents collaborate without compromising data privacy.
Moreover, compliance with data protection laws such as the General Data Protection Regulation (GDPR) [130]
in the EU and the California Consumer Privacy Act (CCPA) [104] in the U.S. is crucial. LMA systems should
follow privacy-by-design principles, ensuring that data subjects’ rights are upheld, and that data processing
activities remain transparent and lawful. This includes implementing mechanisms for data minimization, consent
management, and honoring the right to be forgotten.
For non-sensitive data, integrated data storage solutions can reduce redundancy, improve data consistency,
and increase efficiency. This can be achieved through distributed databases accessible to authorized agents, along
with data synchronization mechanisms to ensure agents have up-to-date information in real time. Additionally,
using technologies like blockchain [167] and distributed ledgers [122] can enhance transparency, traceability,
and tamper-resistance in recording agent transactions and data access events, fostering greater trust among
collaborating entities.
6 DISCUSSION
6.1 A Comparison with the Mixture of Experts Paradigm
Another paradigm that has recently attracted much attention from both academia and industry is the Mixture
of Experts (MoE) paradigm [16, 170]. MoE organizes an LLM into multiple specialized components known as
“experts.” Each expert is designed to focus on specialized tasks. Further, a gating mechanism is employed to
dynamically activate the most relevant subset of experts based on the input. While MoE is promising, LMA
systems offer several distinct advantages:
One limitation of MoE is its high resource consumption. MoE models contain multiple experts within a single
architecture, which makes the total number of parameters rather huge. Furthermore, training MoE is more
resource-intensive and time-consuming than standard LLMs. This is mainly due to the complex training process
for the gating mechanism. Training the gating mechanism involves optimizing the selection process for the most
relevant experts, which adds considerable overhead.
Since specific experts are dynamically activated based on input, MoE can be viewed as a method to learn the
internal routing of LLMs. However, there is no interaction and communication between experts in MoE. On the
other hand, LMA systems usually are designed to resemble real-world collaborative workflows. Agents in LMA
systems can actively communicate with each other, exchange information, and iteratively refine the output based
on feedback from other agents. More importantly, LMA systems can also integrate external feedback from tools
such as compilers, static analyzers, or testing frameworks. LMA systems also facilitate seamless and continuous
human-in-the-loop collaboration, enabling human experts to intervene, validate outputs, and provide guidance at
any stage of the process. As a result, we consider LMA systems to be a more appropriate approach to MoE to
address the multifaceted challenges of software engineering.
search process by combining automated querying with forward and backward snowballing, aiming to identify
and include all pertinent studies.
REFERENCES
[1] Sahar Abdelnabi, Amr Gomaa, Sarath Sivaprasad, Lea Schönherr, and Mario Fritz. 2023. Llm-deliberation: Evaluating llms with
interactive multi-agent negotiation games. arXiv preprint arXiv:2309.17234 (2023).
[2] Pekka Abrahamsson, Outi Salo, Jussi Ronkainen, and Juhani Warsta. 2017. Agile software development methods: Review and analysis.
arXiv preprint arXiv:1709.08439 (2017).
[3] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt,
Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
[4] Saaket Agashe. 2023. LLM-Coordination: Developing Coordinating Agents with Large Language Models. University of California, Santa
Cruz.
[5] Samar Al-Saqqa, Samer Sawalha, and Hiba AbdelNabi. 2020. Agile software development: Methodologies and trends. International
Journal of Interactive Mobile Technologies 14, 11 (2020).
[6] Stefano V Albrecht and Peter Stone. 2018. Autonomous agents modelling other agents: A comprehensive survey and open problems.
Artificial Intelligence 258 (2018), 66–95.
[7] Antonis Antoniades, Albert Örwall, Kexun Zhang, Yuxi Xie, Anirudh Goyal, and William Wang. 2024. SWE-Search: Enhancing
Software Agents with Monte Carlo Tree Search and Iterative Refinement. arXiv preprint arXiv:2410.20285 (2024).
[8] Daman Arora, Atharv Sonwane, Nalin Wadhwa, Abhav Mehrotra, Saiteja Utpala, Ramakrishna Bairi, Aditya Kanade, and Nagarajan
Natarajan. 2024. MASAI: Modular Architecture for Software-engineering AI Agents. arXiv preprint arXiv:2406.11638 (2024).
[9] Mohammadmehdi Ataei, Hyunmin Cheong, Daniele Grandi, Ye Wang, Nigel Morris, and Alexander Tessier. 2024. Elicitron: An LLM
Agent-Based Simulation Framework for Design Requirements Elicitation. arXiv preprint arXiv:2404.16045 (2024).
[10] Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. 2024. Explaining neural scaling laws. Proceedings of the
National Academy of Sciences 121, 27 (2024), e2311878121.
[11] Charles H Bennett, Gilles Brassard, and Jean-Marc Robert. 1988. Privacy amplification by public discussion. SIAM journal on Computing
17, 2 (1988), 210–229.
[12] Nicolas Bettenburg, Sascha Just, Adrian Schröter, Cathrin Weiss, Rahul Premraj, and Thomas Zimmermann. 2008. What makes a good
bug report?. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering. 308–318.
[13] Olivier Boissier, Rafael H Bordini, Jomi Hubner, and Alessandro Ricci. 2020. Multi-agent oriented programming: programming multi-agent
systems using JaCaMo. Mit Press.
[14] Rafael H Bordini, Mehdi Dastani, Jürgen Dix, and Amal El Fallah Seghrouchni. 2009. Multi-agent programming. Springer.
[15] Frank J. Budinsky, Marilyn A. Finnie, John M. Vlissides, and Patsy S. Yu. 1996. Automatic code generation from design patterns. IBM
systems Journal 35, 2 (1996), 151–171.
[16] Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. 2024. A survey on mixture of experts. Authorea
Preprints (2024).
[17] Yuzhe Cai, Shaoguang Mao, Wenshan Wu, Zehua Wang, Yaobo Liang, Tao Ge, Chenfei Wu, Wang You, Ting Song, Yan Xia, et al. 2023.
Low-code llm: Visual programming over llms. arXiv preprint arXiv:2304.08103 2 (2023).
[18] Chong Chen, Jianzhong Su, Jiachi Chen, Yanlin Wang, Tingting Bi, Jianxing Yu, Yanli Wang, Xingwei Lin, Ting Chen, and Zibin Zheng.
2023. When chatgpt meets smart contract vulnerability detection: How far are we? ACM Transactions on Software Engineering and
Methodology (2023).
[19] Dong Chen, Shaoxin Lin, Muhan Zeng, Daoguang Zan, Jian-Gang Wang, Anton Cheshkov, Jun Sun, Hao Yu, Guoliang Dong, Artem
Aliev, et al. 2024. CodeR: Issue Resolving with Multi-Agent and Task Graphs. arXiv preprint arXiv:2406.01304 (2024).
[20] Junjie Chen, Jibesh Patra, Michael Pradel, Yingfei Xiong, Hongyu Zhang, Dan Hao, and Lu Zhang. 2020. A survey of compiler testing.
ACM Computing Surveys (CSUR) 53, 1 (2020), 1–36.
[21] Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al.
2023. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. In The Twelfth International Conference on
Learning Representations.
[22] Yongchao Chen, Jacob Arkin, Yang Zhang, Nicholas Roy, and Chuchu Fan. 2023. Scalable multi-robot collaboration with large language
models: Centralized or decentralized systems? arXiv preprint arXiv:2309.15943 (2023).
[23] Zhenpeng Chen, Jie M Zhang, Max Hort, Mark Harman, and Federica Sarro. 2024. Fairness testing: A comprehensive survey and
analysis of trends. ACM Transactions on Software Engineering and Methodology 33, 5 (2024), 1–59.
[24] Yuheng Cheng, Ceyao Zhang, Zhengwen Zhang, Xiangrui Meng, Sirui Hong, Wenhao Li, Zihao Wang, Zekai Wang, Feng Yin, Junhua
Zhao, et al. 2024. Exploring Large Language Model based Intelligent Agents: Definitions, Methods, and Prospects. arXiv preprint
arXiv:2401.03428 (2024).
[25] Michael G Christel and Kyo C Kang. 1992. Issues in requirements elicitation.
[26] David Cohen, Mikael Lindvall, and Patricia Costa. 2004. An introduction to agile methods. Adv. Comput. 62, 03 (2004), 1–66.
[27] Kate Crawford and Jason Schultz. 2014. Big data and due process: Toward a framework to redress predictive privacy harms. BCL Rev.
55 (2014), 93.
[28] DBLP Computer Science Bibliography. 2024. DBLP: Computer Science Bibliography. https://dblp.org. Accessed: 2024-11-13.
[29] Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass.
2023. Pentestgpt: An llm-empowered automatic penetration testing tool. arXiv preprint arXiv:2308.06782 (2023).
[30] Irit Dinur and Kobbi Nissim. 2003. Revealing information while preserving privacy. In Proceedings of the twenty-second ACM SIGMOD-
SIGACT-SIGART symposium on Principles of database systems. 202–210.
[31] Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2023. Self-collaboration Code Generation via ChatGPT. arXiv preprint arXiv:2304.07590
(2023).
[32] Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. 2023. Improving factuality and reasoning in language
models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023).
[33] Zhuoyun Du, Chen Qian, Wei Liu, Zihao Xie, Yifei Wang, Yufan Dang, Weize Chen, and Cheng Yang. 2024. Multi-Agent Software
Development through Cross-Team Collaboration. arXiv preprint arXiv:2406.08979 (2024).
[34] Cynthia Dwork. 2006. Differential privacy. In International colloquium on automata, languages, and programming. Springer, 1–12.
[35] Gang Fan, Xiaoheng Xie, Xunjin Zheng, Yinan Liang, and Peng Di. 2023. Static Code Analysis in the AI Era: An In-depth Exploration
of the Concept, Function, and Potential of Intelligent Code Analysis Agents. arXiv preprint arXiv:2310.08837 (2023).
[36] Stan Franklin and Art Graesser. 1996. Is it an Agent, or just a Program?: A Taxonomy for Autonomous Agents. In International
workshop on agent theories, architectures, and languages. Springer, 21–35.
[37] Michael Fu, Chakkrit Kla Tantithamthavorn, Van Nguyen, and Trung Le. 2023. Chatgpt for vulnerability detection, classification, and
repair: How far are we?. In 2023 30th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 632–636.
[38] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. 2023. Retrieval-augmented
generation for large language models: A survey. arXiv preprint arXiv:2312.10997 (2023).
[39] Joseph A Goguen and Charlotte Linde. 1993. Techniques for requirements elicitation. In [1993] Proceedings of the IEEE International
Symposium on Requirements Engineering. IEEE, 152–164.
[40] Oded Goldreich. 1998. Secure multi-party computation. Manuscript. Preliminary version 78, 110 (1998), 1–108.
[41] Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large
language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680 (2024).
[42] Junda He, Bowen Xu, Zhou Yang, DongGyun Han, Chengran Yang, Jiakun Liu, Zhipeng Zhao, and David Lo. 2024. PTM4Tag+: Tag
Recommendation of Stack Overflow Posts with Pre-trained Models. arXiv preprint arXiv:2408.02311 (2024).
[43] Junda He, Bowen Xu, Zhou Yang, DongGyun Han, Chengran Yang, and David Lo. 2022. Ptm4tag: sharpening tag recommendation of
stack overflow posts with pre-trained models. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension.
1–11.
[44] Junda He, Xin Zhou, Bowen Xu, Ting Zhang, Kisub Kim, Zhou Yang, Ferdian Thung, Ivana Clairine Irsan, and David Lo. 2024.
Representation learning for stack overflow posts: How far are we? ACM Transactions on Software Engineering and Methodology 33, 3
(2024), 1–24.
[45] Jack Herrington. 2003. Code generation in action. Manning Publications Co.
[46] Ann M Hickey and Alan M Davis. 2004. A unified model of requirements elicitation. Journal of management information systems 20, 4
(2004), 65–84.
[47] Samuel Holt, Max Ruiz Luyten, and Mihaela van der Schaar. 2023. L2mac: Large language model automatic computer for unbounded
code generation. arXiv preprint arXiv:2310.02003 (2023).
[48] Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin,
Liyang Zhou, et al. 2023. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352 (2023).
[49] John J Horton. 2023. Large language models as simulated economic agents: What can we learn from homo silicus? Technical Report.
National Bureau of Economic Research.
[50] Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2023. Large
language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology
(2023).
[51] Sihao Hu, Tiansheng Huang, Fatih İlhan, Selim Furkan Tekin, and Ling Liu. 2023. Large language model-powered smart contract
vulnerability detection: New perspectives. In 2023 5th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems
and Applications (TPS-ISA). IEEE, 297–306.
[52] Vincent C Hu, D Richard Kuhn, David F Ferraiolo, and Jeffrey Voas. 2015. Attribute-based access control. Computer 48, 2 (2015), 85–88.
[53] Yue Hu, Yuzhu Cai, Yaxin Du, Xinyu Zhu, Xiangrui Liu, Zijie Yu, Yuchen Hou, Shuo Tang, and Siheng Chen. 2024. Self-Evolving
Multi-Agent Collaboration Networks for Software Development. arXiv preprint arXiv:2410.16946 (2024).
[54] Dong Huang, Qingwen Bu, Jie M Zhang, Michael Luck, and Heming Cui. 2023. AgentCoder: Multi-Agent-based Code Generation with
Iterative Testing and Optimisation. arXiv preprint arXiv:2312.13010 (2023).
[55] Nasif Imtiaz, Seaver Thorn, and Laurie Williams. 2021. A comparative study of vulnerability reporting by software composition analysis
tools. In Proceedings of the 15th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–11.
[56] Yoichi Ishibashi and Yoshimasa Nishimura. 2024. Self-organized agents: A llm multi-agent framework toward ultra large-scale code
generation and optimization. arXiv preprint arXiv:2404.02183 (2024).
[57] Md. Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. 2024. MapCoder: Multi-Agent Code Generation for Competitive
Problem Solving. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 4912–4944.
https://aclanthology.org/2024.acl-long.269
[58] Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and
Ion Stoica. 2024. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint
arXiv:2403.07974 (2024).
[59] Dongming Jin, Zhi Jin, Xiaohong Chen, and Chunhui Wang. 2024. MARE: Multi-Agents Collaboration Framework for Requirements
Engineering. arXiv preprint arXiv:2405.03256 (2024).
[60] Martin Josifoski, Lars Klein, Maxime Peyrard, Nicolas Baldwin, Yifei Li, Saibo Geng, Julian Paul Schnitzler, Yuxing Yao, Jiheng Wei,
Debjit Paul, et al. 2023. Flows: Building blocks of reasoning and collaborating ai. arXiv preprint arXiv:2308.01285 (2023).
[61] Denis Jouvin and Salima Hassas. 2002. Role delegation as multi-agent oriented dynamic composition. In Proceedings of Net Object Days
(NOD), AgeS workshop, Erfurt, Germany.
[62] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary
Charles, Graham Cormode, Rachel Cummings, et al. 2021. Advances and open problems in federated learning. Foundations and trends®
in machine learning 14, 1–2 (2021), 1–210.
[63] Stephen H Kan. 2003. Metrics and models in software quality engineering. Addison-Wesley Professional.
[64] Sungmin Kang, Bei Chen, Shin Yoo, and Jian-Guang Lou. 2023. Explainable automated debugging via large language model-driven
scientific debugging. arXiv preprint arXiv:2304.02195 (2023).
[65] Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh,
Stephan Günnemann, Eyke Hüllermeier, et al. 2023. ChatGPT for good? On opportunities and challenges of large language models for
[93] Pattie Maes. 1993. Modeling adaptive autonomous agents. Artificial life 1, 1_2 (1993), 135–162.
[94] Zhenyu Mao, Jialong Li, Munan Li, and Kenji Tei. 2024. Multi-role Consensus through LLMs Discussions for Vulnerability Detection.
arXiv preprint arXiv:2403.14274 (2024).
[95] Lina Markauskaite, Rebecca Marrone, Oleksandra Poquet, Simon Knight, Roberto Martinez-Maldonado, Sarah Howard, Jo Tondeur,
Maarten De Laat, Simon Buckingham Shum, Dragan Gašević, et al. 2022. Rethinking the entwinement between artificial intelligence
and human learning: What capabilities do learners need for a world with AI? Computers and Education: Artificial Intelligence 3 (2022),
100056.
[96] Noble Saji Mathews and Meiyappan Nagappan. 2024. Test-Driven Development for Code Generation. arXiv preprint arXiv:2402.13521
(2024).
[97] Vasilios Mavroudis. 2024. LangChain. (2024).
[98] Alfred R Mele. 2001. Autonomous agents: From self-control to autonomy. Oxford University Press, USA.
[99] Timothy Meline. 2006. Selecting studies for systemic review: Inclusion and exclusion criteria. Contemporary issues in communication
science and disorders 33, Spring (2006), 21–27.
[100] Emilia Mendes, Pilar Rodriguez, Vitor Freitas, Simon Baker, and Mohamed Amine Atoui. 2018. Towards improving decision making
and estimating the value of decisions in value-based software engineering: the VALUE framework. Software Quality Journal 26 (2018),
607–656.
[101] Minh Huynh Nguyen, Thang Phan Chau, Phong X Nguyen, and Nghi DQ Bui. 2024. AgileCoder: Dynamic Collaborative Agents for
Software Development based on Agile Methodology. arXiv preprint arXiv:2406.11912 (2024).
[102] Theo X Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. 2023. Is Self-Repair a Silver Bullet
for Code Generation?. In The Twelfth International Conference on Learning Representations.
[103] Maria Paasivaara, Sandra Durasiewicz, and Casper Lassenius. 2008. Distributed agile development: Using scrum in a large project. In
2008 IEEE International Conference on Global Software Engineering. IEEE, 87–95.
[104] Stuart L Pardau. 2018. The california consumer privacy act: Towards a european-style privacy regime in the united states. J. Tech. L. &
Pol’y 23 (2018), 68.
[105] Kai Petersen, Claes Wohlin, and Dejan Baca. 2009. The waterfall model in large-scale development. In Product-Focused Software Process
Improvement: 10th International Conference, PROFES 2009, Oulu, Finland, June 15-17, 2009. Proceedings 10. Springer, 386–400.
[106] Huy Nhat Phan, Tien N Nguyen, Phong X Nguyen, and Nghi DQ Bui. 2024. Hyperagent: Generalist software engineering agents to
solve coding tasks at scale. arXiv preprint arXiv:2409.16299 (2024).
[107] Chen Qian, Yufan Dang, Jiahao Li, Wei Liu, Zihao Xie, YiFei Wang, Weize Chen, Cheng Yang, Xin Cong, Xiaoyin Che, Zhiyuan Liu,
and Maosong Sun. 2024. Experiential Co-Learning of Software-Developing Agents. In Proceedings of the 62nd Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association
for Computational Linguistics, Bangkok, Thailand, 5628–5640. https://aclanthology.org/2024.acl-long.305
[108] Chen Qian, Jiahao Li, Yufan Dang, Wei Liu, YiFei Wang, Zihao Xie, Weize Chen, Cheng Yang, Yingli Zhang, Zhiyuan Liu, et al. 2024.
Iterative Experience Refinement of Software-Developing Agents. arXiv preprint arXiv:2405.04219 (2024).
[109] Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu,
Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. ChatDev: Communicative Agents for Software Development. In Proceedings of the 62nd
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar
(Eds.). Association for Computational Linguistics, Bangkok, Thailand, 15174–15186. https://aclanthology.org/2024.acl-long.810
[110] Yihao Qin, Shangwen Wang, Yiling Lou, Jinhao Dong, Kaixin Wang, Xiaoling Li, and Xiaoguang Mao. 2024. AgentFL: Scaling LLM-based
Fault Localization to Project-Level Context. arXiv preprint arXiv:2403.16362 (2024).
[111] Zeeshan Rasheed, Malik Abdul Sami, Muhammad Waseem, Kai-Kristian Kemell, Xiaofeng Wang, Anh Nguyen, Kari Systä, and Pekka
Abrahamsson. 2024. AI-powered Code Review with LLMs: Early Results. arXiv preprint arXiv:2404.18496 (2024).
[112] Zeeshan Rasheed, Muhammad Waseem, Mika Saari, Kari Systä, and Pekka Abrahamsson. 2024. Codepori: Large scale model for
autonomous software development by using multi-agents. arXiv preprint arXiv:2402.01411 (2024).
[113] Haifeng Ruan, Yuntong Zhang, and Abhik Roychoudhury. 2024. SpecRover: Code Intent Extraction via LLMs. arXiv preprint
arXiv:2408.02232 (2024).
[114] Malik Abdul Sami, Muhammad Waseem, Zeeshan Rasheed, Mika Saari, Kari Systä, and Pekka Abrahamsson. 2024. Experimenting with
Multi-Agent Software Development: Towards a Unified Platform. arXiv preprint arXiv:2406.05381 (2024).
[115] Malik Abdul Sami, Muhammad Waseem, Zheying Zhang, Zeeshan Rasheed, Kari Systä, and Pekka Abrahamsson. 2024. AI based
Multiagent Approach for Requirements Elicitation and Analysis. arXiv preprint arXiv:2409.00038 (2024).
[116] Ravi S Sandhu. 1998. Role-based access control. In Advances in computers. Vol. 46. Elsevier, 237–286.
[117] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2024. Reflexion: Language agents with verbal
reinforcement learning. Advances in Neural Information Processing Systems 36 (2024).
[118] Yoav Shoham. 1993. Agent-oriented programming. Artificial intelligence 60, 1 (1993), 51–92.
[119] Preston G Smith and Guy M Merritt. 2020. Proactive risk management: Controlling uncertainty in product development. productivity
press.
[120] Giriprasad Sridhara, Sourav Mazumdar, et al. 2023. Chatgpt: A study on its utility for ubiquitous software engineering tasks. arXiv
preprint arXiv:2305.16837 (2023).
[121] Zhensu Sun, Xiaoning Du, Zhou Yang, Li Li, and David Lo. 2024. AI Coders Are Among Us: Rethinking Programming Language
Grammar Towards Efficient Code Generation. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing
and Analysis. 1124–1136.
[122] Ali Sunyaev and Ali Sunyaev. 2020. Distributed ledger technology. Internet computing: Principles of distributed systems and emerging
internet-based technologies (2020), 265–299.
[123] Maryam Taeb, Amanda Swearngin, Eldon Schoop, Ruijia Cheng, Yue Jiang, and Jeffrey Nichols. 2024. Axnav: Replaying accessibility
tests from natural language. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–16.
[124] Daniel Tang, Zhenghan Chen, Kisub Kim, Yewei Song, Haoye Tian, Saad Ezzini, Yongfeng Huang, and Jacques Klein Tegawende F
Bissyande. 2024. Collaborative agents for software engineering. arXiv preprint arXiv:2402.02172 (2024).
[125] Wei Tao, Yucheng Zhou, Wenqiang Zhang, and Yu Cheng. 2024. MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue
Resolution. arXiv preprint arXiv:2403.17927 (2024).
[126] Jeff Tian. 2005. Software quality engineering: testing, quality assurance, and quantifiable improvement. John Wiley & Sons.
[127] Rainer Unland. 2015. Software agent systems. In Industrial Agents. Elsevier, 3–22.
[128] Raymon Van Dinter, Bedir Tekinerdogan, and Cagatay Catal. 2021. Automation of systematic literature reviews: A systematic literature
review. Information and Software Technology 136 (2021), 106589.
[129] Axel Van Lamsweerde. 2000. Requirements engineering in the year 00: A research perspective. In Proceedings of the 22nd international
conference on Software engineering. 5–19.
[130] Paul Voigt and Axel Von dem Bussche. 2017. The eu general data protection regulation (gdpr). A Practical Guide, 1st Ed., Cham: Springer
International Publishing 10, 3152676 (2017), 10–5555.
[131] Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. Voyager:
An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 (2023).
[132] Hanbin Wang, Zhenghao Liu, Shuo Wang, Ganqu Cui, Ning Ding, Zhiyuan Liu, and Ge Yu. 2024. INTERVENOR: Prompting the Coding
Ability of Large Language Models with the Interactive Chain of Repair. In Findings of the Association for Computational Linguistics ACL
2024. 2081–2107.
[133] Luyuan Wang, Yongyu Deng, Yiwei Zha, Guodong Mao, Qinmin Wang, Tianchen Min, Wei Chen, and Shoufa Chen. 2024. MobileAgent-
Bench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents. arXiv preprint arXiv:2406.08184 (2024).
[134] Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al.
2023. A Survey on Large Language Model based Autonomous Agents. CoRR abs/2308.11432 (2023).
[135] Qian Wang, Tianyu Wang, Qinbin Li, Jingsheng Liang, and Bingsheng He. 2024. MegaAgent: A Practical Framework for Autonomous
Cooperation in Large-Scale LLM Agent Systems. arXiv preprint arXiv:2408.09955 (2024).
[136] Siyuan Wang, Zhuohan Long, Zhihao Fan, Zhongyu Wei, and Xuanjing Huang. 2024. Benchmark Self-Evolving: A Multi-Agent
Framework for Dynamic LLM Evaluation. arXiv preprint arXiv:2402.11443 (2024).
[137] Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, and Yitao Liang. 2023. Describe, explain, plan and select: Interactive
planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560 (2023).
[138] Zefan Wang, Zichuan Liu, Yingying Zhang, Aoxiao Zhong, Lunting Fan, Lingfei Wu, and Qingsong Wen. 2023. RCAgent: Cloud Root
Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models. arXiv preprint arXiv:2310.16340 (2023).
[139] Zhitao Wang, Wei Wang, Zirao Li, Long Wang, Can Yi, Xinjie Xu, Luyang Cao, Hanjing Su, Shouzhi Chen, and Jun Zhou. 2024.
XUAT-Copilot: Multi-Agent Collaborative System for Automated User Acceptance Testing with Large Language Model. arXiv preprint
arXiv:2401.02705 (2024).
[140] Zekun Moore Wang, Zhongyuan Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan,
Zehao Ni, Man Zhang, et al. 2023. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models.
arXiv preprint arXiv:2310.00746 (2023).
[141] Peter Wegner. 1990. Concepts and paradigms of object-oriented programming. ACM Sigplan Oops Messenger 1, 1 (1990), 7–87.
[142] Ratnadira Widyasari, David Lo, and Lizi Liao. 2024. Beyond ChatGPT: Enhancing Software Quality Assurance Tasks with Diverse
LLMs and Validation Techniques. arXiv preprint arXiv:2409.01001 (2024).
[143] Claes Wohlin. 2014. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings
of the 18th international conference on evaluation and assessment in software engineering. 1–10.
[144] Michael Wooldridge. 2009. An introduction to multiagent systems. John wiley & sons.
[145] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang.
2023. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155 (2023).
[146] Yiran Wu, Feiran Jia, Shaokun Zhang, Qingyun Wu, Hangyu Li, Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng, and Chi Wang.
2023. An empirical study on challenging math problem solving with gpt-4. arXiv preprint arXiv:2306.01337 (2023).
[147] Zengqing Wu, Shuyuan Zheng, Qianying Liu, Xu Han, Brian Inhyuk Kwon, Makoto Onizuka, Shaojie Tang, Run Peng, and Chuan
Xiao. 2024. Shall We Talk: Exploring Spontaneous Collaborations of Competing LLM Agents. arXiv preprint arXiv:2402.12327 (2024).
[148] Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al.
2023. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864 (2023).
[149] Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. 2024. Fuzz4all: Universal fuzzing with large
language models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13.
[150] Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jabbarvand, and Lingming Zhang. 2023. White-box compiler
fuzzing empowered by large language models. arXiv preprint arXiv:2310.15991 (2023).
[151] Chengran Yang, Jiakun Liu, Bowen Xu, Christoph Treude, Yunbo Lyu, Ming Li, and David Lo. 2023. APIDocBooster: An Extract-
Then-Abstract Framework Leveraging Large Language Models for Augmenting API Documentation. arXiv preprint arXiv:2312.10934
(2023).
[152] Weiqing Yang, Hanbin Wang, Zhenghao Liu, Xinze Li, Yukun Yan, Shuo Wang, Yu Gu, Minghe Yu, Zhiyuan Liu, and Ge Yu. 2024.
Enhancing the Code Debugging Ability of LLMs via Communicative Agent Based Data Refinement. arXiv preprint arXiv:2408.05006
(2024).
[153] Xun Yi, Russell Paulet, Elisa Bertino, Xun Yi, Russell Paulet, and Elisa Bertino. 2014. Homomorphic encryption. Springer.
[154] Juyeon Yoon, Robert Feldt, and Shin Yoo. 2024. Intent-Driven Mobile GUI Testing with Autonomous Large Language Model Agents. In
2024 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 129–139.
[155] Daoguang Zan, Ailun Yu, Wei Liu, Dong Chen, Bo Shen, Wei Li, Yafen Yao, Yongshun Gong, Xiaolin Chen, Bei Guan, et al. 2024. CodeS:
Natural Language to Code Repository via Multi-Layer Sketch. arXiv preprint arXiv:2403.16443 (2024).
[156] Zhengran Zeng, Hanzhuo Tan, Haotian Zhang, Jing Li, Yuqun Zhang, and Lingming Zhang. 2022. An extensive study on pre-trained
models for program understanding and generation. In Proceedings of the 31st ACM SIGSOFT international symposium on software testing
and analysis. 39–51.
[157] Guibin Zhang, Yanwei Yue, Xiangguo Sun, Guancheng Wan, Miao Yu, Junfeng Fang, Kun Wang, and Dawei Cheng. 2024. G-designer:
Architecting multi-agent communication topologies via graph neural networks. arXiv preprint arXiv:2410.11782 (2024).
[158] Huan Zhang, Wei Cheng, Yuhan Wu, and Wei Hu. 2024. A Pair Programming Framework for Code Generation via Multi-Plan
Exploration and Feedback-Driven Refinement. arXiv preprint arXiv:2409.05001 (2024).
[159] Kexun Zhang, Weiran Yao, Zuxin Liu, Yihao Feng, Zhiwei Liu, Rithesh Murthy, Tian Lan, Lei Li, Renze Lou, Jiacheng Xu, et al. 2024.
Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents. arXiv preprint arXiv:2408.07060 (2024).
[160] Lyuye Zhang, Kaixuan Li, Kairan Sun, Daoyuan Wu, Ye Liu, Haoye Tian, and Yang Liu. 2024. Acfix: Guiding llms with mined common
rbac practices for context-aware repair of access control vulnerabilities in smart contracts. arXiv preprint arXiv:2403.06838 (2024).
[161] Li Zhang, Jia-Hao Tian, Jing Jiang, Yi-Jun Liu, Meng-Yuan Pu, and Tao Yue. 2018. Empirical research in software engineering—a
literature survey. Journal of Computer Science and Technology 33 (2018), 876–899.
[162] Simiao Zhang, Jiaping Wang, Guoliang Dong, Jun Sun, Yueling Zhang, and Geguang Pu. 2024. Experimenting a New Programming
Practice with LLMs. arXiv preprint arXiv:2401.01062 (2024).
[163] Sai Zhang, Zhenchang Xing, Ronghui Guo, Fangzhou Xu, Lei Chen, Zhaoyuan Zhang, Xiaowang Zhang, Zhiyong Feng, and
Zhiqiang Zhuang. 2024. Empowering Agile-Based Generative Software Development through Human-AI Teamwork. arXiv preprint
arXiv:2407.15568 (2024).
[164] Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. 2023.
Siren’s song in the AI ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219 (2023).
[165] Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Autocoderover: Autonomous program improvement. In
Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1592–1604.
[166] Zhonghan Zhao, Kewei Chen, Dongxu Guo, Wenhao Chai, Tian Ye, Yanting Zhang, and Gaoang Wang. 2024. Hierarchical Auto-
Organizing System for Open-Ended Multi-Agent Navigation. arXiv preprint arXiv:2403.08282 (2024).
[167] Zibin Zheng, Shaoan Xie, Hong-Ning Dai, Xiangping Chen, and Huaimin Wang. 2018. Blockchain challenges and opportunities: A
survey. International journal of web and grid services 14, 4 (2018), 352–375.
[168] Xin Zhou, Sicong Cao, Xiaobing Sun, and David Lo. 2024. Large Language Model for Vulnerability Detection and Repair: Literature
Review and the Road Ahead. arXiv preprint arXiv:2404.02525 (2024).
[169] Xin Zhou, Bowen Xu, DongGyun Han, Zhou Yang, Junda He, and David Lo. 2023. CCBERT: Self-Supervised Code Change Representation
Learning. In 2023 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 182–193.
[170] Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, and Yu Cheng. 2024. Llama-moe: Building mixture-
of-experts from llama with continual pre-training. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language
Processing. 15913–15923.
[171] Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda
He, Indraneil Paul, et al. 2024. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions.
arXiv preprint arXiv:2406.15877 (2024).