0% found this document useful (0 votes)
60 views13 pages

Agent

DocAgent is a multi-agent system designed for automated code documentation generation, addressing challenges in completeness, helpfulness, and truthfulness of documentation produced by existing LLM-based solutions. It utilizes specialized agents that collaboratively process code in a topologically sorted order to ensure accurate context management and dependency handling. Comprehensive evaluations demonstrate that DocAgent consistently outperforms state-of-the-art baselines, offering a robust framework for generating high-quality documentation in complex software repositories.

Uploaded by

ddls0526
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views13 pages

Agent

DocAgent is a multi-agent system designed for automated code documentation generation, addressing challenges in completeness, helpfulness, and truthfulness of documentation produced by existing LLM-based solutions. It utilizes specialized agents that collaboratively process code in a topologically sorted order to ensure accurate context management and dependency handling. Comprehensive evaluations demonstrate that DocAgent consistently outperforms state-of-the-art baselines, offering a robust framework for generating high-quality documentation in complex software repositories.

Uploaded by

ddls0526
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

DocAgent: A Multi-Agent System for Automated Code Documentation

Generation

Dayu Yang* Antoine Simoulin† Xin Qian†


Xiaoyi Liu† Yuwei Cao† Zhaopu Teng† Grey Yang
Meta AI
{dayuyang,antoinesimoulin,xinqian,xiaoyiliu,yuweicao,zhaoputeng,glyang}@meta.com

Abstract lags behind code changes (Aghajani et al., 2019;


Robillard, 2009; Uddin et al., 2021).
High-quality code documentation is crucial for
software development especially in the era of While LLM-based solutions—such as Fill-in-
the-Middle (FIM) predictors (Roziere et al., 2023;
arXiv:2504.08725v2 [cs.SE] 18 Apr 2025

AI. However, generating it automatically us-


ing Large Language Models (LLMs) remains GitHub, 2024) and chat agents (Meta, 2025; Ope-
challenging, as existing approaches often pro- nAI, 2022)—offer automation, extensive stud-
duce incomplete, unhelpful, or factually incor- ies (Dvivedi et al., 2024; Zhang et al., 2024; Zan
rect outputs. We introduce DocAgent, a novel et al., 2022; Zheng et al., 2024), along with our
multi-agent collaborative system using topolog-
empirical analyses (§4), reveal three recurring lim-
ical code processing for incremental context
building. Specialized agents (Reader, Searcher, itations. First, these approaches often omit essen-
Writer, Verifier, Orchestrator) then collabora- tial information (e.g., parameter or return-value
tively generate documentation. We also pro- descriptions). Second, they typically offer minimal
pose a multi-faceted evaluation framework as- context or rationale, limiting the usefulness of the
sessing Completeness, Helpfulness, and Truth- generated documentation. Third, they sometimes
fulness. Comprehensive experiments show hallucinate non-existent components, especially in
DocAgent significantly outperforms baselines large or proprietary repositories, undermining fac-
consistently. Our ablation study confirms the
tual correctness (Zan et al., 2022; Ma et al., 2024;
vital role of the topological processing order.
DocAgent offers a robust approach for reliable Abedu et al., 2024).
code documentation generation in complex and We identify three primary challenges that drive
proprietary repositories. Our code1 and video2 these shortcomings. (1) Context Identification
are publicly available. and Retrieval: Large, complex repositories make
it non-trivial to pinpoint which files, dependencies,
1 Introduction
or external references are genuinely relevant for
High-quality code documentation is essential for a given component. (2)Navigating Complex De-
effective software development (De Souza et al., pendencies: Codebases often exhibit dependency
2005; Garousi et al., 2015; Chen and Huang, 2009), chains that exceed typical LLM context limits, re-
and has become increasingly important as AI mod- quiring strategic context management. (3)Robust
els depend on accurate docstrings3 for code com- and Scalable Evaluation: Existing evaluation met-
prehension tasks (Zhou et al., 2022; Yang et al., rics like BLEU or ROUGE(Roy et al., 2021; Guel-
2024; Anthropic, 2025). However, creating and man et al., 2024) incompletely capture the multi-
maintaining documentation is labor-intensive and faceted goals of documentation, while human eval-
prone to errors (McBurney et al., 2017; Parnas, uation, though more reliable, is expensive and sub-
2010). Even top-starred open-source repositories jective(Luo et al., 2024).
on GitHub often exhibit low docstring coverage and To tackle these challenges, we introduce DocA-
quality,4 leading to documentation that frequently gent, a multi-agent system that processes code
* Corresponding Author. in a topologically sorted order and leverages spe-

Equal contribution. cialized agents (Reader, Searcher, Writer, Ver-
1
https://github.com/facebookresearch/DocAgent ifier, Orchestrator) to collaboratively generate
2
https://youtu.be/e9IjObGe9_I
3 documentation. This mimics human workflows
We use "code documentation" and "docstring" inter-
changeably throughout the paper. and manages context effectively. We also propose
4
See Appendix C for more details. an automatic and robust multi-faceted evaluation

1
Figure 1: Architecture of DocAgent: (1) The Navigator Module uses AST parsing for a Dependency DAG and
topological traversal. (2) The Multi-Agent framework uses specialized agents (Reader, Searcher, Writer, Verifier)
with tools for context-aware documentation generation.

framework assessing Completeness, Helpfulness, terdependencies. These dependencies include func-


and Truthfulness via deterministic checks and tion/method calls, class inheritance, attribute ac-
LLM-as-judge. Our main contributions are: 1) cess, and module imports. These components and
DocAgent, A multi-agent, topologically structured relationships are used to construct a directed graph
system for context-aware documentation genera- where nodes represent code components and a di-
tion. 2) A robust evaluation framework measuring rected edge from A to B signifies that A depends
completeness, helpfulness, and factual consistency on B (A → B). To enable topological sorting, cy-
of code documentation. 3) Comprehensive experi- cles within the graph are detected using Tarjan’s
ments on diverse repositories show DocAgent con- algorithm (Tarjan, 1972) and condensed into a sin-
sistently outperforms state-of-the-art baselines. gle super node. This results in a Directed Acyclic
Graph (DAG) representing the repository’s depen-
2 Methodology dency structure.
DocAgent operates in two stages to handle complex The process begins with static analysis of the en-
dependencies and ensure context relevance. First, tire target repository. Abstract Syntax Trees (ASTs)
the Navigator determines an optimal, dependency- are parsed for all source files to identify core code
aware processing order (§2.1). Second, a Multi- components (e.g., functions, methods, classes) and
Agent System incrementally generates documenta- their interdependencies. These dependencies en-
tion, leveraging specialized agents for code analy- compass function/method calls, class inheritance
sis, information retrieval, drafting, and verification relationships, attribute accesses, and module im-
(§2.2). Figure 1 illustrates this architecture. ports. Based on this analysis, a directed graph is
constructed where nodes represent code compo-
2.1 Navigator: Dependency-Aware Order nents and a directed edge from component A to
Generating accurate documentation often requires component B (A → B) signifies that A depends on
understanding its dependencies. However, naively B (i.e., B must be understood to fully understand
including the full context of all direct and transitive A)5 .
dependencies can easily exceed context window Topological Traversal for Hierarchical Gener-
limit especially in large, complex repositories. To ation. Using the DAG, the Navigator performs a
address this, the Navigator module establishes a topological sort to determine the documentation
processing order that ensures components are doc- generation order. The traversal adheres to the "De-
umented only after their dependencies have been pendencies First" principles: A component is pro-
processed, thereby enabling incremental context cessed only after all components it directly depends
building. on have been documented6 . This topological order-
Dependency Graph Construction. DocAgent ing ensures that, by the time the multi-agent system
first performs static analysis on the entire target generates documentation for a given component,
repository. It parses the Abstract Syntax Trees 5
Cycles within the graph are detected using Tarjan’s algo-
(ASTs) of source files to identify code compo- rithm (Tarjan, 1972) and condensed into a single node.
6
nents (functions, methods, classes) and their in- Methods are documented before their enclosing class.

2
therefore reveal the purpose of the focal code com-
ponent. This is particularly important for public
functions or APIs exposed to the users of the repos-
itory.
External requests target information not directly
present or inferable from the codebase itself, such
as domain-specific knowledge or third-party library
functionalities (see Appendix B).
Searcher. The Searcher agent is responsible
Figure 2: Screenshot of DocAgent live code documen- for fulfilling the Reader’s information requests us-
tation generation page. ing specialized tools: Internal Code Analysis Tool:
This tool leverages static analysis capabilities to
all of its dependencies have already been described. navigate the codebase. It can retrieve the source
Therefore, each code documentation only needs the code and existing documentation of specified in-
information of its one-hop dependencies, eliminat- ternal components, identify call sites for the fo-
ing the need to pull in an ever-growing chain of cal component, trace dependencies using the pre-
background information. computed graph or on-the-fly analysis, and extract
relevant structural information (e.g., class hierar-
2.2 Multi-Agent Documentation Generation
chies, method signatures). External Knowledge
Following Navigator’s order, the multi-agent sys- Retrieval Tool: This tool interfaces with external
tem generates documentation for each component knowledge sources via a generic retrieval API . It
using four specialized agents coordinated by an formulates queries based on the Reader’s requests
Orchestrator. Input is the focal component’s source for external concepts and processes the results to
code including newly generated documentation. extract pertinent explanations, definitions, or de-
Reader. The Reader agent initiates the process scriptions.
by analyzing the focal component’s code. Its pri- The Searcher consolidates the retrieved internal
mary goal is to determine the information required code information and external knowledge into a
to generate a comprehensive and helpful code docu- structured format, which serves as the context for
mentation. It assesses the component’s complexity, the subsequent agents.
visibility (public/private), and implementation de- Like two human agents collaborate on a project
tails to decide: If additional context is needed: Sim- and talk with each other, after Searcher send the re-
ple, self-contained components might not require trieved information back to the reader, reader read
external information. What context is needed: This the updated context and the focal code component,
involves identifying specific internal dependencies and see if the context is adequate for generating
(functions/classes it uses), usage contexts (where the documenation. If reader still feel the retrieved
the component is called, revealing its purpose), or context is still not adequate, reader can further send
external concepts (algorithms, libraries, domain information request to the searcher. So the infor-
knowledge) referenced implicitly or explicitly. mation request, and new information can be sent
The agent outputs structured XML requests for back and forth between reader and searcher, until
two types of information requests (1) internal in- adequate information is retrieved.
formation about related code components, and (2) Writer. The Writer agent receives the focal
external knowledge for specialized algorithms or component’s code and the structured context com-
techniques. piled by the Searcher. Its task is to generate the
The internal information request consists with code documentation. The generation process is
the dependency and the reference. Dependency guided by prompts that specify the desired structure
means the focal component calls other components and content based on the component type: Func-
defined in the repository, where reader will deter- tions/Methods: Typically require a summary, ex-
minate if a dependent is needed or not to provide tended description, parameter descriptions (Args),
necessary context information. return value description (Returns), raised excep-
Reference means the focal component is called tions (Raises), and potentially usage examples (es-
in somewhere in the code repository, showing how pecially for public-facing components). Classes:
it can be used in the real-world application and Typically require a summary, extended description,

3
initialization examples, constructor parameter de-
scriptions (Args), and public attribute descriptions
(Attributes).
The Writer synthesizes information from both
the code and the provided context to produce a
draft code documentation adhering to these require-
ments.
Verifier. The Verifier take the context, code com-
ponent, and generated code documentation from
the writer as inputs, evaluates the quality of code
documentation against predefined criteria: informa-
tion value, detail level, and completeness. Upon
evaluation, the Verifier either approves the docu-
mentation or provides specific improvement sug-
gestions through structured feedback.
Verifier can talk to writer if the issue can be Figure 3: Multi-facet Evaluation Framework of code
address without additional context information, for documentation, assessing quality along three dimen-
example: format issue, which can be easily address sions: (1) Completeness measures structural adherence
to documentation conventions; (2) Helpfulness evalu-
by asking writer to rewrite.
ates practical utility; and (3) Truthfulness verifies factual
If the issue is relevant to lack of information, accuracy.
and additional context is needed, veirfier can also
provide suggestion to reader, and additional infor-
mation will be gathered through another Reader- 3 Evaluation Framework
Searcher cycle.
Orchestrator. An Orchestrator manages the
agent workflow through an iterative process. The Evaluating the quality of automatically generated
cycle begins with the Reader analyzing the focal code documentation is challenging. Traditional
component and requesting necessary context. The metrics commonly used in natural language gen-
Searcher gathers this information, after which the eration, such as BLEU or ROUGE cannot be used
Writer generates a docstring. The Verifier then because of lack of gold references (Roy et al., 2021;
evaluates the docstring quality, either approving it Guelman et al., 2024). Simple heuristics like docu-
or returning it for revision. This process continues mentation length are insufficient indicators of ac-
until a satisfactory code documentaion is generated tual utility. While human evaluation provides the
or a maximum iteration limit is reached. most accurate assessment (Luo et al., 2024), it is
inherently subjective, expensive, and difficult to
Adaptive Context Management: To handle po- scale, rendering it impractical for large-scale exper-
tentially large contexts retrieved by the Searcher, iments or continuous integration scenarios.
especially for complex components, the Orches-
trator implements an adaptive context truncation To overcome these limitations, we propose a
mechanism. It monitors the total token count of comprehensive and scalable evaluation framework
the context provided to the Writer. If the context designed to systematically assess documentation
exceeds a configurable threshold (based on the un- quality along three crucial dimensions: Complete-
derlying LLM’s limits), the Orchestrator applies a ness, Helpfulness, and Truthfulness. This multi-
targeted truncation strategy. It identifies the largest faceted approach combines deterministic structural
sections within the structured context (e.g., external checks, LLM-based qualitative assessments, and
knowledge snippets, specific dependency details) fact-checking against the codebase itself, providing
and selectively removes content from the end of a holistic view of the generated documentation’s
these sections to reduce the token count while pre- value. Our methodology is informed by established
serving the overall structure. This ensures that the software engineering best practices for documen-
context remains within operational limits, balanc- tation and addresses the specific shortcomings ob-
ing contextual richness with model constraints. served in existing LLM-based generation systems.

4
3.2 Helpfulness

Helpfulness assesses the semantic quality and prac-


tical utility of the documentation content. A helpful
docstring goes beyond merely restating code ele-
ments; it elucidates the purpose, usage context, de-
sign rationale, and potential constraints of the code.
Key aspects include: Clarity and Conciseness: Is
the summary informative yet brief? Descriptive
Depth: Does the extended description provide suf-
Figure 4: Screenshot of DocAgent Live Evaluation ficient context, explain the ’why’ behind the code,
Framework or mention relevant scenarios or edge cases? Pa-
rameter/Attribute Utility: Are descriptions for
inputs and attributes meaningful, specifying ex-
3.1 Completeness
pected types, value ranges, or constraints, rather
than just echoing names? Guidance: Does the doc-
Completeness measures the extent to which the gen-
umentation effectively guide a developer on when
erated documentation adheres to standard structural
and how to use the component?
conventions and includes essential components ex-
pected for a given code element (e.g., function, Assessing these qualitative aspects automatically
class). High-quality code documentation typically is challenging. Inspired by recent work on evalu-
includes not only a summary but also descriptions ating complex generation tasks (Wang et al., 2024;
of parameters, return values, raised exceptions, and Zhuge et al., 2024), we utilize an LLM-as-judge ap-
potentially usage examples, dynamically depend- proach, carefully structured to enhance robustness
ing on the element’s signature, body and visibility. and consistency. To mitigate potential biases and
To quantify completeness, we employ an au- variability associated with LLM judgments, we im-
tomated checker based on Abstract Syntax Tree plement a sophisticated framework: Component-
(AST) analysis and regular expressions. The pro- Specific Evaluation: We decompose the evalu-
cess involves: AST Parsing: Identifying code com- ation by assessing distinct parts of the docstring
ponents (classes, functions, methods) and extract- separately (e.g., summary, main description, param-
ing their generated docstrings. Code Analysis: eter descriptions) using tailored prompts for each.
Analyzing the code signature and body (e.g., pres- Structured Prompt Engineering: Each prompt
ence of parameters, return statements, raise state- includes: 1) Explicit Scoring Rubrics: Detailed
ments) and visibility (public/private) to determine criteria for a 5-point Likert scale (1=Poor to 5=Ex-
the required documentation sections dynamically. cellent), defining expectations for each score level
For instance, a function without parameters does regarding clarity, depth, and utility. 2) Illustrative
not require an "Args" section, while a public class Examples: Concrete examples of good and bad
method might benefit more from an "Example" sec- documentation snippets corresponding to different
tion than a private helper function. Section Iden- score levels, grounding the evaluation criteria. 3)
tification: Detecting the presence of standard sec- Step-by-Step Instructions: Guiding the LLM to an-
tions (e.g., Summary, Description, Args, Returns, alyze the code, compare the docstring against the
Raises, Examples, Attributes for classes) within the rubric, consider the code’s context, and justify its
docstring using predefined patterns and structural rating. 4) Standardized Output Format: Requiring
cues. Scoring: Calculating a completeness score the LLM to provide structured output, including de-
for each docstring as the proportion of required tailed reasoning, specific suggestions for improve-
sections that are present. This yields a normalized ment (if applicable), and the final numerical score.
score between 0.0 and 1.0. This facilitates analysis and consistency checking.

This deterministic approach provides an objec- This structured approach allows for scalable
tive measure of structural adherence, indicating assessment of semantic quality, moving beyond
whether the documentation meets basic formal re- surface-level checks to gauge the documentation’s
quirements. actual value to a developer.

5
3.3 Truthfulness dict documentation based on surrounding code.
A critical dimension of documentation quality is We use CodeLlama-13B (Roziere et al., 2023), an
its factual accuracy, or Truthfulness. Documen- open model trained with FIM tasks (Bavarian et al.,
tation, especially when generated by LLMs unfa- 2022). Abbreviated as FIM-CL. Chat: Represents
miliar with a specific private codebase, can suf- generating documentation by providing the code
fer from "hallucinations"—confidently referencing snippet directly to a chat-based LLM. We test two
non-existent methods, parameters, or classes, or leading models: GPT-4o mini 7 (OpenAI, 2022)
misrepresenting relationships between components. and CodeLlama-34B-instruct(Roziere et al., 2023).
Such inaccuracies severely undermine trust and can Abbreviated as Chat-GPT and Chat-CL, respec-
mislead developers. tively.
We evaluate Truthfulness by verifying whether 4.2 Experiment Setup
entities mentioned in the generated documenta-
tion actually exist within the target repository and Data. We select a representative subset of Python
are referenced correctly. Our pipeline comprises repositories to ensure diversity in size, complexity,
three stages: Code Entity Extraction: An LLM and domain. The dataset comprises modules, func-
is prompted to identify mentions of repository- tions, methods, and classes with varying degrees of
specific code components (classes, functions, meth- dependency density (details in Appendix D).
ods, attributes) within the generated docstring. Systems. We evaluate two variants of our proposed
The prompt specifically instructs the model to dis- system, differing only in the backbone LLM used
tinguish these from standard language keywords, by the agents: DA-GPT: DocAgent utilizing GPT-
built-in types (e.g., list, dict), and common ex- 4o mini. DA-CL: DocAgent utilizing CodeLlama-
ternal library components, focusing on internal 34B-instruct8 .
references. Ground Truth Construction: We Statistical Significance. All claims of statistical
leverage the dependency graph constructed by the significance are based on paired t-tests with a sig-
Navigator module 2.1. This graph serves as the nificance threshold of p < 0.059
ground truth, containing a canonical representation 4.3 Experiment Results
of all code components and their locations within
the repository. Verification: Each extracted entity We evaluate the systems using the framework pro-
mention is cross-referenced against the dependency posed in Section 3, focusing on Completeness,
graph. Helpfulness, and Truthfulness.
We quantify Truthfulness using the Existence 4.3.1 Completeness
Ratio: the proportion of unique repository-
specific entities mentioned in the documenta- System Overall Function Method Class
tion that correspond to actual entities in the DA-GPT 0.934† 0.945† 0.935† 0.914†
|Verified Entities|
codebase.Existence Ratio = |Extracted Entities| DA-CL 0.953†‡ 0.985†‡ 0.982†‡ 0.816†‡
A high ratio indicates that the documentation is Chat-GPT 0.815 0.828 0.823 0.773
well-grounded in the actual code structure, mini- Chat-CL 0.724 0.726 0.744 0.667
mizing the risk of hallucinated references. FIM-CL 0.314 0.291 0.345 0.277
Together, these three dimen-
Table 1: Average Completeness Scores. † : Significantly
sions—Completeness, Helpfulness, and Truthful-
better than corresponding Chat baseline. ‡ : Significantly
ness—provide a robust and nuanced framework for better than FIM baseline.
evaluating automatic code documentation systems,
enabling quantitative comparisons and deeper As shown in Table 1, both DocAgent variants
insights into their strengths and weaknesses. significantly outperform their respective Chat coun-
terparts. DocAgent (CodeLlama-34B) achieves an
4 Experiment 7
2024-07-18 version
8
4.1 Baselines The choice of backbone LLM is orthogonal to the DocA-
gent framework itself. We use GPT-4o-2024-08-06 universally
We compare DocAgent against two representative for running evaluation for more robust results.
9
baseline systems commonly used for code docu- Due to space limitations, we are unable to include the full
prompts and detailed experimental setup in the paper. How-
mentation generation: FIM (Fill-in-the-middle): ever, all configurations are available in our project’s public
Simulates inline code completion tools that pre- release repository.

6
overall score of 0.953, representing a substantial System Verified Extracted Existence Ratio (%)
improvement of 0.229 points over Chat. Similarly, DA-GPT 265 305 95.74%
DocAgent (GPT-4o mini) scores 0.934 overall, sig- DA-CL 354 600 88.17%
Chat-GPT 366 347 61.10%
nificantly higher than Chat at 0.815. These im-
Chat-CL 366 488 68.03%
provements are statistically significant across all FIM-CL 338 131 45.04%
component types. FIM performs poorly, achieving
an overall completeness score of only 0.314. This Table 3: Truthfulness Analysis: Existence Ratio (%).
highlights the effectiveness of DocAgent’s struc- Higher is better. Extracted = extracted entities; Verifed
tured, context-aware generation process compared = verified entities in §3.3.
to simply prompting an LLM with the code in iso-
lation.
model often leads to inaccurate assumptions or hal-
4.3.2 Helpfulness lucinations about the surrounding codebase context.
As shown in Table 2, DocAgent (GPT-4o mini) FIM performs worst, with an Existence Ratio of
achieves the highest overall helpfulness score, sig- only 45.04%, implying that nearly half of its refer-
nificantly outperforming the corresponding Chat ences to repository entities might be incorrect. This
baseline. demonstrating its ability to generate low score highlights a significant risk of misleading
clearer and more informative content by leveraging developers when using FIM for documentation.
retrieved context.
4.4 Ablation Study
System Overall Summary Description Parameters
DA-GPT 3.88† 4.32† 3.60† 2.71 To isolate the contribution of the dependency-aware
DA-CL 2.35‡ 2.36†‡ 2.43‡ 2.00 processing order determined by the Navigator mod-
Chat-GPT 2.95 3.56 2.42 2.20 ule (§ 2.1), we conducted an ablation study. We
Chat-CL 2.16 2.04 2.37 1.80 created variants of DocAgent (DA-Rand-GPT, DA-
FIM-CL 1.51 1.30 2.45 1.50
Rand-CL) that process components in a random
Table 2: Average Helpfulness Scores. † : Significantly order10 .
better than corresponding Chat. ‡ : Significantly better
than FIM. 4.4.1 Impact on Helpfulness

DocAgent (CodeLlama-34B) also shows an im- System Overall Summary Description Parameters
provement over its Chat counterpart, producing sig- DA-GPT 3.88† 4.32† 3.60 2.71
DA-Rand-GPT 3.44(-0.44) 3.62(-0.70) 3.30(-0.30) 2.20(-0.51)
nificantly more helpful summaries. Furthermore, DA-CL 2.35† 2.36† 2.43 2.00
DocAgent (CodeLlama-34B) also significantly out- DA-Rand-CL 2.18(-0.17) 1.88(-0.48) 2.42(-0.10) 2.00( 0.00)
performs FIM. Across aspects, generating help-
Table 4: Ablation: Average Helpfulness Scores. † If
ful parameter descriptions appears most challeng-
DocAgent significantly better than its Random variant.
ing. DocAgent (GPT-4o mini) achieves the highest
score even here, suggesting its structured approach
The results in Table 4 demonstrate the benefit
aids in this difficult task, although room for im-
of the Navigator’s topological sorting in improv-
provement remains.
ing Helpfulness. For both underlying LLMs, the
4.3.3 Truthfulness full DocAgent achieved significantly higher overall
The results in Table 3 demonstrate the superior helpfulness scores compared to its random-order
factual accuracy of documentation generated by counterpart. With GPT-4o mini, the full DocA-
DocAgent. DocAgent (GPT-4o mini) achieves the gent scored 3.69 overall, significantly higher than
highest Existence Ratio at 95.74%, indicating that DocAgent-Random’s 3.44. The improvement was
the vast majority of its references to internal code particularly pronounced in summary generation.
components are correct. DocAgent (CodeLlama- Similarly, with CodeLlama-34B, the full DocAgent
34B) also performs strongly with a ratio of 88.17%. scored 2.39 overall, significantly outperforming
This contrasts sharply with the baselines. The DocAgent-Random’s 2.18. Again, the summary
Chat approaches exhibit significantly lower truth- scores showed a significant difference.
fulness, with Chat (GPT-4o mini) at 61.10% and 10
Completeness was omitted from the ablation study be-
Chat (CodeLlama-34B) at 68.03%. This suggests cause it depends on the code’s structure, not the Navigator’s
that simply providing the code snippet to a chat processing order.

7
4.4.2 Impact on Truthfulness tation. An ablation study confirmed the critical
We also evaluated the impact of removing the hi- contribution of the topological processing order
erachical generation order on the factual accuracy to both helpfulness and truthfulness. DocAgent
(Truthfulness). Without the Navigator, the Searcher represents a promising step towards reliable and
can still retrieve dependent code components. How- useful automated code documentation generation
ever, since the ’Dependencies First’ principle is not for complex and proprietary software.
followed, these components are less likely to have
already generated documentation available for con-
6 Ethics and Limitations
text. DocAgent, while advancing automated code doc-
System Verified Extracted Existence Ratio (%)
umentation, has inherent limitations and ethical
DA-GPT 187 224 94.64% considerations. Technically, processing extremely
DA-Rand-GPT 164(-23) 166(-58) 86.75(-7.89)% large codebases may still challenge LLM context
DA-CL 190 343 87.76% limits despite topological sorting and context man-
DA-Rand-CL 188(-2) 360(+17) 83.06(-4.70)%
agement. Relying solely on static analysis restricts
Table 5: Ablation: Truthfulness Analysis (Existence understanding of dynamic behavior, and the current
Ratio %). Use 50 randomly sampled code components Python focus requires effort for adaptation to other
from full data to evaluate. languages.
Table 5 demonstrates that the topological sort Ethically, the primary concern is factual accu-
also improves truthfulness. Both full DocAgent racy; generated documentation, though improved,
variants achieve higher Existence Ratios than their may still contain hallucinations or inaccuracies, po-
random-order counterparts. Existence ratio of tentially misleading developers. The underlying
DocAgent (GPT-4o-mini) drops from 94.64% to LLMs may propagate biases from their training
86.75% without the sort, and the ratio of DocAgent data into the documentation. Over-reliance on such
(Codellama-34B) drops from 87.76% to 83.06%. tools could potentially hinder developers’ deep
Collectively, the ablation results confirm that the code comprehension skills. Applying DocAgent
Navigator’s dependency-aware topological order- to proprietary code necessitates careful handling,
ing is a crucial component of DocAgent, signif- especially regarding external queries, to avoid inad-
icantly contributing to both the helpfulness and vertently leaking sensitive information. Finally, the
factual accuracy of the generated documentation computational resources required for LLM-driven
by enabling effective incremental context manage- multi-agent systems represent a notable cost and
ment. environmental consideration. Future work should
address these limitations, focusing on robustness,
5 Conclusion bias mitigation, and deeper evaluation, while em-
phasizing that human oversight remains crucial in
We addressed the challenge of automatically gen- practical deployment.
erating high-quality code documentation, a task
where existing LLM-based methods often strug-
gle with incompleteness, lack of helpfulness, and References
factual inaccuracies. We introduced DocAgent, a Samuel Abedu, Ahmad Abdellatif, and Emad Shihab.
novel tool-integrated, multi-agent system that lever- 2024. Llm-based chatbots for mining software repos-
ages a dependency-aware topological processing itories: Challenges and opportunities. In Proceedings
order determined by a Navigator module. This al- of the 28th International Conference on Evaluation
and Assessment in Software Engineering, pages 201–
lows specialized agents (Reader, Searcher, Writer, 210.
Verifier, Orchestrator) to collaboratively generate
documentation by incrementally building context Emad Aghajani, Csaba Nagy, Olga Lucero Vega-
from dependencies. We also proposed a robust Márquez, Mario Linares-Vásquez, Laura Moreno,
Gabriele Bavota, and Michele Lanza. 2019. Software
and scalable evaluation framework assessing Com- documentation issues unveiled. In 2019 IEEE/ACM
pleteness, Helpfulness, and Truthfulness. Our 41st International Conference on Software Engineer-
experiments on diverse Python repositories demon- ing (ICSE), pages 1199–1210. IEEE.
strate that DocAgent significantly outperforms FIM Wasi U Ahmad, Saikat Chakraborty, Baishakhi Ray,
and Chat baselines consistently, producing more and Kai-Wei Chang. 2021. Unified pre-training for
complete, helpful, and factually accurate documen- program understanding and generation. In ACL.

8
Anthropic. 2025. Model context length increases with Liron Guelman, Alon Lavie, and Eran Yahav. 2024.
the new context protocol. Accessed: 2025-03-27. Using large language models to document code: A
first quantitative and qualitative assessment. arXiv
Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, preprint arXiv:2403.04264.
John Schulman, Christine McLeavey, Jerry Tworek,
and Mark Chen. 2022. Efficient training of lan- Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu
guage models to fill in the middle. arXiv preprint Tang, Nan Duan, Ming Zhou, and Daxin Jiang. 2021.
arXiv:2207.14255. Graphcodebert: Pre-training code representations
with data flow. In ICLR.
Jie-Cherng Chen and Sun-Jen Huang. 2009. An empiri-
cal analysis of the impact of software development Seungone Kim, Soobin Kim, Alice Oh, and Gunhee
problem factors on software maintainability. Journal Han. 2023. Prometheus: Inducing fine-grained eval-
of Systems and Software, 82(6):981–992. uation capability in language models. arXiv preprint
arXiv:2310.08491.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Michael Krumdick, Jason Wei, Xinyang Chen, Shang-
Henrique Ponde de Oliveira Pinto, Jared Kaplan, bin Du, Shu Xu, Dale Schuurmans, and Ed H Chi.
Harri Edwards, Yuri Burda, Nicholas Joseph, Greg 2025. No free labels: Limitations of llm-as-a-
Brockman, and 1 others. 2021. Evaluating large judge without human grounding. arXiv preprint
language models trained on code. arXiv preprint arXiv:2503.05061.
arXiv:2107.03374.
Yukyung Lee, Wonjoon Cho, and Jinhyuk Kim. 2024.
Qian Chen, Binyuan Tang, Yankai Zhang, Binhua Wang, Checkeval: A reliable llm-as-a-judge framework for
Zhifang Zhang, and Qun Zhang. 2023. Teaching evaluating text generation using checklists. arXiv
large language models to self-debug. arXiv preprint preprint arXiv:2403.18771.
arXiv:2305.03047.
Raymond Li, Lewis Tunstall, Patrick von Platen, Jung-
Cheng-Han Chiang and Hung-yi Lee. 2023. Can large taek Kim, Teven Le Scao, Thomas Wolf, and Alexan-
language models be an alternative to human evalua- der M. Rush. 2023a. Starcoder: May the source be
tions? In Proceedings of the 61st Annual Meeting of with you! Preprint, arXiv:2305.06161.
the Association for Computational Linguistics (ACL).
Xiang Li, Qinyuan Zhu, Yelong Cheng, Weizhu
Colin B Clement, Andrew Terrell, Hanlin Mao, Joshua Xu, and Xi Liu. 2023b. Camel: Communica-
Dillon, Sameer Singh, and Dan Alistarh. 2020. tive agents for “mind” exploration. arXiv preprint
Pymt5: Multi-mode translation of natural language arXiv:2303.17760.
and python code with transformers. In EMNLP.
Minqian Liu, Cheng Feng, Qing Lyu, Wenhao Zeng,
Sergio Cozzetti B De Souza, Nicolas Anquetil, and Chao Zheng, Ruidan Zhang, and Steven C H Lin.
Káthia M de Oliveira. 2005. A study of the doc- 2023a. X-eval: Generalizable multi-aspect text
umentation essential to software maintenance. In evaluation via augmented instruction tuning. arXiv
Proceedings of the 23rd annual international confer- preprint arXiv:2311.08788.
ence on Design of communication: documenting &
designing for pervasive information, pages 68–75. Yang Liu, Yao Fu, Yujie Xie, Xinyi Chen, Bo Pang,
Chenyan Qian, Teng Ma, and Dragomir Radev.
Shubhang Shekhar Dvivedi, Vyshnav Vijay, Sai 2023b. G-eval: Nlg evaluation using gpt-4 with bet-
Leela Rahul Pujari, Shoumik Lodh, and Dhruv Ku- ter human alignment. In Proceedings of the 2023
mar. 2024. A comparative analysis of large language Conference on Empirical Methods in Natural Lan-
models for code documentation generation. In Pro- guage Processing.
ceedings of the 1st ACM International Conference on Qinyu Luo, Yining Ye, Shihao Liang, Zhong Zhang,
AI-Powered Software, pages 65–73. Yujia Qin, Yaxi Lu, Yesai Wu, Xin Cong, Yankai Lin,
Yingli Zhang, and 1 others. 2024. Repoagent: An
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, llm-powered open-source framework for repository-
Xiaocheng Feng, Ming Gong, Linjun Shou, Bing level code documentation generation. arXiv preprint
Qin, Ting Liu, and Daxin Jiang. 2020. Codebert: arXiv:2402.16667.
A pre-trained model for programming and natural
languages. In EMNLP. Yingwei Ma, Qingping Yang, Rongyu Cao, Binhua Li,
Fei Huang, and Yongbin Li. 2024. How to under-
Golara Garousi, Vahid Garousi-Yusifoğlu, Guenther stand whole software repository? arXiv preprint
Ruhe, Junji Zhi, Mahmoud Moussavi, and Brian arXiv:2406.01422.
Smith. 2015. Usage and usefulness of technical
software documentation: An industrial case study. Paul W McBurney, Siyuan Jiang, Marouane Kessentini,
Information and software technology, 57:664–682. Nicholas A Kraft, Ameer Armaly, Mohamed Wiem
Mkaouer, and Collin McMillan. 2017. Towards pri-
GitHub. 2024. How github copilot is getting better at oritizing documentation effort. IEEE Transactions
understanding your code. Accessed: 2025-03-27. on Software Engineering, 44(9):897–913.

9
Meta. 2025. Meta ai. https://ai.meta.com/ models for code understanding and generation. In
meta-ai/. Accessed: 2025-03-27. EMNLP.

OpenAI. 2022. Introducing chatgpt. Accessed: 2025- Ziniu Wu, Cheng Liu, Jindong Zhang, Xinyun Li,
03-27. Yewen Wang, Jimmy Xin, Lianmin Zhang, Eric Xing,
Yuxin Lu, and Percy Liang. 2023. Autogen: Enabling
David Lorge Parnas. 2010. Precise documentation: The next-generation multi-agent communication with lan-
key to better software. In The future of software guage models. arXiv preprint arXiv:2309.07864.
engineering, pages 125–148. Springer.
Dayu Yang, Tianyang Liu, Daoan Zhang, and 1 others.
Yuzhang Qian, Zian Zhang, Liang Pan, Peng Wang,
2025. Code to think, think to code: A survey on code-
Shouyi Liu, Wayne Xin Zhao, and Ji-Rong Wen.
enhanced reasoning and reasoning-driven code intel-
2023. Chatdev: Revolutionizing software develop-
ligence in llms. arXiv preprint arXiv:2502.19411.
ment with ai-collaborative agents. arXiv preprint
arXiv:2307.07924.
Guang Yang, Yu Zhou, Wei Cheng, Xiangyu Zhang,
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- Xiang Chen, Terry Yue Zhuo, Ke Liu, Xin Zhou,
pher D Manning, Stefano Ermon, and Chelsea Finn. David Lo, and Taolue Chen. 2024. Less is more:
2023. Direct preference optimization: Your lan- Docstring compression in code generation. arXiv
guage model is secretly a reward model. Advances in preprint arXiv:2410.22793.
Neural Information Processing Systems, 36:53728–
53741. Shinn Yao, Jeffrey Zhao, Dian Yu, Kang Chen, Karthik
Narasimhan, and Yuan Cao. 2022. React: Synergiz-
Martin P Robillard. 2009. What makes apis hard to ing reasoning and acting in language models. arXiv
learn? answers from developers. IEEE software, preprint arXiv:2210.03629.
26(6):27–34.
Yaqing Zan, Mingyu Ding, Bill Yuchen Lin, and Xi-
Rahul Roy, Saikat Chakraborty, Baishakhi Ray, and ang Ren. 2022. When language model meets private
Miryung Kim. 2021. Reassessing automatic evalu- library. In Proceedings of the 2022 Conference on
ation metrics for code summarization tasks. In Pro- Empirical Methods in Natural Language Processing
ceedings of the 29th ACM Joint European Software (EMNLP). Association for Computational Linguis-
Engineering Conference and Symposium on the Foun- tics.
dations of Software Engineering (ESEC/FSE), pages
1344–1356. Kaiyu Zhang, Yifei Wang, Yue Yu, Yujie Li, Zihan
Lin, Dongxu Zhang, Yichi Zhou, Yifei Xu, Ang
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Chen, Weiyi Zhang, and 1 others. 2024. Llm hal-
Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, lucinations in practical code generation: Phenom-
Jingyu Liu, Romain Sauvestre, Tal Remez, and 1 ena, mechanism, and mitigation. arXiv preprint
others. 2023. Code llama: Open foundation models arXiv:2401.10650.
for code. arXiv preprint arXiv:2308.12950.
Shiyue Zhang, Binyi Li, Jason Wei, Aditi Raghunathan
Noah Shinn, Margaret Labash, and Stefano Er- Vyas, and Percy Liang. 2023a. Themis: A flexi-
mon. 2023. Reflexion: Language agents with ble and interpretable nlg evaluation model. arXiv
verbal reinforcement learning. arXiv preprint preprint arXiv:2309.12082.
arXiv:2303.11366.
Xiaoqing Zhang, Zhirui Wang, Lichao Yang, Wei Zhang,
Robert Tarjan. 1972. Depth-first search and linear graph
and Yong Zhang. 2023b. Mapcoder: Map-reduce-
algorithms. SIAM journal on computing, 1(2):146–
style code generation with multi-agent collaboration.
160.
arXiv preprint arXiv:2307.15808.
Gias Uddin, Foutse Khomh, and Chanchal K Roy. 2021.
Automatic api usage scenario documentation from Lianmin Zheng, Shangbin Du, Yuhui Lin, Yukuo Shao,
technical q&a sites. ACM Transactions on Software Zi Lin, Zhen Liu, and 1 others. 2023. Judging llm-
Engineering and Methodology (TOSEM), 30(3):1– as-a-judge with mt-bench and chatbot arena. arXiv
45. preprint arXiv:2306.05685.

Yidong Wang, Qi Guo, Wenjin Yao, Hongbo Zhang, Zihan Zheng, Jiayi Zheng, Weiyan Liu, Yizhong Wang,
Xin Zhang, Zhen Wu, Meishan Zhang, Xinyu Dai, Chen Liu, Xiang Lorraine Li, Mu Li, Wenhao Zhang,
Qingsong Wen, Wei Ye, and 1 others. 2024. Autosur- Diyi Huang, and Xiang Ren. 2024. How well do
vey: Large language models can automatically write llms generate code for different application domains?
surveys. Advances in Neural Information Processing arXiv preprint arXiv:2401.13727.
Systems, 37:115119–115145.
Shuyan Zhou, Uri Alon, Frank F Xu, Zhiruo
Yue Wang, Shuo Ren, Daya Lu, Duyu Tang, Nan Wang, Zhengbao Jiang, and Graham Neubig. 2022.
Duan, Ming Zhou, and Daxin Jiang. 2021. Codet5: Docprompting: Generating code by retrieving the
Identifier-aware unified pre-trained encoder-decoder docs. arXiv preprint arXiv: 2207.05987.

10
Mingchen Zhuge, Changsheng Zhao, Dylan Ashley,
Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong,
Zechun Liu, Ernie Chang, Raghuraman Krishnamoor-
thi, Yuandong Tian, and 1 others. 2024. Agent-as-a-
judge: Evaluate agents with agents. arXiv preprint
arXiv:2410.10934.

11
A Related Work C Scarcity of Code Documentation

LLM Agent Recent advancements in LLM agents We analyzed 164 top-starred Python repositories
have enabled automating complex code-related (created after January 1, 2025), encompassing
tasks (Yang et al., 2025). Single-agent frame- 13,314 files and 115,943 documentable nodes (func-
works like ReAct (Yao et al., 2022) and Reflex- tions, classes, and methods). Of these nodes, only
ion (Shinn et al., 2023) integrate action-reasoning 27.28% contained any documentation, with 66.46%
and self-reflection. Multi-agent systems (CAMEL of repositories exhibiting less than 30% coverage
(Li et al., 2023b), AutoGen (Wu et al., 2023)) ex- (Figure 5). Furthermore, 62.25% of repositories av-
tend these ideas with role-specialized LLMs and eraged 30 words or fewer per documentation block
structured communication to handle more complex (Figure 6), while only 3.98% exceeded an average
problems. In software development, MapCoder of 100 words, illustrating the widespread brevity
(Zhang et al., 2023b), RGD (Chen et al., 2023), and overall scarcity of code documentation.
and ChatDev (Qian et al., 2023) leverage special-
ized agents for many downstream tasks, achieving
state-of-the-art code generation. These insights on
multi-agent coordination and workflow structuring
underpin our DocAgent framework, which adopts
a topologically-aware, tool-integrated multi-agent
design.
Code Summarization Pre-trained encoders
such as CodeBERT(Feng et al., 2020) and Graph- Figure 5: Distribution of repositories by code
CodeBERT (Guo et al., 2021) introduced bi- documentation coverage.
modal and structure-aware learning, while encoder-
decoder models PLBART (Ahmad et al., 2021) and
CodeT5 (Wang et al., 2021) unified code genera-
tion and summarization. PyMT5 (Clement et al.,
2020) extended T5 for Python docstring transla-
tion with multi-mode support. Recently, LLMs
(OpenAI Codex (Chen et al., 2021), StarCoder (Li
et al., 2023a), CodeLlama (Roziere et al., 2023))
have demonstrated strong zero-shot summariza- Figure 6: Distribution of repositories by aver-
tion. However, they often lack repository-level age words per documentation.
context, dependency awareness, and collabora- D Data
tion—limitations our multi-agent, context-aware
D OC AGENT aims to overcome. We gathered 164 top-stared Python repositories
from GitHub, each created after January 1, 2025,
B Why External Information is needed having more than 50 stars, and exceeding 50 KB in
size. From this corpus, we selected 9 repositories
The external open-internet information request reflecting the overall distribution in terms of lines
is necessary for writing documentation for some of code and topological complexity. Figure 7 shows
novel, newly-proposed ideas, like novel evaluation the selected repositories (red points) overlaid on
method, algorithm, model structure, loss functions. the broader distribution. Eventually, we collected
For example, DPO (Rafailov et al., 2023) is a re- 366 code components (120 functions, 178 meth-
inforcement learning algorithm proposed in 2023. ods, and 68 classes) for evaluation, with a separate
Codellama has the knowledge cutoff in Sep 2022. subset of 50 distinct code components (randomly
So when using codellama for documentation gen- sampled from the full set) used specifically for our
eration, without accessing mathematical intuition truthfulness ablation study.
and description of DPO from the open internet,
E Robust LLM-as-judge
codellama will not possible to write helpful doc-
umentation that describe the intuition behind the Assessing the qualitative aspects of Helpfulness au-
implementation of DPO. tomatically is inherently challenging due to subjec-

12
This structured LLM-as-judge approach aims to
provide a scalable yet nuanced assessment of the
documentation’s practical value to developers.

F More System Screenshots


Figure 8 shows the configuration page before initi-
ating the code documentation generation process.
The page mainly consists of three parts: the tar-
get repository path, LLM configuration, and flow
control (for the orchestrator).
Figure 7: Distribution of repositories used for docstring
generation evaluation.

tivity. We employ an LLM-as-judge approach, but


incorporate rigorous mechanisms inspired by ex-
isting work to enhance reliability and consistency,
mitigating known issues like positional bias or vari-
ability (Wang et al., 2024; Zhuge et al., 2024): De-
composed Evaluation: Instead of a single holis-
tic judgment, the LLM evaluates distinct parts of
the docstring (e.g., summary, parameter descrip-
tions, overall description) separately, using tailored
prompts for each part (Liu et al., 2023a; Lee et al.,
2024). Structured Prompting: Each prompt pro-
vides the LLM with:

• Explicit Rubrics: Detailed criteria defining ex-


pectations for different levels on a 5-point Lik-
ert scale (1=Poor to 5=Excellent) concerning
Figure 8: Screenshot of the configuration page.
clarity, detail, and utility specific to the doc-
string part being evaluated (Kim et al., 2023; Figure 9 displays the window that appears af-
Zhang et al., 2023a). ter clicking the "Evaluate" button. Since using
an LLM as a judge is costly (consuming approxi-
• Illustrative Examples: Few-shot examples mately 500 tokens per evaluation), this feature is
demonstrating good and bad documentation optional in the web UI. Only when the user clicks
snippets corresponding to different score lev- the "Evaluate" button will the evaluation be trig-
els, grounding the rubric criteria (Zheng et al., gered, after which the button changes to display the
2023; Chiang and Lee, 2023). generated score.
• Chain-of-Thought Instructions: Guiding the
LLM to first analyze the code, then compare
the corresponding docstring section against
the rubric, justify its rating step-by-step, and
identify specific strengths or weaknesses (Liu
et al., 2023b; Zheng et al., 2023).

• Standardized Output Format: Requiring the


LLM to output its rating along with de-
tailed justifications in a structured format (e.g.,
Figure 9: Screenshot of the helpfulness evaluation win-
JSON), facilitating aggregation and analysis dow.
while ensuring the LLM adheres to the eval-
uation protocol (Liu et al., 2023b; Lee et al.,
2024; Krumdick et al., 2025).

13

You might also like