Agent
Agent
Generation
1
Figure 1: Architecture of DocAgent: (1) The Navigator Module uses AST parsing for a Dependency DAG and
topological traversal. (2) The Multi-Agent framework uses specialized agents (Reader, Searcher, Writer, Verifier)
with tools for context-aware documentation generation.
2
therefore reveal the purpose of the focal code com-
ponent. This is particularly important for public
functions or APIs exposed to the users of the repos-
itory.
External requests target information not directly
present or inferable from the codebase itself, such
as domain-specific knowledge or third-party library
functionalities (see Appendix B).
Searcher. The Searcher agent is responsible
Figure 2: Screenshot of DocAgent live code documen- for fulfilling the Reader’s information requests us-
tation generation page. ing specialized tools: Internal Code Analysis Tool:
This tool leverages static analysis capabilities to
all of its dependencies have already been described. navigate the codebase. It can retrieve the source
Therefore, each code documentation only needs the code and existing documentation of specified in-
information of its one-hop dependencies, eliminat- ternal components, identify call sites for the fo-
ing the need to pull in an ever-growing chain of cal component, trace dependencies using the pre-
background information. computed graph or on-the-fly analysis, and extract
relevant structural information (e.g., class hierar-
2.2 Multi-Agent Documentation Generation
chies, method signatures). External Knowledge
Following Navigator’s order, the multi-agent sys- Retrieval Tool: This tool interfaces with external
tem generates documentation for each component knowledge sources via a generic retrieval API . It
using four specialized agents coordinated by an formulates queries based on the Reader’s requests
Orchestrator. Input is the focal component’s source for external concepts and processes the results to
code including newly generated documentation. extract pertinent explanations, definitions, or de-
Reader. The Reader agent initiates the process scriptions.
by analyzing the focal component’s code. Its pri- The Searcher consolidates the retrieved internal
mary goal is to determine the information required code information and external knowledge into a
to generate a comprehensive and helpful code docu- structured format, which serves as the context for
mentation. It assesses the component’s complexity, the subsequent agents.
visibility (public/private), and implementation de- Like two human agents collaborate on a project
tails to decide: If additional context is needed: Sim- and talk with each other, after Searcher send the re-
ple, self-contained components might not require trieved information back to the reader, reader read
external information. What context is needed: This the updated context and the focal code component,
involves identifying specific internal dependencies and see if the context is adequate for generating
(functions/classes it uses), usage contexts (where the documenation. If reader still feel the retrieved
the component is called, revealing its purpose), or context is still not adequate, reader can further send
external concepts (algorithms, libraries, domain information request to the searcher. So the infor-
knowledge) referenced implicitly or explicitly. mation request, and new information can be sent
The agent outputs structured XML requests for back and forth between reader and searcher, until
two types of information requests (1) internal in- adequate information is retrieved.
formation about related code components, and (2) Writer. The Writer agent receives the focal
external knowledge for specialized algorithms or component’s code and the structured context com-
techniques. piled by the Searcher. Its task is to generate the
The internal information request consists with code documentation. The generation process is
the dependency and the reference. Dependency guided by prompts that specify the desired structure
means the focal component calls other components and content based on the component type: Func-
defined in the repository, where reader will deter- tions/Methods: Typically require a summary, ex-
minate if a dependent is needed or not to provide tended description, parameter descriptions (Args),
necessary context information. return value description (Returns), raised excep-
Reference means the focal component is called tions (Raises), and potentially usage examples (es-
in somewhere in the code repository, showing how pecially for public-facing components). Classes:
it can be used in the real-world application and Typically require a summary, extended description,
3
initialization examples, constructor parameter de-
scriptions (Args), and public attribute descriptions
(Attributes).
The Writer synthesizes information from both
the code and the provided context to produce a
draft code documentation adhering to these require-
ments.
Verifier. The Verifier take the context, code com-
ponent, and generated code documentation from
the writer as inputs, evaluates the quality of code
documentation against predefined criteria: informa-
tion value, detail level, and completeness. Upon
evaluation, the Verifier either approves the docu-
mentation or provides specific improvement sug-
gestions through structured feedback.
Verifier can talk to writer if the issue can be Figure 3: Multi-facet Evaluation Framework of code
address without additional context information, for documentation, assessing quality along three dimen-
example: format issue, which can be easily address sions: (1) Completeness measures structural adherence
to documentation conventions; (2) Helpfulness evalu-
by asking writer to rewrite.
ates practical utility; and (3) Truthfulness verifies factual
If the issue is relevant to lack of information, accuracy.
and additional context is needed, veirfier can also
provide suggestion to reader, and additional infor-
mation will be gathered through another Reader- 3 Evaluation Framework
Searcher cycle.
Orchestrator. An Orchestrator manages the
agent workflow through an iterative process. The Evaluating the quality of automatically generated
cycle begins with the Reader analyzing the focal code documentation is challenging. Traditional
component and requesting necessary context. The metrics commonly used in natural language gen-
Searcher gathers this information, after which the eration, such as BLEU or ROUGE cannot be used
Writer generates a docstring. The Verifier then because of lack of gold references (Roy et al., 2021;
evaluates the docstring quality, either approving it Guelman et al., 2024). Simple heuristics like docu-
or returning it for revision. This process continues mentation length are insufficient indicators of ac-
until a satisfactory code documentaion is generated tual utility. While human evaluation provides the
or a maximum iteration limit is reached. most accurate assessment (Luo et al., 2024), it is
inherently subjective, expensive, and difficult to
Adaptive Context Management: To handle po- scale, rendering it impractical for large-scale exper-
tentially large contexts retrieved by the Searcher, iments or continuous integration scenarios.
especially for complex components, the Orches-
trator implements an adaptive context truncation To overcome these limitations, we propose a
mechanism. It monitors the total token count of comprehensive and scalable evaluation framework
the context provided to the Writer. If the context designed to systematically assess documentation
exceeds a configurable threshold (based on the un- quality along three crucial dimensions: Complete-
derlying LLM’s limits), the Orchestrator applies a ness, Helpfulness, and Truthfulness. This multi-
targeted truncation strategy. It identifies the largest faceted approach combines deterministic structural
sections within the structured context (e.g., external checks, LLM-based qualitative assessments, and
knowledge snippets, specific dependency details) fact-checking against the codebase itself, providing
and selectively removes content from the end of a holistic view of the generated documentation’s
these sections to reduce the token count while pre- value. Our methodology is informed by established
serving the overall structure. This ensures that the software engineering best practices for documen-
context remains within operational limits, balanc- tation and addresses the specific shortcomings ob-
ing contextual richness with model constraints. served in existing LLM-based generation systems.
4
3.2 Helpfulness
This deterministic approach provides an objec- This structured approach allows for scalable
tive measure of structural adherence, indicating assessment of semantic quality, moving beyond
whether the documentation meets basic formal re- surface-level checks to gauge the documentation’s
quirements. actual value to a developer.
5
3.3 Truthfulness dict documentation based on surrounding code.
A critical dimension of documentation quality is We use CodeLlama-13B (Roziere et al., 2023), an
its factual accuracy, or Truthfulness. Documen- open model trained with FIM tasks (Bavarian et al.,
tation, especially when generated by LLMs unfa- 2022). Abbreviated as FIM-CL. Chat: Represents
miliar with a specific private codebase, can suf- generating documentation by providing the code
fer from "hallucinations"—confidently referencing snippet directly to a chat-based LLM. We test two
non-existent methods, parameters, or classes, or leading models: GPT-4o mini 7 (OpenAI, 2022)
misrepresenting relationships between components. and CodeLlama-34B-instruct(Roziere et al., 2023).
Such inaccuracies severely undermine trust and can Abbreviated as Chat-GPT and Chat-CL, respec-
mislead developers. tively.
We evaluate Truthfulness by verifying whether 4.2 Experiment Setup
entities mentioned in the generated documenta-
tion actually exist within the target repository and Data. We select a representative subset of Python
are referenced correctly. Our pipeline comprises repositories to ensure diversity in size, complexity,
three stages: Code Entity Extraction: An LLM and domain. The dataset comprises modules, func-
is prompted to identify mentions of repository- tions, methods, and classes with varying degrees of
specific code components (classes, functions, meth- dependency density (details in Appendix D).
ods, attributes) within the generated docstring. Systems. We evaluate two variants of our proposed
The prompt specifically instructs the model to dis- system, differing only in the backbone LLM used
tinguish these from standard language keywords, by the agents: DA-GPT: DocAgent utilizing GPT-
built-in types (e.g., list, dict), and common ex- 4o mini. DA-CL: DocAgent utilizing CodeLlama-
ternal library components, focusing on internal 34B-instruct8 .
references. Ground Truth Construction: We Statistical Significance. All claims of statistical
leverage the dependency graph constructed by the significance are based on paired t-tests with a sig-
Navigator module 2.1. This graph serves as the nificance threshold of p < 0.059
ground truth, containing a canonical representation 4.3 Experiment Results
of all code components and their locations within
the repository. Verification: Each extracted entity We evaluate the systems using the framework pro-
mention is cross-referenced against the dependency posed in Section 3, focusing on Completeness,
graph. Helpfulness, and Truthfulness.
We quantify Truthfulness using the Existence 4.3.1 Completeness
Ratio: the proportion of unique repository-
specific entities mentioned in the documenta- System Overall Function Method Class
tion that correspond to actual entities in the DA-GPT 0.934† 0.945† 0.935† 0.914†
|Verified Entities|
codebase.Existence Ratio = |Extracted Entities| DA-CL 0.953†‡ 0.985†‡ 0.982†‡ 0.816†‡
A high ratio indicates that the documentation is Chat-GPT 0.815 0.828 0.823 0.773
well-grounded in the actual code structure, mini- Chat-CL 0.724 0.726 0.744 0.667
mizing the risk of hallucinated references. FIM-CL 0.314 0.291 0.345 0.277
Together, these three dimen-
Table 1: Average Completeness Scores. † : Significantly
sions—Completeness, Helpfulness, and Truthful-
better than corresponding Chat baseline. ‡ : Significantly
ness—provide a robust and nuanced framework for better than FIM baseline.
evaluating automatic code documentation systems,
enabling quantitative comparisons and deeper As shown in Table 1, both DocAgent variants
insights into their strengths and weaknesses. significantly outperform their respective Chat coun-
terparts. DocAgent (CodeLlama-34B) achieves an
4 Experiment 7
2024-07-18 version
8
4.1 Baselines The choice of backbone LLM is orthogonal to the DocA-
gent framework itself. We use GPT-4o-2024-08-06 universally
We compare DocAgent against two representative for running evaluation for more robust results.
9
baseline systems commonly used for code docu- Due to space limitations, we are unable to include the full
prompts and detailed experimental setup in the paper. How-
mentation generation: FIM (Fill-in-the-middle): ever, all configurations are available in our project’s public
Simulates inline code completion tools that pre- release repository.
6
overall score of 0.953, representing a substantial System Verified Extracted Existence Ratio (%)
improvement of 0.229 points over Chat. Similarly, DA-GPT 265 305 95.74%
DocAgent (GPT-4o mini) scores 0.934 overall, sig- DA-CL 354 600 88.17%
Chat-GPT 366 347 61.10%
nificantly higher than Chat at 0.815. These im-
Chat-CL 366 488 68.03%
provements are statistically significant across all FIM-CL 338 131 45.04%
component types. FIM performs poorly, achieving
an overall completeness score of only 0.314. This Table 3: Truthfulness Analysis: Existence Ratio (%).
highlights the effectiveness of DocAgent’s struc- Higher is better. Extracted = extracted entities; Verifed
tured, context-aware generation process compared = verified entities in §3.3.
to simply prompting an LLM with the code in iso-
lation.
model often leads to inaccurate assumptions or hal-
4.3.2 Helpfulness lucinations about the surrounding codebase context.
As shown in Table 2, DocAgent (GPT-4o mini) FIM performs worst, with an Existence Ratio of
achieves the highest overall helpfulness score, sig- only 45.04%, implying that nearly half of its refer-
nificantly outperforming the corresponding Chat ences to repository entities might be incorrect. This
baseline. demonstrating its ability to generate low score highlights a significant risk of misleading
clearer and more informative content by leveraging developers when using FIM for documentation.
retrieved context.
4.4 Ablation Study
System Overall Summary Description Parameters
DA-GPT 3.88† 4.32† 3.60† 2.71 To isolate the contribution of the dependency-aware
DA-CL 2.35‡ 2.36†‡ 2.43‡ 2.00 processing order determined by the Navigator mod-
Chat-GPT 2.95 3.56 2.42 2.20 ule (§ 2.1), we conducted an ablation study. We
Chat-CL 2.16 2.04 2.37 1.80 created variants of DocAgent (DA-Rand-GPT, DA-
FIM-CL 1.51 1.30 2.45 1.50
Rand-CL) that process components in a random
Table 2: Average Helpfulness Scores. † : Significantly order10 .
better than corresponding Chat. ‡ : Significantly better
than FIM. 4.4.1 Impact on Helpfulness
DocAgent (CodeLlama-34B) also shows an im- System Overall Summary Description Parameters
provement over its Chat counterpart, producing sig- DA-GPT 3.88† 4.32† 3.60 2.71
DA-Rand-GPT 3.44(-0.44) 3.62(-0.70) 3.30(-0.30) 2.20(-0.51)
nificantly more helpful summaries. Furthermore, DA-CL 2.35† 2.36† 2.43 2.00
DocAgent (CodeLlama-34B) also significantly out- DA-Rand-CL 2.18(-0.17) 1.88(-0.48) 2.42(-0.10) 2.00( 0.00)
performs FIM. Across aspects, generating help-
Table 4: Ablation: Average Helpfulness Scores. † If
ful parameter descriptions appears most challeng-
DocAgent significantly better than its Random variant.
ing. DocAgent (GPT-4o mini) achieves the highest
score even here, suggesting its structured approach
The results in Table 4 demonstrate the benefit
aids in this difficult task, although room for im-
of the Navigator’s topological sorting in improv-
provement remains.
ing Helpfulness. For both underlying LLMs, the
4.3.3 Truthfulness full DocAgent achieved significantly higher overall
The results in Table 3 demonstrate the superior helpfulness scores compared to its random-order
factual accuracy of documentation generated by counterpart. With GPT-4o mini, the full DocA-
DocAgent. DocAgent (GPT-4o mini) achieves the gent scored 3.69 overall, significantly higher than
highest Existence Ratio at 95.74%, indicating that DocAgent-Random’s 3.44. The improvement was
the vast majority of its references to internal code particularly pronounced in summary generation.
components are correct. DocAgent (CodeLlama- Similarly, with CodeLlama-34B, the full DocAgent
34B) also performs strongly with a ratio of 88.17%. scored 2.39 overall, significantly outperforming
This contrasts sharply with the baselines. The DocAgent-Random’s 2.18. Again, the summary
Chat approaches exhibit significantly lower truth- scores showed a significant difference.
fulness, with Chat (GPT-4o mini) at 61.10% and 10
Completeness was omitted from the ablation study be-
Chat (CodeLlama-34B) at 68.03%. This suggests cause it depends on the code’s structure, not the Navigator’s
that simply providing the code snippet to a chat processing order.
7
4.4.2 Impact on Truthfulness tation. An ablation study confirmed the critical
We also evaluated the impact of removing the hi- contribution of the topological processing order
erachical generation order on the factual accuracy to both helpfulness and truthfulness. DocAgent
(Truthfulness). Without the Navigator, the Searcher represents a promising step towards reliable and
can still retrieve dependent code components. How- useful automated code documentation generation
ever, since the ’Dependencies First’ principle is not for complex and proprietary software.
followed, these components are less likely to have
already generated documentation available for con-
6 Ethics and Limitations
text. DocAgent, while advancing automated code doc-
System Verified Extracted Existence Ratio (%)
umentation, has inherent limitations and ethical
DA-GPT 187 224 94.64% considerations. Technically, processing extremely
DA-Rand-GPT 164(-23) 166(-58) 86.75(-7.89)% large codebases may still challenge LLM context
DA-CL 190 343 87.76% limits despite topological sorting and context man-
DA-Rand-CL 188(-2) 360(+17) 83.06(-4.70)%
agement. Relying solely on static analysis restricts
Table 5: Ablation: Truthfulness Analysis (Existence understanding of dynamic behavior, and the current
Ratio %). Use 50 randomly sampled code components Python focus requires effort for adaptation to other
from full data to evaluate. languages.
Table 5 demonstrates that the topological sort Ethically, the primary concern is factual accu-
also improves truthfulness. Both full DocAgent racy; generated documentation, though improved,
variants achieve higher Existence Ratios than their may still contain hallucinations or inaccuracies, po-
random-order counterparts. Existence ratio of tentially misleading developers. The underlying
DocAgent (GPT-4o-mini) drops from 94.64% to LLMs may propagate biases from their training
86.75% without the sort, and the ratio of DocAgent data into the documentation. Over-reliance on such
(Codellama-34B) drops from 87.76% to 83.06%. tools could potentially hinder developers’ deep
Collectively, the ablation results confirm that the code comprehension skills. Applying DocAgent
Navigator’s dependency-aware topological order- to proprietary code necessitates careful handling,
ing is a crucial component of DocAgent, signif- especially regarding external queries, to avoid inad-
icantly contributing to both the helpfulness and vertently leaking sensitive information. Finally, the
factual accuracy of the generated documentation computational resources required for LLM-driven
by enabling effective incremental context manage- multi-agent systems represent a notable cost and
ment. environmental consideration. Future work should
address these limitations, focusing on robustness,
5 Conclusion bias mitigation, and deeper evaluation, while em-
phasizing that human oversight remains crucial in
We addressed the challenge of automatically gen- practical deployment.
erating high-quality code documentation, a task
where existing LLM-based methods often strug-
gle with incompleteness, lack of helpfulness, and References
factual inaccuracies. We introduced DocAgent, a Samuel Abedu, Ahmad Abdellatif, and Emad Shihab.
novel tool-integrated, multi-agent system that lever- 2024. Llm-based chatbots for mining software repos-
ages a dependency-aware topological processing itories: Challenges and opportunities. In Proceedings
order determined by a Navigator module. This al- of the 28th International Conference on Evaluation
and Assessment in Software Engineering, pages 201–
lows specialized agents (Reader, Searcher, Writer, 210.
Verifier, Orchestrator) to collaboratively generate
documentation by incrementally building context Emad Aghajani, Csaba Nagy, Olga Lucero Vega-
from dependencies. We also proposed a robust Márquez, Mario Linares-Vásquez, Laura Moreno,
Gabriele Bavota, and Michele Lanza. 2019. Software
and scalable evaluation framework assessing Com- documentation issues unveiled. In 2019 IEEE/ACM
pleteness, Helpfulness, and Truthfulness. Our 41st International Conference on Software Engineer-
experiments on diverse Python repositories demon- ing (ICSE), pages 1199–1210. IEEE.
strate that DocAgent significantly outperforms FIM Wasi U Ahmad, Saikat Chakraborty, Baishakhi Ray,
and Chat baselines consistently, producing more and Kai-Wei Chang. 2021. Unified pre-training for
complete, helpful, and factually accurate documen- program understanding and generation. In ACL.
8
Anthropic. 2025. Model context length increases with Liron Guelman, Alon Lavie, and Eran Yahav. 2024.
the new context protocol. Accessed: 2025-03-27. Using large language models to document code: A
first quantitative and qualitative assessment. arXiv
Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, preprint arXiv:2403.04264.
John Schulman, Christine McLeavey, Jerry Tworek,
and Mark Chen. 2022. Efficient training of lan- Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu
guage models to fill in the middle. arXiv preprint Tang, Nan Duan, Ming Zhou, and Daxin Jiang. 2021.
arXiv:2207.14255. Graphcodebert: Pre-training code representations
with data flow. In ICLR.
Jie-Cherng Chen and Sun-Jen Huang. 2009. An empiri-
cal analysis of the impact of software development Seungone Kim, Soobin Kim, Alice Oh, and Gunhee
problem factors on software maintainability. Journal Han. 2023. Prometheus: Inducing fine-grained eval-
of Systems and Software, 82(6):981–992. uation capability in language models. arXiv preprint
arXiv:2310.08491.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Michael Krumdick, Jason Wei, Xinyang Chen, Shang-
Henrique Ponde de Oliveira Pinto, Jared Kaplan, bin Du, Shu Xu, Dale Schuurmans, and Ed H Chi.
Harri Edwards, Yuri Burda, Nicholas Joseph, Greg 2025. No free labels: Limitations of llm-as-a-
Brockman, and 1 others. 2021. Evaluating large judge without human grounding. arXiv preprint
language models trained on code. arXiv preprint arXiv:2503.05061.
arXiv:2107.03374.
Yukyung Lee, Wonjoon Cho, and Jinhyuk Kim. 2024.
Qian Chen, Binyuan Tang, Yankai Zhang, Binhua Wang, Checkeval: A reliable llm-as-a-judge framework for
Zhifang Zhang, and Qun Zhang. 2023. Teaching evaluating text generation using checklists. arXiv
large language models to self-debug. arXiv preprint preprint arXiv:2403.18771.
arXiv:2305.03047.
Raymond Li, Lewis Tunstall, Patrick von Platen, Jung-
Cheng-Han Chiang and Hung-yi Lee. 2023. Can large taek Kim, Teven Le Scao, Thomas Wolf, and Alexan-
language models be an alternative to human evalua- der M. Rush. 2023a. Starcoder: May the source be
tions? In Proceedings of the 61st Annual Meeting of with you! Preprint, arXiv:2305.06161.
the Association for Computational Linguistics (ACL).
Xiang Li, Qinyuan Zhu, Yelong Cheng, Weizhu
Colin B Clement, Andrew Terrell, Hanlin Mao, Joshua Xu, and Xi Liu. 2023b. Camel: Communica-
Dillon, Sameer Singh, and Dan Alistarh. 2020. tive agents for “mind” exploration. arXiv preprint
Pymt5: Multi-mode translation of natural language arXiv:2303.17760.
and python code with transformers. In EMNLP.
Minqian Liu, Cheng Feng, Qing Lyu, Wenhao Zeng,
Sergio Cozzetti B De Souza, Nicolas Anquetil, and Chao Zheng, Ruidan Zhang, and Steven C H Lin.
Káthia M de Oliveira. 2005. A study of the doc- 2023a. X-eval: Generalizable multi-aspect text
umentation essential to software maintenance. In evaluation via augmented instruction tuning. arXiv
Proceedings of the 23rd annual international confer- preprint arXiv:2311.08788.
ence on Design of communication: documenting &
designing for pervasive information, pages 68–75. Yang Liu, Yao Fu, Yujie Xie, Xinyi Chen, Bo Pang,
Chenyan Qian, Teng Ma, and Dragomir Radev.
Shubhang Shekhar Dvivedi, Vyshnav Vijay, Sai 2023b. G-eval: Nlg evaluation using gpt-4 with bet-
Leela Rahul Pujari, Shoumik Lodh, and Dhruv Ku- ter human alignment. In Proceedings of the 2023
mar. 2024. A comparative analysis of large language Conference on Empirical Methods in Natural Lan-
models for code documentation generation. In Pro- guage Processing.
ceedings of the 1st ACM International Conference on Qinyu Luo, Yining Ye, Shihao Liang, Zhong Zhang,
AI-Powered Software, pages 65–73. Yujia Qin, Yaxi Lu, Yesai Wu, Xin Cong, Yankai Lin,
Yingli Zhang, and 1 others. 2024. Repoagent: An
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, llm-powered open-source framework for repository-
Xiaocheng Feng, Ming Gong, Linjun Shou, Bing level code documentation generation. arXiv preprint
Qin, Ting Liu, and Daxin Jiang. 2020. Codebert: arXiv:2402.16667.
A pre-trained model for programming and natural
languages. In EMNLP. Yingwei Ma, Qingping Yang, Rongyu Cao, Binhua Li,
Fei Huang, and Yongbin Li. 2024. How to under-
Golara Garousi, Vahid Garousi-Yusifoğlu, Guenther stand whole software repository? arXiv preprint
Ruhe, Junji Zhi, Mahmoud Moussavi, and Brian arXiv:2406.01422.
Smith. 2015. Usage and usefulness of technical
software documentation: An industrial case study. Paul W McBurney, Siyuan Jiang, Marouane Kessentini,
Information and software technology, 57:664–682. Nicholas A Kraft, Ameer Armaly, Mohamed Wiem
Mkaouer, and Collin McMillan. 2017. Towards pri-
GitHub. 2024. How github copilot is getting better at oritizing documentation effort. IEEE Transactions
understanding your code. Accessed: 2025-03-27. on Software Engineering, 44(9):897–913.
9
Meta. 2025. Meta ai. https://ai.meta.com/ models for code understanding and generation. In
meta-ai/. Accessed: 2025-03-27. EMNLP.
OpenAI. 2022. Introducing chatgpt. Accessed: 2025- Ziniu Wu, Cheng Liu, Jindong Zhang, Xinyun Li,
03-27. Yewen Wang, Jimmy Xin, Lianmin Zhang, Eric Xing,
Yuxin Lu, and Percy Liang. 2023. Autogen: Enabling
David Lorge Parnas. 2010. Precise documentation: The next-generation multi-agent communication with lan-
key to better software. In The future of software guage models. arXiv preprint arXiv:2309.07864.
engineering, pages 125–148. Springer.
Dayu Yang, Tianyang Liu, Daoan Zhang, and 1 others.
Yuzhang Qian, Zian Zhang, Liang Pan, Peng Wang,
2025. Code to think, think to code: A survey on code-
Shouyi Liu, Wayne Xin Zhao, and Ji-Rong Wen.
enhanced reasoning and reasoning-driven code intel-
2023. Chatdev: Revolutionizing software develop-
ligence in llms. arXiv preprint arXiv:2502.19411.
ment with ai-collaborative agents. arXiv preprint
arXiv:2307.07924.
Guang Yang, Yu Zhou, Wei Cheng, Xiangyu Zhang,
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- Xiang Chen, Terry Yue Zhuo, Ke Liu, Xin Zhou,
pher D Manning, Stefano Ermon, and Chelsea Finn. David Lo, and Taolue Chen. 2024. Less is more:
2023. Direct preference optimization: Your lan- Docstring compression in code generation. arXiv
guage model is secretly a reward model. Advances in preprint arXiv:2410.22793.
Neural Information Processing Systems, 36:53728–
53741. Shinn Yao, Jeffrey Zhao, Dian Yu, Kang Chen, Karthik
Narasimhan, and Yuan Cao. 2022. React: Synergiz-
Martin P Robillard. 2009. What makes apis hard to ing reasoning and acting in language models. arXiv
learn? answers from developers. IEEE software, preprint arXiv:2210.03629.
26(6):27–34.
Yaqing Zan, Mingyu Ding, Bill Yuchen Lin, and Xi-
Rahul Roy, Saikat Chakraborty, Baishakhi Ray, and ang Ren. 2022. When language model meets private
Miryung Kim. 2021. Reassessing automatic evalu- library. In Proceedings of the 2022 Conference on
ation metrics for code summarization tasks. In Pro- Empirical Methods in Natural Language Processing
ceedings of the 29th ACM Joint European Software (EMNLP). Association for Computational Linguis-
Engineering Conference and Symposium on the Foun- tics.
dations of Software Engineering (ESEC/FSE), pages
1344–1356. Kaiyu Zhang, Yifei Wang, Yue Yu, Yujie Li, Zihan
Lin, Dongxu Zhang, Yichi Zhou, Yifei Xu, Ang
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Chen, Weiyi Zhang, and 1 others. 2024. Llm hal-
Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, lucinations in practical code generation: Phenom-
Jingyu Liu, Romain Sauvestre, Tal Remez, and 1 ena, mechanism, and mitigation. arXiv preprint
others. 2023. Code llama: Open foundation models arXiv:2401.10650.
for code. arXiv preprint arXiv:2308.12950.
Shiyue Zhang, Binyi Li, Jason Wei, Aditi Raghunathan
Noah Shinn, Margaret Labash, and Stefano Er- Vyas, and Percy Liang. 2023a. Themis: A flexi-
mon. 2023. Reflexion: Language agents with ble and interpretable nlg evaluation model. arXiv
verbal reinforcement learning. arXiv preprint preprint arXiv:2309.12082.
arXiv:2303.11366.
Xiaoqing Zhang, Zhirui Wang, Lichao Yang, Wei Zhang,
Robert Tarjan. 1972. Depth-first search and linear graph
and Yong Zhang. 2023b. Mapcoder: Map-reduce-
algorithms. SIAM journal on computing, 1(2):146–
style code generation with multi-agent collaboration.
160.
arXiv preprint arXiv:2307.15808.
Gias Uddin, Foutse Khomh, and Chanchal K Roy. 2021.
Automatic api usage scenario documentation from Lianmin Zheng, Shangbin Du, Yuhui Lin, Yukuo Shao,
technical q&a sites. ACM Transactions on Software Zi Lin, Zhen Liu, and 1 others. 2023. Judging llm-
Engineering and Methodology (TOSEM), 30(3):1– as-a-judge with mt-bench and chatbot arena. arXiv
45. preprint arXiv:2306.05685.
Yidong Wang, Qi Guo, Wenjin Yao, Hongbo Zhang, Zihan Zheng, Jiayi Zheng, Weiyan Liu, Yizhong Wang,
Xin Zhang, Zhen Wu, Meishan Zhang, Xinyu Dai, Chen Liu, Xiang Lorraine Li, Mu Li, Wenhao Zhang,
Qingsong Wen, Wei Ye, and 1 others. 2024. Autosur- Diyi Huang, and Xiang Ren. 2024. How well do
vey: Large language models can automatically write llms generate code for different application domains?
surveys. Advances in Neural Information Processing arXiv preprint arXiv:2401.13727.
Systems, 37:115119–115145.
Shuyan Zhou, Uri Alon, Frank F Xu, Zhiruo
Yue Wang, Shuo Ren, Daya Lu, Duyu Tang, Nan Wang, Zhengbao Jiang, and Graham Neubig. 2022.
Duan, Ming Zhou, and Daxin Jiang. 2021. Codet5: Docprompting: Generating code by retrieving the
Identifier-aware unified pre-trained encoder-decoder docs. arXiv preprint arXiv: 2207.05987.
10
Mingchen Zhuge, Changsheng Zhao, Dylan Ashley,
Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong,
Zechun Liu, Ernie Chang, Raghuraman Krishnamoor-
thi, Yuandong Tian, and 1 others. 2024. Agent-as-a-
judge: Evaluate agents with agents. arXiv preprint
arXiv:2410.10934.
11
A Related Work C Scarcity of Code Documentation
LLM Agent Recent advancements in LLM agents We analyzed 164 top-starred Python repositories
have enabled automating complex code-related (created after January 1, 2025), encompassing
tasks (Yang et al., 2025). Single-agent frame- 13,314 files and 115,943 documentable nodes (func-
works like ReAct (Yao et al., 2022) and Reflex- tions, classes, and methods). Of these nodes, only
ion (Shinn et al., 2023) integrate action-reasoning 27.28% contained any documentation, with 66.46%
and self-reflection. Multi-agent systems (CAMEL of repositories exhibiting less than 30% coverage
(Li et al., 2023b), AutoGen (Wu et al., 2023)) ex- (Figure 5). Furthermore, 62.25% of repositories av-
tend these ideas with role-specialized LLMs and eraged 30 words or fewer per documentation block
structured communication to handle more complex (Figure 6), while only 3.98% exceeded an average
problems. In software development, MapCoder of 100 words, illustrating the widespread brevity
(Zhang et al., 2023b), RGD (Chen et al., 2023), and overall scarcity of code documentation.
and ChatDev (Qian et al., 2023) leverage special-
ized agents for many downstream tasks, achieving
state-of-the-art code generation. These insights on
multi-agent coordination and workflow structuring
underpin our DocAgent framework, which adopts
a topologically-aware, tool-integrated multi-agent
design.
Code Summarization Pre-trained encoders
such as CodeBERT(Feng et al., 2020) and Graph- Figure 5: Distribution of repositories by code
CodeBERT (Guo et al., 2021) introduced bi- documentation coverage.
modal and structure-aware learning, while encoder-
decoder models PLBART (Ahmad et al., 2021) and
CodeT5 (Wang et al., 2021) unified code genera-
tion and summarization. PyMT5 (Clement et al.,
2020) extended T5 for Python docstring transla-
tion with multi-mode support. Recently, LLMs
(OpenAI Codex (Chen et al., 2021), StarCoder (Li
et al., 2023a), CodeLlama (Roziere et al., 2023))
have demonstrated strong zero-shot summariza- Figure 6: Distribution of repositories by aver-
tion. However, they often lack repository-level age words per documentation.
context, dependency awareness, and collabora- D Data
tion—limitations our multi-agent, context-aware
D OC AGENT aims to overcome. We gathered 164 top-stared Python repositories
from GitHub, each created after January 1, 2025,
B Why External Information is needed having more than 50 stars, and exceeding 50 KB in
size. From this corpus, we selected 9 repositories
The external open-internet information request reflecting the overall distribution in terms of lines
is necessary for writing documentation for some of code and topological complexity. Figure 7 shows
novel, newly-proposed ideas, like novel evaluation the selected repositories (red points) overlaid on
method, algorithm, model structure, loss functions. the broader distribution. Eventually, we collected
For example, DPO (Rafailov et al., 2023) is a re- 366 code components (120 functions, 178 meth-
inforcement learning algorithm proposed in 2023. ods, and 68 classes) for evaluation, with a separate
Codellama has the knowledge cutoff in Sep 2022. subset of 50 distinct code components (randomly
So when using codellama for documentation gen- sampled from the full set) used specifically for our
eration, without accessing mathematical intuition truthfulness ablation study.
and description of DPO from the open internet,
E Robust LLM-as-judge
codellama will not possible to write helpful doc-
umentation that describe the intuition behind the Assessing the qualitative aspects of Helpfulness au-
implementation of DPO. tomatically is inherently challenging due to subjec-
12
This structured LLM-as-judge approach aims to
provide a scalable yet nuanced assessment of the
documentation’s practical value to developers.
13