Less is More: DocString Compression in Code Generation

Yang, Guang; Zhou, Yu; Cheng, Wei; Zhang, Xiangyu; Chen, Xiang; Zhuo, Terry Yue; Liu, Ke; Zhou, Xin; Lo, David; Chen, Taolue

Computer Science > Software Engineering

arXiv:2410.22793 (cs)

[Submitted on 30 Oct 2024 (v1), last revised 31 Oct 2024 (this version, v2)]

Title:Less is More: DocString Compression in Code Generation

Authors:Guang Yang, Yu Zhou, Wei Cheng, Xiangyu Zhang, Xiang Chen, Terry Yue Zhuo, Ke Liu, Xin Zhou, David Lo, Taolue Chen

View PDF HTML (experimental)

Abstract:The widespread use of Large Language Models (LLMs) in software engineering has intensified the need for improved model and resource efficiency. In particular, for neural code generation, LLMs are used to translate function/method signature and DocString to executable code. DocStrings which capture user re quirements for the code and used as the prompt for LLMs, often contains redundant information. Recent advancements in prompt compression have shown promising results in Natural Language Processing (NLP), but their applicability to code generation remains uncertain. Our empirical study show that the state-of-the-art prompt compression methods achieve only about 10% reduction, as further reductions would cause significant performance degradation. In our study, we propose a novel compression method, ShortenDoc, dedicated to DocString compression for code generation. Our extensive experiments on six code generation datasets, five open-source LLMs (1B to 10B parameters), and one closed-source LLM GPT-4o confirm that ShortenDoc achieves 25-40% compression while preserving the quality of generated code, outperforming other baseline methods at similar compression levels. The benefit of this research is to improve efficiency and reduce the cost while maintaining the quality of the generated code, especially when calling third-party APIs, and is able to reduce the token processing cost by 25-40%.

Comments:	UNDER REVIEW
Subjects:	Software Engineering (cs.SE)
Cite as:	arXiv:2410.22793 [cs.SE]
	(or arXiv:2410.22793v2 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2410.22793

Submission history

From: Guang Yang [view email]
[v1] Wed, 30 Oct 2024 08:17:10 UTC (5,612 KB)
[v2] Thu, 31 Oct 2024 07:20:35 UTC (5,612 KB)

Computer Science > Software Engineering

Title:Less is More: DocString Compression in Code Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:Less is More: DocString Compression in Code Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators