PyBench: Evaluating LLM Agent on various real-world coding tasks

Zhang, Yaolun; Pan, Yinxu; Wang, Yudong; Cai, Jie

Computer Science > Software Engineering

arXiv:2407.16732 (cs)

[Submitted on 23 Jul 2024 (v1), last revised 3 Aug 2024 (this version, v2)]

Title:PyBench: Evaluating LLM Agent on various real-world coding tasks

Authors:Yaolun Zhang, Yinxu Pan, Yudong Wang, Jie Cai

View PDF HTML (experimental)

Abstract:The LLM Agent, equipped with a code interpreter, is capable of automatically solving real-world coding tasks, such as data analysis and image editing.
However, existing benchmarks primarily focus on either simplistic tasks, such as completing a few lines of code, or on extremely complex and specific tasks at the repository level, neither of which are representative of various daily coding tasks.
To address this gap, we introduce \textbf{PyBench}, a benchmark encompassing five main categories of real-world tasks, covering more than 10 types of files. Given a high-level user query and related files, the LLM Agent needs to reason and execute Python code via a code interpreter for a few turns before making a formal response to fulfill the user's requirements. Successfully addressing tasks in PyBench demands a robust understanding of various Python packages, superior reasoning capabilities, and the ability to incorporate feedback from executed code. Our evaluations indicate that current open-source LLMs are struggling with these tasks. Hence, we conduct analysis and experiments on four kinds of datasets proving that comprehensive abilities are needed for PyBench. Our fine-tuned 8B size model: \textbf{PyLlama3} achieves an exciting performance on PyBench which surpasses many 33B and 70B size models. Our Benchmark, Training Dataset, and Model are available at: {this https URL}

Comments:	16 pages
Subjects:	Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2407.16732 [cs.SE]
	(or arXiv:2407.16732v2 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2407.16732

Submission history

From: Yaolun Zhang [view email]
[v1] Tue, 23 Jul 2024 15:23:14 UTC (2,559 KB)
[v2] Sat, 3 Aug 2024 03:00:43 UTC (1,575 KB)

Computer Science > Software Engineering

Title:PyBench: Evaluating LLM Agent on various real-world coding tasks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:PyBench: Evaluating LLM Agent on various real-world coding tasks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators