0% found this document useful (0 votes)
211 views27 pages

HumanEval Pro and MBPPPro Evaluating Large Language Models

We introduce self-invoking code generation, a new task designed to evaluate the progres sive reasoning and problem-solving capabili ties of LLMs. In this task, models are presented with a base problem and a related, more com plex problem. They must solve the base prob lem and then utilize its solution to address the more complex one. This work features three key contributions.

Uploaded by

Ωo oΩ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
211 views27 pages

HumanEval Pro and MBPPPro Evaluating Large Language Models

We introduce self-invoking code generation, a new task designed to evaluate the progres sive reasoning and problem-solving capabili ties of LLMs. In this task, models are presented with a base problem and a related, more com plex problem. They must solve the base prob lem and then utilize its solution to address the more complex one. This work features three key contributions.

Uploaded by

Ωo oΩ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

HumanEval Pro and MBPP Pro: Evaluating Large Language Models

on Self-invoking Code Generation

Zhaojian Yu1 Yilun Zhao2 Arman Cohan2 Xiao-Ping Zhang1


1 2
Tsinghua University Yale University

§ github.com/CodeEval-Pro/CodeEval-Pro

Model Input
Prompt
Abstract You are an exceptionally intelligent coding assistant that consistently delivers
accurate and reliable responses to user instructions.
We introduce self-invoking code generation, Write a solution of python file to the following problems, the solution of the
arXiv:2412.21199v1 [cs.SE] 30 Dec 2024

second problem requires single or multiple calls to the first solution.


a new task designed to evaluate the progres-
Base Problem
sive reasoning and problem-solving capabili- Write a function to replace characters in a string.
ties of LLMs. In this task, models are presented Self-invoking Problem
with a base problem and a related, more com- Write a function to replace multiple characters in a string with their
corresponding new characters. The function should take a string and a dictionary
plex problem. They must solve the base prob- where keys are characters to be replaced and values are the new characters.
lem and then utilize its solution to address the
more complex one. This work features three Language Model
key contributions. First, we propose a gen-
Generate
eral recipe for generating more challenging ver- Model Output
sions of existing benchmarks, resulting in three solution.py

new benchmarks: HumanEval Pro, MBPP Pro, def replace_char(str1, ch, newch):
return str1.replace(ch, newch)
and BigCodeBench-Lite Pro, specifically de- def replace_multiple_chars(str1, char_map):
signed to assess LLMs on self-invoking code for ch, newch in char_map.items():
str1 = replace_char(str1, ch, newch)
generation. Second, from the analysis of ex- return str1
perimental results over twenty LLMs on our
Test
benchmarks, we have two important observa- assert replace_multiple_chars('python', {'p': 'b', 'y': 'i'}) == 'bithon'
tions: (i) Most LLMs excel in traditional code
generation benchmarks like HumanEval and Figure 1: The overview of self-invoking code genera-
MBPP, but their performance declines on self- tion in HumanEval Pro and MBPP Pro. Given a base
invoking tasks. For example, o1-mini achieves problem and a related, more complex problem, they are
96.2% pass@1 on HumanEval but only 76.2% required to solve the base problem and use its solution
on HumanEval Pro. (ii) On self-invoking code to address the complex problems.
generation task, the instruction-tuned models
demonstrate only marginal improvements com-
pared to the base models. Third, we disclose (Chen et al., 2021) and MBPP (Austin et al., 2021)
the types of failure modes that exist in our eval- have been widely adopted to evaluate the code
uation results. All these results underscore the
generation abilities of LLMs, providing standard-
need for further advancements in self-invoking
code generation tasks and provide a new direc- ized evaluation protocols for assessing their per-
tion for future research on enhancing LLMs’ formance on code-related tasks. However, these
code reasoning capabilities. existing benchmarks primarily focus on isolated,
single-function code generation, which represents
only a subset of the challenges encountered in real-
1 Introduction world software development scenarios.
Large Language Models (LLMs) have demon- To evaluate LLMs under more realistic problem-
strated significant progress in various code-related solving scenarios, BigCodeBench (Zhuo et al.,
tasks including code generation (Roziere et al., 2024) presents a benchmark that comprises of com-
2023; Zhang et al., 2023; Ni et al., 2024), pro- plex and practical problems requiring LLMs to
gram repair (Xia et al., 2022; Jin et al., 2023), and use multiple function calls from diverse libraries.
code translation (Zhu et al., 2022), etc. Traditional While BigCodeBench highlights the use of exter-
human-annotated benchmarks such as HumanEval nal function calls, it falls short in assessing LLMs’
reasoning ability to generate and invoke their own BigCodeBench-Lite Pro, a new self-invoking prob-
generated functions in problem-solving. CRUX- lems set derived from BigCodeBench (Zhuo et al.,
Eval (Gu et al., 2024) assesses LLMs’ code rea- 2024). On Bigcodebench-Lite Pro, LLMs show
soning by predicting function inputs and outputs. consistent performance trend with HumanEval Pro
However, the direct input and output prediction and MBPP Pro, which emphasizes the generaliz-
does not involve explicit code generation. In practi- ability of our construction pipeline. Therefore, our
cal software engineering contexts, developers must benchmark construction approach can also be ex-
not only write code but also comprehend, modify, tended to adapt other code generation benchmarks,
and utilize existing code to solve more complex particularly as the capabilities of LLMs advance
problems. Hence, the ability to understand and and older benchmarks become obsolete.
subsequently leverage one’s own generated code, Through extensive evaluation of various LLMs,
namely self-invoking code generation (Figure 1), we uncover a significant disparity between tradi-
plays an important role for LLMs to leverage their tional code generation and self-invoking code gen-
reasoning capabilities to code generation that cur- eration capabilities. Our findings reveal that while
rent benchmarks fail to capture. frontier LLMs excel at generating individual code
Therefore, we present HumanEval Pro and snippets, they often struggle to effectively utilizing
MBPP Pro, two expanded versions of the tradi- their own generated code for solving more com-
tional HumanEval and MBPP benchmarks to eval- plex problems. For example, o1-mini achieves
uate LLMs on self-invoking code generation task. 96.2% pass@1 on HumanEval but only 76.2% on
As illustrated in Figure 1, HumanEval Pro and HumanEval Pro, demonstrating the challenges in-
MBPP Pro extend beyond simple code generation herent in self-invoking code generation. From the
by introducing self-invoking problems which re- comparison between instruction-tuned models and
quires LLMs to solve the base problem and invoke their base models, we found that instruction-tuned
their self-generated code to solve a more complex models are less efficient on self-invoking code gen-
problem. By evaluating LLMs on self-invoking eration than traditional code generation task. Fur-
code generation task, HumanEval Pro and MBPP thermore, our detailed statistics of failure cases
Pro provide a useful and important probe to better in HumanEval Pro and MBPP Pro also reflect the
understand the programming capabilities of LLMs. shortcomings of LLMs in self-invoking code gen-
The capability of self-invoking code generation eration, thereby providing complementary insights
also facilitates LLMs to tackle difficult tasks with on real-world coding capabilities of LLMs.
greater autonomy and effectiveness.
To obtain HumanEval Pro and MBPP Pro, 2 Related Work
we propose a general recipe for constructing
Recent advances in LLMs have demonstrated re-
self-invoking code generation benchmarks by
markable capabilities in code generation and un-
building upon existing datasets. First, we use
derstanding. This section reviews the current land-
Deepseek-V2.5 (DeepSeek-AI, 2024) to generate
scape of code-related benchmarks and LLMs.
self-invoking problems based on the original prob-
lems in HumanEval and MBPP. These problems Benchmarks for Code Generation The eval-
are designed to be more complex than the base uation landscape for Code LLMs has evolved
problems and closely related to them, ensuring significantly. HumanEval (Chen et al., 2021)
progressive reasoning and coherent code invoca- and MBPP (Austin et al., 2021) serve as fun-
tion. Second, we generate the candidate solution damental benchmarks, focusing on Python func-
and test inputs for each problem. Third, we ex- tion completion tasks with test-driven evaluation.
ecute the code of candidate solution to generate Several benchmarks have expanded code eval-
output and use the assert command in Python to uation benchmarks to encompass multiple pro-
build test cases. In the third stage, human ex- gramming languages (Zheng et al., 2023; Athi-
perts are assigned to manually review each prob- waratkun et al., 2022), complex tasks like pro-
lem and continuously modify and execute the code gram repair (Haque et al., 2022; Jiang et al., 2023;
of solutions to ensure that all canonical solutions Muennighoff et al., 2024; Xia et al., 2024), dy-
could correctly solve the problem and cover the test namic problem sets (Jain et al., 2024), and code
cases. To verify the reproducibility of our bench- reasoning through code summarization (Barone
mark construction approach, we further construct and Sennrich, 2017; Hasan et al., 2021) and sim-
1 Self-invoking Problem Generation 2 Solution Generation 3 Test Cases Generation

Solution
Test Inputs
Modify Assertion
Base Problem Self-invoking
Executor
Problem Manually Check Test Outputs Test Cases
Failed Passed

Figure 2: The overview of benchmark construction. An example is shown in Figure 8. We summarize the entire
benchmark construction process as follows: (1) Self-invoking problem Generation: We use Deepseek-V2.5 to
generate the self-invoking problems, as well as their candidate solutions and test inputs. (2) Solutions Generation:
We execute the generated solution with the test inputs in a controlled Python environment to obtain ground truth
outputs. (3) Test Cases Generation: We employ an iterative method involving Python execution check and manual
review to ensure that all test cases pass successfully. The final execution results are then used to construct complete
test cases with assert command.

ulated execution (Gu et al., 2024). To evaluate WizardCoder (Luo et al., 2023), Magicoder (Wei
LLMs in professional software engineering, bench- et al., 2024), WaveCoder (Yu et al., 2024), Open-
marks like SWE-Bench (Jimenez et al., 2023), CodeInterpreter (Zheng et al., 2024), and Reflec-
EvoCodeBench (Li et al., 2024), RepoBench (Liu tionCoder (Ren et al., 2024). These models have
et al., 2023), and GoogleCodeRepo (Shrivastava achieved impressive performance on standard code
et al., 2023) focus on real-world tasks, code evolu- generation benchmarks through enhanced data di-
tion, and repository-level challenges. These bench- versity and instruction complexity.
marks collectively drive the advancement of LLMs,
providing valuable insights into their strengths and 3 Benchmark Construction
limitations. Our benchmarks introduce novel self-
To facilitate a meaningful comparison between
invoking code generation task, which addresses
self-invoking code generation and traditional code
gaps left by existing benchmarks. This addition
generation, we have crafted two new benchmarks,
provides a more holistic framework to evaluate
HumanEval Pro and MBPP Pro. These bench-
LLMs on leveraging their reasoning capabilities to
marks are extensions of the original HumanEval
code generation. Moreover, our benchmark con-
and MBPP, requiring the model to solve both the
struction method could also push existing bench-
base problem and a more complex self-invoking
marks forward to accommodate more complex and
problem. In addressing the self-invoking problems,
challenging code-related tasks.
LLMs are required to apply the solutions they have
LLMs for Code Generation The development independently generated for the base problem. This
of foundation models specifically designed for code evaluation of self-invoking code generation offers
generation has seen significant progress. CodeX deeper insights into the programming capabilities
(Chen et al., 2021) pioneered this direction by fine- of LLMs, extending beyond the scope of single-
tuning GPT models on code-specific data. Subse- problem code generation. The benchmark con-
quent models like CodeGeeX (Zheng et al., 2023) struction process, illustrated in Figure 2, will be
and CodeLLaMA (Roziere et al., 2023) further discussed in detail in the following subsections.
advanced the field by incorporating multilingual
code understanding and generation capabilities. 3.1 Self-invoking Problem Generation
StarCoder (Li et al., 2023), DeepseekCoder (Zhu To ensure that all benchmarks are permissively
et al., 2024) and Qwen2.5-Coder (Hui et al., 2024) licensed, we employ one of the state-of-the-art
demonstrated the importance of high-quality code (SoTA) open-source models, DeepSeek-V2.5, to
data curation and specialized architecture designs. create new problems and solutions derived from
Building upon these models, researchers have ex- the original HumanEval and MBPP datasets. Two
plored instruction-tuning approaches using GPT-4 main guidelines is established for self-invoking
or GPT-3.5 as teachers. Notable examples include problems generation to rigorously evaluate LLMs.
1) Complexity Enhancement: The self-invoking Iteration HumanEval Pro (%) MBPP Pro (%)
problems should introduce additional programming Round 1 64.0 84.7
challenges while preserving the core functionality Round 2 98.8 99.7
Round 3 100.0 100.0
of the original problems. This ensures that suc-
cessful solutions require both understanding of the Table 1: Pass@1 (%) of candidate solutions across dif-
original code and ability to extend it appropriately. ferent iteration rounds for canonical solution and test
2) Semantic Relevance: The self-invoking prob- case generation with human manual review.
lems should maintain sufficient semantic similar-
ity to their original counterparts to enable mean-
ingful self-invoking code generation process. Ap- Python execution checks with manual reviews, we
pendix F.1 presents the prompt for self-invoking ensure that all test cases accurately assess solution
problem generation. correctness and achieves a 100% pass@1 under
correct implementation conditions. Furthermore,
3.2 Solution Generation we categorize the common execution errors that oc-
In self-invoking problem generation process, the cur during test case generation into four main types:
candidate solution and test inputs are generated si- variable type mismatches, index out of bounds, in-
multaneously with the self-invoking problem. How- valid input handling, and edge case failures. To
ever, when dealing with self-invoking problems, obtain the high-quality self-invoking problem solu-
these generated solutions are often flawed, which tions, we adopt main remediation strategies includ-
can lead to execution errors during the verifica- ing: (1) implementing input validation, (2) adding
tion process, thereby highlighting a significant chal- type checking, (3) handling edge cases explicitly,
lenge in maintaining the accuracy and effectiveness and (4) refining problem specifications when nec-
of these test cases. Therefore, as shown in Figure 2, essary. Beyond basic execution correctness, we
we propose a method to iteratively execute the solu- also verify the self-invoking problem and solutions
tion code with test inputs and obtain expected out- in the following aspects: (1) logical consistency
puts correctly. For the execution errors, the authors between problem statements and test cases, (2) cov-
manually analyze these errors and modify the solu- erage of essential edge cases, and (3) alignment
tions to ensure that the final solution can cover all with original problem objectives.
the test cases comprehensively. The manual review
4 Experiments
process involves (1) identifying the root causes of
the errors, (2) making necessary adjustments to the We present results of proprietary models and open-
code or algorithm, and (3) re-evaluating the solu- source models on HumanEval Pro and MBPP
tion against the entire set of test cases to confirm Pro: Qwen-2.5-Coder (Base and Instruct, 1.5B,
its correctness and completeness. Table 1 shows 7B, 33B) (Hui et al., 2024), DeepseekCoder (Base
that our rigorous verification process ensures the and Instruct) (Guo et al., 2024), DeepseekCoder-
high quality of our benchmarks. V2 (DeepSeek-AI, 2024), Yi-Coder-9B (Base and
Instruct) (01.AI, 2024), OpenCoder (Base and
3.3 Test Cases Generation instruct) (Huang et al., 2024), Magicoder-S-DS-
After obtaining the self-invoking problem and its 6,7B (Wei et al., 2024), WaveCoder-Ultra-6.7B (Yu
candidates solution, a critical challenge is ensur- et al., 2024), Codestral-22B (Mistral, 2024), GPT-
ing the reliability of the test cases (with both test 3.5 (Ouyang et al., 2022), GPT-4o (OpenAI,
inputs and expected execution outputs) to validate 2024a), Claude-3.5-sonnet (Anthropic, 2024) and
the the generated solutions. Despite the apparent o1-mini (OpenAI, 2024b). To facilitate repro-
simplicity of using the same LLM context to gener- ducibility, the HuggingFace checkpoints of all
ate both problems and test cases, CRUXEval (Gu open-source models and API name of proprietary
et al., 2024) results show that even leading mod- models are provided in Appendix C. Our prompts
els like GPT-4 achieve only a 63.4% pass@1 rate for evaluation is shown in Appendix F.2.
in test output prediction. This suggests that using Following previous work (Chen et al., 2021), We
models like GPT-4 to directly generate test cases use the pass@k (Chen et al., 2021) score as the
for problems will lead to many inaccurate eval- evaluation metric of HumanEval Pro and MBPP
uation results. Our iterative verification method Pro. We use greedy decoding strategy to gener-
effectively addresses this challenge. By combining ate solutions for all open-source models and set
HumanEval Pro MBPP Pro
Model Params HumanEval (+) MBPP (+)
(0-shot) (1-shot) (0-shot) (1-shot)
Proprietary Models
o1-mini - 97.6 (90.2) 76.2 84.8 93.9 (78.3) 68.3 81.2
GPT-4o - 90.2 (86.0) 75.0 77.4 86.8 (72.5) 70.9 80.2
GPT-4-Turbo - 90.2 (86.6) 72.0 76.2 85.7 (73.3) 69.3 73.3
Claude-3.5-sonnet - 92.1 (86.0) 72.6 79.9 91.0 (74.6) 66.4 76.2
Open-source Models
Deepseek-V2.5 - 90.2 (83.5) 73.8 76.8 87.6 (74.1) 71.2 77.5
DeepseekCoder-V2-instruct 21/236B 90.2 (84.8) 77.4 82.3 89.4 (76.2) 71.4 76.5
Qwen2.5-Coder-1.5B-base 1.5B 43.9 (36.6) 37.2 39.6 69.2 (58.6) 48.4 51.3
Qwen2.5-Coder-1.5B-instruct 1.5B 70.7 (66.5) 33.5 37.8 69.2 (59.4) 42.1 43.7
DeepseekCoder-6.7B-base 6.7B 49.4 (39.6) 35.4 36.6 70.2 (51.6) 50.5 55.0
DeepseekCoder-6.7B-instruct 6.7B 78.6 (71.3) 55.5 61.6 74.9 (65.6) 57.1 58.2
Magicoder-S-DS-6.7B 6.7B 76.8 (70.7) 54.3 56.7 75.7 (64.4) 58.7 64.6
WaveCoder-Ultra-6.7B 6.7B 78.6 (69.5) 54.9 59.8 74.9 (63.5) 60.1 64.6
Qwen2.5-Coder-7B-base 7B 61.6 (53.0) 54.9 56.1 76.9 (62.9) 61.4 68.0
Qwen2.5-Coder-7B-instruct 7B 88.4 (84.1) 65.9 67.1 83.5 (71.7) 64.8 69.8
OpenCoder-8B-base 8B 66.5 (63.4) 39.0 42.1 79.9 (70.4) 52.4 53.7
OpenCoder-8B-instruct 8B 83.5 (78.7) 59.1 54.9 79.1 (69.0) 57.9 61.4
Yi-Coder-9B-base 9B 53.7 (46.3) 42.7 50.0 78.3 (64.6) 60.3 61.4
Yi-Coder-9B-chat 9B 85.4 (74.4) 59.8 64.0 81.5 (69.3) 64.8 71.7
Codestral-22B-v0.1 22B 81.1 (73.2) 59.1 65.9 78.2 (62.2) 63.8 71.2
DeepseekCoder-33B-base 33B 56.1 (47.6) 49.4 49.4 74.2 (60.7) 59.0 65.1
DeepseekCoder-33B-instruct 33B 79.3 (75.0) 56.7 62.8 80.4 (70.1) 64.0 68.3
Qwen2.5-Coder-32B-base 32B 65.9 (60.4) 61.6 67.1 83.0 (68.2) 67.7 73.3
Qwen2.5-Coder-32B-instruct 32B 92.7 (87.2) 70.1 80.5 90.2 (75.1) 69.8 77.5
LLaMA3-70B-instruct 70B 81.7 (72.0) 60.4 64.6 82.3 (69.0) 63.5 70.4

Table 2: Main result of different models on HumanEval Pro and MBPP Pro. More results is shown in Appendix A.

Model Performance on Various Benchmarks


100 HumanEval MBPP
Distribution Distribution HumanEval MBPP
HumanEval Pro MBPP Pro HumanEval Pro MBPP Pro
Distribution Distribution
80
Pass@1 (%)

60

40

20

0
Cla 2B-in ni

t
T-4 o

De ekC tral-2 uct


kC 33B v0.1
e-3 uct

2.5 -V 2.5
Qw kCo see o

Co er-9B ct
aM B-i t

-C r-S 6.7B
De Co 70B- uct

Op 1.5B .7B

r- t
en er- ase

see oder ase


se

ase
Yi- B-ins t

ve 7B- ct
Qw Mag er-U uct

se ode base

1.5 se
ne

LL er-8 -cha

en ode ruc
c

en der bas
GP T-4
see Deep Turb

Op Co tru
er- ru

Wa er-6. stru
i

od -ba

od -ba
er- -m

en der k-V

tr

er- S-6
ud str

tr
son

Qw -Cod 8B-b

B-b
A3 nst

Qw nC -inst
od nst

en ico ltra-
GP

De 2.5-C 32B-

Qw ekCo r-9B-
-
ins

Co ins
B

kC -7B

De Yi-C -33B

-C .7B
od o1

od -in
-

od -D
2
.5-

-C 2-i

2.5 -6
er-
er
7
d
3

2.5 de
-

see er-
s

d
d

d e
ep od

e
2.5
-C

en
2.5

ep

ep
se
ep
en

ep
De
Qw

Figure 3: Performance Comparison: HumanEval Pro (and MBPP Pro) vs. HumanEval (and MBPP).

temperature=0.2 for all API-models. For all previ- 2024), highlighting the following salient observa-
ous benchmarks, we use the reported results when- tions: 1) Most LLMs have a 10% to 15% abso-
ever available; otherwise, we evaluate using the lute performance drop on self-invoking code gen-
EvalPlus codebase (Liu et al., 2024). eration benchmarks. 2) Large size open-source
LLMs have comparable performance with propri-
Table 2 presents the pass@1 scores of Hu- etary LLMs on self-invoking benchmarks. Notably,
manEval Pro and MBPP Pro alongside those of DeepseekCoder-V2-instruct achieves 77.4% on Hu-
other relevant benchmarks, including HumanEval, manEval Pro, surpassing the score of all propri-
HumanEval+, MBPP, and MBPP+ (Liu et al.,
Figure 4: HumanEval (or MBPP) scores against the results on HumanEval Pro and MBPP Pro (HumanEval+ and
MBPP+). We presents the comparison between base model and instruct model.

etary LLMs. 3) Most instruction-tuned models proves model performance on HumanEval Pro
have less improvements on self-invoking code gen- and MBPP Pro, the pass@1 scores achieved on
eration benchmarks (e.g., HumanEval Pro) than these datasets remain notably lower compared to
traditional benchmarks (e.g.,HumanEval). For in- their counterparts on the original HumanEval and
stance, Qwen2.5Coder-32B-instruct have 26.8% MBPP benchmarks. This performance gap indi-
absolute improvement on HumanEval compared to cates that although current LLMs excel at direct
Qwen2.5Coder-32B-base (from 65.9% to 92.7%) code generation tasks, they struggle to maintain
but only 8.5% on HumanEval Pro (from 61.6% comparable performance when tasked with self-
to 70.1%). Appendix A also presents the evalua- invoking code generation for complex problems.
tion results for different k values with the sampling Notably, even the SoTA reasoning model o1-mini,
generation strategy. Section 4 provides detailed that achieves an impressive 96.2% pass@1 on Hu-
analysis for these results. manEval, demonstrates significant performance
degradation when tackling more complex problems,
5 Analysis as evidenced by its lower 76.2 pass@1 score on Hu-
manEval Pro under zero-shot setting.
Frontier LLMs still face challenges in self-
invoking code generation. Table 2 and Figure 3 5.1 Base Model vs Instruct Model
present the comparison between HumanEval Pro
(or MBPP Pro) and HumanEval (or MBPP). As Currently, the training of LLMs is typically divided
shown in Table 2, while 1-shot prompting im- into two stages: a pre-training stage that relies
Qwen7b-base

(a) Qwen2.5-Coder-7B-base (b) Qwen2.5-Coder-32B-base

(c) Qwen2.5-Coder-7B-instruct (d) Qwen2.5-Coder-32B-instruct

Figure 5: The confusion matrix of different models. We use (Failed, Passed) to indicate samples that fail in
HumanEval Pro (or MBPP Pro) but pass in HumanEval (or MBPP).

on self-supervised learning, and a subsequent su- dot is always distributed to the upper of orange dot
pervised fine-tuning stage based on <instruction, (even in a line on HumanEval vs HumanEval+).
response> pairs. Previous studies (Luo et al., 2023; Overall, this suggests that while instruction-based
Hui et al., 2024; Wei et al., 2024) have shown that fine-tuning significantly improves performance on
the instruction-based supervised fine-tuning stage simpler benchmarks like HumanEval (+) (or MBPP
can significantly enhance the code generation capa- (+)), its efficiency diminishes for more complex
bilities of base models on traditional benchmarks. self-invoking code generation tasks. On the other
For example, as shown in Table 2, Qwen2.5-Coder- hand, base models like Qwen2.5-Coder-base and
instruct 7B started with the Qwen2.5-Coder-7B Deepseek-Coder-base have a higher
base model and improved the HumanEval pass@1
pass@k on HumanEval Pro (or MBPP Pro)
score from 61.6% to 88.4%. There remains new cu- Ratio = (1)
pass@k on HumanEval (or MBPP)
riosity about whether these instruction-tuned mod-
els still show such significant improvements under than instruct models, which indicates that they have
a new problem solving scenario. In this section, we elevated training potential on self-invoking code
explore this through our new benchmarks. generation task.

The instruction-tuned models demonstrate 5.2 Confusion Matrix Correlation for


only marginal improvements compared to the Different Models
base models on self-invoking code generation. From Table 2, we observe that most LLMs have a
In Figure 4, we plot the previous reported Hu- score gap between direct code generation and self-
manEval (or MBPP) scores against the results on invoking code generation tasks. To better under-
HumanEval Pro and MBPP Pro (HumanEval+ and stand the correlation and overlap between these two
MBPP+). From the Figure 4, we have an interest- kinds of tasks, we compare the number of problems
ing finding: When observing the correlation be- passed and failed in HumanEval Pro and MBPP
tween HumanEval (or MBPP) and HumanEval Pro Pro with their corresponding base problems in Hu-
(or MBPP Pro), we see that the orange dot (indi- manEval and MBPP. Figure 5 presents an array of
cates base model) is always to the upper left of the confusion matrix over problems, highlighting the
blue dot (indicates instruction-tuned model). How- following salient observations:
ever, for the comparison between HumanEval (or Most LLMs are proficient in code genera-
MBPP) and HumanEval+ (or MBPP+), the blue tion tasks but struggle with generating code
Error Type Description Examples
AssertionError Failing to pass the test cases. Examples in Appendix G.1
NameError The code includes undefined variables. Examples in Appendix G.2
ValueError Unaware of the value of variables Examples in Appendix G.3
IndexError Array out of bounds Examples in Appendix G.4
TypeError Incorrect variable type usage. Examples in Appendix G.5
Other Errors KeyError, SyntaxError, ZeroDivisionError, IndentationError, etc. –

Table 3: The execution error types and their descriptions in our evaluation results.

Error Counts
28
by CoT and Direct Answer (GPT-4o)
Model CoT HE Pro MBPP Pro CoT
25 24
✘ 75.0 70.9 Direct Answer
GPT-4o
✔ 78.0 70.9 20

Error Count
✘ 73.8 71.2 15
DeepseekV2.5
✔ 74.4 71.4
9 10
✘ 70.1 69.8 10
Qwen2.5-Coder-32B-ins
✔ 72.0 70.1 5
2 2 1 1
✘ 65.9 64.8
Qwen2.5-Coder-7B-ins 0
✔ 71.3 64.8 AssertionError NameError ValueError IndexError

Table 4: The execution error types and their descriptions Figure 6: Error types of GPT-4o with and without CoT
in our evaluation results. reasoning on HumanEval Pro.

5.3 Chain-of-Thought Prompting

that can self-invoke effectively. Although some To evaluate the impact of the model’s reasoning
SoTA LLMs such as Qwen2.5-Coder-32B-instruct ability, we evaluated the performance of GPT-4o,
successfully solve 90% of base problems on the DeepseekV2.5, Qwen2.5-Coder-instruct (7B and
original HumanEval and MBPP benchmarks, over 32B) with and without Chain-of-Thought (CoT)
25% of problems still fail on more challenging prompting (Wei et al., 2022) on HumanEval Pro
HumanEval Pro and MBPP Pro benchmarks with and MBPP Pro. The full prompt we use is shown
self-invoking code generation (as shown in the top in Appendix F.2. For CoT prompting, we used the
right of each subfigure in Figure 5). This suggests greedy decoding strategy for generation to align
that the drop in the model’s scores on HumanEval the results before. As shown in Table 4, after ap-
Pro and MBPP Pro is largely due to its lower accu- plying CoT, the pass@1 of the selected models on
racy in generating self-invoking code compared to HumanEval Pro witnesses a significant improve-
direct code generation. ment. Notably, the accuracy of GPT-4o increases
from 75.0% to 78.0%. On MBPP Pro, although
The instruction-tuned model does not sig- the model does not show a significant improve-
nificantly outperform the base model in self- ment, it still maintains its original performance
invoking code generation task. From the con- level, indicating that CoT can enhance the accuracy
fusion matrices of the base model and the in- of model-generated code to a notable degree.
struct model in Figure 5, we can observe a trend: CoT could help Code LLMs to generate more
the instruction-tuned model typically has a sig- reliable code when scheduling across multiple
nificantly higher number of (Passed, Passed) code-related problems. To further study which
instances compared to the base model. How- aspects of code LLM can be improved by CoT, we
ever, for samples that pass the base problems use Python to run the code generated by GPT4o
but fail in HumanEval Pro and MBPP Pro, i.e., with and without CoT, and present the number of
(Failed, Passed), the instruct model does not all error types that occurred in Figure 6. We have
demonstrate notable improvement. This obser- two main observations: (1) With CoT prompting,
vation underscores our argument in Section 5.1: the AssertionError number decreases from 28 to
current instruction-based fine-tuning approaches 24. This indicates that CoT prompting enables
are insufficiently effective for more complex self- the model to generate code that more frequently
invoking code generation tasks. passes test cases. (2) The NameError number de-
Error Type Distribution Across Models
AssertionError
DeepseekCoder-V2-instruct NameError Model BCB-Lite Pro (%)
GPT-4o ValueError
DeepseekV2.5 IndexError GPT-4o 64.9 52.6
o1-mini TypeError GPT4-Turbo 61.4 52.6
OtherError Claude-3.5-sonnet 73.7 50.9
GPT-4-Turbo
Qwen2.5-Coder-32B-instruct DeepseekV2.5 80.7 50.9
Claude-3.5-sonnet
Qwen2.5Coder-32B-base
Qwen2.5Coder-1.5B-base 50.9 15.8
Qwen2.5-Coder-7B-instruct Qwen2.5Coder-1.5B-instruct 50.9 10.5
Yi-Coder-9B-Chat OpenCoder-8B-base 56.1 10.5
LLaMa-3-70B-instruct
OpenCoder-8B-instruct 75.4 22.8
Codestral-22B
DeepseekCoder-33B-instruct DeepseekCoder-6.7B-base 59.6 35.1
Qwen2.5Coder-7B-base DeepseekCoder-6.7B-instruct 56.1 35.1
WaveCoder-Ultra-6.7B WaveCoder-Ultra-6.7B 61.4 26.3
OpenCoder-8B-instruct Magicoder-S-DS-6.7B 50.9 33.3
Magicoder-S-DS
DeepseekCoder-6.7B-instruct Yi-Coder-9B 57.9 21.1
DeepseekCoder-33B-base Yi-Coder-9B-Chat 66.7 31.6
Yi-Coder-9B
OpenCoder-8B-base Qwen2.5Coder-7B-base 59.6 38.6
DeepseekCoder-6.7B-base Qwen2.5Coder-7B-instruct 64.9 35.1
Qwen2.5Coder-1.5B-base
DeepseekCoder-33B-base 71.9 38.6
0 50 100 150 200 250 300 DeepseekCoder-33B-instruct 80.7 43.9
Count
Qwen2.5Coder-32B-base 68.4 49.1
Figure 7: Statistics of error type across different LLMs
Qwen2.5Coder-32B-instruct 80.7 52.6
on HumanEval Pro and MBPP Pro. We sum up all
kinds of errors on the two benchmarks. Exact number Codestral-22B 78.9 54.4
is shown in Appendix H. QwQ-32B-preview 86.0 59.6

Table 5: Passing rate (%) of LLMs on BigCodeBench


creases, which indicates that CoT prompting helps (BCB)-Lite and BCB-Lite-Pro. A dataset example of
the model produce more self-contained code snip- BCB-Lite-Pro is shown in Appendix G.6.
pets and reduces the use of undefined variables.
These findings highlight that CoT prompting could
generating self-invoking code.
help LLMs to generate more accurate and reliable
solution on self-invoking code generation task. 6 Generalization Study of Self-invoking
Code Generation
5.4 Error Analysis
In order to further understand the failure modes 6.1 BigCodeBench-Lite Pro Benchmark
across different LLMs, we analyze the errors en- To study self-invoking code generation on a wider
countered in code generated by different LLMs for range of programming problems, we construct
HumanEval Pro and MBPP Pro problems and cat- BigCodeBench-Lite Pro, a small self-invoking
egorize them by error type. The result is shown code generation benchmark derived from Big-
in Figure 7. Primarily, AssertionErrors constitute CodeBench (Zhuo et al., 2024). We first construct
the primary source of errors for all models on self- the BigCodeBench-Lite benchmark by selecting
invoking code generation task, which suggests that 57 problems with solve rate between 50% and
the majority of errors are still due to failing test 70% from BigCodeBench1 . For each examples
cases. Secondly, the NameErrors, which is often in BigCodeBench-Lite, we then curate the cor-
caused by the undefined variable or function, con- responding self-invoking problem as well as test
tribute significantly to the error rate. This suggests cases, following the same procedure described in
that despite the function infomation being provided Section 3. After further filtering by human experts,
in the prompt, many functions still fail to gener- BigCodeBench-Lite Pro contains 57 self-invoking
ate the correct function header. This may indicate programming problems from different topics.
that the LLM has issues with understanding or cor-
6.2 Results Analysis
rectly utilizing the provided information. Finally,
we also found that some TypeErrors and ValueEr- We evaluate a set of LLMs on BigCodeBench-Lite
rors accounted for a relatively small proportion of Pro. Table 5 presents the results (pass@1) of vari-
errors, which shows that LLM still has some defi- 1
We use reported statistics in https://huggingface.co/
ciencies in handling variable types and usage when datasets/bigcode/bigcodebench-solve-rate.
ous Proprietary and Open-source LLMs, highlight- Anthropic. 2024. The claude 3 model family: Opus,
ing the following observations: (1) Although the sonnet, haiku.
base problems we selected has a solving rate of Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang,
between 50% and 70% on BigCodeBench, only a Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin
small number of models in Table 5 have a passing Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, et al.
rate of more than 50% on BigCodeBench-Lite Pro. 2022. Multi-lingual evaluation of code generation
models. arXiv preprint arXiv:2210.14868.
This highlights the difficulty of the self-invoking
code generation task. (2) The instruction-tuned Jacob Austin, Augustus Odena, Maxwell Nye, Maarten
models still demonstrate marginal improvements Bosma, Henryk Michalewski, David Dohan, Ellen
Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021.
(sometimes decrease) compared to base models,
Program synthesis with large language models. arXiv
which also reinforces our argument in Section 5.1. preprint arXiv:2108.07732.

Antonio Valerio Miceli Barone and Rico Sennrich. 2017.


7 Conclusion A parallel corpus of python functions and documen-
tation strings for automated code documentation and
We present HumanEval Pro, MBPP Pro as well as code generation. arXiv preprint arXiv:1707.02275.
BigCodeBench-Lite Pro, a series of benchmarks to
evaluate LLMs on self-invoking code generation Mark Chen, Jerry Tworek, Heewoo Jun, Qiming
Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka-
task where the LLMs are employed to solve the plan, Harri Edwards, Yuri Burda, Nicholas Joseph,
base problem and use its solution to address more Greg Brockman, et al. 2021. Evaluating large
complex problems. Through extensive evaluation language models trained on code. arXiv preprint
of over 20 LLMs, we found that while these models arXiv:2107.03374.
have made significant progress in traditional code DeepSeek-AI. 2024. Deepseek-v2: A strong, economi-
generation tasks, they still struggle with more com- cal, and efficient mixture-of-experts language model.
plex self-invoking code generation tasks. Further- Preprint, arXiv:2405.04434.
more, we provide extensive comparison and analy- Alex Gu, Baptiste Roziere, Hugh James Leather, Ar-
sis between existing instruct model and base model. mando Solar-Lezama, Gabriel Synnaeve, and Sida
HumanEval Pro and MBPP Pro are positioned to Wang. 2024. Cruxeval: A benchmark for code rea-
serve as valuable benchmarks for code-related eval- soning, understanding and execution. In Forty-first
International Conference on Machine Learning.
uations and to inspire future LLM development by
shedding light on current model shortcomings and Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie,
encouraging innovation in training methodologies. Kai Dong, Wentao Zhang, Guanting Chen, Xiao
Bi, Yu Wu, YK Li, et al. 2024. Deepseek-coder:
When the large language model meets programming–
Limitations the rise of code intelligence. arXiv preprint
arXiv:2401.14196.
In this paper, we present HumanEval Pro and
MBPP Pro, a series of benchmarks evaluate LLMs Md Mahim Anjum Haque, Wasi Uddin Ahmad, Is-
mini Lourentzou, and Chris Brown. 2022. Fixeval:
on self-invoking code generation task. One lim- Execution-based evaluation of program fixes for com-
itation is that the programming languages of our petitive programming problems.
benchmarks only includes Python due to the intrin-
sic limitation of original HumanEval and MBPP. Masum Hasan, Tanveer Muttaqueen, Abdullah Al
Ishtiaq, Kazi Sajeed Mehrab, Md Mahim Anjum
Secondly, although the models have shown short- Haque, Tahmid Hasan, Wasi Uddin Ahmad, Anindya
comings in the self-invoking problem, the diversity Iqbal, and Rifat Shahriyar. 2021. Codesc: A large
of existing self-invoking problems in HumanEval code-description parallel dataset. arXiv preprint
Pro and MBPP Pro is still subject to the con- arXiv:2105.14220.
straints of the original problems. Hence, future Siming Huang, Tianhao Cheng, Jason Klein Liu, Jiaran
work should pay more attention to more diverse and Hao, Liuyihan Song, Yang Xu, J Yang, JH Liu,
multi-lingual self-invoking problem benchmarks. Chenchen Zhang, Linzheng Chai, et al. 2024. Open-
coder: The open cookbook for top-tier code large
language models. arXiv preprint arXiv:2411.04905.

References Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Day-
iheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang,
01.AI. 2024. Meet yi-coder: A small but mighty llm for Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder
code. technical report. arXiv preprint arXiv:2409.12186.
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia 2024. L2CEval: Evaluating language-to-code gener-
Yan, Tianjun Zhang, Sida Wang, Armando Solar- ation capabilities of large language models. Transac-
Lezama, Koushik Sen, and Ion Stoica. 2024. Live- tions of the Association for Computational Linguis-
codebench: Holistic and contamination free eval- tics, 12:1311–1329.
uation of large language models for code. arXiv
preprint. OpenAI. 2024a. Gpt-4o.

Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. OpenAI. 2024b. Openai o1 system card.
2023. Impact of code language models on automated
program repair. arXiv preprint arXiv:2302.05020. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Carlos E Jimenez, John Yang, Alexander Wettig, Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
Shunyu Yao, Kexin Pei, Ofir Press, and Karthik 2022. Training language models to follow instruc-
Narasimhan. 2023. Swe-bench: Can language mod- tions with human feedback. Advances in neural in-
els resolve real-world github issues? arXiv preprint formation processing systems, 35:27730–27744.
arXiv:2310.06770.
Houxing Ren, Mingjie Zhan, Zhongyuan Wu, Aojun
Matthew Jin, Syed Shahriar, Michele Tufano, Xin Zhou, Junting Pan, and Hongsheng Li. 2024. Re-
Shi, Shuai Lu, Neel Sundaresan, and Alexey Svy- flectioncoder: Learning from reflection sequence
atkovskiy. 2023. Inferfix: End-to-end program repair for enhanced one-off code generation. Preprint,
with llms. arXiv preprint arXiv:2303.07263. arXiv:2405.17057.

Jia Li, Ge Li, Xuanming Zhang, Yunfei Zhao, Yihong Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten
Dong, Zhi Jin, Binhua Li, Fei Huang, and Yongbin Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi,
Li. 2024. Evocodebench: An evolving code genera- Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023.
tion benchmark with domain-specific evaluations. In Code llama: Open foundation models for code. arXiv
The Thirty-eight Conference on Neural Information preprint arXiv:2308.12950.
Processing Systems Datasets and Benchmarks Track.
Disha Shrivastava, Hugo Larochelle, and Daniel Tar-
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas low. 2023. Repository-level prompt generation for
Muennighoff, Denis Kocetkov, Chenghao Mou, Marc large language models of code. In International Con-
Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. ference on Machine Learning, pages 31693–31715.
2023. Starcoder: may the source be with you! arXiv PMLR.
preprint arXiv:2305.06161.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Ling- Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
ming Zhang. 2024. Is your code generated by chatgpt et al. 2022. Chain-of-thought prompting elicits rea-
really correct? rigorous evaluation of large language soning in large language models. Advances in Neural
models for code generation. Advances in Neural Information Processing Systems, 35:24824–24837.
Information Processing Systems, 36.
Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and
Tianyang Liu, Canwen Xu, and Julian McAuley. Lingming Zhang. 2024. Magicoder: Empowering
2023. Repobench: Benchmarking repository-level code generation with oss-instruct. In Forty-first Inter-
code auto-completion systems. arXiv preprint national Conference on Machine Learning.
arXiv:2306.03091.
Chunqiu Steven Xia, Yinlin Deng, and Lingming Zhang.
Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xi- 2024. Top leaderboard ranking = top coding pro-
ubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, ficiency, always? evoeval: Evolving coding bench-
Qingwei Lin, and Daxin Jiang. 2023. Wizardcoder: marks via llm. arXiv preprint.
Empowering code large language models with evol-
instruct. arXiv preprint arXiv:2306.08568. Chunqiu Steven Xia, Yuxiang Wei, and Lingming
Zhang. 2022. Practical program repair in the era
Mistral. 2024. Codestral. of large pre-trained language models. arXiv preprint
arXiv:2210.14179.
Niklas Muennighoff, Qian Liu, Armel Randy Ze-
baze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Zhaojian Yu, Xin Zhang, Ning Shang, Yangyu Huang,
Swayam Singh, Xiangru Tang, Leandro Von Werra, Can Xu, Yishujie Zhao, Wenxiang Hu, and Qiufeng
and Shayne Longpre. 2024. Octopack: Instruction Yin. 2024. Wavecoder: Widespread and versatile
tuning code large language models. In The Twelfth enhancement for code large language models by in-
International Conference on Learning Representa- struction tuning. In Proceedings of the 62nd Annual
tions. Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 5140–5153.
Ansong Ni, Pengcheng Yin, Yilun Zhao, Martin Riddell,
Troy Feng, Rui Shen, Stephen Yin, Ye Liu, Semih Ziyin Zhang, Chaoyu Chen, Bingchang Liu, Cong Liao,
Yavuz, Caiming Xiong, Shafiq Joty, Yingbo Zhou, Zi Gong, Hang Yu, Jianguo Li, and Rui Wang. 2023.
Dragomir Radev, Arman Cohan, and Arman Cohan. A survey on language models for code.
Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan
Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang,
Yang Li, et al. 2023. Codegeex: A pre-trained model
for code generation with multilingual evaluations on
humaneval-x. arXiv preprint arXiv:2303.17568.
Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu,
Bill Yuchen Lin, Jie Fu, Wenhu Chen, and Xiang
Yue. 2024. Opencodeinterpreter: Integrating code
generation with execution and refinement. arXiv
preprint arXiv:2402.14658.
Ming Zhu, Aneesh Jain, Karthik Suresh, Roshan Ravin-
dran, Sindhu Tipirneni, and Chandan K Reddy. 2022.
Xlcost: A benchmark dataset for cross-lingual code
intelligence. arXiv preprint arXiv:2206.08474.
Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang,
Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo
Gao, Shirong Ma, et al. 2024. Deepseek-coder-v2:
Breaking the barrier of closed-source models in code
intelligence. arXiv preprint arXiv:2406.11931.
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu,
Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani
Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al.
2024. Bigcodebench: Benchmarking code genera-
tion with diverse function calls and complex instruc-
tions. arXiv preprint arXiv:2406.15877.
Appendix Contents
A Detailed Results 14

B Example in Benchmark Construction 14

C Model Information 15

D Comparison between HumanEval (Pro), MBPP (Pro) and BigCodeBench-Lite (Pro) 16

E Discussion about Self-invoking Problems and Solutions 17

F Prompts 18
F.1 Prompts for Benchmark Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
F.2 Prompts for Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

G Examples of Different Error Types 18


G.1 Examples of AssertionError . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
G.2 Examples of NameError . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
G.3 Examples of ValueError . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
G.4 Examples of IndexError . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
G.5 Examples of TypeError . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
G.6 An Example of BigCodeBench-Lite Pro . . . . . . . . . . . . . . . . . . . . . . . . . . 24

H Error Statistics across Different Models 27


A Detailed Results

Model HumanEval Pro (0-shot) MBPP Pro (0-shot)


LLaMA-3.1-8B-base 25.0 36.5
LLaMA-3.1-8B-instruct 45.7 53.7
LLaMA-3.1-70B-base 40.9 57.4
LLaMA-3.1-70B-instruct 60.4 63.8
Qwen-2.5-72B-base 62.2 65.3
Qwen-2.5-72B-instruct 68.9 68.8
QwQ-32B-preview 72.0 67.5
LLaMA-3.3-70B-instruct 67.1 64.6
Mistral-Large-instruct-2411 75.0 69.3

Table 6: Results of Other LLMs on HumanEval Pro and MBPP Pro (greedy decoding).

HumanEval Pro MBPP Pro


Model
pass@1 pass@5 pass@10 pass@1 pass@5 pass@10
DeepseekCoder-6.7B-base 38.0 50.9 54.7 51.6 60.4 63.1
DeepseekCoder-6.7B-instruct 55.9 64.1 66.5 55.2 62.6 64.9
Magicoder-S-DS-6.7B 55.1 62.7 65.1 57.7 64.9 67.2
WaveCoder-Ultra-6.7B 55.7 61.4 63.0 58.2 64.4 66.3
DeepseekCoder-33B-base 49.4 60.8 65.2 59.1 67.2 69.3
DeepseekCoder-33B-instruct 59.1 68.6 71.3 63.4 70.6 72.9
Qwen2.5-Coder-7B-base 51.8 62.1 66.2 61.3 69.9 72.3
Qwen2.5-Coder-7B-instruct 65.7 72.5 75.0 64.2 70.5 72.6
OpenCoder-9B-base 44.5 56.2 59.9 54.8 62.9 65.0
OpenCoder-9B-instruct 59.8 68.5 70.8 58.1 63.7 65.1
Yi-Coder-9B-base 47.9 59.0 61.9 59.6 67.7 69.7
Yi-Coder-9B-chat 59.7 66.4 67.9 65.0 69.8 71.2
Codestral-22B 59.5 66.2 67.7 63.2 67.7 68.9
Qwen2.5-Coder-32B-base 62.4 70.3 72.2 67.6 75.0 76.9
Qwen2.5-Coder-32B-instruct 69.2 72.3 73.3 70.6 74.7 76.0
QwQ-32B-preview 70.9 77.7 79.5 67.0 73.0 74.5

Table 7: The results of different models on HumanEval Pro and MBPP Pro . We generate 20 samples for each
problems with random sampling strategy where temperature is set to 0.2 and top_p is set to 0.95.

B Example in Benchmark Construction


Base Problem Self-invoking Problem
# You are a farmer who needs to feed a group of hungry rabbits. Each rabbit has
def eat(number, need, remaining):
""" a specific number of carrots it has already eaten and a specific number it still
needs to eat. You have a limited number of carrots in stock. Write a function that
You're a hungry rabbit, and you already have eaten a certain number of carrots,
takes in a list of rabbits, where each rabbit is represented by a tuple (number,
but now you need to eat more carrots to complete the day's meals.
need), and the total number of carrots in stock. The function should return the
you should return an array of [ total number of eaten carrots after your meals,
total number of carrots eaten by all rabbits and the number of carrots left in stock
the number of carrots left after your meals ]
if there are not enough remaining carrots, you will eat all remaining carrots, but will after feeding all the rabbits.
still be hungry.
Example: Canonical Solution
* eat(5, 6, 10) -> [11, 4]
* eat(4, 8, 9) -> [12, 1] def feed_rabbits(rabbits, stock):
* eat(1, 10, 10) -> [11, 0]
* eat(2, 11, 5) -> [7, 0] total_eaten = 0
Variables: remaining_carrots = stock
@number : integer
the number of carrots that you have eaten. for rabbit in rabbits:
@need : integer number, need = rabbit
the number of carrots that you need to eat. eaten, remaining_carrots = eat(number, need, remaining_carrots)
@remaining : integer total_eaten += eaten – number
the number of remaining carrots thet exist in stock
Constrain: return [total_eaten, remaining_carrots]
* 0 <= number <= 1000
* 0 <= need <= 1000
* 0 <= remaining <= 1000
Have fun :)
Test Cases
"""
if(need <= remaining): assert feed_rabbits([(5, 6), (4, 8), (1, 10)], 25) == [24, 1]
return [ number + need , remaining-need ] assert feed_rabbits([(2, 11), (3, 5), (4, 7)], 20) == [20, 0]
else: assert feed_rabbits([(0, 5), (5, 5), (10, 5)], 30) == [15, 15]
return [ number + remaining , 0] assert feed_rabbits([(1, 10), (2, 11), (3, 12)], 50) == [33, 17]

Figure 8: An example of self-invoking problems in HumanEval Pro

C Model Information

Model Name API Name


O1-mini o1-mini-2024-09-12
GPT-4o gpt-4o-2024-08-06
GPT-4-Turbo gpt-4-turbo-2024-04-09
Claude-3.5-sonnet claude-3-5-sonnet-20241022
Deepseek-V2.5 deepseek-chat

Model Name HuggingFace URL


DeepseekCoder-V2-instruct https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct
Qwen2.5-Coder-1.5B-base https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B
Qwen2.5-Coder-1.5B-instruct https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct
DeepseekCoder-6.7B-base https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base
DeepseekCoder-6.7B-instruct https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct
Magicoder-S-DS-6.7B https://huggingface.co/ise-uiuc/Magicoder-S-DS-6.7B
WaveCoder-Ultra-6.7B https://huggingface.co/microsoft/wavecoder-ultra-6.7b
Qwen2.5-Coder-7B-base https://huggingface.co/Qwen/Qwen2.5-Coder-7B
Qwen2.5-Coder-7B-instruct https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct
OpenCoder-8B-base https://huggingface.co/infly/OpenCoder-8B-Base
OpenCoder-8B-instruct https://huggingface.co/infly/OpenCoder-8B-Instruct
Yi-Coder-9B-base https://huggingface.co/01-ai/Yi-Coder-9B
Yi-Coder-9B-chat https://huggingface.co/01-ai/Yi-Coder-9B-Chat
Codestral-22B-v0.1 https://huggingface.co/mistralai/Codestral-22B-v0.1
DeepseekCoder-33B-base https://huggingface.co/deepseek-ai/deepseek-coder-33b-base
DeepseekCoder-33B-instruct https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct
Qwen2.5-Coder-32B-base https://huggingface.co/Qwen/Qwen2.5-Coder-32B
Qwen2.5-Coder-32B-instruct https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct
LLaMA3-70B-instruct https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct
QwQ-32B-Preview https://huggingface.co/Qwen/QwQ-32B-Preview
LLaMA3.1-8B-base https://huggingface.co/meta-llama/Llama-3.1-8B
LLaMA3.1-8B-instruct https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
LLaMA3.1-70B-base https://huggingface.co/meta-llama/Llama-3.1-70B
LLaMA3.1-70B-instruct https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct
Qwen2.5-72B-base https://huggingface.co/Qwen/Qwen2.5-72B
Qwen2.5-72B-instruct https://huggingface.co/Qwen/Qwen2.5-72B-Instruct

Table 8: The corresponding API names and HuggingFace model URLs for the evaluated models are listed in Table 2.
D
Pass@1 (%) De De Pass@1 (%)
ep Pass@1 (%) ep

0
20
40
60
80
Qw

0
20
40
60
80
0
20
40
60
80
100
see see
Q- kC kC
32B od od
-pr er-V er-
ev 2-i V2
Co ie nstr
-in
str
de w De uc uc
str ep t t
Qw al- see o1
2

MBPP
en Qw k-V -m
2.5 GPT 2B en 2.5 ini

MBPP Pro
2.5
Co 4-T -C GP G
de od T-4
r-3 urbo er- o De PT-4o
2B 32B ep
-in -in see
k

MBPP Distribution
str str Cla
uc ud -V2.5
uc
t GP t

MBPP Pro Distribution


T-4 e-3
.5-
GP - Tu Qw son
De T -4o Qw rb o en n
ep en 2.5 GPT et
2.5 o1 -C -4-
Cl see
k -C - Qw ode Tu
Qw aude V2.5 od mini
er- en r- rbo
-3 2.5 32B-i
e Cla 32B-b -C ns
De n2.5C .5-so ud a
ep n e-3 se Qw oder- truct
en 7
see oder net .5- 2.5 B-ins
kC - Qw Yi-C sonn -C
od 32B-b en o e od truct
er
2.5 der- t LL
Qw er-33 ase -C 9B
-
aM -32B-
en B-i De
ep o d c h A3 b
2 n see er-7B at -70 ase
De .5Co struc kC -in B-
ep de od str Yi- instr
er- u Co uc
see r-7 t 33B ct Op de t
Qw kCo B-ba Co -in en r
en d s de str Co -9B-c
de
2.5 er-33 e st uc r-8 hat
C LL ral-2 t B-
De ode B-bas aM 2 De Cod instr
ep r-7 e A3 B-v0
.1 ep e uc
Qw -70 see stral t
De seekC B-ins en B kC -22
ep o d t ru 2.5 -inst De od B-v
see c -C
od
ruc ep
see e r-3 0 . 1
kC er-6.7 t er- t kC 3B-in
od Yi- 7B-b od str
er- B-ba Co ase e
de Qw r-6.7 uct
Ma 6.7B- se
in
Wa
ve r- en B
gic Co 9B-b 2.5 -inst
od stru De de ase -C ru
er- c ep r Wa oder ct
S-D t see -Ultr
a ve -
Yi- kC Co 7B-b
Co S-6.7 od -6.7B de ase
Wa d B Ma er-33 Ma r-Ult
ve er-9B gic B-b gic ra-
Co -C o a De o de 6 .7B
Model Performance on MBPP vs. MBPP Pro (0-shot)

Op d e h a Op der-S se ep r
en see -S-D
en r-Ult t De Co -DS-6 kC S
Co r ep de
r .7B od -6.7B
Model Performance on HumanEval and HumanEval Pro (0-shot)

de a-6.7 see er-


r-8 B kC -8B-in 3
B-i od
er- s t r Yi- 3B-b
uc Co ase
nst 6. t de
Qw ruc Op 7B-in Op r -9B
en Yi-C t en s e
2.5 De Co truc Qw nCo -base
Co oder- ep de t en d

Model Performance on BigCodeBench-Lite and BigCodeBench-Lite Pro (0-shot)


see r-8
de
r-1 9B k B 2.5 er-8B
-
O Qw Cod -base De Cod -base
Qw pen .5B- en er- ep er-
1
ba 2 . 6.7 s
en
2.5
Co
d se Qw 5 - B- Qw eekC .5B-b
en Code base en od ase
Co er-8B 2.5 2.5 e
de -C r-1.5B -C r-6.7B
od -b od
HumanEval

r-1 -base er- er- -ba


.5B 1.5 ase 1.5 se
-in B-i B-i
str nst nst
uc
t ruc
t
ruc
t
HumanEval Pro 0-shot

BigCodeBench-Lite
HumanEval Distribution

BigCodeBench-Lite Pro
HumanEval Pro Distribution

Figure 9: Comparison between HumanEval Family, MBPP Family and BigCodeBench-Lite Family.
Comparison between HumanEval (Pro), MBPP (Pro) and BigCodeBench-Lite (Pro)
E Discussion about Self-invoking Problems and Solutions
We analyze the complexity comparison between a base problem and its self-invoking counterpart by
examining the line count of their canonical solutions. The line count serves as a proxy for the complexity
of each problem. By comparing the number of lines required to solve the base problem with those
needed for the self-invoking version, we gain insight into how the introduction of self-invocation affects
the overall complexity. Generally, self-invoking problems, which often involve recursion or similar
constructs, may require more lines of code to handle additional logic and edge cases, thereby increasing
the complexity. This comparison helps in understanding the additional computational and conceptual
challenges introduced by self-invocation.

Complexity Comparison between Base Problem and Self-invoking Problem


Base Problem
40 Self-invoking Problem

35

30

25
Complexity

20

15

10

0
0 20 40 60 80 100 120 140 160
HumanEval Problem ID

Complexity Comparison between Base Problem and Self-invoking Problem


Base Problem
80 Self-invoking Problem

70

60

50
Complexity

40

30

20

10

0
0 50 100 150 200 250 300 350
MBPP Problem ID

Figure 10: Complexity comparison between base problem and self-invoking problem. We use the line count of
the canonical solution for both the base problem and the self-invoking problem as a measure of the problem’s
complexity.
F Prompts
F.1 Prompts for Benchmark Construction
We set the prompt in our benchmark construction as follows:

Prompt for benchmark construction:


I’ll give you a raw programming question and its solution, please generate a new problem that
requires multiple calls to the raw question to solve, and generate the solution in new_solution.
Please return with json format including 3 keys: ’new_problem’,’new_solution’, ’test_input,́ I’ll
use json.loads() to transform it to dict type.
To solve new_problems, new_solution should include the multiple function calls of raw question.
So new_problems will be not only a related problem but also a more complex problem than raw
problem.
raw problem:
{raw problem}
raw solution:
{raw solution}

F.2 Prompts for Evaluation


We set the prompt in our evaluation as follows:

Prompt of 0-shot: You are an exceptionally intelligent coding assistant that consistently delivers
accurate and reliable responses to user instructions. Write a solution of python file to the following
problems, the solution of the second problem requires single or multiple calls to the first
@@ Instruction
{base problem}
{self-invoking problem}
@@ Response

Prompt of 1-shot: You are an exceptionally intelligent coding assistant that consistently delivers
accurate and reliable responses to user instructions. Write a solution of python file to the following
problems, the solution of the second problem requires single or multiple calls to the first solution
@@ Instruction
{base problem}
{self-invoking problem}
{example}
@@ Response

Prompt with Chain-of-Thought:


You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable
responses to user instructions. Write a solution of python file to the following problems, the
solution of the second problem requires single or multiple calls to the first
@@ Instruction
{base problem}
{self-invoking problem}
@@ Response
Let’s think step by step.

G Examples of Different Error Types


G.1 Examples of AssertionError
Listing 1: AssertionError in HumanEval/24-generated by Qwen2.5Coder-7B-base
1 Status : AssertionError
2 Traceback : assert largest_common_divisor ([15 , 30 , 45]) == 15
3 Explanation : expected 15 but got 5.

1 def largest_divisor ( n : int ) -> int :


2 """ For a given number n , find the largest number that divides n
evenly , smaller than n
3 >>> largest_divisor (15)
4 5
5 """
6 for i in range (n -1 , 0 , -1) :
7 if n % i == 0:
8 return i
9 return 1
10

11 def largest_common_divisor ( numbers : list ) -> int :


12 """ For a given list of integers , find the largest number that
divides each of the numbers in the list evenly , smaller than
the smallest number in the list
13 >>> largest_common_divisor ([15 , 30 , 45])
14 15
15 """
16 smallest = min ( numbers )
17 for i in range ( smallest -1 , 0 , -1) :
18 if all ( n % i == 0 for n in numbers ) :
19 return i
20 return 1
21

22

23 assert largest_common_divisor ([15 , 30 , 45]) == 15


24 assert largest_common_divisor ([15 , 30 , 45]) == 15
25 assert largest_common_divisor ([10 , 20 , 30]) == 10
26 assert largest_common_divisor ([7 , 14 , 21]) == 7
27 assert largest_common_divisor ([16 , 32 , 64]) == 16
28 assert largest_common_divisor ([9 , 18 , 27]) == 9

G.2 Examples of NameError

Listing 2: NameError in HumanEval/14-generated by Qwen2.5Coder-7B-instruct


1 Status : NameError
2 Traceback : assert all_prefixes_for_list ([ ’ abc ’ , ’ def ’ ]) == [[ ’a ’ , ’ ab
’, ’ abc ’] , [ ’d ’ , ’ de ’ , ’ def ’ ]]
3 Explanation : The expected function name is ’ all_prefixes_for_list ’ ,
but ’ all_prefixes_of_strings ’ is generated though this information
is given in prompt .

1 from typing import List


2 def all_prefixes ( string : str ) -> List [ str ]:
3 """ Return list of all prefixes from shortest to longest of the
input string
4 >>> all_prefixes ( ’ abc ’)
5 [’a ’, ’ ab ’, ’ abc ’]
6 """
7 return [ string [: i ] for i in range (1 , len ( string ) + 1) ]
8

10 def all_prefixes_of_strings ( strings : List [ str ]) -> List [ List [ str ]]:
11 """ Return list of lists where each sublist contains all prefixes
of the corresponding string in the input list , sorted from
shortest to longest . If the input list is empty , return an
empty list .
12 >>> all_prefixes_of_strings ([ ’ abc ’, ’ def ’, ’ ghi ’])
13 [[ ’ a ’, ’ ab ’, ’ abc ’] , [ ’ d ’, ’ de ’, ’ def ’] , [ ’ g ’, ’ gh ’, ’ ghi ’]]
14 """
15 return [ all_prefixes ( s ) for s in strings ]
16

17

18

19 assert all_prefixes_for_list ([ ’ abc ’ , ’ def ’ ]) == [[ ’a ’ , ’ ab ’ , ’ abc ’] ,


[ ’d ’, ’ de ’ , ’ def ’ ]]
20 assert all_prefixes_for_list ([ ’ ’ ]) == [[]]
21 assert all_prefixes_for_list ([]) == []
22 assert all_prefixes_for_list ([ ’ single ’ ]) == [[ ’s ’ , ’ si ’ , ’ sin ’ , ’ sing
’, ’ singl ’ , ’ single ’ ]]

G.3 Examples of ValueError

Listing 3: ValueError in HumanEval/153-generated by O1-mini


1 Status : ValueError
2 Traceback :
3 >>> assert Strongest_Extensions ({ ’ Slices ’: [ ’ SErviNGSliCes ’ , ’ Cheese ’
, ’ StuFfed ’] , ’ my_class ’: [ ’ AA ’ , ’ Be ’ , ’ CC ’ ]}) == [ ’ Slices .
SErviNGSliCes ’ , ’ my_class . AA ’]
4 >>> for class_name , extensions in classes_with_extensions
5 Explanation : ValueError : too many values to unpack ( expected 2)

2 def Strongest_Extension ( class_name , extensions ) :


3 """
4 Determines the strongest extension for a given class based on the
strength calculation .
5

6 The strength of an extension is calculated as the number of


uppercase letters
7 minus the number of lowercase letters in the extension ’s name .
8

9 If multiple extensions have the same strength , the first one in


the list is chosen .
10

11 Parameters :
12 - class_name ( str ) : The name of the class .
13 - extensions ( List [ str ]) : A list of extension names .
14

15 Returns :
16 - str : A string in the format " ClassName . StrongestExtensionName ".
17

18 Example :
19 >>> Strongest_Extension ( ’ my_class ’, [ ’ AA ’, ’ Be ’, ’ CC ’])
20 ’ my_class . AA ’
21 """
22 if not extensions :
23 return f " { class_name }. None "
24

25 def calculate_strength ( extension ) :


26 cap = sum (1 for char in extension if char . isupper () )
27 sm = sum (1 for char in extension if char . islower () )
28 return cap - sm
29

30 strongest_extension = extensions [0]


31 max_strength = calculate_strength ( strongest_extension )
32

33 for ext in extensions [1:]:


34 strength = calculate_strength ( ext )
35 if strength > max_strength :
36 strongest_extension = ext
37 max_strength = strength
38 return f " { class_name }.{ strongest_extension } "
39

40 def Strongest_Extensions ( classes_with_extensions ) :


41 """
42 Determines the strongest extensions for a list of classes .
43

44 For each class , it finds the strongest extension based on the


strength calculation .
45 If a class has no extensions , it is represented as " ClassName .
None ".
46

47 Parameters :
48 - classes_with_extensions ( List [ Tuple [ str , List [ str ]]]) :
49 A list where each element is a tuple containing a class name
and a list of its extensions .
50

51 Returns :
52 - List [ str ]: A list of strings in the format " ClassName .
StrongestExtensionName ".
53

54 Example :
55 >>> Strongest_Extensions ([
56 ... ( ’ my_class ’, [ ’ AA ’, ’ Be ’, ’ CC ’]) ,
57 ... ( ’ Slices ’, [ ’ SErviNGSliCes ’, ’ Cheese ’, ’ StuFfed ’]) ,
58 ... ( ’ EmptyClass ’, [])
59 ... ])
60 [’ my_class . AA ’, ’ Slices . SErviNGSliCes ’, ’ EmptyClass . None ’]
61 """
62 result = []
63 for class_name , extensions in classes_with_extensions :
64 if extensions :
65 strongest = Strongest_Extension ( class_name , extensions )
66 else :
67 strongest = f " { class_name }. None "
68 result . append ( strongest )
69 return result
70

71 assert Strongest_Extensions ({ ’ Slices ’: [ ’ SErviNGSliCes ’ , ’ Cheese ’ , ’


StuFfed ’] , ’ my_class ’: [ ’ AA ’ , ’ Be ’ , ’ CC ’ ]}) == [ ’ Slices .
SErviNGSliCes ’ , ’ my_class . AA ’]
72 assert Strongest_Extensions ({ ’ Empty ’: [] , ’ Single ’: [ ’A ’ ]}) == [ ’
Empty . None ’ , ’ Single . A ’]
73 assert Strongest_Extensions ({ ’ Mixed ’: [ ’ aA ’ , ’ Bb ’ , ’ Cc ’] , ’ AllCaps ’:
[ ’AA ’, ’ BB ’ , ’ CC ’ ]}) == [ ’ Mixed . aA ’ , ’ AllCaps . AA ’]
74 assert Strongest_Extensions ({ ’ NoExtensions ’: []}) == [ ’ NoExtensions .
None ’]
75 assert Strongest_Extensions ({}) == [] # Check for empty input

G.4 Examples of IndexError

Listing 4: IndexError in HumanEval/73-generated by GPT-4o


1 Status : IndexError
2 Traceback :
3 >>> assert smallest_change_2d ([[1 ,2 ,3 ,5 ,4 ,7 ,9 ,6] , [1 ,2 ,3 ,4 ,3 ,2 ,2] ,
[1 ,2 ,3 ,2 ,1]]) == 7
4 >>> column = [ arr [ row ][ col ] for row in range ( rows ) ]
5 Explanation : IndexError : list index out of range

2 def smallest_change ( arr ) :


3 """
4 Given an array arr of integers , find the minimum number of
elements that
5 need to be changed to make the array palindromic . A palindromic
array is an array that
6 is read the same backwards and forwards . In one change , you can
change one element to any other element .
7 """
8 n = len ( arr )
9 changes = 0
10 for i in range ( n // 2) :
11 if arr [ i ] != arr [ n - i - 1]:
12 changes += 1
13 return changes
14

15 def smallest_change_2d ( arr ) :


16 """
17 Given a 2 D array of integers , find the minimum number of elements
that need to be changed to make each row
18 and each column of the array palindromic . A palindromic row or
column is a row or column that is read the same
19 backwards and forwards . In one change , you can change one element
to any other element .
20 """
21 if not arr or not arr [0]:
22 return 0
23

24 rows = len ( arr )


25 cols = len ( arr [0])
26

27 # Calculate changes needed for each row


28 row_changes = sum ( smallest_change ( row ) for row in arr )
29

30 # Calculate changes needed for each column


31 col_changes = 0
32 for col in range ( cols ) :
33 column = [ arr [ row ][ col ] for row in range ( rows ) ]
34 col_changes += smallest_change ( column )
35

36 return row_changes + col_changes


37

38 assert smallest_change_2d ([[1 ,2 ,3 ,5 ,4 ,7 ,9 ,6] , [1 ,2 ,3 ,4 ,3 ,2 ,2] ,


[1 ,2 ,3 ,2 ,1]]) == 7
39 assert smallest_change_2d ([[1 ,2 ,3 ,4] , [4 ,3 ,2 ,1] , [1 ,2 ,2 ,1] ,
[1 ,2 ,3 ,4]]) == 8
40 assert smallest_change_2d ([[1 ,2 ,3 ,2 ,1] , [1 ,2 ,3 ,2 ,1] , [1 ,2 ,3 ,2 ,1] ,
[1 ,2 ,3 ,2 ,1]]) == 0

G.5 Examples of TypeError

Listing 5: TypeError in MBPP/164-generated by DeepseekCoder-V2-instruct


1 Status : TypeError
2 Traceback :
3 >>> assert total_and_average_volume ([{ ’ radius ’: 3 , ’ height ’: 5} , { ’
radius ’: 4 , ’ height ’: 6}]) == (147.6548547187203 ,
73.82742735936014)
4 >>> total_volume += volume_cone (r , h )
5 >>> return ( math . pi * r **2 * h ) / 3
6

7 Explanation : TypeError : unsupported operand type ( s ) for ** or pow () :


’ str ’ and ’ int ’

1 # Write a function to find the volume of a cone .


2 import math
3 def volume_cone (r , h ) :
4 return ( math . pi * r **2 * h ) / 3
5
6 # Given a list of cones with their respective radii and heights ,
write a function to calculate the total volume of all cones and
the average volume per cone .
7

8 def total_and_average_volume ( cones ) :


9 total_volume = 0
10 for cone in cones :
11 r , h = cone
12 total_volume += volume_cone (r , h )
13 average_volume = total_volume / len ( cones )
14 return total_volume , average_volume
15

16 assert total_and_average_volume ([{ ’ radius ’: 3 , ’ height ’: 5} , { ’ radius


’: 4, ’ height ’: 6}]) == (147.6548547187203 , 73.82742735936014)
17 assert total_and_average_volume ([{ ’ radius ’: 1 , ’ height ’: 2}]) ==
(2.0943951023931953 , 2.0943951023931953)

G.6 An Example of BigCodeBench-Lite Pro

Listing 6: The outputs of GPT-4o on BigCodeBench-Lite Pro (BigCodeBench/355).


1 Traceback ( most recent call last ) :
2 File " evalpro / result / GPT -4 o / bigcodebench_lite_pro / log /
santized_results / case_20 / gen_0 . py " , line 76 , in < module >
3 new_solution ( points )
4 File " evalpro / result / GPT -4 o / bigcodebench_lite_pro / log /
santized_results / case_20 / gen_0 . py " , line 65 , in new_solution
5 voronoi_plot_2d ( Voronoi ( subset ) , ax = ax , show_vertices = False ,
line_colors = f ’C { i } ’ , point_size =2)
6 File " _qhull . pyx " , line 2677 , in scipy . spatial . _qhull . Voronoi .
__init__
7 File " _qhull . pyx " , line 352 , in scipy . spatial . _qhull . _Qhull .
__init__
8 scipy . spatial . _qhull . QhullError : QH6214 qhull input error : not enough
points (1) to construct initial simplex ( need 4)

1 import numpy as np
2 from scipy . spatial import Voronoi , voronoi_plot_2d
3 import matplotlib . pyplot as plt
4 def task_func ( points , seed =0) :
5 """
6 Calculate the Voronoi diagram for a number of points in 2 D and
plot it .
7 Note : this function will raise errors when input is invalid , for
example wrong type or shape .
8 Jittering is applied prior to plotting .
9

10 Parameters :
11 - points ( np . ndarray ) : A numpy ndarray of shape ( n_points , 2)
with the coordinates of the points .
12 - seed ( int ) : Random seed for reproducibility . Defaults to 0.
13
14 Returns :
15 tuple ( vor , ax ) : A tuple containing :
16 - vor ( Voronoi ) : A Voronoi object representing the Voronoi
diagram of the points .
17 - ax ( Axes ) : The axes of the plotted Voronoi diagram .
18 """
19 if points . shape [1] != 2:
20 raise ValueError ( " Input ␣ points ␣ should ␣ have ␣ shape ␣ ( n_points , ␣
2) " )
21

22 np . random . seed ( seed )


23 jittered_points = points + np . random . normal ( scale =0.01 , size =
points . shape )
24

25 vor = Voronoi ( jittered_points )


26

27 fig , ax = plt . subplots ()


28 voronoi_plot_2d ( vor , ax = ax , show_vertices = False , line_colors = ’
blue ’ , point_size =2)
29 plt . show ()
30

31 return vor , ax
32

33 def new_solution ( points ) :


34 """
35 Given a set of points in 2 D space , calculate and plot the Voronoi
diagram for each subset of points obtained
36 by dividing the original set into three equal parts . Additionally
, overlay the Voronoi diagrams of these subsets
37 on a single plot to visualize their combined effect .
38

39 Parameters :
40 - points ( np . ndarray ) : A numpy ndarray of shape ( n_points , 2)
with the coordinates of the points .
41

42 Returns :
43 None
44 """
45 if len ( points ) < 3:
46 raise ValueError ( " Need ␣ at ␣ least ␣ 3 ␣ points ␣ to ␣ divide ␣ into ␣ three
␣ subsets " )
47

48 # Sort points array for reproducibility


49 points = np . array ( sorted ( points , key = lambda x : ( x [0] , x [1]) ) )
50

51 # Divide the points into three equal subsets


52 n_points = len ( points )
53 part_size = n_points // 3
54

55 subsets = [ points [ i * part_size :( i +1) * part_size ] for i in range (3) ]


56

57 # Create a plot for overlay


58 fig , ax = plt . subplots ()
59

60 # Plot the Voronoi for each subset


61 for i , subset in enumerate ( subsets ) :
62 # Note : Here ‘ task_func ‘ is assumed to return the ‘ax ‘ , which
will be reused for plotting
63 voronoi_plot_2d ( Voronoi ( subset ) , ax = ax , show_vertices = False ,
line_colors = f ’C { i } ’ , point_size =2)
64

65 plt . title ( " Overlay ␣ of ␣ Voronoi ␣ Diagrams ␣ for ␣ the ␣ Three ␣ Subsets " )
66 plt . show ()
67

68 # Test case 1: Basic test with 9 points


69 points = np . array ([[0 , 0] , [0 , 1] , [1 , 0] , [1 , 1] , [2 , 2] , [2 , 3] ,
[3 , 3] , [3 , 4] , [4 , 4]])
70 new_solution ( points )
71

72 # Test case 2: Test with exactly 3 points


73 points = np . array ([[0 , 0] , [1 , 1] , [2 , 2]])
74 new_solution ( points )
75

76 # Test case 3: Test with random points ensuring at least 9 points


77 points = np . random . rand (9 , 2)
78 new_solution ( points )
H Error Statistics across Different Models

Error type
Model Dataset All
AssertionError NameError ValueError IndexError TypeError OtherError

O1-mini HumanEval Pro 27 8 3 1 0 0 39


MBPP Pro 89 15 6 2 4 4 120
All 116 23 9 3 4 4 159

GPT-4o HumanEval Pro 28 11 2 1 0 0 41


MBPP Pro 82 17 4 1 5 1 110
All 110 28 6 2 5 1 151

DeepseekCoder-V2-instruct HumanEval Pro 26 7 1 1 1 1 37


MBPP Pro 79 12 4 3 7 3 108
All 105 19 5 4 8 4 145

DeepseekV2.5 HumanEval Pro 30 8 2 1 2 0 43


MBPP Pro 82 18 1 3 4 1 109
All 112 26 3 4 6 1 152

Qwen2.5-Coder-32B-instruct HumanEval Pro 32 12 2 2 1 1 50


MBPP Pro 89 16 3 1 4 1 114
All 121 28 5 3 5 2 164

Qwen2.5-Coder-7B-instruct HumanEval Pro 36 8 3 2 6 1 56


MBPP Pro 93 14 3 3 18 2 133
All 129 22 6 5 24 3 189

Claude-3.5-sonnet HumanEval Pro 30 11 1 1 0 2 45


MBPP Pro 87 28 3 1 6 2 127
All 117 39 4 2 6 4 172

LLaMa-3-70B-instruct HumanEval Pro 44 10 3 2 2 4 65


MBPP Pro 100 12 2 2 14 8 138
All 144 22 5 4 16 12 203

Codestral-22B HumanEval Pro 45 13 3 3 2 1 67


MBPP Pro 102 16 3 1 12 3 137
All 147 29 6 4 14 4 204

OpenCoder-8B-base HumanEval Pro 47 43 0 3 5 2 100


MBPP Pro 114 43 2 2 14 6 181
All 161 86 2 5 19 8 281

OpenCoder-8B-instruct HumanEval Pro 42 15 2 1 5 2 67


MBPP Pro 118 22 3 1 11 4 159
All 160 37 5 2 16 6 226

Qwen2.5Coder-1.5B-base HumanEval Pro 56 25 7 1 9 5 103


MBPP Pro 117 37 3 4 14 21 196
All 173 62 10 5 23 26 299

Qwen2.5Coder-7B-base HumanEval Pro 45 15 3 4 5 2 74


MBPP Pro 99 21 1 3 16 6 146
All 144 36 4 7 21 8 220

Qwen2.5Coder-32B-base HumanEval Pro 39 15 3 3 1 2 63


MBPP Pro 90 17 2 2 7 4 122
All 129 32 5 5 8 6 185

Yi-Coder-9B HumanEval Pro 48 31 2 5 3 5 94


MBPP Pro 92 37 1 3 12 5 150
All 140 68 3 8 15 10 244

Yi-Coder-9B-Chat HumanEval Pro 47 12 1 3 3 0 66


MBPP Pro 96 19 1 2 11 4 133
All 143 31 2 5 14 4 199

GPT-4-Turbo HumanEval Pro 33 8 3 1 1 0 46


MBPP Pro 91 18 1 1 5 0 116
All 124 26 4 2 6 0 162

DeepseekCoder-33B-base HumanEval Pro 55 16 2 2 3 5 83


MBPP Pro 108 23 5 1 8 10 155
All 163 39 7 3 11 15 238

DeepseekCoder-33B-instruct HumanEval Pro 49 14 2 2 4 0 71


MBPP Pro 101 16 2 1 10 6 136
All 150 30 4 3 14 6 207

DeepseekCoder-6.7B-base HumanEval Pro 59 24 4 4 6 9 106


MBPP Pro 128 25 3 3 14 14 187
All 187 49 7 7 20 23 293

DeepseekCoder-6.7B-instruct HumanEval Pro 46 15 4 4 2 2 73


MBPP Pro 107 30 4 2 17 2 162
All 153 45 8 6 19 4 235

Magicoder-S-DS HumanEval Pro 49 11 6 4 5 0 75


MBPP Pro 107 21 2 2 20 4 156
All 156 32 8 6 25 4 231

WaveCoder-Ultra-6.7B HumanEval Pro 51 12 2 3 4 2 74


MBPP Pro 113 20 2 4 8 4 151
All 164 32 4 7 12 6 225

Table 9: Error type of Different Models on HumanEval Pro and MBPP Pro.

You might also like