HumanEval Pro and MBPPPro Evaluating Large Language Models
HumanEval Pro and MBPPPro Evaluating Large Language Models
§ github.com/CodeEval-Pro/CodeEval-Pro
Model Input
Prompt
Abstract You are an exceptionally intelligent coding assistant that consistently delivers
accurate and reliable responses to user instructions.
We introduce self-invoking code generation, Write a solution of python file to the following problems, the solution of the
arXiv:2412.21199v1 [cs.SE] 30 Dec 2024
new benchmarks: HumanEval Pro, MBPP Pro, def replace_char(str1, ch, newch):
return str1.replace(ch, newch)
and BigCodeBench-Lite Pro, specifically de- def replace_multiple_chars(str1, char_map):
signed to assess LLMs on self-invoking code for ch, newch in char_map.items():
str1 = replace_char(str1, ch, newch)
generation. Second, from the analysis of ex- return str1
perimental results over twenty LLMs on our
Test
benchmarks, we have two important observa- assert replace_multiple_chars('python', {'p': 'b', 'y': 'i'}) == 'bithon'
tions: (i) Most LLMs excel in traditional code
generation benchmarks like HumanEval and Figure 1: The overview of self-invoking code genera-
MBPP, but their performance declines on self- tion in HumanEval Pro and MBPP Pro. Given a base
invoking tasks. For example, o1-mini achieves problem and a related, more complex problem, they are
96.2% pass@1 on HumanEval but only 76.2% required to solve the base problem and use its solution
on HumanEval Pro. (ii) On self-invoking code to address the complex problems.
generation task, the instruction-tuned models
demonstrate only marginal improvements com-
pared to the base models. Third, we disclose (Chen et al., 2021) and MBPP (Austin et al., 2021)
the types of failure modes that exist in our eval- have been widely adopted to evaluate the code
uation results. All these results underscore the
generation abilities of LLMs, providing standard-
need for further advancements in self-invoking
code generation tasks and provide a new direc- ized evaluation protocols for assessing their per-
tion for future research on enhancing LLMs’ formance on code-related tasks. However, these
code reasoning capabilities. existing benchmarks primarily focus on isolated,
single-function code generation, which represents
only a subset of the challenges encountered in real-
1 Introduction world software development scenarios.
Large Language Models (LLMs) have demon- To evaluate LLMs under more realistic problem-
strated significant progress in various code-related solving scenarios, BigCodeBench (Zhuo et al.,
tasks including code generation (Roziere et al., 2024) presents a benchmark that comprises of com-
2023; Zhang et al., 2023; Ni et al., 2024), pro- plex and practical problems requiring LLMs to
gram repair (Xia et al., 2022; Jin et al., 2023), and use multiple function calls from diverse libraries.
code translation (Zhu et al., 2022), etc. Traditional While BigCodeBench highlights the use of exter-
human-annotated benchmarks such as HumanEval nal function calls, it falls short in assessing LLMs’
reasoning ability to generate and invoke their own BigCodeBench-Lite Pro, a new self-invoking prob-
generated functions in problem-solving. CRUX- lems set derived from BigCodeBench (Zhuo et al.,
Eval (Gu et al., 2024) assesses LLMs’ code rea- 2024). On Bigcodebench-Lite Pro, LLMs show
soning by predicting function inputs and outputs. consistent performance trend with HumanEval Pro
However, the direct input and output prediction and MBPP Pro, which emphasizes the generaliz-
does not involve explicit code generation. In practi- ability of our construction pipeline. Therefore, our
cal software engineering contexts, developers must benchmark construction approach can also be ex-
not only write code but also comprehend, modify, tended to adapt other code generation benchmarks,
and utilize existing code to solve more complex particularly as the capabilities of LLMs advance
problems. Hence, the ability to understand and and older benchmarks become obsolete.
subsequently leverage one’s own generated code, Through extensive evaluation of various LLMs,
namely self-invoking code generation (Figure 1), we uncover a significant disparity between tradi-
plays an important role for LLMs to leverage their tional code generation and self-invoking code gen-
reasoning capabilities to code generation that cur- eration capabilities. Our findings reveal that while
rent benchmarks fail to capture. frontier LLMs excel at generating individual code
Therefore, we present HumanEval Pro and snippets, they often struggle to effectively utilizing
MBPP Pro, two expanded versions of the tradi- their own generated code for solving more com-
tional HumanEval and MBPP benchmarks to eval- plex problems. For example, o1-mini achieves
uate LLMs on self-invoking code generation task. 96.2% pass@1 on HumanEval but only 76.2% on
As illustrated in Figure 1, HumanEval Pro and HumanEval Pro, demonstrating the challenges in-
MBPP Pro extend beyond simple code generation herent in self-invoking code generation. From the
by introducing self-invoking problems which re- comparison between instruction-tuned models and
quires LLMs to solve the base problem and invoke their base models, we found that instruction-tuned
their self-generated code to solve a more complex models are less efficient on self-invoking code gen-
problem. By evaluating LLMs on self-invoking eration than traditional code generation task. Fur-
code generation task, HumanEval Pro and MBPP thermore, our detailed statistics of failure cases
Pro provide a useful and important probe to better in HumanEval Pro and MBPP Pro also reflect the
understand the programming capabilities of LLMs. shortcomings of LLMs in self-invoking code gen-
The capability of self-invoking code generation eration, thereby providing complementary insights
also facilitates LLMs to tackle difficult tasks with on real-world coding capabilities of LLMs.
greater autonomy and effectiveness.
To obtain HumanEval Pro and MBPP Pro, 2 Related Work
we propose a general recipe for constructing
Recent advances in LLMs have demonstrated re-
self-invoking code generation benchmarks by
markable capabilities in code generation and un-
building upon existing datasets. First, we use
derstanding. This section reviews the current land-
Deepseek-V2.5 (DeepSeek-AI, 2024) to generate
scape of code-related benchmarks and LLMs.
self-invoking problems based on the original prob-
lems in HumanEval and MBPP. These problems Benchmarks for Code Generation The eval-
are designed to be more complex than the base uation landscape for Code LLMs has evolved
problems and closely related to them, ensuring significantly. HumanEval (Chen et al., 2021)
progressive reasoning and coherent code invoca- and MBPP (Austin et al., 2021) serve as fun-
tion. Second, we generate the candidate solution damental benchmarks, focusing on Python func-
and test inputs for each problem. Third, we ex- tion completion tasks with test-driven evaluation.
ecute the code of candidate solution to generate Several benchmarks have expanded code eval-
output and use the assert command in Python to uation benchmarks to encompass multiple pro-
build test cases. In the third stage, human ex- gramming languages (Zheng et al., 2023; Athi-
perts are assigned to manually review each prob- waratkun et al., 2022), complex tasks like pro-
lem and continuously modify and execute the code gram repair (Haque et al., 2022; Jiang et al., 2023;
of solutions to ensure that all canonical solutions Muennighoff et al., 2024; Xia et al., 2024), dy-
could correctly solve the problem and cover the test namic problem sets (Jain et al., 2024), and code
cases. To verify the reproducibility of our bench- reasoning through code summarization (Barone
mark construction approach, we further construct and Sennrich, 2017; Hasan et al., 2021) and sim-
1 Self-invoking Problem Generation 2 Solution Generation 3 Test Cases Generation
Solution
Test Inputs
Modify Assertion
Base Problem Self-invoking
Executor
Problem Manually Check Test Outputs Test Cases
Failed Passed
Figure 2: The overview of benchmark construction. An example is shown in Figure 8. We summarize the entire
benchmark construction process as follows: (1) Self-invoking problem Generation: We use Deepseek-V2.5 to
generate the self-invoking problems, as well as their candidate solutions and test inputs. (2) Solutions Generation:
We execute the generated solution with the test inputs in a controlled Python environment to obtain ground truth
outputs. (3) Test Cases Generation: We employ an iterative method involving Python execution check and manual
review to ensure that all test cases pass successfully. The final execution results are then used to construct complete
test cases with assert command.
ulated execution (Gu et al., 2024). To evaluate WizardCoder (Luo et al., 2023), Magicoder (Wei
LLMs in professional software engineering, bench- et al., 2024), WaveCoder (Yu et al., 2024), Open-
marks like SWE-Bench (Jimenez et al., 2023), CodeInterpreter (Zheng et al., 2024), and Reflec-
EvoCodeBench (Li et al., 2024), RepoBench (Liu tionCoder (Ren et al., 2024). These models have
et al., 2023), and GoogleCodeRepo (Shrivastava achieved impressive performance on standard code
et al., 2023) focus on real-world tasks, code evolu- generation benchmarks through enhanced data di-
tion, and repository-level challenges. These bench- versity and instruction complexity.
marks collectively drive the advancement of LLMs,
providing valuable insights into their strengths and 3 Benchmark Construction
limitations. Our benchmarks introduce novel self-
To facilitate a meaningful comparison between
invoking code generation task, which addresses
self-invoking code generation and traditional code
gaps left by existing benchmarks. This addition
generation, we have crafted two new benchmarks,
provides a more holistic framework to evaluate
HumanEval Pro and MBPP Pro. These bench-
LLMs on leveraging their reasoning capabilities to
marks are extensions of the original HumanEval
code generation. Moreover, our benchmark con-
and MBPP, requiring the model to solve both the
struction method could also push existing bench-
base problem and a more complex self-invoking
marks forward to accommodate more complex and
problem. In addressing the self-invoking problems,
challenging code-related tasks.
LLMs are required to apply the solutions they have
LLMs for Code Generation The development independently generated for the base problem. This
of foundation models specifically designed for code evaluation of self-invoking code generation offers
generation has seen significant progress. CodeX deeper insights into the programming capabilities
(Chen et al., 2021) pioneered this direction by fine- of LLMs, extending beyond the scope of single-
tuning GPT models on code-specific data. Subse- problem code generation. The benchmark con-
quent models like CodeGeeX (Zheng et al., 2023) struction process, illustrated in Figure 2, will be
and CodeLLaMA (Roziere et al., 2023) further discussed in detail in the following subsections.
advanced the field by incorporating multilingual
code understanding and generation capabilities. 3.1 Self-invoking Problem Generation
StarCoder (Li et al., 2023), DeepseekCoder (Zhu To ensure that all benchmarks are permissively
et al., 2024) and Qwen2.5-Coder (Hui et al., 2024) licensed, we employ one of the state-of-the-art
demonstrated the importance of high-quality code (SoTA) open-source models, DeepSeek-V2.5, to
data curation and specialized architecture designs. create new problems and solutions derived from
Building upon these models, researchers have ex- the original HumanEval and MBPP datasets. Two
plored instruction-tuning approaches using GPT-4 main guidelines is established for self-invoking
or GPT-3.5 as teachers. Notable examples include problems generation to rigorously evaluate LLMs.
1) Complexity Enhancement: The self-invoking Iteration HumanEval Pro (%) MBPP Pro (%)
problems should introduce additional programming Round 1 64.0 84.7
challenges while preserving the core functionality Round 2 98.8 99.7
Round 3 100.0 100.0
of the original problems. This ensures that suc-
cessful solutions require both understanding of the Table 1: Pass@1 (%) of candidate solutions across dif-
original code and ability to extend it appropriately. ferent iteration rounds for canonical solution and test
2) Semantic Relevance: The self-invoking prob- case generation with human manual review.
lems should maintain sufficient semantic similar-
ity to their original counterparts to enable mean-
ingful self-invoking code generation process. Ap- Python execution checks with manual reviews, we
pendix F.1 presents the prompt for self-invoking ensure that all test cases accurately assess solution
problem generation. correctness and achieves a 100% pass@1 under
correct implementation conditions. Furthermore,
3.2 Solution Generation we categorize the common execution errors that oc-
In self-invoking problem generation process, the cur during test case generation into four main types:
candidate solution and test inputs are generated si- variable type mismatches, index out of bounds, in-
multaneously with the self-invoking problem. How- valid input handling, and edge case failures. To
ever, when dealing with self-invoking problems, obtain the high-quality self-invoking problem solu-
these generated solutions are often flawed, which tions, we adopt main remediation strategies includ-
can lead to execution errors during the verifica- ing: (1) implementing input validation, (2) adding
tion process, thereby highlighting a significant chal- type checking, (3) handling edge cases explicitly,
lenge in maintaining the accuracy and effectiveness and (4) refining problem specifications when nec-
of these test cases. Therefore, as shown in Figure 2, essary. Beyond basic execution correctness, we
we propose a method to iteratively execute the solu- also verify the self-invoking problem and solutions
tion code with test inputs and obtain expected out- in the following aspects: (1) logical consistency
puts correctly. For the execution errors, the authors between problem statements and test cases, (2) cov-
manually analyze these errors and modify the solu- erage of essential edge cases, and (3) alignment
tions to ensure that the final solution can cover all with original problem objectives.
the test cases comprehensively. The manual review
4 Experiments
process involves (1) identifying the root causes of
the errors, (2) making necessary adjustments to the We present results of proprietary models and open-
code or algorithm, and (3) re-evaluating the solu- source models on HumanEval Pro and MBPP
tion against the entire set of test cases to confirm Pro: Qwen-2.5-Coder (Base and Instruct, 1.5B,
its correctness and completeness. Table 1 shows 7B, 33B) (Hui et al., 2024), DeepseekCoder (Base
that our rigorous verification process ensures the and Instruct) (Guo et al., 2024), DeepseekCoder-
high quality of our benchmarks. V2 (DeepSeek-AI, 2024), Yi-Coder-9B (Base and
Instruct) (01.AI, 2024), OpenCoder (Base and
3.3 Test Cases Generation instruct) (Huang et al., 2024), Magicoder-S-DS-
After obtaining the self-invoking problem and its 6,7B (Wei et al., 2024), WaveCoder-Ultra-6.7B (Yu
candidates solution, a critical challenge is ensur- et al., 2024), Codestral-22B (Mistral, 2024), GPT-
ing the reliability of the test cases (with both test 3.5 (Ouyang et al., 2022), GPT-4o (OpenAI,
inputs and expected execution outputs) to validate 2024a), Claude-3.5-sonnet (Anthropic, 2024) and
the the generated solutions. Despite the apparent o1-mini (OpenAI, 2024b). To facilitate repro-
simplicity of using the same LLM context to gener- ducibility, the HuggingFace checkpoints of all
ate both problems and test cases, CRUXEval (Gu open-source models and API name of proprietary
et al., 2024) results show that even leading mod- models are provided in Appendix C. Our prompts
els like GPT-4 achieve only a 63.4% pass@1 rate for evaluation is shown in Appendix F.2.
in test output prediction. This suggests that using Following previous work (Chen et al., 2021), We
models like GPT-4 to directly generate test cases use the pass@k (Chen et al., 2021) score as the
for problems will lead to many inaccurate eval- evaluation metric of HumanEval Pro and MBPP
uation results. Our iterative verification method Pro. We use greedy decoding strategy to gener-
effectively addresses this challenge. By combining ate solutions for all open-source models and set
HumanEval Pro MBPP Pro
Model Params HumanEval (+) MBPP (+)
(0-shot) (1-shot) (0-shot) (1-shot)
Proprietary Models
o1-mini - 97.6 (90.2) 76.2 84.8 93.9 (78.3) 68.3 81.2
GPT-4o - 90.2 (86.0) 75.0 77.4 86.8 (72.5) 70.9 80.2
GPT-4-Turbo - 90.2 (86.6) 72.0 76.2 85.7 (73.3) 69.3 73.3
Claude-3.5-sonnet - 92.1 (86.0) 72.6 79.9 91.0 (74.6) 66.4 76.2
Open-source Models
Deepseek-V2.5 - 90.2 (83.5) 73.8 76.8 87.6 (74.1) 71.2 77.5
DeepseekCoder-V2-instruct 21/236B 90.2 (84.8) 77.4 82.3 89.4 (76.2) 71.4 76.5
Qwen2.5-Coder-1.5B-base 1.5B 43.9 (36.6) 37.2 39.6 69.2 (58.6) 48.4 51.3
Qwen2.5-Coder-1.5B-instruct 1.5B 70.7 (66.5) 33.5 37.8 69.2 (59.4) 42.1 43.7
DeepseekCoder-6.7B-base 6.7B 49.4 (39.6) 35.4 36.6 70.2 (51.6) 50.5 55.0
DeepseekCoder-6.7B-instruct 6.7B 78.6 (71.3) 55.5 61.6 74.9 (65.6) 57.1 58.2
Magicoder-S-DS-6.7B 6.7B 76.8 (70.7) 54.3 56.7 75.7 (64.4) 58.7 64.6
WaveCoder-Ultra-6.7B 6.7B 78.6 (69.5) 54.9 59.8 74.9 (63.5) 60.1 64.6
Qwen2.5-Coder-7B-base 7B 61.6 (53.0) 54.9 56.1 76.9 (62.9) 61.4 68.0
Qwen2.5-Coder-7B-instruct 7B 88.4 (84.1) 65.9 67.1 83.5 (71.7) 64.8 69.8
OpenCoder-8B-base 8B 66.5 (63.4) 39.0 42.1 79.9 (70.4) 52.4 53.7
OpenCoder-8B-instruct 8B 83.5 (78.7) 59.1 54.9 79.1 (69.0) 57.9 61.4
Yi-Coder-9B-base 9B 53.7 (46.3) 42.7 50.0 78.3 (64.6) 60.3 61.4
Yi-Coder-9B-chat 9B 85.4 (74.4) 59.8 64.0 81.5 (69.3) 64.8 71.7
Codestral-22B-v0.1 22B 81.1 (73.2) 59.1 65.9 78.2 (62.2) 63.8 71.2
DeepseekCoder-33B-base 33B 56.1 (47.6) 49.4 49.4 74.2 (60.7) 59.0 65.1
DeepseekCoder-33B-instruct 33B 79.3 (75.0) 56.7 62.8 80.4 (70.1) 64.0 68.3
Qwen2.5-Coder-32B-base 32B 65.9 (60.4) 61.6 67.1 83.0 (68.2) 67.7 73.3
Qwen2.5-Coder-32B-instruct 32B 92.7 (87.2) 70.1 80.5 90.2 (75.1) 69.8 77.5
LLaMA3-70B-instruct 70B 81.7 (72.0) 60.4 64.6 82.3 (69.0) 63.5 70.4
Table 2: Main result of different models on HumanEval Pro and MBPP Pro. More results is shown in Appendix A.
60
40
20
0
Cla 2B-in ni
t
T-4 o
2.5 -V 2.5
Qw kCo see o
Co er-9B ct
aM B-i t
-C r-S 6.7B
De Co 70B- uct
Op 1.5B .7B
r- t
en er- ase
ase
Yi- B-ins t
ve 7B- ct
Qw Mag er-U uct
se ode base
1.5 se
ne
LL er-8 -cha
en ode ruc
c
en der bas
GP T-4
see Deep Turb
Op Co tru
er- ru
Wa er-6. stru
i
od -ba
od -ba
er- -m
en der k-V
tr
er- S-6
ud str
tr
son
Qw -Cod 8B-b
B-b
A3 nst
Qw nC -inst
od nst
en ico ltra-
GP
De 2.5-C 32B-
Qw ekCo r-9B-
-
ins
Co ins
B
kC -7B
De Yi-C -33B
-C .7B
od o1
od -in
-
od -D
2
.5-
-C 2-i
2.5 -6
er-
er
7
d
3
2.5 de
-
see er-
s
d
d
d e
ep od
e
2.5
-C
en
2.5
ep
ep
se
ep
en
ep
De
Qw
Figure 3: Performance Comparison: HumanEval Pro (and MBPP Pro) vs. HumanEval (and MBPP).
temperature=0.2 for all API-models. For all previ- 2024), highlighting the following salient observa-
ous benchmarks, we use the reported results when- tions: 1) Most LLMs have a 10% to 15% abso-
ever available; otherwise, we evaluate using the lute performance drop on self-invoking code gen-
EvalPlus codebase (Liu et al., 2024). eration benchmarks. 2) Large size open-source
LLMs have comparable performance with propri-
Table 2 presents the pass@1 scores of Hu- etary LLMs on self-invoking benchmarks. Notably,
manEval Pro and MBPP Pro alongside those of DeepseekCoder-V2-instruct achieves 77.4% on Hu-
other relevant benchmarks, including HumanEval, manEval Pro, surpassing the score of all propri-
HumanEval+, MBPP, and MBPP+ (Liu et al.,
Figure 4: HumanEval (or MBPP) scores against the results on HumanEval Pro and MBPP Pro (HumanEval+ and
MBPP+). We presents the comparison between base model and instruct model.
etary LLMs. 3) Most instruction-tuned models proves model performance on HumanEval Pro
have less improvements on self-invoking code gen- and MBPP Pro, the pass@1 scores achieved on
eration benchmarks (e.g., HumanEval Pro) than these datasets remain notably lower compared to
traditional benchmarks (e.g.,HumanEval). For in- their counterparts on the original HumanEval and
stance, Qwen2.5Coder-32B-instruct have 26.8% MBPP benchmarks. This performance gap indi-
absolute improvement on HumanEval compared to cates that although current LLMs excel at direct
Qwen2.5Coder-32B-base (from 65.9% to 92.7%) code generation tasks, they struggle to maintain
but only 8.5% on HumanEval Pro (from 61.6% comparable performance when tasked with self-
to 70.1%). Appendix A also presents the evalua- invoking code generation for complex problems.
tion results for different k values with the sampling Notably, even the SoTA reasoning model o1-mini,
generation strategy. Section 4 provides detailed that achieves an impressive 96.2% pass@1 on Hu-
analysis for these results. manEval, demonstrates significant performance
degradation when tackling more complex problems,
5 Analysis as evidenced by its lower 76.2 pass@1 score on Hu-
manEval Pro under zero-shot setting.
Frontier LLMs still face challenges in self-
invoking code generation. Table 2 and Figure 3 5.1 Base Model vs Instruct Model
present the comparison between HumanEval Pro
(or MBPP Pro) and HumanEval (or MBPP). As Currently, the training of LLMs is typically divided
shown in Table 2, while 1-shot prompting im- into two stages: a pre-training stage that relies
Qwen7b-base
Figure 5: The confusion matrix of different models. We use (Failed, Passed) to indicate samples that fail in
HumanEval Pro (or MBPP Pro) but pass in HumanEval (or MBPP).
on self-supervised learning, and a subsequent su- dot is always distributed to the upper of orange dot
pervised fine-tuning stage based on <instruction, (even in a line on HumanEval vs HumanEval+).
response> pairs. Previous studies (Luo et al., 2023; Overall, this suggests that while instruction-based
Hui et al., 2024; Wei et al., 2024) have shown that fine-tuning significantly improves performance on
the instruction-based supervised fine-tuning stage simpler benchmarks like HumanEval (+) (or MBPP
can significantly enhance the code generation capa- (+)), its efficiency diminishes for more complex
bilities of base models on traditional benchmarks. self-invoking code generation tasks. On the other
For example, as shown in Table 2, Qwen2.5-Coder- hand, base models like Qwen2.5-Coder-base and
instruct 7B started with the Qwen2.5-Coder-7B Deepseek-Coder-base have a higher
base model and improved the HumanEval pass@1
pass@k on HumanEval Pro (or MBPP Pro)
score from 61.6% to 88.4%. There remains new cu- Ratio = (1)
pass@k on HumanEval (or MBPP)
riosity about whether these instruction-tuned mod-
els still show such significant improvements under than instruct models, which indicates that they have
a new problem solving scenario. In this section, we elevated training potential on self-invoking code
explore this through our new benchmarks. generation task.
Table 3: The execution error types and their descriptions in our evaluation results.
Error Counts
28
by CoT and Direct Answer (GPT-4o)
Model CoT HE Pro MBPP Pro CoT
25 24
✘ 75.0 70.9 Direct Answer
GPT-4o
✔ 78.0 70.9 20
Error Count
✘ 73.8 71.2 15
DeepseekV2.5
✔ 74.4 71.4
9 10
✘ 70.1 69.8 10
Qwen2.5-Coder-32B-ins
✔ 72.0 70.1 5
2 2 1 1
✘ 65.9 64.8
Qwen2.5-Coder-7B-ins 0
✔ 71.3 64.8 AssertionError NameError ValueError IndexError
Table 4: The execution error types and their descriptions Figure 6: Error types of GPT-4o with and without CoT
in our evaluation results. reasoning on HumanEval Pro.
that can self-invoke effectively. Although some To evaluate the impact of the model’s reasoning
SoTA LLMs such as Qwen2.5-Coder-32B-instruct ability, we evaluated the performance of GPT-4o,
successfully solve 90% of base problems on the DeepseekV2.5, Qwen2.5-Coder-instruct (7B and
original HumanEval and MBPP benchmarks, over 32B) with and without Chain-of-Thought (CoT)
25% of problems still fail on more challenging prompting (Wei et al., 2022) on HumanEval Pro
HumanEval Pro and MBPP Pro benchmarks with and MBPP Pro. The full prompt we use is shown
self-invoking code generation (as shown in the top in Appendix F.2. For CoT prompting, we used the
right of each subfigure in Figure 5). This suggests greedy decoding strategy for generation to align
that the drop in the model’s scores on HumanEval the results before. As shown in Table 4, after ap-
Pro and MBPP Pro is largely due to its lower accu- plying CoT, the pass@1 of the selected models on
racy in generating self-invoking code compared to HumanEval Pro witnesses a significant improve-
direct code generation. ment. Notably, the accuracy of GPT-4o increases
from 75.0% to 78.0%. On MBPP Pro, although
The instruction-tuned model does not sig- the model does not show a significant improve-
nificantly outperform the base model in self- ment, it still maintains its original performance
invoking code generation task. From the con- level, indicating that CoT can enhance the accuracy
fusion matrices of the base model and the in- of model-generated code to a notable degree.
struct model in Figure 5, we can observe a trend: CoT could help Code LLMs to generate more
the instruction-tuned model typically has a sig- reliable code when scheduling across multiple
nificantly higher number of (Passed, Passed) code-related problems. To further study which
instances compared to the base model. How- aspects of code LLM can be improved by CoT, we
ever, for samples that pass the base problems use Python to run the code generated by GPT4o
but fail in HumanEval Pro and MBPP Pro, i.e., with and without CoT, and present the number of
(Failed, Passed), the instruct model does not all error types that occurred in Figure 6. We have
demonstrate notable improvement. This obser- two main observations: (1) With CoT prompting,
vation underscores our argument in Section 5.1: the AssertionError number decreases from 28 to
current instruction-based fine-tuning approaches 24. This indicates that CoT prompting enables
are insufficiently effective for more complex self- the model to generate code that more frequently
invoking code generation tasks. passes test cases. (2) The NameError number de-
Error Type Distribution Across Models
AssertionError
DeepseekCoder-V2-instruct NameError Model BCB-Lite Pro (%)
GPT-4o ValueError
DeepseekV2.5 IndexError GPT-4o 64.9 52.6
o1-mini TypeError GPT4-Turbo 61.4 52.6
OtherError Claude-3.5-sonnet 73.7 50.9
GPT-4-Turbo
Qwen2.5-Coder-32B-instruct DeepseekV2.5 80.7 50.9
Claude-3.5-sonnet
Qwen2.5Coder-32B-base
Qwen2.5Coder-1.5B-base 50.9 15.8
Qwen2.5-Coder-7B-instruct Qwen2.5Coder-1.5B-instruct 50.9 10.5
Yi-Coder-9B-Chat OpenCoder-8B-base 56.1 10.5
LLaMa-3-70B-instruct
OpenCoder-8B-instruct 75.4 22.8
Codestral-22B
DeepseekCoder-33B-instruct DeepseekCoder-6.7B-base 59.6 35.1
Qwen2.5Coder-7B-base DeepseekCoder-6.7B-instruct 56.1 35.1
WaveCoder-Ultra-6.7B WaveCoder-Ultra-6.7B 61.4 26.3
OpenCoder-8B-instruct Magicoder-S-DS-6.7B 50.9 33.3
Magicoder-S-DS
DeepseekCoder-6.7B-instruct Yi-Coder-9B 57.9 21.1
DeepseekCoder-33B-base Yi-Coder-9B-Chat 66.7 31.6
Yi-Coder-9B
OpenCoder-8B-base Qwen2.5Coder-7B-base 59.6 38.6
DeepseekCoder-6.7B-base Qwen2.5Coder-7B-instruct 64.9 35.1
Qwen2.5Coder-1.5B-base
DeepseekCoder-33B-base 71.9 38.6
0 50 100 150 200 250 300 DeepseekCoder-33B-instruct 80.7 43.9
Count
Qwen2.5Coder-32B-base 68.4 49.1
Figure 7: Statistics of error type across different LLMs
Qwen2.5Coder-32B-instruct 80.7 52.6
on HumanEval Pro and MBPP Pro. We sum up all
kinds of errors on the two benchmarks. Exact number Codestral-22B 78.9 54.4
is shown in Appendix H. QwQ-32B-preview 86.0 59.6
References Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Day-
iheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang,
01.AI. 2024. Meet yi-coder: A small but mighty llm for Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder
code. technical report. arXiv preprint arXiv:2409.12186.
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia 2024. L2CEval: Evaluating language-to-code gener-
Yan, Tianjun Zhang, Sida Wang, Armando Solar- ation capabilities of large language models. Transac-
Lezama, Koushik Sen, and Ion Stoica. 2024. Live- tions of the Association for Computational Linguis-
codebench: Holistic and contamination free eval- tics, 12:1311–1329.
uation of large language models for code. arXiv
preprint. OpenAI. 2024a. Gpt-4o.
Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. OpenAI. 2024b. Openai o1 system card.
2023. Impact of code language models on automated
program repair. arXiv preprint arXiv:2302.05020. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Carlos E Jimenez, John Yang, Alexander Wettig, Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
Shunyu Yao, Kexin Pei, Ofir Press, and Karthik 2022. Training language models to follow instruc-
Narasimhan. 2023. Swe-bench: Can language mod- tions with human feedback. Advances in neural in-
els resolve real-world github issues? arXiv preprint formation processing systems, 35:27730–27744.
arXiv:2310.06770.
Houxing Ren, Mingjie Zhan, Zhongyuan Wu, Aojun
Matthew Jin, Syed Shahriar, Michele Tufano, Xin Zhou, Junting Pan, and Hongsheng Li. 2024. Re-
Shi, Shuai Lu, Neel Sundaresan, and Alexey Svy- flectioncoder: Learning from reflection sequence
atkovskiy. 2023. Inferfix: End-to-end program repair for enhanced one-off code generation. Preprint,
with llms. arXiv preprint arXiv:2303.07263. arXiv:2405.17057.
Jia Li, Ge Li, Xuanming Zhang, Yunfei Zhao, Yihong Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten
Dong, Zhi Jin, Binhua Li, Fei Huang, and Yongbin Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi,
Li. 2024. Evocodebench: An evolving code genera- Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023.
tion benchmark with domain-specific evaluations. In Code llama: Open foundation models for code. arXiv
The Thirty-eight Conference on Neural Information preprint arXiv:2308.12950.
Processing Systems Datasets and Benchmarks Track.
Disha Shrivastava, Hugo Larochelle, and Daniel Tar-
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas low. 2023. Repository-level prompt generation for
Muennighoff, Denis Kocetkov, Chenghao Mou, Marc large language models of code. In International Con-
Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. ference on Machine Learning, pages 31693–31715.
2023. Starcoder: may the source be with you! arXiv PMLR.
preprint arXiv:2305.06161.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Ling- Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
ming Zhang. 2024. Is your code generated by chatgpt et al. 2022. Chain-of-thought prompting elicits rea-
really correct? rigorous evaluation of large language soning in large language models. Advances in Neural
models for code generation. Advances in Neural Information Processing Systems, 35:24824–24837.
Information Processing Systems, 36.
Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and
Tianyang Liu, Canwen Xu, and Julian McAuley. Lingming Zhang. 2024. Magicoder: Empowering
2023. Repobench: Benchmarking repository-level code generation with oss-instruct. In Forty-first Inter-
code auto-completion systems. arXiv preprint national Conference on Machine Learning.
arXiv:2306.03091.
Chunqiu Steven Xia, Yinlin Deng, and Lingming Zhang.
Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xi- 2024. Top leaderboard ranking = top coding pro-
ubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, ficiency, always? evoeval: Evolving coding bench-
Qingwei Lin, and Daxin Jiang. 2023. Wizardcoder: marks via llm. arXiv preprint.
Empowering code large language models with evol-
instruct. arXiv preprint arXiv:2306.08568. Chunqiu Steven Xia, Yuxiang Wei, and Lingming
Zhang. 2022. Practical program repair in the era
Mistral. 2024. Codestral. of large pre-trained language models. arXiv preprint
arXiv:2210.14179.
Niklas Muennighoff, Qian Liu, Armel Randy Ze-
baze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Zhaojian Yu, Xin Zhang, Ning Shang, Yangyu Huang,
Swayam Singh, Xiangru Tang, Leandro Von Werra, Can Xu, Yishujie Zhao, Wenxiang Hu, and Qiufeng
and Shayne Longpre. 2024. Octopack: Instruction Yin. 2024. Wavecoder: Widespread and versatile
tuning code large language models. In The Twelfth enhancement for code large language models by in-
International Conference on Learning Representa- struction tuning. In Proceedings of the 62nd Annual
tions. Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 5140–5153.
Ansong Ni, Pengcheng Yin, Yilun Zhao, Martin Riddell,
Troy Feng, Rui Shen, Stephen Yin, Ye Liu, Semih Ziyin Zhang, Chaoyu Chen, Bingchang Liu, Cong Liao,
Yavuz, Caiming Xiong, Shafiq Joty, Yingbo Zhou, Zi Gong, Hang Yu, Jianguo Li, and Rui Wang. 2023.
Dragomir Radev, Arman Cohan, and Arman Cohan. A survey on language models for code.
Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan
Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang,
Yang Li, et al. 2023. Codegeex: A pre-trained model
for code generation with multilingual evaluations on
humaneval-x. arXiv preprint arXiv:2303.17568.
Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu,
Bill Yuchen Lin, Jie Fu, Wenhu Chen, and Xiang
Yue. 2024. Opencodeinterpreter: Integrating code
generation with execution and refinement. arXiv
preprint arXiv:2402.14658.
Ming Zhu, Aneesh Jain, Karthik Suresh, Roshan Ravin-
dran, Sindhu Tipirneni, and Chandan K Reddy. 2022.
Xlcost: A benchmark dataset for cross-lingual code
intelligence. arXiv preprint arXiv:2206.08474.
Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang,
Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo
Gao, Shirong Ma, et al. 2024. Deepseek-coder-v2:
Breaking the barrier of closed-source models in code
intelligence. arXiv preprint arXiv:2406.11931.
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu,
Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani
Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al.
2024. Bigcodebench: Benchmarking code genera-
tion with diverse function calls and complex instruc-
tions. arXiv preprint arXiv:2406.15877.
Appendix Contents
A Detailed Results 14
C Model Information 15
F Prompts 18
F.1 Prompts for Benchmark Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
F.2 Prompts for Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Table 6: Results of Other LLMs on HumanEval Pro and MBPP Pro (greedy decoding).
Table 7: The results of different models on HumanEval Pro and MBPP Pro . We generate 20 samples for each
problems with random sampling strategy where temperature is set to 0.2 and top_p is set to 0.95.
C Model Information
Table 8: The corresponding API names and HuggingFace model URLs for the evaluated models are listed in Table 2.
D
Pass@1 (%) De De Pass@1 (%)
ep Pass@1 (%) ep
0
20
40
60
80
Qw
0
20
40
60
80
0
20
40
60
80
100
see see
Q- kC kC
32B od od
-pr er-V er-
ev 2-i V2
Co ie nstr
-in
str
de w De uc uc
str ep t t
Qw al- see o1
2
MBPP
en Qw k-V -m
2.5 GPT 2B en 2.5 ini
MBPP Pro
2.5
Co 4-T -C GP G
de od T-4
r-3 urbo er- o De PT-4o
2B 32B ep
-in -in see
k
MBPP Distribution
str str Cla
uc ud -V2.5
uc
t GP t
Op d e h a Op der-S se ep r
en see -S-D
en r-Ult t De Co -DS-6 kC S
Co r ep de
r .7B od -6.7B
Model Performance on HumanEval and HumanEval Pro (0-shot)
BigCodeBench-Lite
HumanEval Distribution
BigCodeBench-Lite Pro
HumanEval Pro Distribution
Figure 9: Comparison between HumanEval Family, MBPP Family and BigCodeBench-Lite Family.
Comparison between HumanEval (Pro), MBPP (Pro) and BigCodeBench-Lite (Pro)
E Discussion about Self-invoking Problems and Solutions
We analyze the complexity comparison between a base problem and its self-invoking counterpart by
examining the line count of their canonical solutions. The line count serves as a proxy for the complexity
of each problem. By comparing the number of lines required to solve the base problem with those
needed for the self-invoking version, we gain insight into how the introduction of self-invocation affects
the overall complexity. Generally, self-invoking problems, which often involve recursion or similar
constructs, may require more lines of code to handle additional logic and edge cases, thereby increasing
the complexity. This comparison helps in understanding the additional computational and conceptual
challenges introduced by self-invocation.
35
30
25
Complexity
20
15
10
0
0 20 40 60 80 100 120 140 160
HumanEval Problem ID
70
60
50
Complexity
40
30
20
10
0
0 50 100 150 200 250 300 350
MBPP Problem ID
Figure 10: Complexity comparison between base problem and self-invoking problem. We use the line count of
the canonical solution for both the base problem and the self-invoking problem as a measure of the problem’s
complexity.
F Prompts
F.1 Prompts for Benchmark Construction
We set the prompt in our benchmark construction as follows:
Prompt of 0-shot: You are an exceptionally intelligent coding assistant that consistently delivers
accurate and reliable responses to user instructions. Write a solution of python file to the following
problems, the solution of the second problem requires single or multiple calls to the first
@@ Instruction
{base problem}
{self-invoking problem}
@@ Response
Prompt of 1-shot: You are an exceptionally intelligent coding assistant that consistently delivers
accurate and reliable responses to user instructions. Write a solution of python file to the following
problems, the solution of the second problem requires single or multiple calls to the first solution
@@ Instruction
{base problem}
{self-invoking problem}
{example}
@@ Response
22
10 def all_prefixes_of_strings ( strings : List [ str ]) -> List [ List [ str ]]:
11 """ Return list of lists where each sublist contains all prefixes
of the corresponding string in the input list , sorted from
shortest to longest . If the input list is empty , return an
empty list .
12 >>> all_prefixes_of_strings ([ ’ abc ’, ’ def ’, ’ ghi ’])
13 [[ ’ a ’, ’ ab ’, ’ abc ’] , [ ’ d ’, ’ de ’, ’ def ’] , [ ’ g ’, ’ gh ’, ’ ghi ’]]
14 """
15 return [ all_prefixes ( s ) for s in strings ]
16
17
18
11 Parameters :
12 - class_name ( str ) : The name of the class .
13 - extensions ( List [ str ]) : A list of extension names .
14
15 Returns :
16 - str : A string in the format " ClassName . StrongestExtensionName ".
17
18 Example :
19 >>> Strongest_Extension ( ’ my_class ’, [ ’ AA ’, ’ Be ’, ’ CC ’])
20 ’ my_class . AA ’
21 """
22 if not extensions :
23 return f " { class_name }. None "
24
47 Parameters :
48 - classes_with_extensions ( List [ Tuple [ str , List [ str ]]]) :
49 A list where each element is a tuple containing a class name
and a list of its extensions .
50
51 Returns :
52 - List [ str ]: A list of strings in the format " ClassName .
StrongestExtensionName ".
53
54 Example :
55 >>> Strongest_Extensions ([
56 ... ( ’ my_class ’, [ ’ AA ’, ’ Be ’, ’ CC ’]) ,
57 ... ( ’ Slices ’, [ ’ SErviNGSliCes ’, ’ Cheese ’, ’ StuFfed ’]) ,
58 ... ( ’ EmptyClass ’, [])
59 ... ])
60 [’ my_class . AA ’, ’ Slices . SErviNGSliCes ’, ’ EmptyClass . None ’]
61 """
62 result = []
63 for class_name , extensions in classes_with_extensions :
64 if extensions :
65 strongest = Strongest_Extension ( class_name , extensions )
66 else :
67 strongest = f " { class_name }. None "
68 result . append ( strongest )
69 return result
70
1 import numpy as np
2 from scipy . spatial import Voronoi , voronoi_plot_2d
3 import matplotlib . pyplot as plt
4 def task_func ( points , seed =0) :
5 """
6 Calculate the Voronoi diagram for a number of points in 2 D and
plot it .
7 Note : this function will raise errors when input is invalid , for
example wrong type or shape .
8 Jittering is applied prior to plotting .
9
10 Parameters :
11 - points ( np . ndarray ) : A numpy ndarray of shape ( n_points , 2)
with the coordinates of the points .
12 - seed ( int ) : Random seed for reproducibility . Defaults to 0.
13
14 Returns :
15 tuple ( vor , ax ) : A tuple containing :
16 - vor ( Voronoi ) : A Voronoi object representing the Voronoi
diagram of the points .
17 - ax ( Axes ) : The axes of the plotted Voronoi diagram .
18 """
19 if points . shape [1] != 2:
20 raise ValueError ( " Input ␣ points ␣ should ␣ have ␣ shape ␣ ( n_points , ␣
2) " )
21
31 return vor , ax
32
39 Parameters :
40 - points ( np . ndarray ) : A numpy ndarray of shape ( n_points , 2)
with the coordinates of the points .
41
42 Returns :
43 None
44 """
45 if len ( points ) < 3:
46 raise ValueError ( " Need ␣ at ␣ least ␣ 3 ␣ points ␣ to ␣ divide ␣ into ␣ three
␣ subsets " )
47
65 plt . title ( " Overlay ␣ of ␣ Voronoi ␣ Diagrams ␣ for ␣ the ␣ Three ␣ Subsets " )
66 plt . show ()
67
Error type
Model Dataset All
AssertionError NameError ValueError IndexError TypeError OtherError
Table 9: Error type of Different Models on HumanEval Pro and MBPP Pro.