Advanced Language Models Eliminate The Need For
Advanced Language Models Eliminate The Need For
ZEYU SUN, National Key Laboratory of Space Integrated Information System, Institute of Software, Chinese
Academy of Sciences, China
ZHIHAO GONG, Key Lab of HCST (PKU), MOE;
School of Computer Science, Peking University, China
SIXIANG YE, Beijing University of Chemical Technology, China
YIZHOU CHEN, Key Lab of HCST (PKU), MOE;
School of Computer Science, Peking University, China
YIFAN ZHAO, Key Lab of HCST (PKU), MOE;
School of Computer Science, Peking University, China
QINGYUAN LIANG, Key Lab of HCST (PKU), MOE;
School of Computer Science, Peking University, China
DAN HAO, Key Lab of HCST (PKU), MOE;
School of Computer Science, Peking University, China
Large Language Models (LLMs) have significantly advanced software engineering (SE) tasks, with prompt
engineering techniques enhancing their performance in code-related areas. However, the rapid development of
foundational LLMs such as the non-reasoning model GPT-4o and the reasoning model o1 raises questions about
the continued effectiveness of these prompt engineering techniques. This paper presents an extensive empirical
study that reevaluates various prompt engineering techniques within the context of these advanced LLMs.
Focusing on three representative SE tasks, i.e., code generation, code translation, and code summarization, we
assess whether prompt engineering techniques still yield improvements with advanced models, the actual
effectiveness of reasoning models compared to non-reasoning models, and whether the benefits of using
these advanced models justify their increased costs. Our findings reveal that prompt engineering techniques
developed for earlier LLMs may provide diminished benefits or even hinder performance when applied
to advanced models. In reasoning LLMs, the ability of sophisticated built-in reasoning reduces the impact
of complex prompts, sometimes making simple zero-shot prompting more effective. Furthermore, while
reasoning models outperform non-reasoning models in tasks requiring complex reasoning, they offer minimal
advantages in tasks that do not need reasoning and may incur unnecessary costs. Based on our study, we
provide practical guidance for practitioners on selecting appropriate prompt engineering techniques and
Authors’ Contact Information: Guoqing Wang, [email protected], Key Lab of HCST (PKU), MOE;
School of Computer Science, Peking University, Beijing, China; Zeyu Sun, [email protected], National Key Laboratory
of Space Integrated Information System, Institute of Software, Chinese Academy of Sciences, Beijing, China; Zhihao Gong,
[email protected], Key Lab of HCST (PKU), MOE;
School of Computer Science, Peking University, Beijing, China; Sixiang Ye, Beijing University of Chemical Technology,
Beijing, China, [email protected]; Yizhou Chen, [email protected], Key Lab of HCST (PKU), MOE;
School of Computer Science, Peking University, Beijing, China; Yifan Zhao, [email protected], Key Lab of HCST
(PKU), MOE;
School of Computer Science, Peking University, Beijing, China; Qingyuan Liang, [email protected], Key Lab of HCST
(PKU), MOE;
School of Computer Science, Peking University, Beijing, China; Dan Hao, [email protected], Key Lab of HCST (PKU),
MOE;
School of Computer Science, Peking University, Beijing, China.
foundational LLMs, considering factors such as task requirements, operational costs, and environmental
impact. Our work contributes to a deeper understanding of effectively harnessing advanced LLMs in SE tasks,
informing future research and application development.
Additional Key Words and Phrases: Large Language Model
ACM Reference Format:
Guoqing Wang, Zeyu Sun, Zhihao Gong, Sixiang Ye, Yizhou Chen, Yifan Zhao, Qingyuan Liang, and Dan Hao.
2024. Do Advanced Language Models Eliminate the Need for Prompt Engineering in Software Engineering?.
1, 1 (November 2024), 22 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
1 Introduction
Large Language Models (LLMs) [5, 8, 11, 12, 29] have achieved remarkable results across various
domains [10, 37, 41, 51, 55], demonstrating human-like intelligence, especially in natural language
processing (NLP) [28, 44, 58]. This success has prompted a growing number of Software Engineering
(SE) researchers to integrate LLMs into solving diverse SE tasks, yielding promising outcomes [10,
13, 19, 20, 25, 43, 52]. Despite these successes, significant challenges persist in effectively using
LLMs for task completion [10, 22, 41, 59].
Harnessing LLMs in SE to maximize their exceptional in-context learning and reasoning ca-
pabilities [15, 48, 49, 57], relies on the prompts used, the information provided, and the specific
ways in which models are invoked [43, 53, 60]. In response, various strategies have emerged,
often termed “prompt engineering techniques”, which aim to optimize LLM performance beyond
simple model calls. These techniques1 include few-shot prompting [15], Chain-of-Thought (CoT)
prompting [49, 57], critique prompting [21, 27], expert prompting [53], and so on. While these
techniques have proven effective across different software development and maintenance tasks,
such as code generation, code understanding, software testing, and debugging, challenges like
hallucinations and inaccuracies remain [10, 13, 34, 60]. To address these limitations, researchers
have shifted focus to dynamic strategies [13, 16, 34, 56] involving continuous interaction with the
LLMs, task decomposition, and result verification. Building upon these concepts, recent approaches
employ prompt engineering techniques such as multi-agent systems [13, 27], iterative refinement
processes [27, 56, 60], and the integration of additional contextual information to refine the LLM’s
output [56, 60]. By leveraging these prompt engineering techniques, LLMs can deliver more reliable
results in complex SE tasks, paving the way for further advancements in SE research.
With the rapid advancements in large language model training [5, 8, 11, 12, 29], foundational
models are being iterated and updated at an accelerated pace [31–33]. More advanced foundational
models demonstrate improved understanding and generating capabilities. When OpenAI released
the GPT-4o model [31], its performance outperformed that of most prompt engineering techniques
developed for earlier foundational LLMs in coding tasks [2]. The subsequent o1, o1-mini models [32,
33] integrate CoT reasoning, allowing reasoning-based LLMs to autonomously decompose complex
problems into a series of simpler steps, thereby forming effective strategies for tackling intricate
logical issues. However, many prompt engineering techniques for code [13, 15, 53] were developed
based on the capabilities of the earlier model, ChatGPT-3.5 [29], as it was the only option available
at the time. This overlooks the enhancements offered by the more advanced GPT-4o [31] and the
reasoning capabilities of the o1 and o1-mini models [32, 33]. Moreover, OpenAI’s guidance indicates
that using complex prompts is not recommended for reasoning LLMs [32, 33]. Thus, this raises the
first question about ① the effectiveness of these prompt engineering techniques on the more advanced
models. Furthermore, while it is claimed that the reasoning LLMs, i.e., o1 and o1-mini, may provide
1 Toavoid confusion, following previous work [43], the term “techniques” in this paper specifically refers to prompt
engineering techniques, while specific approaches based on these techniques will be referred to as “approaches”.
enhanced performance, ② what is its actual effectiveness compared to non-reasoning models, and
what are its respective advantages and disadvantages? Additionally, the reasoning LLMs typically
incur higher operational costs, both in terms of monetary expenditure and time efficiency [1, 32].
In addition to computational and token-based costs, the varying levels of carbon emissions are a
critical consideration. This raises the third question: ③ do the benefits of utilizing these advanced
models justify their increased costs?
To explore these questions, this paper presents the first extensive study aimed at revisiting and
re-evaluating a variety of prompt engineering techniques within the context of the GPT-4o and
o1-mini2 models. For Question ①, we investigate the effectiveness of these prompt engineering
techniques on more advanced models, assessing whether they still yield significant improvements
or if methodological adjustments are required for adaptation to new LLMs. For Question ②,
we examine the actual performance of the reasoning LLM, compared to non-reasoning models,
exploring its advantages and disadvantages, particularly in light of its capability to autonomously
decompose complex problems and form reasoning strategies. For Question ③, we provide practical
guidance to users on selecting appropriate prompt engineering techniques and foundational LLMs
based on our findings, particularly considering whether the benefits of utilizing these advanced
models justify their increased operational costs.
To comprehensively assess the effectiveness of prompt engineering strategies within advanced
LLMs, we deliberately select three representative code-related tasks [10, 25] based on their preva-
lence and significance in the SE field: code generations [13, 16, 60], code translation [3, 34, 56],
and code summarization [4, 15, 21, 43, 48, 53]. These tasks correspond to the common scenarios
of text-to-code, code-to-code, and code-to-text in SE research [26], respectively, and they encom-
pass a wide range of practical applications. To evaluate these approaches, we used three widely
recognized datasets: HumanEval [9] for code generation, CodeTrans [56] for code translation, and
CodeSearchNet [17] of CodeXGLUE [26] benchmark for code summarization. These datasets are
standard benchmarks in the field and provide a solid foundation for comparative analysis. For
each task, we identify and include several state-of-the-art approaches that utilize distinct prompt
engineering techniques, ensuring a diverse and representative evaluation. Specifically, we select
11 approaches [3, 4, 13, 15, 16, 21, 34, 43, 48, 53, 56, 60] that employ techniques such as few-shot
prompting, CoT prompting, critique prompting, and multi-agent collaboration. We replace the
underlying LLMs with the latest foundational models, including both non-reasoning and reasoning
models [31, 33], to evaluate their performance across these SE tasks. This comprehensive assess-
ment aims to provide new insights into the effectiveness of prompt engineering strategies within
advanced LLMs.
From this study, we uncover several notable findings, some of which are summarized as follows:
(1) Effectiveness of Prompt Engineering Techniques: When applied to more advanced
LLMs, the improvements achieved by prompt engineering techniques developed for earlier
LLMs are often less pronounced than those reported in prior studies. In some cases, these
techniques even introduce performance drawbacks. For the reasoning LLM, the LLM’s ability
to self-correct through internal reasoning makes most prompt engineering techniques less
effective than using a simple zero-shot prompt. The formulation of prompts itself is less
critical than effectively leveraging accurate information and addressing errors based on
reliable feedback, such as execution details in software testing.
(2) The actual effectiveness of reasoning models compared to non-reasoning models: For
tasks that need many steps of reasoning, the reasoning LLMs can achieve better performance
than non-reasoning LLMs. For tasks that do not require complex reasoning, performance
2 Due to the high cost of the o1-preview [1, 32], we utilize the o1-mini [33] as a replacement.
differences between reasoning models and non-reasoning models are minimal, even worse
than prompt engineering techniques based on non-reasoning LLMs. The output format and
content of reasoning LLMs are more flexible and longer, which may be unnecessary and hard
to handle.
(3) Practical guidance to users on utilizing these advanced models: Given the additional
costs, time efficiency, and potential environmental impacts associated with reasoning mod-
els, non-reasoning models may be a more cost-effective option for tasks where advanced
reasoning capabilities are not essential. When the expected output is not long, such as code
summarization, it is recommended to use non-reasoning LLMs. Furthermore, when utilizing
reasoning LLMs, the expected format and content should be restricted precisely.
In summary, this paper makes the following contributions:
• To the best of our knowledge, this study is the first to empirically evaluate a diverse range of
prompt engineering techniques on more advanced LLMs within code-related tasks, specifi-
cally focusing on code generation, code translation, and code summarization. It includes an
evaluation of the non-reasoning and reasoning LLMs.
• We provide a detailed analysis of the performance of approaches leveraging various prompt
engineering techniques. We find that some prompt engineering techniques will decrease
and lose their effectiveness on advanced LLMs, and the reasoning LLMs can not always
outperform non-reasoning LLMs.
• Based on our findings, we offer insights and implications that can guide the adoption and
future development of LLMs in SE tasks. We also provide practical guidance to users when
selecting prompt engineering techniques and foundational LLMs, considering the monetary
expenditure, time costs, and potential impact on the environment.
problem content, aligning the model’s output with expert-level understanding and improving task
performance [43, 53].
In many code-related tasks, generating correct outputs often requires more than a single model
call, as the initial responses may contain errors [13, 60]. Critique prompting addresses this issue by
prompting LLMs to identify and correct errors in their responses [21]. This iterative process may
involve multiple rounds of corrections or coordination between agents. Furthermore, some tasks
benefit from the inclusion of domain knowledge. For instance, execution information can assist
in fixing bugs in code generation and code translation [16, 34, 56, 60]. Incorporating task-specific
information enhances the accuracy and reliability of LLMs in code-related tasks. Hence, many
recent LLM-based frameworks integrate multiple prompting strategies, such as AgentCoder [16]
and LDB [60] in code generation, and Unitrans [56] and LostinTransl [34] in code translation.
3 Study Design
3.1 Research Questions
This study aims to address the following research questions:
RQ1: Are previous prompt engineering techniques effective for more advanced LLMs?
Existing studies on LLM-based code-related tasks in Software Engineering have predominantly
utilized earlier models like ChatGPT-3.5 and GPT-4, which are now outdated and no longer sup-
ported by OpenAI [29]. Recent evaluations [2] have shown that approaches based on these earlier
LLMs sometimes underperform compared to more advanced LLMs that are utilized without specific
prompt engineering techniques. Advanced models, trained with higher-quality data and more ro-
bust training strategies, may exhibit reduced sensitivity to prompt engineering techniques. Notably,
the o1 and o1-mini models incorporate CoT reasoning [32, 33], enabling them to autonomously
decompose complex problems into simpler steps and devise reasoning strategies for solving intri-
cate logical tasks [32]. This raises the question of whether prompt engineering techniques remain
necessary to enhance performance in such advanced models. Thus, this research question aims to
identify which types of prompt engineering techniques, if any, continue to be effective with more
advanced LLMs (including non-reasoning LLMs and reasoning LLMs).
RQ2: What are the advantages and limitations of the reasoning LLMs in SE tasks
compared to previous LLMs? Building on the results from RQ1, this question seeks to explore the
specific conditions under which the reasoning model outperforms non-reasoning LLMs in SE tasks.
Given the diversity of SE tasks and scenarios, it is essential to determine the types of data and tasks
where the reasoning models excel and where they may encounter difficulties. By investigating
these factors, we aim to provide a comprehensive understanding of the reasoning models’ strengths
and weaknesses in the SE domain.
RQ3: How should practitioners and researchers select prompt engineering techniques
and foundational LLMs, considering the monetary expenditure and time costs? OpenAI’s
pricing indicates that reasoning LLMs have significantly higher costs and longer API response times
compared to non-reasoning LLMs [1]. Moreover, due to the model’s autonomous reasoning phase,
which users cannot directly control, accurately estimating the total cost of using reasoning LLMs
becomes challenging. Therefore, this research question aims to evaluate the balance between cost
and effectiveness in SE tasks, providing recommendations on when and how the reasoning LLMs
should be used in practice. We aim to offer practical guidance for practitioners and researchers to
make informed decisions regarding the adoption of advanced LLMs in cost-sensitive contexts.
Table 1. The summary of the prompt engineering techniques utilized by each approach.
agents, forming a team that works together autonomously to tackle code generation tasks, reducing
the need for human intervention.
LDB. LDB is an LLM-based framework that refines generated programs by incorporating runtime
execution information. It segments code into basic blocks and tracks intermediate variable values
after each block during execution. This approach enables the model to focus on smaller units of
code, verify correctness against the task description incrementally, and efficiently identify errors
throughout the code execution flow.
B. Dataset. We conduct our evaluation on the HumanEval benchmark [9], a widely-used dataset
for assessing the code generation capabilities of LLMs. HumanEval consists of 164 programming
problems, each defined by a natural language prompt and paired with a reference solution. The
dataset includes unit tests to verify the correctness of the generated code, making it suitable for
evaluating various code generation approaches based on different prompt engineering techniques.
C. Metric. The primary metric for evaluating model performance is pass@k, which measures the
proportion of problems solved correctly within 𝑘 attempts [9]. This metric emphasizes the model’s
ability to generate syntactically and semantically accurate Python code. For our evaluation, we
specifically use pass@1, which calculates the success rate based on the model’s top-1 prediction,
aligning with recent studies [2, 13, 16, 60]. This metric provides a clear measure of the model’s
precision in generating accurate solutions within a single attempt, reflecting real-world use cases
where efficiency and correctness in the initial generation are critical.
A. Approaches. In our evaluation of the code translation task, we compare three state-of-the-art
approaches: Summarize-and-Generate (S&G) [3], Unitrans [56], and LostinTransl [34].
Summarize-and-Generate (S&G). Summarize-and-Generate (S&G) is initially proposed by Ahmad
et al.[3] as a strategy for enhancing unsupervised program translation. In this paper, we adopt
the S&G paradigm by prompting large language models (LLMs) to first generate summaries of the
source code and subsequently produce the translated code as a kind of CoT prompting. This two-
step process leverages the summarization capabilities of LLMs to capture the semantic essence of
the original program, which we hypothesize may improve the quality of the generated translations.
Unitrans. UniTrans [56] first crafts a series of test cases for target programs with the assistance of
source programs. It then leverages these auto-generated test cases to augment the code translation
process and evaluate their correctness through execution. Subsequently, UniTrans iteratively repairs
incorrectly translated programs based on the outcomes of test case executions.
LostinTransl. Similar to UniTrans, LostinTransl[34] introduces an iterative repair strategy for code
translation, hypothesizing that incorporating additional contextual information in prompts can
enhance translations generated by LLMs. To achieve this, they include elements such as incorrect
translations, error details, and expected behavior. Unlike UniTrans, LostinTransl assesses translation
accuracy and obtains execution feedback directly from the original dataset’s test cases.
B. Dataset. We use a refined version of the dataset initially released by Roziere et al.[40], which
comprises 948 parallel code functions in C++, Java, and Python, sourced from the GeeksforGeeks
platform. Yang et al.[56] identified notable errors and inconsistencies within the original dataset
and subsequently conducted an extensive data-cleaning process, resulting in a curated version with
568 parallel code samples. To enhance the reliability of our evaluation, we adopt this refined dataset
released by Yang et al.[56]. Considering the high computational cost of evaluation, we focus on the
code translation between two popular programming languages, i.e., Java and Python.
C. Metric. Following previous studies[56, 61], we adopt the execution-based metric, Computa-
tional Accuracy (CA) for evaluation. CA measures the proportion of successfully passed test cases,
providing a direct measure of the functional correctness of translated programs. We exclude static
metrics such as BLEU [35, 45] and CodeBLEU [38], as they focus on surface-level syntactic similarity
and do not effectively capture the semantic equivalence essential for accurate code translation.
3.2.3 Code summarization.
A. Approaches. We compare five widely used prompt engineering techniques in code summariza-
tion: few-shot, Chain-of-Thought, critique, expert, and ASAP [4, 15, 21, 43, 48, 53].
Few-Shot. The few-shot prompting is proposed by Gao et al. [15], and is specifically designed for
code summarization. The example selection principle and order of the example3 remain the same
as their implementation.
Chain-of-Thought. We utilize the CoT steps in code summarization as previous work does [43,
48]. The technique first asks LLMs to answer five questions (i.e., the name of the function, the
input parameters that are being accepted by the function, the expected output or return value of
the function, the specific requirements or constraints for using this function, and the additional
dependencies or external requirements). Based on the response, the technique integrates the above
information and asks LLMs to generate comments.
Critique. The critique prompting [21] improves the quality of LLMs’ answers by asking LLMs to
find errors in the previous answers and correct them.
Expert. The expert prompting [43, 53] first asks LLMs to generate a description of an expert who
can complete the instruction, and then the description serves as the system prompt for zero-shot
prompting. To generate the description of an expert, the prompt engineering technique employs
few-shot prompting to let LLMs generate a description of an expert who can “Generate a short
comment in one sentence for a function”.
ASAP. ASAP [4] leverages multiple prompt engineering techniques. It employs few-shot prompt-
ing to identify relevant examples based on BM25 [39]. Subsequently, it extracts semantic features
for each code sample (including the target function and the retrieved exemplars), including the
repository name, the fully qualified name of the target function, its signature, the Abstract Syntax
Tree (AST) tags of its identifiers, and its data flow graph to enhance the generation process.
B. Dataset. We perform our evaluation on the CodeSearchNet dataset [17] of CodeXGLUE [26],
a widely-used benchmark in many code-related tasks [10, 24, 25, 62], including code summariza-
tion [4, 15, 43]. The code summarization dataset encompasses six programming languages and
contains a test set of approximately 53,000 samples. Similar to code translation, to mitigate the high
computational cost of evaluation, we focus on two popular programming languages, i.e., Java and
Python, and randomly select 250 samples from each language for our experiments.
C. Metric. Previous research [4, 14, 43, 47] has shown that common automated evaluation methods
for code summarization, such as those based on summary-summary text similarity or semantic
similarity, often lack consistency with human evaluations. In contrast, GPT-based evaluation
methods have demonstrated a stronger correlation with human judgments. As a result, we adopt the
GPT-based evaluation method, following the approach of Sun et al. [43], which provides evaluation
code and metrics that better align with human assessments. The LLM rates each summary from 1 to
5 where a higher score represents a higher quality of the summary. This method allows us to more
accurately measure the quality of LLM-generated summaries by comparing them to human-like
reference summaries, ensuring a more reliable evaluation of model performance.
3 The number of examples is set to 4 as their findings.
GPT-4o o1-mini
Zero-shot 90.4 93.9
CoT 91.5 94.1
AgentCoder 96.3 95.1
Self-collaboration 90.9 95.7
LDB 94.5 96.3
AgentCoder-no-iter 87.8 89.6
Self-collaboration-no-iter 90.2 94.1
LDB-no-iter 91.5 94.4
Finally, to assess the fundamental capabilities of LLMs, we serve zero-shot prompting as a baseline
for all three tasks, which utilizes models without any specific prompting strategies. The zero-shot
prompt for code generation is defined as the description in HumanEval [9], which is the same as
the previous work’s setting [13, 16, 60]. For code translation, the zero-shot prompt is the same as
the prompt provided in the previous evaluation of Transcoder by Yang et al. [56]. The zero-shot
prompt for code summarization is provided by Sun et al. [43].
4 Results
In this section, we present the evaluation results to answer each RQ. We implement the selected
approaches based on the released code of their reproducible packages and only change the fun-
damental LLM to the more advanced non-reasoning and reasoning models. We select GPT-4o as
the representation of non-reasoning models because of its excellent performance and affordable
cost [1, 31]. For reasoning models, due to the high cost of the o1-preview [1, 32], we utilize the
o1-mini [33] as a replacement. Regarding the randomness of the response of LLMs, we run each
approach three times and present the average results for each experiment.
Java2Python Python2Java
GPT-4o o1-mini GPT-4o o1-mini
zero-shot 0.947 0.943 0 0.085
1-shot 0.944 0.954 0.776 0.803
S&G 0.941 0.943 0.660 0.099
Unitrans 0.949 0.908 0.817 0.682
LostinTransl 0.971 0.978 0.783 0.806
Unitrans-no-iter 0.941 0.900 0.783 0.548
LostinTransl-no-iter 0.942 0.956 0.782 0.799
Java Python
GPT-4o o1-mini GPT-4o o1-mini
Zero-shot 4.14 4.23 3.71 4.12
Few-shot 4.19 4.15 3.89 4.00
CoT 4.26 3.98 4.31 4.09
Critique 4.42 3.76 4.46 3.71
Expert 4.44 3.98 4.26 4.04
ASAP 4.34 4.11 4.41 4.08
CoT prompting achieves a pass@1 rate of 91.5%, showing only a modest improvement over the zero-
shot baseline of 90.4%. This indicates that CoT prompting may not be essential for code generation
when using more advanced LLMs. Similarly, in the code translation task, the performance of
S&G and Unitrans remains similar to that of the simple prompt when translating from Java to
Python. When translating from Python to Java, S&G performs even worse than the simple prompt,
highlighting that advanced non-reasoning LLMs like GPT-4o can directly understand and translate
code to the target language without extensive guidance. However, iterative and collaborative
prompting methods like Unitrans and LostinTransl outperform the simple prompt, indicating that
such strategies can still benefit code translation in non-reasoning LLMs. In code summarization, all
prompt engineering techniques outperform zero-shot prompting. Among these, critique prompting
appears to be the most effective technique for GPT-4o, achieving scores of 4.42 and 4.46 for Java
and Python, respectively.
For o1-mini, which has a built-in CoT strategy, the zero-shot baseline already achieves a high
success rate of 93.9% in code generation. Given that the CoT prompting performance is nearly
identical to the zero-shot baseline, we conduct a Wilcoxon signed-rank test [50]. The resulting
p-value exceeds 0.10, indicating no significant difference between the two results. However, other
selected approaches, i.e., AgentCoder, Self-collaboration, and LDB, can still improve performance,
although the extent of improvement becomes smaller compared to GPT-4o. We suspect that the
included execution information contributes to the enhancement of the performance. This result
suggests that providing explicit CoT prompts is unnecessary for o1-mini, as its built-in CoT
handles reasoning effectively.
Moreover, in the code translation task, prompt engineering techniques have a limited impact
when utilizing o1-mini. Except for LostinTransl4 , techniques like S&G and Unitrans negatively
4 Although LostinTransl and Unitrans are both iterative and collaborative approaches, LostinTransl iterates based on the
execution feedback from the evaluated dataset’s test cases.
affect performance compared to simple prompts. An even more pronounced trend is observed
in code summarization, where the basic zero-shot prompt not only competes with but typically
exceeds the results of more complex prompt engineering strategies when using o1-mini. This
indicates that in some contexts, especially in tasks utilizing reasoning LLMs like o1-mini,
simpler approaches may be more effective than more intricate prompt engineering techniques.
We also notice that while o1-mini demonstrates better performance in zero-shot scenarios
compared to GPT-4o, when enhanced with prompt engineering techniques, GPT-4o often
exceeds o1-mini’s performance. For example, with the AgentCoder technique, GPT-4o reaches
a pass@1 rate of 96.3% (the highest pass@1 in code generation), whereas o1-mini achieves 95.1%.
This trend is even more pronounced in tasks such as Java code summarization, where several
approaches on GPT-4o, including CoT, Critique, Expert, and ASAP, outperform all configurations
on o1-mini. This suggests that the sophisticated built-in reasoning capabilities of o1-mini yield
diminishing returns when further enhanced with prompt engineering, whereas GPT-4o, with a
lower baseline, benefits more significantly from the same techniques, often outperforming o1-mini
post-enhancement.
Finding. The results demonstrate that while prompt engineering can still enhance the per-
formance of non-reasoning models like GPT-4o, the benefits are significantly reduced. For
example, in code generation, GPT-4o’s pass@1 rate increases modestly from 90.4% (zero-shot)
to 96.3% with AgentCoder. In contrast, the reasoning model o1-mini achieves a high pass@1
rate of 93.9% with zero-shot prompting, and prompt engineering techniques offer minimal or no
improvement. This suggests that for reasoning models like o1-mini, prompt engineering may
have diminishing returns or even negative impacts. Additionally, tasks like code summarization
do not significantly benefit from using reasoning models.
In the results of code generation and code translation, the approaches that can outperform zero-
shot prompting on o1-mini are AgentCoder, Self-collaboration, LDB, and LostinTransl. We observe
that although the formulation of prompts in the approaches are different from each other, they all
utilize multi-iteration based on the execution information of test cases. Hence, we raise a question as
to whether the execution information contributes to the enhancement of the performance compared
to zero-shot prompting. To verify our hypothesis, we remove the feedback of the test execution
information and the iteration phase in code generation and code translation5 . The results are listed
in Table 2 and 3, donated as “<Approach>-no-iter”.
As shown in these tables, for the non-reasoning model, GPT-4o, when removing the feedback
of the test execution information and the iteration phase, the performance of each approach is
similar to or lower than the performance of zero-shot prompting. It indicates that the useful part
of each approach’s prompt is the test execution information and the fix phase during the
iteration instead of the formulation of prompts.
For o1-mini, when removing the feedback of the test execution information and the iteration
phase, the performance of each approach is also similar to or lower than the performance of simple
promptings, such as zero-shot and 1-shot prompting. It shows that even without the complex
description of prompts, LLMs with built-in CoT can understand the problem and reason the correct
solution to the problem by itself.
In code translation, we observe a significant performance decline for Unitrans when using o1-
mini. Specifically, the performance drops from 0.954 to 0.900 when translating from Java to Python
and from 0.682 to 0.548 for Python-to-Java. Unitrans initially generates a series of test cases for the
target program and uses these cases to enhance code translation. However, after a manual check,
5 There is no approach in code summarization utilizing execution information.
we find that this approach can be error-prone, as the test generation phase may not accurately
represent the original code, leading to incomplete or flawed test cases. This issue is amplified when
correctness is not verified and errors are corrected only during execution, potentially causing
o1-mini to reason inaccurately about the target code, resulting in incorrect translations.
A similar challenge arises in code summarization. ASAP, which aims to improve summarization
by extracting semantic features from each code sample, performs worse with o1-mini than with
simple zero-shot prompting. This suggests that, without execution feedback that can reflect the
ground truth, the supplemental information extracted from the input may not enhance performance
and can even hinder it. For advanced LLMs like o1-mini, which possess sophisticated internal
reasoning mechanisms, such information may be unnecessary for code summarization. Inputting
all information simultaneously, without considering its relevance, can disrupt the internal Chain-
of-Thought logic, leading to degraded performance.
Finding. Our results show that the specific wording of prompts has minimal impact on ad-
vanced models like GPT-4o and o1-mini. Performance gains are primarily due to real execution
feedback used during iteration. For example, in code generation, removing execution feedback
in AgentCoder reduces GPT-4o’s pass@1 rate from 96.3% to 87.8%. Conversely, providing in-
accurate information without actual execution feedback can mislead reasoning models and
degrade performance.
4.2 RQ2: the advantages and limitations of the reasoning LLMs in SE tasks
To address RQ2, we delve deeper into the performance of the reasoning LLMs. We aim to explain its
effectiveness and highlight differences compared to user-designed prompt engineering techniques.
Additionally, recognizing that o1-mini does not perform optimally across all code-related tasks, we
categorize the types of cases where the reasoning model tends to fail and summarize the underlying
reasons for these failures. This analysis helps us identify the model’s limitations and provides
insights into areas where prompt engineering or model refinement may be necessary.
Advantages. We find that for problems involving multiple steps of reasoning, LLMs with
built-in CoT generally outperform non-reasoning LLMs. To further verify our finding, we
extract the steps to solve problems of o1-mini using OpenAI’s Chat Website [30] as the steps of CoT
reasoning6 . To be specific, for each task, we randomly choose 100 samples, a total of 300 samples.
The average length of CoT’s step is 3.52 in code generation, 4.35 in code translation, and 1.38 in code
summarization. Hence, we suspect that code generation and code translation need more reasoning
steps than code summarization for LLMs. We inspect the differences in effectiveness between
GPT-4o and o1-mini with different lengths of CoT. We filter the problems that the length of o1-mini
CoT’s step is longer than or equal to 5. We find that for these problems, the performance of o1-mini
is 16.67% better than GPT-4o. For the problems where the length of o1-mini CoT’s step is shorter
than 5, the performance of o1-mini is 2.89% better than GPT-4o. It indicates that the non-reasoning
models, like GPT-4o, cannot conduct complex reasoning, which causes the performance of o1-mini
to be better than GPT-4o in code generation.
An example of 𝐻𝑢𝑚𝑎𝑛𝐸𝑣𝑎𝑙129 illustrates the difference between GPT-4o and o1-mini. The
problem asks to find a path of length 𝑘 in an 𝑁 × 𝑁 grid where 𝑁 ≥ 2 (see Fig. 1 for details). For
the problem, o1-mini solves the problem using a dynamic programming (DP) algorithm, effectively
considering time complexity and the size of the search space through 10 reasoning steps in its CoT.
In contrast, GPT-4o employs a depth-first search (DFS) algorithm that ultimately results in a timeout
6 Sincethe details of o1-mini’s CoT can not be obtained, we can only use the abbreviated CoT displayed on the website as
the object of our statistics.
due to inefficiency. Even when advanced prompt engineering techniques like Self-collaboration and
LDB are applied to GPT-4o, the solution still times out. This example demonstrates that reasoning
LLMs like o1-mini can provide proper and efficient solutions based on the problem’s difficulty and
features—capabilities not present in GPT-4o without built-in CoT.
Another example of 𝐻𝑢𝑚𝑎𝑛𝐸𝑣𝑎𝑙132 in Fig. 2 demonstrates the influence of prompt engineering
techniques on reasoning LLMs. The task requires writing a function is_nested that takes a string
of square brackets ([]) and determines whether it contains a valid nested pair. Only o1-mini with
zero-shot prompting generates the correct answer. It takes o1-mini 32 reasoning steps over 35.17
seconds, during which its understanding deepens as it repeatedly verifies and refines its solution.
If it identifies issues, it regenerates to correct them. However, when using prompt engineering
techniques like AgentCoder and Self-collaboration, o1-mini is hindered by the complex prompts,
leading to implementations that overlook certain special cases.
Analyzing the reasons behind these results, we find that o1-mini’s built-in CoT capability
allows it to break down complex problems into manageable steps and correct errors
iteratively, particularly in scenarios with longer CoT sequences. This adaptability enables the model
to match the depth of its reasoning to the complexity of the task. For problems that require extensive
reasoning, o1-mini leverages its CoT more effectively than models without such capabilities,
optimizing the reasoning process without overcomplicating simpler problems. This adaptability
and built-in error correction make it less reliant on external prompt engineering to achieve high
performance, especially in tasks that demand a deeper understanding and solution exploration.
Finding. Our analysis shows that for problems requiring multi-step reasoning, specifically
when the CoT length is 5 steps or more, reasoning LLMs like o1-mini outperform non-reasoning
LLMs by an average of 16.67%. Otherwise, the performance advantage decreases to 2.89%.
Further analysis shows that it probably due to the built-in CoT capability allows it to break
down complex problems into manageable steps and correct errors iteratively.
Limitations. We further examine the limitations found in reasoning LLMs, specifically focusing
on o1-mini. Despite its advanced capabilities, we find that o1-mini often exhibits excessive
divergent thinking, leading to overextended reasoning in straightforward tasks. In our analysis,
where the CoT length is less than three steps, we observe that in 24% of cases where o1-mini
underperformed compared to GPT-4o, the issue stems from unnecessary and expansive reasoning.
This tendency to consider irrelevant factors complicates the model’s decision-making process and
negatively impacts its effectiveness in simpler tasks.
Additionally, o1-mini is less structured in its reply formats. Across multiple randomized
experiments, we identify a significant issue: with HumanEval’s default prompt, o1-mini occasionally
regenerates code segments that should remain unmodified. Specifically, nearly 40% of o1-mini’s
incorrect answers under zero-shot prompting are deemed incorrect because their output formats
can not be processed by standard post-processing tools used in previous evaluations, compared
to 0% for GPT-4o. This indicates that o1-mini may modify parts of the input that are meant to be
fixed, a problem not encountered with GPT-4o.
In code summarization tasks, when provided with the same prompts, o1-mini’s responses consis-
tently contain reasoning descriptions rather than directly providing concise answers, as GPT-4o
does. This necessitates additional post-processing to extract the relevant comment parts from the
given code, adding extra steps to the workflow.
Furthermore, when provided with human-written prompts, reasoning models like o1-mini
may need additional time to interpret these instructions, potentially causing errors. This
mismatch between the model’s internal reasoning process and external prompts can diminish
the effectiveness of human-written prompts or even introduce negative impacts on performance.
We suspect that this difference in thinking and understanding results in human-crafted prompts
having limited or adverse effects on reasoning models. The autonomous ability of these models to
think, correct, and rethink allows them to handle complex problems more effectively without the
need for intricate prompts, highlighting the diminishing returns of traditional prompt engineering
techniques when applied to advanced reasoning LLMs like o1-mini.
GPT-4o o1-mini
Message Token Reasoning Token Time Cost (s) Message Token Reasoning Token Time Cost (s)
Zero-shot 369.21 0 6.55 919.71 636.10 9.62
CoT 400.75 0 15.83 1269.16 539.71 13.19
AgentCoder 615.01 0 9.98 1747.54 1161.76 25.36
Self-collaboration 732.85 0 13.91 1858.53 890.54 17.57
LDB 702.87 0 19.39 2625.74 1824.27 39.12
GPT-4o o1-mini
Message Token Reasoning Token Time Cost (s) Message Token Reasoning Token Time Cost (s)
zero-shot 108.00 0 4.35 122.35 401.93 6.49
1-shot 98.14 0 3.61 111.78 631.59 9.05
Java2Python S&G 259.04 0 9.32 294.94 669.38 8.31
Unitrans 572.64 0 15.12 484.84 8835.72 78.50
LostinTransl 639.79 0 18.53 673.85 5138.89 90.72
zero-shot 192.37 0 5.94 188.49 477.74 8.85
1-shot 123.35 0 3.63 140.93 672.00 9.88
Python2Java S&G 346.52 0 7.73 418.05 708.38 10.80
Unitrans 694.85 0 12.09 994.69 9507.75 174.92
LostinTransl 681.40 0 14.41 698.93 3829.95 74.53
4.3 RQ3: the recommendations on how to select prompt engineering techniques and
foundational LLMs
In RQ1, we revisit the previous state-of-the-art prompt engineering techniques utilized in code-
related tasks. The results show the effectiveness of each prompt engineering technique in different
foundational LLMs (i.e., GPT-4o and o1-mini). In RQ2, we further investigate the advantages and
limitations of the reasoning model in code-related tasks compared to non-reasoning LLMs. Hence,
in RQ3, we want to offer practical guidance on how to select prompt engineering techniques and
foundational LLMs when taking the monetary and time costs into account.
4.3.1 The cost of prompt engineering techniques when utilizing different foundational LLMs. We
first analyze the computational and token-based costs of different approaches for code generation,
code translation, and code summarization, comparing them across two selected LLMs (i.e., GPT-4o
and o1-mini). We focus on three key metrics: message tokens, reasoning tokens, and time cost
(in seconds). Given the uncontrollable nature of reasoning time and CoT processes in reasoning
LLMs, we measure the cost by calculating the call time and token usage of the response phase and
reasoning phase separately. The evaluations are performed in a single-threaded environment on
the same machine to ensure consistency.
As shown in Table 5, 6, and 7, we can find when leveraging the same prompt engineering
technique, GPT-4o costs less tokens and time compared to o1-mini. When utilizing the same
foundational LLM, the kind of simple promptings, such as zero-shot and 1-shot promptings, cost the
least among all the prompting engineering techniques. The prompting engineering techniques with
multiple iterations need more computational and token-based costs compared to the prompting
engineering techniques with single calling.
Token usage. For o1-mini, the lengths of responses are usually longer than GPT-4o’s responses
even without considering reasoning tokens, as mentioned in RQ2. When considering reasoning
tokens, the cost of o1-mini increases rapidly. For Unitrans and LostinTransl, which contain multiple
GPT-4o o1-mini
Message Token Reasoning Token Time Cost (s) Message Token Reasoning Token Time Cost (s)
Zero-shot 24.96 0 1.27 93.42 209.66 3.06
Few-shot 19.65 0 1.33 51.72 302.59 3.96
Java CoT 405.44 0 6.80 1041.77 582.91 11.41
Critique 193.05 0 5.63 393.36 1248.51 11.36
Expert 26.58 0 1.33 94.80 215.81 4.63
ASAP 34.28 0 1.31 267.78 1204.48 8.65
Zero-shot 28.40 0 1.32 118.71 240.38 2.92
Few-shot 16.56 0 1.38 44.69 344.58 10.81
Python CoT 419.72 0 8.71 1166.48 646.66 13.23
Critique 213.94 0 8.11 449.15 1385.22 18.95
Expert 24.76 0 1.35 112.44 244.99 5.15
ASAP 29.43 0 1.40 225.65 876.29 10.90
iterations, the number of reasoning tokens is 5.5 times to 18.2 times compared to the number of
message tokens. The fact that more complex prompt engineering techniques based on o1-mini tend
to generate a larger volume of tokens suggests a correlation between prompt complexity and token
generation. Furthermore, from the perspective of replies, this type of lengthy response also does not
have significance in certain scenarios. For example, in code summarization, the average number of
comment tokens for Java in CodeSearchNet [17] is 36.54. However, the number of message tokens
in o1-mini is much longer, often exceeding 1,000 when utilizing CoT prompting. This can lead to
significant redundancies, which may interfere with processing and user understanding.
Time cost. When examining time cost, zero-shot prompting remains the most efficient, requiring
the least amount of time. In comparison to approaches leveraging GPT-4o, approaches leveraging
o1-mini result in substantially increased time costs. For instance, in code translation, LostinTransl’s
time cost with GPT-4o is 2.4 to 4.3 times higher than that of zero-shot prompting. However, when
o1-mini is used, LostinTransl’s time cost escalates dramatically, ranging from 8.4 to 14.0 times that
of zero-shot prompting. A similar trend is observed with Unitrans in Python-to-Java translation,
where the time cost relative to zero-shot prompting increases from 2.0 to 19.8 times, without any
corresponding improvement in performance. These observations suggest that advanced prompt
engineering techniques may inadvertently extend the logical reasoning process of LLMs with
built-in CoT, leading to longer reasoning times. This extended reasoning can result in significantly
higher computational costs, potentially outweighing the intended benefits of prompt complexity.
It is also noteworthy that when considering the performance of GPT-4o and o1-mini (as detailed
in RQ1), the increased computational overhead—whether in token cost or time—does
not necessarily translate into better performance. This observation indicates that more
complex prompts may inadvertently extend the reasoning process, leading to higher costs without
corresponding benefits. For example, in the code summarization task, when using S&G approach
with GPT-4o and o1-mini, the performance actually decreases with the rise of costs.
Finding. GPT-4o is more efficient than o1-mini in both token usage and processing time. For
example, in code translation, using LostinTransl increases GPT-4o’s time cost by up to 4.3 times
over zero-shot prompting, while for o1-mini, the time cost increases up to 14 times. Additionally,
o1-mini’s reasoning tokens can be up to 18 times the message tokens with complex techniques
like Unitrans. Additionally, this increased computational overhead does not necessarily translate
into better performance, indicating that more complex prompts may inadvertently prolong
reasoning, leading to greater costs without proportional benefits.
4.3.2 The suggestions of selecting prompt engineering techniques and foundational LLMs. We provide
practical guidance for both practitioners and researchers to make informed decisions regarding the
adoption of prompt engineering techniques and advanced LLMs when taking cost and effectiveness
into account. In addition to computational and token-based costs, the carbon footprint of different
prompt engineering techniques and foundational LLMs is a critical consideration. Each LLM call
involves significant computational processes that consume electrical energy. Depending on the
source of the electricity, this consumption can lead to varying levels of carbon emissions. Generally,
higher token usage, longer reasoning times, and increased call duration translate to greater energy
consumption [23, 36, 54], which have significant environmental implications [6, 18].
Foundational LLMs. Based on the findings from RQ2, we suggest selecting LLMs based on the
complexity of tasks. It can be preliminarily assessed by checking the length of CoT steps provided
on the OpenAI website. To assess task complexity accurately, we recommend sampling a few
representative data points directly from the task itself.
For tasks that do not require complex reasoning where the CoT length is less than 3 steps,
we recommend using GPT-4o due to its relatively lower carbon footprint. Furthermore, in
generating natural language for SE tasks, GPT-4o, combined with appropriate prompt techniques,
can outperform reasoning LLMs with built-in CoT capabilities7 . Additionally, users can select
foundational LLMs based on the expected output length. For tasks with concise expected outputs,
non-reasoning LLMs are more efficient choices.
In more complex tasks where the length of o1-mini CoT summaries is longer than or equal to
5 where o1-mini’s deeper reasoning capabilities are necessary, we recommend utilizing LLMs
with built-in CoT (e.g., o1-mini). When utilizing LLMs with built-in CoT, we encourage the
use of simple prompting techniques, such as zero-shot and one-shot, which can reduce energy
consumption without substantially compromising performance. When the extra information can
be obtained, utilizing LLMs with built-in CoT can further enhance the performance through a
more comprehensive analysis than GPT-4o. Furthermore, when utilizing reasoning LLMs, users
are recommended to write prompts with restrictions to ensure that reasoning LLMs maintain the
original content of the input and avoid generating undesirable outputs.
Prompt Engineering Techniques. For the advanced LLMs, the influence of the format of prompts
is small. The prompting engineering techniques including more information tend to yield
better performance. When generating code in SE tasks, the techniques that combine with the
feedback of execution can more easily generate correct code. When generating texts in SE tasks,
the prompt engineering techniques with iteration and more extracted information can achieve
better performance based on GPT-4o. When generating texts in SE tasks, we do not recommend
utilizing prompting engineering techniques on LLMs with built-in CoT at present (e.g., o1-mini).
When LLMs are not familiar with the format of output, an example to guide the format is necessary
for prompt engineering techniques.
Finding. For the selection of foundational LLMs in complex tasks requiring deep reasoning, the
built-in CoT capabilities of reasoning LLMs can enhance performance. However, it is advisable
to pair these capabilities with simpler prompting strategies, such as zero-shot or one-shot, to
minimize carbon consumption. Regarding the selection of prompt engineering techniques, those
that incorporate feedback or additional information can significantly improve code generation
and text outputs. In other cases, we recommend opting for GPT-4o, since the prompt engineering
techniques do not yield substantial benefits on reasoning LLMs.
5 Discussion
In light of our findings on the reasoning capabilities of reasoning models, future work could
explore several research directions to further harness and refine the strengths of reasoning LLMs
in code-related and other complex tasks in SE.
Optimizing prompt strategies for reasoning LLMs is an area ripe for exploration. Previous prompt
engineering techniques designed for non-reasoning LLMs may not fully utilize the autonomous
thinking and error-correction abilities of reasoning LLMs. Thus, devising adaptive prompt en-
gineering techniques that integrate these capabilities could unlock new levels of performance.
Researchers may experiment with minimalistic or constraint-based prompts that allow reasoning
models to leverage their CoT while maintaining focus and avoiding extraneous reasoning steps.
Another promising avenue is the dynamic control of CoT length. While o1 can adjust its reasoning
steps according to problem complexity, proper guidance could lead to more precise outputs that
balance detail and efficiency. Researchers could investigate adaptive mechanisms that limit or
expand CoT length based on predefined task characteristics, ensuring that reasoning LLMs do not
overcomplicate simple tasks or underperform on challenging ones. By controlling the length and
depth of reasoning, it may be possible to reduce computational costs while retaining high accuracy.
Additionally, ensuring o1’s outputs without unnecessary deviation or verbosity presents another
opportunity for improvement in specific tasks. Techniques that align o1’s output to task-specific
requirements could help reduce the model’s tendency toward excessive responses. This could be
achieved by developing new prompt engineering techniques that guide the model’s output to
remain concise and relevant, especially in tasks like code summarization [3, 43] or commit message
generation [46, 47] where preciseness and conciseness are both critical.
Furthermore, it would be beneficial to explore methods for making reasoning models more
cost-effective and environmentally sustainable. This could involve reducing token usage in CoT or
implementing methods for batch-processing similar reasoning paths to minimize redundant compu-
tations. As reasoning models continue to evolve, considering the trade-offs between performance,
cost, and environmental impact will be essential to their responsible deployment.
Overall, these future research directions, i.e., optimizing prompt strategies to enhance reasoning
LLMs, controlling reasoning depth, and aligning output to task requirements, could lead to a more
powerful and efficient use of reasoning LLMs in diverse SE applications.
6 Threats to Validity
Threats to internal validity mainly lie in implementing selected approaches of our experiments.
To mitigate this threat, we directly use the released code of these approaches [13, 16, 43, 56, 60].
Another threat may appear due to the randomness of LLMs’ response. To alleviate this threat, we
run each approach three times and present the average results for each experiment. Additionally,
the variations in the tuning parameters of these models may also affect their performance, which we
controlled by adhering closely to the parameter settings recommended in their respective studies.
Threats to external validity lie in the dataset, which may influence the generalization of our
findings. To mitigate this risk, we employ HumanEval [9], Transcoder [56], and CodeSearchNet [17,
26], which are all widely used datasets in each task. However, due to the high cost of utilizing
LLMs, our evaluation only involves three code-related tasks, i.e., code generation, translation,
and summarization, and two programming languages, i.e., Java and Python. The generalization
of our findings to other tasks and programming languages remains uncertain. To address this
issue, preliminary tests in related tasks not detailed in this paper have been conducted, yielding
similar results that support the conclusions drawn in this study. However, these results are not
comprehensive enough to assert broad applicability. Future work will therefore involve more
extensive evaluations across various tasks and programming languages to thoroughly assess the
findings’ generalizability.
Threats to construct validity lie in the metrics we used. In the three code-related tasks
evaluated in our experiment, we utilize the most widely used metric in commit generation and code
translation. In code summarization, common automated evaluation methods often lack consistency
with human evaluations, which are mentioned in many code understanding tasks [14, 42, 47]. Hence,
we follow the GPT-based evaluation methods proposed by Sun et al. [43], which has demonstrated
a stronger correlation with human judgments.
7 Conclusion
In conclusion, this study provides a comprehensive evaluation of prompt engineering techniques
within the context of advanced large language models (including reasoning models and non-
reasoning models) for SE tasks, focusing on code generation, translation, and summarization. Our
results indicate that while prompt engineering has been essential in enhancing the effectiveness of
earlier LLMs, its benefits are often diminished or altered when applied to more advanced models
like GPT-4o and the reasoning LLM. Specifically, we find that reasoning models offer advantages
primarily in complex tasks requiring multi-step reasoning but may not justify their additional
costs and potential environmental impact in simpler tasks where non-reasoning models perform
comparably or even more effectively. Our findings suggest that adapting prompt engineering
techniques to advanced LLMs requires a shift in focus, emphasizing accurate information input and
response evaluation rather than complex prompt structures. For SE tasks that do not heavily rely
on reasoning, simpler prompt configurations with non-reasoning models can deliver high-quality
results with greater cost efficiency. Additionally, when using reasoning models for more intricate
tasks, careful management of output format and content length is advised to improve usability and
relevance. This study contributes valuable insights into the evolving landscape of LLM applications
in SE, underscoring the importance of adapting prompt engineering strategies in line with the
capabilities and limitations of current LLM advancements. Future research may further explore the
nuanced interplay between prompt complexity and model capabilities, providing deeper insights
into optimizing LLM deployment across a broader array of SE applications.
8 Data Availability
All data and results are available on our homepage [7].
References
[1] 2024. OpenAI API Pricing. https://openai.com/api/pricing/ Accessed: 2024-10-23.
[2] 2024. State-of-the-Art Code Generation on HumanEval. https://paperswithcode.com/sota/code-generation-on-
humaneval Accessed: 2024-10-28.
[3] Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2023. Summarize and Generate to Back-translate:
Unsupervised Translation of Programming Languages. In Proceedings of the 17th Conference of the European Chapter of
the Association for Computational Linguistics. 1528–1542.
[4] Toufique Ahmed, Kunal Suresh Pai, Premkumar Devanbu, and Earl Barr. 2024. Automatic semantic augmentation of
language model prompts (for code summarization). In Proceedings of the IEEE/ACM 46th International Conference on
Software Engineering. 1–13.
[5] Meta AI. 2023. LLaMA: Open and Efficient Foundation Language Models. https://ai.meta.com/llama/ Accessed:
2024-10-27.
[6] Anonymous. 2023. LLMCarbon: Modeling the End-to-End Carbon Footprint of Large Language Models. arXiv preprint
arXiv:2309.14393 (2023).
[7] Anonymous. 2024. The reproducible package of our empirical study. https://doi.org/10.5281/zenodo.14035224
[8] Anthropic. 2023. Claude AI Language Model. https://www.anthropic.com/claude Accessed: 2024-10-27.
[9] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards,
Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf,
Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser,
Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert,
Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak,
Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan
Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie
Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021.
Evaluating Large Language Models Trained on Code. (2021). arXiv:2107.03374 [cs.LG]
[10] Xiangping Chen, Xing Hu, Yuan Huang, He Jiang, Weixing Ji, Yanjie Jiang, Yanyan Jiang, Bo Liu, Hui Liu, Xiaochen
Li, et al. 2024. Deep Learning-based Software Engineering: Progress, Challenges, and Opportunities. arXiv preprint
arXiv:2410.13110 (2024).
[11] Google DeepMind. 2023. Gemini AI Language Model. https://www.deepmind.com/research/gemini Accessed:
2024-10-27.
[12] DeepSeek-AI. 2024. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.
arXiv:2405.04434 [cs.CL]
[13] Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2024. Self-collaboration code generation via chatgpt. ACM Transactions on
Software Engineering and Methodology 33, 7 (2024), 1–38.
[14] Aleksandra Eliseeva, Yaroslav Sokolov, Egor Bogomolov, Yaroslav Golubev, Danny Dig, and Timofey Bryksin. 2023.
From Commit Message Generation to History-Aware Commit Message Completion. In 2023 38th IEEE/ACM International
Conference on Automated Software Engineering (ASE). IEEE, 723–735.
[15] Shuzheng Gao, Xin-Cheng Wen, Cuiyun Gao, Wenxuan Wang, Hongyu Zhang, and Michael R Lyu. 2023. What makes
good in-context demonstrations for code intelligence tasks with llms?. In 2023 38th IEEE/ACM International Conference
on Automated Software Engineering (ASE). IEEE, 761–773.
[16] Dong Huang, Jie M Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. 2023. Agentcoder: Multi-agent-
based code generation with iterative testing and optimisation. arXiv preprint arXiv:2312.13010 (2023).
[17] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet
challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).
[18] Akshaya Jagannadharao, Nicole Beckage, Dawn Nafus, and Scott Chamberlin. 2023. Timeshifting Strategies for
Carbon-Efficient Long-Running Large Language Model Training. Innovations in Systems and Software Engineering
(2023), 1–10.
[19] Sungmin Kang, Gabin An, and Shin Yoo. 2024. A quantitative and qualitative evaluation of LLM-based explainable
fault localization. Proceedings of the ACM on Software Engineering 1, FSE (2024), 1424–1446.
[20] Sungmin Kang, Bei Chen, Shin Yoo, and Jian-Guang Lou. 2023. Explainable automated debugging via large language
model-driven scientific debugging. arXiv preprint arXiv:2304.02195 (2023).
[21] Geunwoo Kim, Pierre Baldi, and Stephen McAleer. 2024. Language models can solve computer tasks. Advances in
Neural Information Processing Systems 36 (2024).
[22] Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. 2023. Benchmarking
cognitive biases in large language models as evaluators. arXiv preprint arXiv:2309.17012 (2023).
[23] Nikita N Lazarev, Andrey N Zakharenko, Oleg A Korovin, Daria V Plosskaya, Vladimir S Dimitrov, Ivan V Akhripkin,
Ivan V Pavlov, Ivan V Oseledets, Ivan S Barsola, Alexander A Egorov, et al. 2022. eco2AI: Carbon Emissions Tracking
of Machine Learning Models as the First Step Towards Sustainable AI. Computational Mathematics and Modeling 33, 4
(2022), 1–17.
[24] Bo Lin, Shangwen Wang, Zhongxin Liu, Yepang Liu, Xin Xia, and Xiaoguang Mao. 2023. Cct5: A code-change-oriented
pre-trained model. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on
the Foundations of Software Engineering. 1509–1521.
[25] Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. 2024. Large
language model-based agents for software engineering: A survey. arXiv preprint arXiv:2409.02977 (2024).
[26] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain,
Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou,
Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021. CodeXGLUE: A Machine Learning
Benchmark Dataset for Code Understanding and Generation. CoRR abs/2102.04664 (2021).
[27] Wei Ma, Daoyuan Wu, Yuqiang Sun, Tianwen Wang, Shangqing Liu, Jian Zhang, Yue Xue, and Yang Liu. 2024.
Combining Fine-Tuning and LLM-based Agents for Intuitive Smart Contract Auditing with Justifications. arXiv
preprint arXiv:2403.16073 (2024).
[28] Yuhong Mo, Hao Qin, Yushan Dong, Ziyi Zhu, and Zhenglin Li. 2024. Large language model (llm) ai text generation
detection based on transformer deep learning algorithm. arXiv preprint arXiv:2405.06652 (2024).
[29] OpenAI. 2021. ChatGPT: A Large-Scale Chatbot Model. OpenAI Blog (2021). https://www.openai.com/blog/chatgpt
[30] OpenAI. 2024. ChatGPT. https://chat.openai.com/ Accessed: 2024-11-01.
[31] OpenAI. 2024. GPT-4o: Optimized GPT-4 Language Model. https://openai.com/index/hello-gpt-4o/ Accessed:
2024-10-26.
[32] OpenAI. 2024. Learning to Reason with LLMs. https://openai.com/index/learning-to-reason-with-llms/ Accessed:
2024-10-26.
[33] OpenAI. 2024. OpenAI o1-mini. https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/
Accessed: 2024-10-26.
[34] Rangeet Pan, Ali Reza Ibrahimzada, Rahul Krishna, Divya Sankar, Lambert Pouguem Wassi, Michele Merler, Boris
Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand. 2024. Lost in translation: A study of bugs introduced
by large language models while translating code. In Proceedings of the IEEE/ACM 46th International Conference on
Software Engineering. 1–13.
[35] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of
machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
[36] David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud
Texier, and Jeff Dean. 2021. Carbon Emissions and Large Neural Network Training. arXiv preprint arXiv:2104.10350
(2021).
[37] Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Le Yan, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald
Metzler, et al. 2023. Large language models are effective text rankers with pairwise ranking prompting. arXiv preprint
arXiv:2306.17563 (2023).
[38] Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco,
and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297
(2020).
[39] Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance framework: BM25 and beyond. Foundations
and Trends® in Information Retrieval 3, 4 (2009), 333–389.
[40] Baptiste Roziere, Marie-Anne Lachaux, Lowik Chanussot, and Guillaume Lample. 2020. Unsupervised translation of
programming languages. Advances in neural information processing systems 33 (2020), 20601–20611.
[41] Ankit Satpute, Noah Gießing, André Greiner-Petter, Moritz Schubotz, Olaf Teschke, Akiko Aizawa, and Bela Gipp.
2024. Can llms master math? investigating large language models on math stack exchange. In Proceedings of the 47th
International ACM SIGIR Conference on Research and Development in Information Retrieval. 2316–2320.
[42] Ensheng Shi, Yanlin Wang, Wei Tao, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, and Hongbin Sun. 2022. RACE:
Retrieval-augmented Commit Message Generation. In Proceedings of the 2022 Conference on Empirical Methods in
Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 5520–5530.
https://aclanthology.org/2022.emnlp-main.372
[43] Weisong Sun, Yun Miao, Yuekang Li, Hongyu Zhang, Chunrong Fang, Yi Liu, Gelei Deng, Yang Liu, and Zhenyu
Chen. 2024. Source Code Summarization in the Era of Large Language Models. In 2025 IEEE/ACM 47th International
Conference on Software Engineering (ICSE). IEEE Computer Society, 419–431.
[44] Xiaofei Sun, Xiaoya Li, Shengyu Zhang, Shuhe Wang, Fei Wu, Jiwei Li, Tianwei Zhang, and Guoyin Wang. 2023.
Sentiment analysis through llm negotiations. arXiv preprint arXiv:2311.01876 (2023).
[45] Wei Tao, Yanlin Wang, Ensheng Shi, Lun Du, Shi Han, Hongyu Zhang, Dongmei Zhang, and Wenqiang Zhang. 2021.
On the evaluation of commit message generation models: an experimental study. In 2021 IEEE International Conference
on Software Maintenance and Evolution (ICSME). IEEE, 126–136.
[46] Yingchen Tian, Yuxia Zhang, Klaas-Jan Stol, Lin Jiang, and Hui Liu. 2022. What makes a good commit message?. In
Proceedings of the 44th International Conference on Software Engineering. 2389–2401.
[47] Guoqing Wang, Zeyu Sun, Jinhao Dong, Yuxia Zhang, Mingxuan Zhu, Qingyuan Liang, and Dan Hao. [n. d.]. Is It
Hard to Generate Holistic Commit Message? ACM Transactions on Software Engineering and Methodology ([n. d.]).
[48] Yiming Wang, Zhuosheng Zhang, and Rui Wang. 2023. Element-aware Summarization with Large Language Mod-
els: Expert-aligned Evaluation and Chain-of-Thought Method. In The 61st Annual Meeting Of The Association For
Computational Linguistics.
[49] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022.
Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing