代码大模型
代码大模型
Large Language Models (LLMs) have demonstrated several minutes to automatically generate code for
their remarkable capabilities in numerous fields. data preprocessing. Figure 1 shows an example of
This survey focuses on how LLMs empower users, using ChatGPT to automatically generate code for
regardless of their technical background, to use a data preprocessing task (replacing missing val-
human languages to automatically generate exe- ues). The accessibility, effectiveness, and efficiency
cutable code. We begin with understanding LLMs’ of using LLMs in daily tasks show the huge poten-
limitations and challenges in automated code gener- tial of applying LLMs in automatic code genera-
ation. Subsequently, we review various fine-tuning tion. We write this survey in the hope of opening
techniques designed to enhance both the perfor- the possibilities of Data Mining for everyone.
mance and adaptability of LLMs in code genera- Today, it is feasible to train a Generative AI
tion tasks. We then review the existing metrics and (GenAI) model based on vast amounts of rich and
benchmarks for evaluations to assess model perfor- diverse data from various resources, such as code
mance based on fine-tuning techniques. Finally, we repositories, technical forums, and web data on
explore the applications of LLMs (e.g. CodeLlama, coding aspects [4]. For example, a descendant of
GitHub Copilot, ToolGen) in code generation tasks GPT-3, OpenAI Codex, has been trained on data
to illustrate their roles and functionalities. This from billions of source code, such as code in public
survey provides a comprehensive overview of LLMs GitHub repositories [5]. The comprehensive train-
for code generation, helps researchers in diverse ing on rich data enables LLMs to better understand
fields better understand the current state-of-the-art the context of code comments and function names,
technologies, and offers the potential of effectively as well as to better interpret variable names [4].
leveraging LLMs for code generation tasks. For example, since LLMs have also been integrated
into Integrated Development Environments (IDEs),
Index Terms— Large Language Models (LLMs), such as PyCharm and VSCode to allow program-
Code Generation, Machine Learning, Artificial In- mers to develop their code, LLMs within these envi-
telligence (AI). ronments can comprehend this contextual informa-
tion and provide suggestions for users [6]. Figure 2
shows an example of using GitHub Copilot to han-
1 Introduction dle a data preprocessing task - splitting the dataset
[7].
In general, data mining requires users with good This survey outlines key aspects of LLMs in
coding skills and domain knowledge based on ex- code generation by organizing them into four parts:
tensive training hours [1]. A competent data min- (III) limitations and challenges, (IV) fine-tuning
ing expert needs a foundational education in com- techniques, (V) evaluations, and (VI) applications.
puter science or related fields along with practical These four parts are supported by 38 references
experience in programming languages (PLs), such (including research papers and technical articles).
as Python and R [2]. Coding with a programming Each part is further detailed by specific subtopics.
language like Python and its powerful libraries like Part III, Limits and Challenges of LLMs in au-
Pandas can make these efforts easier by automat- tomatic code generation, has four topics: (A) re-
ing many preprocessing tasks [3]. However, for ex- source constraints, (B) syntactic and semantic er-
ample, data preprocessing, a critical step in data rors, (C) biases, and (D) security risks. Subse-
mining, is still very time consuming. The New quently, Part IV, Fine-tuning Techniques for better
York Times states that data preprocessing accounts
1
Given the comment:
User Prompt:
# Split the data into features
Help me generate code for the fol- and target variable
lowing task: Remove rows with
missing Product ID, replace miss-
GitHub Copilot generates the following
ing Price values with the median,
code:
forward-fill missing Quantity values,
and save the cleaned dataset as
cleaned sales data.csv. X = df . drop ([ ’ Date ’ , ’ Rented
In addition, record the time you take Bike Count ’] , axis =1)
to generate this code. y = df [ ’ Rented Bike Count ’]
3
transformer model receives a user prompt to sum- panies to advance their development and unlock
marize an article, it analyzes the text and generates new possibilities in various domains. For example,
a concise summary that only contains the article’s Google leverages BERT, with its ability to extract
key points. important information, summarize long texts, and
optimize search results, to summarize texts with
precision and conciseness [14]. In addition, Mi-
crosoft developed an LLM called Turing-NLG to
enhance its system for identifying and extracting
meaningful information from text (e.g. names, lo-
cations, and dates), allowing Microsoft to enhance
language understanding, deliver reliable, context-
aware information, and improve applications in
NLP, search engines, and information retrieval [14].
Furthermore, IBM uses the WATSON NLU (Nat-
ural Language Understanding) model, which lever-
ages LLMs to analyze and extract valuable insights
from customer feedback, social media posts, etc.,
allowing them to make decisions based on this in-
formation [14]. These real-world applications illus-
trate how LLMs are revolutionizing industries and
Figure 3: How transformer models work [10] unlocking innovative solutions in diverse domains.
Training with an extensive and diverse dataset 2.2 LLMs for Code Generations
enables LLMs to process many tasks in numer-
ous areas, including healthcare and education [11]. To begin with, we will introduce the steps of how
For instance, InteliHealth Company develops per- LLMs generate code, in terms of before data feed
sonal health LLMs to generate recommendations into LLMs, during training, and after training. Be-
on personalized health plans for patients based fore training, data preprocessing plays a crucial
on their medical history combined with clinical step in ensuring that datasets (e.g. open source
data [12]. In education, an interactive textbook repositories from GitHub and Overflow) are clean,
- CourseKata using LLMs shows the benefits of standardized, and suitable for the chosen models,
training diverse datasets to meet educational needs maximizing LLMs’ ability to learn [15]. During
[13]. Incorporating datasets from textbook ma- training, LLMs build complicated internal code rep-
terials, student responses, and assessments allows resentations to comprehend their meanings (seman-
CourseKata to provide students with personalized tics), structures (syntax), and relationships of var-
feedback in real time via the generation of person- ious code elements [16]. Furthermore, the gener-
alized practice questions by offering detailed ex- ating code process consists of four steps: Under-
planations and adapting to each student’s learning standing the Prompt, Retrieving Relevant Code
pace [13]. These examples highlight how training Patterns, Assembling Code Fragments, and Gener-
on large and diverse datasets enables LLMs to gen- ating code [16]. In the first step, LLMs analyze the
eralize to different domains. given input by breaking it down to understand the
intended functionality, the programming language,
and any specified constraints. Secondly, LLMs find
their input’s internal representations to match the
prompt’s requirements by finding code patterns and
structures. Subsequently, they smartly combine re-
trieved code fragments and modify them to fit the
prompt’s context in the third step. Lastly, LLMs
generate final code output with many variations or
suggestions for users to decide.
Figure 4: Notable released LLMs timeline With the transformative process of how LLMs
generate code, these models have shown rapid de-
velopment in just a few years. The journey started
Both deep learning architectures and extensive
in 2021 with GitHub Copilot, becoming one of the
and diverse datasets trained in LLMs lay the foun-
first widely available code generation tools [17].
dations for practical real-world applications in var-
From then on, numerous LLMs with code genera-
ious industries like BERT from Google. These
tion capability have been developed. By 2022, more
applications show the transformative potential of
advanced models like Replit Ghostwriter were re-
LLMs and the commitment of leading tech com-
4
leased, which allow users to perform tasks, such as rors, resulting in failure during execution or pro-
code completion, explanation, transformation, gen- gram generation with unexpected incorrect output
eration, and error detection with debugging [17]. [23]. For example, code generation can contain er-
In 2023, Bard was announced to support cod- rors in “if” statements, as these can lead to incor-
ing in more than 20 programming languages such rect branching, such as skipping conditions or ex-
as C++, Go, Java, JavaScript, TypeScript, and ecuting the wrong code parts [23]. Thirdly, a bias
Python [17]. These milestones illustrate the rapid testing framework called Code Bias Score (CBS)
advancements in LLMs with the capability of gen- revealed that 38.92% of GPT-4’s generated code
erating code that changes the way users approach contained gender bias [24]. Finally, security risks
coding tasks. of LLMs may come from their training data with
To further demonstrate the impact of advance- unsanitized open source code as an example that
ments in LLMs on code generation, OpenAI o1, contains vulnerabilities such as memory safety vio-
also known as “Strawberry,” shows a significant lations or SQL injection risks[25].
leap forward in coding performance. Compared
to other top-performing models like 3.5 Sonnet, 3.1 Resource Constraints
GPT-4o, and Llama 405b in coding challenges using
HumanEval benchmark data, OpenAI o1 achieves Chen et al. [26] point out that training LLMs
the highest performance rate of 92.4% which estab- for generating code is highly resource intensive,
lishes itself as the best coding model according to as these models require vast computational capac-
Vellum[18]. OpenAI o1 - one of the latest OpenAI ity and memory, especially on Graphics Processing
models - presents its groundbreaking technology of Units (GPUs). For instance, StarCoder - an LLM
“thinking capacity” [19]. Unlike previous models trained on over 80 PLs from GitHub - is a 15B
that focus on the number of parameters, OpenAI model trained on 1T tokens. These models demon-
o1 has the ability to “think” before responding by strate significant computational demands. Recent
creating a long internal chain of thought, similar trillion-parameter models have stretched the lim-
to how humans brainstorm to respond to problems its of current computational capacity for consuming
[20]. Using a large-scale reinforcement learning al- extensive processing resources and memory. For ex-
gorithm, OpenAI o1 can detect and fix its mis- ample, CodeLlama, a coding model from Llama2,
takes, break down complicated problems into sim- is available in the following model sizes: 7B, 13B,
pler components as well as attempt a new approach 34B, and 70B parameters. All of these, except the
if the old one is not working [21]. OpenAI o1 out- 70B model, have been trained on 500B tokens of
performs previous models like GPT-4o by ranking code and code-related data, whereas the 70B model
in the 89th percentile on Codeforces and has the required 1 trillion tokens as they continue to push
skills to solve PhD-level problems when evaluating the limits on current state-of-the-art computing re-
on GPQA diamond, a difficult intelligence bench- sources.
mark [21]. Meanwhile, Chavan et al. [27] highlight the crit-
ical challenges of optimizing LLMs for faster and
more memory-efficient, particularly in resource-
3 Limits and Challenges of constrained environments. For example, loading
Using LLMs for Code Gen- a LLaMa-70B model requires 140GB of VRAM,
which excludes the additional memory needed for
eration inference. This addressed the need for model com-
pression when deploying in a resource-constrained
The rapid development and huge potential capa- environment. Because of this, quantization, which
bilities of LLMs also raise several significant limi- reduces memory usage by lowering the numerical
tations and challenges. In this section, we discuss precision of model weights, is introduced to handle
four areas: Resource Constraints, Syntactic and Se- these challenges. However, various methods such as
mantic Errors, Bias, and Security Risks. Firstly, GPTQ, AWQ, OmniQuant demonstrate the trade-
the training and deployment of LLMs require im- offs between memory efficiency and model accuracy.
mense processing power and memory. For exam- This study presents a performance comparison be-
ple, as one token defines one unit of text (e.g. a tween these quantization methods in compressing
word or a small piece of a sentence), Llama 3.1 the LLaMA2-7B model. For example, OmniQuant
models were trained on over 15 trillion tokens, in reduces memory usage by lowering the precision to
which Llama 3.1 - 8B required 7 million GPU hours 4-bit, achieving strong performance with a perplex-
while Llama 3.1 - 405B required approximately 31 ity of 5.97. However, reducing precision further to
million GPU hours. [22]. Secondly, LLM perfor- 3-bit with OmniQuant increases the perplexity to
mance, such as accuracy and reliability, would be 6.65, indicating a decline in output quality. Sim-
significantly impacted by syntactic and semantic er- ilarly, GPTQ with 3-bit precision reduces weight
5
memory to 2.87 GB but leads to a higher perplex- correct function names, and incorrect function ar-
ity score (7.36) compared to 4-bit (6.08) and 8-bit guments, were also common and easy to correct.
(5.86) configurations. The most common semantic issues from the code
In contrast, Hassid et al. [28] examine the im- generated by these six LLMs are incorrect logi-
pact of computational constraints on LLM perfor- cal flow and flawed conditional statements. In
mance by comparing smaller models such as Code contrast to the other four models (CodeGen-16B,
Llama of 7B/8B/13B with larger models such as the InCoder-1.3B, SantaCoder, and StarCoder), GPT-
70B model under the same resource limits. Interest- 3.5 and GPT-4 show better performance for gener-
ingly, smaller models of 7B/8B/13B demonstrated ating code with fewer missing steps.
better results by showing 5 to 15% performance Based on these findings, Dou et al. [30] conduct
gains over the 70B model. In a “small budget a study on both syntactic and semantic errors in
regime” capped at 32 normalized FLOP units and LLM-generated code by analyzing seven different
64 wall-time units, the study evaluated code genera- models such as GPT-4, GPT-3.5, Claude-3, Llama-
tion benchmarks (HumanEval, MBPP) in different 3, Phi-3, StarCoder-2, and DeepSeekCoder on three
model sizes of 7B/8B/13B models against the 70B benchmarks: HumanEval+, MBPP+, and APPS+.
model. Evaluation tasks include HumanEval with They observed that syntactic errors, including in-
164 function completions and MBPP with 500 code correct syntax structure, indentation errors, and
generation instructions. In addition, CodeLlama missing library import, are relatively rare, account-
7B/13B achieved a 60% score in only a quarter of ing for less than 10% of the total errors among all
the time required by larger models to achieve the models. In contrast, semantic errors, such as mis-
same result on the HumanEval benchmark. This understanding task requirements, logic errors, hal-
finding highlights the efficiency of smaller models lucinations, and input/output format issues, consti-
and their ability to deliver competitive results with tute the largest category of errors. DeepSeekCoder,
fewer resources in limited environments. Llama-3, Phi-3, and GPT-3.5 have proportions of
semantic error that exceed 50% on the APPS+
Table 2: Summary of Resource Constraints benchmark, showing their struggles with intricate
logic and conditional structures. Additionally, as
Source Key Ideas Challenges the complexity of the benchmark grows, seman-
tic errors increase proportionally, highlighting the
Chen A 15B model challenges LLMs face in accurately interpreting and
Training executing complex tasks.
et al. trained on 1T
inefficiency Extending the scope to translation tasks, Pan et
[26] tokens.
al. [31] categorize translation errors in LLM code
Chavan Issues for faster Quantization into 15 types, covering both syntactic and semantic
et al. and lighter method issues. Syntactic errors, often involving misalign-
[27] LLMs. trade-offs. ment with target language-specific requirements,
and semantic errors, which affect the logical consis-
Performance tency of translated code, were common. The study
Hassid 7B/13B models used real-world projects like Apache Commons CLI
comparison
et al. outperform the and Python Click to evaluate LLMs’ effectiveness in
between smaller
[28] 70B model. code translation by categorizing translation errors
and larger LLMs.
and assessing the resulting syntactic and seman-
tic issues across multiple benchmarks, such as Hu-
manEval, MBPP, and APPS. In particular, 30.5%
3.2 Syntactic and Semantic Errors of translation errors resulted from syntactic and se-
mantic misalignments between the source and tar-
Wang et al. [29] provide an in-depth analysis get languages, and 24.3% of these errors involved
of common syntactic and semantic errors in code unmet target language requirements. This study
generated by six prominent LLMs-CodeGen-16B, underscores the challenges LLMs face in code trans-
InCoder-1.3B, GPT-3.5, GPT-4, SantaCoder, and lation tasks, where nearly 80% of the issues arise
StarCoder by evaluating their 557 incorrect code from such discrepancies.
snippets across 164 Python tasks from the Hu- Finally, Liu et al. [32] analyze ChatGPT’s cor-
manEval dataset. Syntactic errors, such as miss- rectness on code generation by examining their se-
ing or incorrectly structured code blocks, were two mantic and syntactic errors that impact code re-
common problems with the six LLMs generating liability and quality. The study was analyzed on
40% of these errors. This indicated that even 4,066 code snippets generated in Java and Python
widely used LLMs struggle with fundamental code in 2,033 programming tasks, revealing that both
structure. Simple errors, including “if” errors, in- types of errors affect the compilation and runtime
6
errors of the generated program. The result demon- proves from 12.76% to 26.13% on the Pass@1 but
strates that Illegal Index and Type Mismatch errors reveals a sharp increase in the bias of CBS escalat-
are the most common semantic errors in ChatGPT- ing from 9.36% to 62.65%.
generated code. Illegal Index errors account for
46.4% of the 97 runtime errors in Java, while Type 3.4 Security Risks
Mismatch errors are more frequent in Python be-
cause of its dynamic typing system. Furthermore, Islam et al. [35] introduce security vulnerabili-
for semantic errors, 1,930 snippets (47%) exhibited ties in LLMs including three main technical issues
maintainability issues, such as inconsistent variable data quality, model design, and evaluation prac-
use and improper loop handling, affecting readabil- tices. LLMs show their disadvantage in producing
ity and reusability. This breakdown underlines se- 10% more vulnerable code than human developers.
mantic and syntactic issues that lead not only to Data quality issues, including incorrect labeling and
runtime errors but also to a higher demand for man- data leakage, as indicated by datasets such as MVD
ual correction to achieve functional code. and Devign, were observed to trigger the generation
of false positives or false negatives in vulnerability
detection. In addition, models designed only for
3.3 Biases
supervised fine-tuning, such as VulRepair, mostly
Wang et al. [33] explore multilingual bias in generate non-functional code due to scarce syntax
LLM code generation, including Multi-NL bias and and functionality checks. Lastly, for evaluation,
Multi-PL bias. The paper studied multilingual the common metrics used to evaluate these mod-
bias using three popular LLMs, such as StarCoder, els, such as BLEU and Exact Match, are not in-
CodeLlama, and DeepSeek-Coder, while evaluating dicative enough for the security and functionality
it on the Pass@k metric. For bias in multi-NL, of the generated code.
the results showed that LLMs exhibit a significant Based on this analysis, He [36] explores recent
performance gap when generating code from dif- efforts to evaluate code security by LLMs from sys-
ferent language instructions like English and Chi- tematic testing to user studies. Initially, the au-
nese across different PLs (e.g. Python, Java, C++, thor discusses a popular security risk called “Out-
etc.). Using Chinese instructions led to an aver- of-Bounds Write” (CWE-787), which can allow at-
age Pass@1 rate drop of 17.2% for base models and tackers to exploit computer memory for criminal
14.3% for instruction-tuned models in Python, with activities by writing malicious information. Re-
CodeLlama-34B experiencing more severe bias as cent efforts to assess the security of LLM-generated
its Java code generation dropped by 37.8%. For code include systematic evaluations using Common
bias in multi-PL, the results showed various LLMs’ Weakness Enumeration (CWE), focusing on how
performance in generating code in different PLs. Copilot handles various vulnerabilities across dif-
Base models achieved the highest Pass@1 rate in ferent prompts, weaknesses, and programming do-
Python for outperforming C++ and Java by 5.7% mains. Copilot’s response to the scenarios of the
and 11.3%, respectively. diversity of prompt and domain shows that around
Expanding the discussion to social biases, Liu et 40% of the generated code is vulnerable from a se-
al. [34] investigate the severity of these biases in curity standpoint. In addition, a security-driven
the generation by LLM code. Their experiments user study examines code written by student pro-
were performed on different LLMs such as Codex, grammers with LLM’s assistance. The user study
InCoder, and CodeGen with different sizes to evalu- found that while LLM-assisted code generation in-
ate social biases in code using three metrics, such as troduced some vulnerabilities, the overall impact on
Code Bias Score (CBS), UnFairness Score (UFS), security was small. AI-assisted students produced
and the standard deviation of the frequency for all security-critical bugs about 10% more often than
valid demographics (e.g., ethnicity, religion, and non-assisted students.
gender). The results revealed that models such Furthermore, Black et al. [37] investigate the se-
as Codex and InCoder generated harmful codes in curity issues with LLM-generated code that arise
which certain ethnicities or religions were associ- from challenges in balancing security and correct-
ated with the derogatory term “disgusting” by ex- ness based on prompting strategies, model selec-
pressing prejudice against “Islam” and “Muslim”. tion, and the degree of randomness allowed in re-
Furthermore, Codex, with over 100 billion param- sponses. CWE-22, which is a directory traversal,
eters, achieved the highest code generation qual- and CWE-190, which is an integer overflow, have
ity (Pass@1: 47.03%) but also demonstrated the been two of the common vulnerabilities used as
most severe biases (CBS: 82.64%), highlighting a benchmarks to evaluate generated programs. In
troubling trade-off between performance and fair- CWE-22 (directory traversal), the task is to gen-
ness. Similarly, as the sizes of CodeGen model in- erate programs that write files to specified paths.
crease from 350M to 6.1B, their performance im- The results show that GPT-3.5 generated code that
7
allowed filenames with “../”, enabling unauthorized
access to parent directories. In CWE-190 (integer
overflow), the task required generating programs
to handle numerical operations safely. The results
show that Claude Opus initially used standard int
types that failed to handle large numbers such as 2
* 9,999,999,999, resulting in incorrect output.
Finally, Wang et al. [38] highlight the security
risks in the LLM code generated in their training
and during their generation process. Firstly, LLM
training using unsanitized data from open source
such as GitHub can lead to potential risks of inad-
vertently embedding security vulnerabilities in gen-
Figure 5: LLMs-Based Fine-tuning process
erated code. For example, the 2022 Open Source
Security and Risk Analysis (OSSRA) report high-
lights that 81% of the 2,049 codebases analyzed framework establishes a comprehensive instruction
had at least one vulnerability, with 49% contained set that contains well-described problem prompts
high-risk vulnerabilities. Therefore, these models and effective optimization codes. Then, it proposes
are prone to propagating vulnerabilities during the a new two-phase learning strategy that includes a
code generation process, potentially resulting in contrast learning-based warm-up procedure before
flawed outputs that are highly susceptible to ex- the instruction-tuning phase to boost the conver-
ploitation and malicious attacks. To better illus- gence behavior during model fine-tuning. The re-
trate this, Copilot generates insecure code in about sults of the experiment showed that LLaMoCo sig-
40% of the cases, whereas ChatGPT showed that nificantly improved the performance of LLMs as
of the 21 programs, only 5 were initially secure. a fine-tuned CodeGen (350M) model of LLaMoCo
demonstrated superior optimization performance
4 Fine-Tuning Techniques for compared to GPT-4 Turbo on both synthetic and
realistic problem sets, showing less error for 4.168%
Enhancing LLM Perfor- and 79.483% and better performance for 87.227%
mance in Code Generation and 59.174%, respectively. In addition, LlaMoCo
boosted CodeLlama 7B model performance from
To handle the limitations and challenges of code 29.717% to 81.843%.
generation by LLMs, fine-tuning has become an Furthermore, Weyssow et al. [41] review four
important strategy to enhance LLMs’ capabilities. popular Parameter-Efficient Fine-Tuning-LoRA,
Fine-tuning allows users to put the pre-trained IA3, Prompt Tuning, and Prefix Tuning. These
LLMs into use more for specialized applications techniques ensure fine-tuning of LLMs by updating
with significantly improved performance while pre- only a subset of model parameters rather than all
serving the remaining knowledge. For example, the parameters. Thus, LLM focuses on task-specific
Google revealed that the fine-tuning of sentiment data while maintaining good resource usage. Then,
analysis boosted the accuracy of LLMs by 10% this paper compares these four techniques with
[39]. Because of that, this section explores three ICL and traditional full fine-tuning on code gener-
fine-tuning techniques: prompt engineering, which ation tasks using Python datasets like CoNaLa or
optimizes LLM outputs by crafting effective input CodeAlpaca. The results indicated that the fine-
instructions; feedback refinement, which reduces tuned LLMs consistently perform significantly bet-
errors by incorporating corrections; and domain- ter with PEFT compared to ICL as LoRA fine-
specific dataset tuning, which improves LLM per- tuning LLMs improved 25.4% and 22.8% (150% and
formance in specialized areas. Together, these tech- 29.8%) in EM@10 and CodeBLEU respectively in
niques mitigate specific weaknesses within LLMs CoNaLa (CodeAlpacaPy) dataset. QLoRA reduces
and open the way to more effective and robust ap- memory usage by allowing fine-tuning of LLMs
plications involving code generation. with up to 34B parameters. This investigation em-
phasizes the potential of PEFT techniques in ef-
ficiently fine-tuning LLMs to task-specific data to
4.1 Fine-Tuning on Domain-Specific generate code.
Datasets Complementing these approaches, Tsai et al. [42]
Ma et al. [40] propose LLaMoCo, the first introduce a novel approach to fine-tuning LLMs
instruction-tuning framework designed to adapt for code generation by integrating data pruning
LLMs for the optimization of code generation. This methods. The paper explores the use of cluster-
ing algorithms (KMeans, Agglomerative Cluster-
8
ing, HDBSCAN) and pruning metrics (Diversity and reinforcement learning from human feedback
Metric, Density Metric) to reduce the size of train- (cRLHF), to improve code generation in LLMs.
ing data selectively while maintaining the accu- This aims to maximize code quality using multiple
racy and functionality of the generated code as user feedback. As the traditional method - RLHF
there are significant redundancies in training data. contains biases and misses important insights that
HumanEval(+) and MBPP(+) datasets are used limit LLMs’ potential, cRLHF collects feedback
to evaluate pruning methods and highlight per- data from different sources and uses Bayesian infer-
formance improvements. Surprisingly, the results ence to align and combine the feedback data into
show that pruning in a small portion of the train- one belief that gives more objective assessments
ing data can lead to performance improvements of without complicated reward modeling. The frame-
up to 2.7% in HumanEval and 3.5% in MBPP. Re- work fine-tunes LLMs by using aggregated feed-
markably, using data pruning on only 1% of the back to improve code correctness and quality. The
data can result in a 4.1% improvement compared results show significant improvements in LLMs of
to the base model, achieving performance nearly different sizes when the cRLHF method is applied.
equivalent to training with the entire dataset. In the HumanEval benchmark evaluation, the suc-
cess rate for CodeGen-2.7B improved from 39.8%
4.2 Feedback to 45.4% and from 17.3% to 20.0% for the smaller
model CodeGen-350M.
Mu et al. [43] present a novel framework - Clar-
ifyGPT - that can identify and clarify ambiguous
4.3 Prompt Engineering
user requirements to improve LLM-based code gen-
eration. ClarifyGPT can perform a code consis- Sun et al. [46] apply the “Chain-of-Thought”
tency check to detect ambiguity and generate tar- prompting technique to generate “solution plans”
geted clarifying questions to refine unclear input. for complex programming challenges to develop
Consequently, it generates the solution code from a framework called CodePLAN. This framework
the received response. Therefore, this framework is designed to infuse the reasoning capabilities of
plays an important role in improving the inter- LLMs in smaller models to enhance their code
pretability of the code generated by LLMs. Fur- generation performance. CodePLAN uses multi-
thermore, it helps users better understand gen- task learning to train smaller models on both code
erated code from interaction and provides more generation and solution plan generation simultane-
clarification of their intentions. Using two pub- ously. It uses backward reasoning and plan sam-
licly available benchmarks: MBPP-sanitized and pling strategies to improve solution plan quality.
MBPP-ET for evaluation, ClarifyGPT improved The higher quality of the solution plan may lead
the average performance of GPT-4 and ChatGPT to more accurate code generation outputs. The
from 68.02% to 75.75% and from 58.55% to 67.22%, framework considers LLMs as “teachers” to pro-
respectively. vide solution plans that distill into smaller models
Furthermore, Gehring et al. [44] discuss Re- considered “students”. This allows them to develop
inforcement Learning from Execution Feedback solution plans independently at inference time. Ex-
(RLEF), a method to improve LLMs in code syn- periments demonstrated that this approach signif-
thesis by using feedback from code execution to it- icantly improves the code generation abilities of
eratively refine outputs. The process includes three smaller models by more than 130% in performance
steps: generating code, receiving feedback from test using the pass@1 metric on the APPS benchmark.
cases, and updating the model through reinforce- Expanding on prompting techniques, Li et al.
ment learning using Proximal Policy Optimization [47] develop a novel approach named AceCoder to
(PPO). In experiments on competitive program- improve LLM’s performance in code generation. It
ming tasks such as those in CodeContests, models is designed to perform two major challenges of code
that are trained with RLEF achieved a solve rate of generation: requirement understanding and code
37.5% in the test set for the standalone Llama 3.1 implementation. This method performs code gen-
70B model. These significantly outperform the pre- eration in three steps: example retrieval, prompt
vious state-of-the-art AlphaCodium at 29%. The construction, and code generation. First, the re-
method also reduced samples by an order of mag- triever selects similar programs based on language
nitude compared to the RLEF approach. This ap- input, whereas the selector selects non-redundant
proach generalizes well to other benchmarks like programs based on prioritizing non-overlapping in-
HumanEval+ and MBPP+, where feedback is used formation. Second, the technique identifies a com-
for the grounding of the output of LLMs, especially bination of chosen examples, their preliminary ar-
on multi-turn code generation tasks. tifacts in the form of test cases, and input require-
Finally, Wong et al. [45] introduce a new ments to construct a prompt. Finally, the LLM uses
method that combines crowd-sourced computation the constructed prompt to generate test cases that
9
yield the final source code. AceCoder was evalu- mantic similarity, and hallucination [50]. On the
ated on three LLMs, such as CodeGeeX, CodeGen, other hand, benchmarks are constructed from eval-
and InCoder, using three public benchmarks using uation datasets and metrics where test cases cre-
Pass@k. It follows that AceCoder has surpassed the ate an evaluation dataset [51]. Figure 6 illustrates
state-of-the-art prompting techniques in improving the LLM benchmark structure that includes the in-
Pass@1 by up to 56.4% in MBPP, 70.7% in MBJP, tegration of metrics within this evaluation frame-
and 88.4% in MBJSP and has proven to be effective work.
in different LLM sizes and languages.
Lastly, Tony et al. [48] explore the impact of
different prompting techniques on the security of
the code generated by LLMs from NL instruc-
tions. These techniques were implemented in the
GPT-3, GPT-3.5, and GPT-4 models. The au-
thors investigated some of these techniques us-
ing a dataset of 150 NL prompts related to
security-relevant coding tasks. 15 different ex-
plored prompting techniques are classified into 5
categories depending on their common characteris-
tics, such as root techniques, refinement-based tech-
niques, decomposition-based techniques, reasoning-
based techniques, and priming techniques. For in- Figure 6: LLMs System Benchmark [51]
stance, refinement-based techniques focus on im-
proving model outputs through iterative refine-
ment, feedback loops, or self-assessment, includ-
ing methods such as Recursive Criticism and Im- 5.1 Metrics
provement (RCI), Self-refine, and Progressive Hint
One common metric for evaluating code genera-
prompting. The results indicated that RCI per-
tion by LLMs is CodeBLEU [52]. Compared to the
formed best for both GPT-3.5 and GPT-4, while
generally adopted BLEU metric for NL evaluation,
zero-shot prompting performed best out of these
which lacks key syntactical and semantic character-
techniques for GPT-3. The persona / memetic
istics of codes, CodeBLEU was designed to incor-
proxy yielded the poorest performance, generating
porate both traditional n-gram matching and syn-
the most security weaknesses across all models.
tactic and semantic matching. Specifically, the n-
gram match weights different n-grams differently,
5 Evaluation Metrics and the syntactic match plugs in AST information by
aligning subtrees, while the semantic match mea-
Benchmarks for Assessing sures similarities of code based on the analysis of
LLM-Generated Code its data-flow structure. CodeBLEU combines these
elements (including weighted n-gram match, syn-
As numerous LLMs with code generation capabil- tactic AST match, and semantic data-flow match)
ity have been developing - a crucial tool for pro- into a comprehensive evaluation metric. The exper-
grammers of all skill levels, evaluating these mod- iments are tested in three coding tasks text-to-code
els is essential to ensure their dependability and ef- (Java), code translation (from Java to C#), and
ficiency in meeting users’ needs. While significant code refinement (Java). The results demonstrate
efforts have been dedicated to the performance eval- that CodeBLEU has a better correlation with hu-
uation of LLMs, most of these research questions man evaluation scores compared to traditional met-
remain unanswered, such as: “Are the evaluations rics like BLEU and perfect accuracy in all three
and comparisons fair and are the differences signif- tasks.
icant?” or “Do findings from performance evalua- Another popularly used metric for LLMs gener-
tion truly reflect the usability of LLMs as practical ated code is pass@k [56]. Pass@k, with the goal
programming tools?” [49]. This section will discuss of addressing the shortcomings of traditional text
two key aspects of evaluation: benchmarks and per- similarity metrics, is designed to assess the func-
formance metrics. tional correctness of generated code samples. It
Before we discuss these aspects in depth, it is presents the probability that at least one of the top
essential to clarify the terms “benchmarks” and k samples passes unit tests. This metric contains
“metrics”. LLM evaluation metrics are criteria variations, such as pass@1, pass@10, pass@100, etc.
used to quantify the performance of LLM systems Pass@1 provides the likelihood of correctness at the
in aspects such as correctness of the answers, se- first attempt, while pass@10 and pass@100 assess
the model’s performance on much larger sample
10
sets to provide a comprehensive view of its ability Table 3: Benchmark Features Summary
to generate valid solutions. The following formula
is used to calculate pass@k to handle all problems
with E defining the expected value of the problems, Number
Release
n for the total number of samples, and k for the Benchmarks of PLs
Date
number of top samples to consider: Tasks
C(n − c, k)
pass@k := Eproblems 1 − HumanEval 164 Python July 2021
C(n, k)
Yeo et al. [53] propose a new metric called the CLASSEVAL 100 Python August 2023
pass-ratio@n that measures precision based on the
accuracy granularity through the pass rate of the
test cases. As LLMs can generate different solutions October
SWE-bench 2,294 Python
across inferences, considering that n inferences are 2023
made, the average pass-ratio across the n solutions
is used to mitigate bias. For each solution i (0 < BigCodeBench 1,140 Python June 2024
i ≤ n), the pass-ratio is calculated by the following
formula:
2
The HumanEval [56] dataset is a benchmark
# of test cases passed at code i
pass-ratioi = designed to evaluate LLMs in code generation.
# of test cases
It includes 164 programming challenges as each
problem contains “function signatures, docstrings,
And the pass-ratio@n demonstrates the mean pass-
body, and unit tests” to evaluate functional cor-
ratio of n generated code.
rectness. On average, there are 7.7 tests per prob-
Pn
pass-ratioi lem. Traditional metrics like BLEU can measure
pass-ratio@n = i=1 the similarity of texts. Unfortunately, these mea-
n
sures are not suitable for evaluating code genera-
The metric pass-ratio@n was tested on three coding tion because functional correctness is much more
problems and five generated coding solutions for important. This was addressed by introducing the
each problem from LLM inference to compare to pass@k metric, which came with the HumanEval
pass@k. The results show that the pass-ratio@n dataset and helped to assess this functional correct-
can provide more granular insight than the pass@k ness. This metric measures the probability that at
metrics. In a coding problem where none of the least one of the top k-generated code samples passes
solutions passed all test cases, pass-ratio@5 got a the unit tests to provide a more practical evalua-
partial score of 61%, while pass@k recorded 0%. tion of the generated code. HumanEval and pass@k
In addition, Zhuo [54] introduces another new have become critical factors in testing LLM coding
metric called ICE-Score by instructing LLMs for capabilities to provide more meaningful and valu-
code assessments. The ICE-Score metric includes able test results.
two key components: (1) the task definition, evalu- In contrast, ClassEval [57] is a benchmark de-
ation criteria, and structured evaluation steps, and signed to evaluate LLMs on the more challenging
(2) the provided problem along with the gener- coding tasks, such as class-level code generation,
ated code snippet for assessment. Unlike tradi- unlike existing benchmarks like HumanEval that fo-
tional metrics such as BLEU, reference-based meth- cus only on simple scenarios, such as function-level
ods relying on human-written test suites, ICE-Score or statement-level. The benchmark consists of 100
uses LLMs’ capabilities to assess generated code on manually constructed Python coding tasks, includ-
two aspects: usefulness and functional correctness. ing 100 classes and 412 methods. The study exper-
By entering the task problems and their generated imented with the evaluation of 11 state-of-the-art
code, ICE-Score outputs the corresponding assess- LLMs in class-level generated code using three dif-
ments. The results outputted for both aspects in- ferent code generation strategies, including holistic
clude “Nearly Useless” or “Totally Useless”, and generation, incremental generation, and composi-
“Functional Incorrect” or “Functional Correct”. tional generation. First, existing LLMs perform sig-
nificantly worse in class-level code generation com-
5.2 Benchmarks pared to standalone method-level benchmarks like
HumanEval. Secondly, GPT-4 and GPT-3.5 consis-
Table 3 represents the information of several no- tently outperform other models; and models such
table benchmarks, which will be discussed in this as WizardCoder, Instruct-StarCoder, and Instruct-
benchmark section from Symflower [55]. CodeGen have similar performance. Lastly, the
11
best strategy for GPT-4 and GPT-3.5 is to generate pletion, code translation, etc. In addition to that,
code for the entire class, while using a method-by- each model has unique strengths and developers
method strategy will be a better choice for other must understand and manipulate them effectively
models. in their workflows. Selecting a suitable model for
Furthermore, SWE-bench [58] is a benchmark the task can give users the advantages of enhanc-
designed to evaluate LLMs for the study of capa- ing productivity, streamlining processes, reducing
bilities in real-world software engineering settings. errors, and maximizing its potential. For example,
As an evaluation framework, this benchmark com- OpenAI Codex benefits users whose workflow de-
prises 2,294 tasks from Github issues and their re- pends on GitHub due to the integration of Codex
lated pull requests across 12 well-known Python with GitHub Copilot [60]. Furthermore, for devel-
repositories. Solving issues in the SWE-bench of- opers working on AWS, CodeWhisperer provides
ten requires a comprehensive understanding and co- domain-specific insights and customized recommen-
ordination with changes across various functions, dations that position it as one of the top LLMs for
classes, and even multiple files at the same time, cloud computing-focused development [60]. More-
requiring models to interact with execution envi- over, tools have been designed to augment LLMs’
ronments, handle extensive contexts, and carry out capabilities. For example, CodeAgent - an LLM-
complex reasoning beyond typical code generation based agent framework - was developed to assist
tasks. In addition, the evaluation experiment re- and allow LLMs to handle complicated program-
vealed that both advanced proprietary models and ming tasks [61]. The following sections demonstrate
the fine-tuned model of the paper, SWE-Llama, several aspects of the broader code generation topic
can handle only the most simple issues. The best- and are categorized into three groups of tasks: (A)
performing model, Claude 2, can solve only 1.96% code generation and code completion (foundational
of the tasks with the BM25 retriever. Therefore, tasks), (B) code generation and code search (ad-
SWE-bench reflects the real-world coding environ- vanced tasks), and (C) debugging and code trans-
ments to create solutions immediately applicable in lation (auxiliary tasks).
open source software development.
Finally, BigCodeBench [59] is a new benchmark 6.1 Code Completion and Code
designed to evaluate LLMs on tackling practical
Generation
and complex programming tasks to ensure no data
contamination. Due to many library and func- Xu [62] introduces an advanced code autocomple-
tion calls in real-world software development, con- tion tool, GitHub Copilot, from the collaboration
cerns are raised for HumanEval, which is a simpler between GitHub’s vast software development re-
benchmark and is affected by contamination and source and OpenAI’s groundbreaking AI develop-
overfitting problems. BigCodeBench includes 1,140 ment. GitHub Copilot uses deep learning models
function-level tasks that require LLMs to use var- (recurrent neural networks (RNNs)) and transform-
ious libraries and compose multiple function calls. ers (Transformer) which help the model to learn the
On average, there are 5.6 test cases per task with code’s syntactic and semantic structure and devel-
branch coverage of 99%. The benchmark tests per- opers’ coding habits. In addition, it is capable of
formance using the Pass@1 metric with greedy de- generating context-sensitive code suggestions from
coding to measure the percentage of tasks correctly the training data of multiple open source code li-
solved by the first generated code against curated braries and developers’ code contributions. GitHub
test cases. For experiments, BigCodeBench ensures Copilot analyzes the code context from develop-
the quality of the task through collaboration be- ers, including the currently written code (functions,
tween GPT-4 and 20 human experts, refining the classes, and methods) to understand developers’ de-
tasks with test cases in a sandbox environment. mands. Subsequently, based on given inputs, it gen-
The tasks are further evaluated by other LLMs and erates code suggestions in real time and can contin-
cross-checked by 7 experts, with the resulting aver- uously modify them depending on what developers
age human performance at 97%. need. This allows developers to learn from the cod-
ing community’s knowledge and experience as well
as enhances the reusability, quality, and efficiency
6 LLMs’ Applications in Code of the code. Furthermore, another advantage to
Generation and Develop- mention is that GitHub Copilot can save develop-
ers a lot of time and effort by reducing manual code
ment written from repetitive or boilerplate tasks.
Moreover, Meta AI [63] releases a state-of-the-
LLMs have transformed code generation in software art LLM - Code Llama for code generation built
development as these models provide user assis- from LLaMA 2 architecture, which excels in code
tance in a variety of coding tasks, such as code com- completion tasks. Three available models are Code
12
Llama - the base model, Code Llama - Python for address issues like ambiguity and Vocabulary Mis-
optimizing Python Programming, and Code Llama match Problems. RepoRift utilizes a multi-stream
- Instruct fine-tuned for better understanding and ensemble architecture that refines the search results
responding to NL instructions. There are four Code by doing multiple comparisons and generating the
Llama sizes for 7B, 13B, 34B, and 70B parameters, most relevant snippets. For evaluation in the Code-
respectively, as each model is trained on 500B code SearchNet dataset, RepoLift significantly outper-
tokens, and related data with the 70B model are formed other code search methods by successfully
specifically trained on 1T tokens. The base and achieving a success rate of 78.2% and 34.6% at Suc-
instruct models with 7B and 13B are trained us- cess@10 and Success@1, respectively. Furthermore,
ing fill-in-the-middle (FIM) capability. This en- it delivers high accuracy using minimal preprocess-
ables models to insert code within existing code ing of the evaluation set and efficiently manages
to support tasks like code completion. Addition- queries in different forms.
ally, smaller 7B and 13B models are quicker and Extending code search capabilities, Feng et al.
more suitable for tasks requiring low latency like [66] present CodeBERT which is a bimodal pre-
real-time code completion. Code Llama was tested trained model designed to understand and gen-
for its ability to complete code with given doc- erate NL and PL code, such as Python, Java,
strings using the HumanEval benchmark. The re- etc. With the Transformer-based neural architec-
sults show that Code Llama surpassed open source, ture and training on a hybrid objective function
code-focused LLMs and outperformed Llama 2 as combined with the pretraining task of replaced
model 34B achieved 53.7% on HumanEval, match- token detection, this allows codeBERT to lever-
ing ChatGPT in performance. age both bimodal (NL-PL pairs) and (only NL or
Lastly, Wang et al. [64] develop ToolGen to PL) unimodal data. Based on these, CodeBERT
improve repository-level code generation in the shows its strong potential in code search. For
LLM generation process by integrating autocom- evaluation, CodeBERT is tested using a dataset
pletion tools. It includes two phases: Offline Trig- for NL-PL probing including NL code search in
ger Insertion and Model Fine-tuning, and Online a zero-shot scenario and compared with an NL-
Tool-integrated Code Generation. ToolGen tack- based pre-trained model called RoBERTa. In an
les problems in code generation for dependency er- experiment on the CodeSearchNet corpus, Code-
rors, such as undefined-variable and no-member er- BERT performed better and more consistently than
rors by manipulating autocompletion tools to fill RoBERTa. Moreover, on the documentation gen-
repository-level dependencies. In the experiments, eration task in six PLs, CodeBERT outperformed
ToolGen was applied to three different LLMs- RoBERTa by achieving a 1.3 BLEU score gain and
CodeGPT, CodeT5, and CodeLlama, and tested state-of-the-art performance.
on two datasets CodeSearchNet and CoderEval to Switching to advanced code generation, Li et
evaluate similarity-based and dependency-based ef- al. [67] develop a model named AlphaCode to
fectiveness, and execution-based effectiveness, re- handle competitive programming problems that re-
spectively. The results demonstrated models’ im- quire advanced problem solving skills. It is ini-
provement by enhancing 31.4% to 39.1% in De- tially pre-trained on selected GitHub code and fine-
pendency Coverage, and 44.9% to 57.7% in Static tuned on a curated dataset of competitive program-
Validity Rate for the three LLMs, while maintain- ming problems like CodeContests. The approach
ing competitive performances on metrics BLEU- is to automatically generate millions of code ex-
4, CodeBLEU, Edit Similarity, and Exact Match. amples, filter them according to their execution re-
In addition, ToolGen improved CodeT5 and Code sults, and cluster them, after which a small number
Llama by 40% and 25%, respectively, and main- of high-quality submissions are manually selected.
tained the same pass rate for CodeGPT. For evaluation, using simulation on the Codeforces
platform, AlphaCode’s performance reached the
6.2 Code Search and Advanced top 54.3% among more than 5,000 human com-
petitors. To improve this model, DeepMind de-
Code Generation
veloped a new dataset for training and evaluation
Code search is an essential task in software devel- called CodeContests. It combines data from multi-
opment practices, allowing developers to efficiently ple sources, where training data predate the evalu-
create solutions to problems. An LLM-assisted tool ation problems, adds additional tests for accuracy,
that can enhance this task is RepoRift. Jain et and has the evaluation of submissions done in a
al. [65] introduce this advanced code search ap- competitive programming-like setting. This results
plication RepoRift designed to improve code snip- in 34.2% of the long-held competitive problems
pet retrieval using LLMs with Retrieval Augmented from CodeContests being solved by the best model.
Generation (RAG). It enhances user queries by in- Finally, for the model’s good and reliable perfor-
jecting more context from GitHub repositories to mance, the paper found the following critical com-
13
ponents: a high-quality competitive programming strategy that provides input to the LLMs, allow-
dataset, efficient transformer models, and large- ing it to correct the identified counterexample. Ex-
scale sampling. periments were carried out on 8160 code transla-
tions of 408 code samples, four feedback strategies,
6.3 Code Translation and Code De- and five LLMs, including GPT4, Claude 3, Claude
2.1, Gemini Pro, and Mixtral. Benchmarks are col-
bugging
lected as real-world projects from GitHub, using C
Hou and Ji [68] discuss the fact in a study that and GO as the source code. The results revealed
GPT-4 is the top-performing LLM in generating that the most successful LLM can achieve up to
programming code that outperforms other LLMs, 47% of the benchmarks.
such as Gemini Ultra and Claude 2. It has gained
success with various forms of programming tasks,
including assisting in writing code, learning from 7 Conclusion
coding error messages, and code translation. In
a LeetCode and GeeksforGeeks coding contest be- This survey provides an overview of the recent
tween human programmers and LLMs, the GPT-4 landscape of LLMs for automatic code genera-
success rate reached over 90% for tasks that only tion. To begin with, we point out the limits and
more than 20% human participants could solve. challenges LLM has faced, such as resource con-
These showed that GPT-4 has the ability to be straints; syntactic and semantic errors; biases; and
a reliable coding assistant. Furthermore, using security risks, highlighting factors that need to be
prompt strategies, GPT-4 demonstrated its abil- mitigated. Subsequently, we discuss various fine-
ity to learn from past errors by salvaging over 60% tuning techniques, including prompt engineering,
of easy and medium tasks after failing in the first reinforcement learning, and domain-specific dataset
attempt. Finally, for the task of translating the cor- tuning which are essential approaches to handle the
rect Python3 code to multiple different languages, issues and enhance model performance and adapt-
it translated most of the tasks accurately. Surpris- ability. We then examine the importance of eval-
ingly, in several medium tasks, it even tackled the uation metrics and benchmarks, as they are criti-
programming task correctly despite giving the in- cal for assessing the effectiveness and reliability of
correct original Python3 code, proving it a reliable the models, techniques, and their generated code
tool for code translation. to guide future development. Finally, we explore
Furthermore, Prenner et al. [69] investigate the significant potential of LLMs in many differ-
Codex’s ability to detect and fix bugs, which are es- ent coding tasks, including code generation, com-
sential tasks for automated program repair. Codex, pletion, search, debugging, and translation, which
built on GPT-3 architecture, has shown great po- significantly boost productivity and efficiency for
tential in generating code from NL descriptions. In users in writing code.
this paper, Codex’s ability to fix software bugs was
evaluated on the QuixBugs benchmark, which con- References
tains 40 bugs in both Java and Python, then com-
paring its performance with three APR approaches, [1] Discover Data Science, “How to become a Data
such as CoCoNut, DeepDebug, and CURE. The Mining Specialist – A Complete Career Guide,”
results show that Codex performed the tasks sur- Discover Data Science. Available: https://ww
prisingly competitively, especially in Python, with w.discoverdatascience.org/career-infor
50% more bugs fixed compared to Java despite mation/data-mining-specialist/?utm_sou
not being trained on Automatic Program Repair rce=chatgpt.com.
(APR). Codex outperformed both CoCoNuT and
DeepDebug in Python and even outperformed Co- [2] edX, “Learn data mining with online courses
CoNut in Java. Additionally, Codex performance and programs,” edX. Available: https://www.
was also tested using different prompt strategies for edx.org/learn/data-mining?utm_source=c
bug localization and repair, revealing that prompts hatgpt.com.
can significantly impact Codex’s capability of fixing
bugs effectively. [3] O. Samuel, “How to Use Pandas for Data Clean-
Finally, a notable application that can greatly as- ing and Preprocessing,” freeCodeCamp, 2024.
sist LLMs in code translation is Flourine. Flourine Available: https://www.freecodecamp.org/n
[70], which is an end-to-end translation tool, en- ews/data-cleaning-and-preprocessing-w
sures translation validation based on cross-language ith-pandasbdvhj/?utm_source=chatgpt.c
differential fuzzing without requiring any test case om.
to check input-output similarity for the original and [4] K. Ketan, “Large Language Models for Code
translated code. Flourine implements the feedback Generation,” Medium, 2023. Available: https:
14
//blog.fabrichq.ai/large-language-model r.nvidia.com/blog/mastering-llm-techniq
s-for-code-generation-f95f93fe7de4. ues-data-preprocessing/.
[5] OpenAI, “OpenAI Codex,” OpenAI, 2021. [16] AI Verse Info, “How does a Large Language
Available: https://openai.com/index/ope Model (LLM) write Code,” AI Verse Info, 2024.
nai-codex/. Available: https://aiverseinfo.com/how-l
lm-writes-code/?amp=1&fbclid=IwZXh0bgNh
[6] O. Mendelevitch, “Large Language Models for ZW0CMTEAAR0CD-t2AHBiDS2dca56gLKHuXx6pb6
Code Generation - Part 1,” Vectara, 2023. AsAi2jiOyVZ96HDTumoYAoqTmJPU_aem_EzXu_
Available: https://www.vectara.com/blog 6J4h8wi11f3wPAwrQ.
/large-language-models-llms-for-code-g
eneration-part-1. [17] M. Heller, “LLMs and the rise of the AI code
generators,” InfoWorld, 2023. Available: http
[7] E. Anello, “How to Use GitHub Copilot: Use s://www.infoworld.com/article/2338500/
Cases and Best Practices,” DataCamp, 2024. llms-and-the-rise-of-the-ai-code-gener
Available: https://www.datacamp.com/tut ators.html.
orial/github-copilot-a-complete-guide
-for-beginners. [18] Vellum, “LLM Leaderboard,” Vellum. Avail-
able: https://www.vellum.ai/llm-leaderb
[8] IBM, “Large language models,” IBM. Avail- oard.
able: https://www.ibm.com/topics/large
-language-models. [19] AI/ML API, “GPT o1: Real-World Appli-
cations and Ultimate Prompt Guide,” AI/ML
[9] AWS, “What is LLM (Large Language API, 2024. Available: https://aimlapi.com/
Model)?” AWS. Available: https://aws.amaz blog/gpt-o1-real-world-applications-a
on.com/what-is/large-language-model/#: nd-ultimate-prompt-guide#:~:text=What%
~:text=Large%20language%20models%2C%20 20Makes%20GPT%20AI%20o1,o1%20does%20th
also%20known,decoder%20with%20self%2Da e%20heavy%20lifting.
ttention%20capabilities.
[20] V. Chhetri, “Why OpenAI’s new AI model,
[10] Nvidia, “Large Language Models Explained,” code-named Strawberry, can be good and bad
Nvidia. Available: https://www.nvidia.com at the same time,” Tech Funding News, 2024.
/en-us/glossary/large-language-models/. Available: https://techfundingnews.com/wh
[11] Analytics Insight, “Exploring Large Language y-openais-new-ai-model-code-named-str
Models: Foundations and Applications,” Ana- awberry-can-be-good-and-bad-at-the-sam
lytics Insight, 2024. Available: https://www. e-time/.
analyticsinsight.net/llm/exploring-lar [21] OpenAI, “Learning to Reason with LLMs,”
ge-language-models-foundations-and-app OpenAI. Available: https://openai.com/i
lications. ndex/learning-to-reason-with-llms/.
[12] L. Price, “Large language models: What is [22] P. Schmid, O. Sanseviero, A. Bartolome, L.
driving the hype behind LLMs in healthcare?” von Werra, D. Vila, V. Srivastav, M. Sun,
Nelson Advisors, 2023. Available: https://ww and P. Cuenca, “Llama 3.1 - 405B, 70B & 8B
w.healthcare.digital/single-post/larg with multilinguality and long context,” Hugging
e-language-models-what-is-driving-the Face, 2024. Available: https://huggingface.
-hype-behind-llm-s-in-healthcare. co/blog/llama31.
[13] J. D. Baierl, “Applications of Large Language [23] D. Cleary, “Using LLMs for Code Generation:
Models in Education: Literature Review and A Guide to Improving Accuracy and Addressing
Case Study,” UCLA, 2023. Available: https: Common Issues,” PromptHub, 2024. Available:
//escholarship.org/uc/item/6kf0r28s. https://www.prompthub.us/blog/using-llm
[14] Daivi, “7 Top Large Language Model Use s-for-code-generation-a-guide-to-impro
Cases And Applications,” ProjectPro, 2024. ving-accuracy-and-addressing-common-i
Available: https://www.projectpro.io/ar ssues.
ticle/large-language-model-use-cases-a [24] D. Huang, Q. Bu, J. Zhang, X. Xie, J. Chen,
nd-applications/887#mcetoc_1h6mcnr1022. and H. Cui, “Bias Testing and Mitigation in
[15] A. Bleiweiss and N. Luo, “Mastering LLM LLM-based Code Generation,” arXiv preprint
Techniques: Data Preprocessing,” Nvidia Tech- arXiv:2309.14345, 2023. Available: https://
nical Blog, 2024. Available: https://develope arxiv.org/abs/2309.14345.
15
[25] H. Hajipour, K. Hassler, T. Holz, L. ACM Journals, 2024. Available: https://dl.a
Schönherr, and M. Fritz, “CodeLMSec Bench- cm.org/doi/full/10.1145/3643674.
mark: Systematically Evaluating and Find-
ing Security Vulnerabilities in Black-Box [33] C. Wang, Z. Li, C. Gao, W. Wang, T. Peng,
Code Language Models,” arXiv preprint H. Huang, Y. Deng, S. Wang, and M. R. Lyu,
arXiv:2302.04012, 2023. Available: https:// “Exploring Multi-Lingual Bias of Large Code
arxiv.org/abs/2302.04012. Models in Code Generation,” arXiv preprint
arXiv:2404.19368, 2024. Available: https://
[26] L. Chen, N. K. Ahmed, A. Dutta, A. Bhat- arxiv.org/abs/2404.19368.
tacharjee, S. Yu, Q. I. Mahmud, W. Abebe,
H. Phan, A. Sarkar, B. Butler, N. Hasabnis, [34] Y. Liu, X. Chen, Y. Gao, Z. Su, F. Zhang, D.
G. Oren, V. A. Vo, J. P. Munoz, T. L. Willke, Zan, J.-G. Lou, P.-Y. Chen, and T.-Y. Ho, “Un-
T. Mattson, and A. Jannesari, “The Landscape covering and Quantifying Social Biases in Code
and Challenges of HPC Research and LLMs,” Generation,” arXiv preprint arXiv:2305.15377,
arXiv preprint arXiv:2402.02018, 2024. Avail- 2023. Available: https://arxiv.org/abs/23
able: https://arxiv.org/pdf/2402.02018. 05.15377.
[28] M. Hassid, T. Remez, J. Gehring, R. Schwartz, [36] He, “Large language models for code writing:
and Y. Adi, “The Larger the Better? Im- Security assessment,” Medium, 2023. Available:
proved LLM Code-Generation via Budget Re- https://medium.com/@researchgraph/larg
allocation,” arXiv preprint arXiv:2404.00725, e-language-models-for-code-writing-sec
2024. Available: https://arxiv.org/html/2 urity-assessment-f305f9f01ce9.
404.00725v1. [37] G. S. Black, B. P. Rimal, and V. M. Vaidyan,
[29] Z. Wang, Z. Zhou, D. Song, Y. Huang, S. “Balancing Security and Correctness in Code
Chen, L. Ma, and T. Zhang, “Where Do Generation: An Empirical Study on Commer-
Large Language Models Fail When Generating cial Large Language Models,” IEEE Transac-
Code?” arXiv preprint arXiv:2406.08731, 2024. tions on Emerging Topics in Computational In-
Available: https://arxiv.org/pdf/2406.087 telligence, 2024. Available: https://ieeexplo
31. re-ieee-org.ezproxy.lib.ou.edu/document
/10658990.
[30] S. Dou, H. Jia, S. Wu, H. Zheng, W. Zhou, M.
Wu, M. Chai, J. Fan, C. Huang, Y. Tao, Y. Liu, [38] J. Wang, X. Luo, L. Cao, H. He, H. Huang,
E. Zhou, M. Zhang, Y. Zhou, Y. Wu, R. Zheng, J. Xie, A. Jatowt, and Y. Cai, “Is Your
M. Wen, R. Weng, J. Wang, X. Cai, T. Gui, X. AI-Generated Code Really Secure? Evaluat-
Qiu, Q. Zhang, and X. Huang, “What’s Wrong ing Large Language Models on Secure Code
with Your Code Generated by Large Language Generation with CodeSecEval,” arXiv preprint
Models? An Extensive Study,” arXiv preprint arXiv:2407.02395, 2024. Available: https://
arXiv:2407.06153, 2024. Available: https:// arxiv.org/html/2407.02395v1.
arxiv.org/html/2407.06153v1.
[39] Turing, “Finetuning large language models:
[31] R. Pan, A. R. Ibrahimzada, R. Krishna, D. An in-depth guide,” Turing. Available: https:
Sankar, L. P. Wassi, M. Merler, B. Sobolev, R. //www.turing.com/resources/finetuning-l
Pavuluri, S. Sinha, and R. Jabbarvand, “Lost arge-language-models.
in Translation: A Study of Bugs Introduced
[40] Z. Ma, H. Guo, J. Chen, G. Peng, Z. Cao,
by Large Language Models while Translating
Y. Ma, and Y.-J. Gong, “LLaMoCo: Instruc-
Code,” arXiv preprint arXiv:2308.03109, 2023.
tion Tuning of Large Language Models for Op-
Available: https://arxiv.org/abs/2308.031
timization Code Generation,” arXiv preprint
09.
arXiv:2403.01131, 2024. Available: https://
[32] Y. Liu, T. Le-Cong, R. Widyasari, C. Tan- arxiv.org/pdf/2403.01131v1.
tithamthavorn, L. Li, X.-B. D. Le, and D. Lo,
[41] M. Weyssow, X. Zhou, K. Kim, D. Lo, and
“Refining ChatGPT-Generated Code: Charac-
H. Sahraoui, “Exploring Parameter-Efficient
terizing and Mitigating Code Quality Issues,”
16
Fine-Tuning Techniques for Code Generation [51] J. Ip, “Evaluating LLM Systems: Essen-
with Large Language Models,” arXiv preprint tial Metrics, Benchmarks, and Best Practices,”
arXiv:2308.10462, 2023. Available: https:// Confident AI, 2024. Available: https://www.
arxiv.org/pdf/2308.10462. confident-ai.com/blog/evaluating-llm-s
ystems-metrics-benchmarks-and-best-pra
[42] Y. Tsai, M. Liu, and H. Ren, “Code Less, ctices.
Align More: Efficient LLM Fine-tuning for
Code Generation with Data Pruning,” arXiv [52] S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D.
preprint arXiv:2407.05040, 2024. Available: ht Tang, N. Sundaresan, M. Zhou, A. Blanco, and
tps://arxiv.org/pdf/2407.05040. S. Ma, “CodeBLEU: a Method for Automatic
Evaluation of Code Synthesis,” arXiv preprint
[43] F. Mu, L. Shi, S. Wang, Z. Yu, B. Zhang,
arXiv:2009.10297, 2020. Available: https://
C. Wang, S. Liu, and Q. Wang, “Clari-
arxiv.org/pdf/2009.10297.
fyGPT: Empowering LLM-based Code Gen-
eration with Intention Clarification,” arXiv [53] S. Yeo, Y.-S. Ma, S. C. Kim, H. Jun, and
preprint arXiv:2310.10996, 2023. Available: ht T. Kim, “Framework for evaluating code gen-
tps://arxiv.org/pdf/2310.10996. eration ability of large language models,” Wi-
ley Online Library, 2024. Available: https:
[44] J. Gehring, K. Zheng, J. Copet, V.
//onlinelibrary.wiley.com/doi/10.421
Mella, T. Cohen, and G. Synnaeve, “RLEF:
8/etrij.2023-0357.
Grounding Code LLMs in Execution Feedback
with Reinforcement Learning,” arXiv preprint [54] T. Y. Zhuo, “ICE-Score: Instructing Large
arXiv:2410.02089, 2024. Available: https:// Language Models to Evaluate Code,” arXiv
arxiv.org/pdf/2410.02089. preprint arXiv:2304.14317, 2024. Available: ht
tps://arxiv.org/abs/2304.14317.
[45] M. F. Wong and C. W. Tan, “Aligning Crowd-
sourced Human Feedback for Code Generation [55] Symflower, “Comparing LLM Benchmarks,”
with Bayesian Inference,” IEEE, 2024. Avail- Symflower, 2024. Available: https://symflo
able: https://ieeecai.org/2024/wp-conte wer.com/en/company/blog/2024/comparing
nt/pdfs/540900a152/540900a152.pdf. -llm-benchmarks/.
[46] Z. Sun, C. Lyu, B. Li, Y. Wan, H. Zhang, [56] Z. Wang, “HumanEval: Decoding the LLM
G. Li, and Z. Jin, “Enhancing Code Gener- Benchmark for Code Generation,” Deepgram,
ation Performance of Smaller Models by Dis- 2023. Available: https://deepgram.com/lea
tilling the Reasoning Ability of LLMs,” arXiv rn/humaneval-llm-benchmark.
preprint arXiv:2403.13271, 2024. Available: ht
tps://arxiv.org/pdf/2403.13271v1. [57] X. Du, M. Liu, K. Wang, H. Wang, J. Liu,
Y. Chen, J. Feng, C. Sha, X. Peng, and Y.
[47] J. Li, Y. Zhao, Y. Li, G. Li, and Z. Lou, “ClassEval: A Manually-Crafted Bench-
Jin, “AceCoder: Utilizing Existing Code to mark for Evaluating LLMs on Class-level Code
Enhance Code Generation,” arXiv preprint Generation,” arXiv preprint arXiv:2308.01861,
arXiv:2303.17780, 2023. Available: https:// 2023. Available: https://arxiv.org/pdf/23
arxiv.org/pdf/2303.17780. 08.01861.
[48] C. Tony, N. E. Dı́az Ferreyra, M. Mu- [58] C. E. Jimenez, J. Yang, A. Wettig, S.
tas, S. Dhiff, and R. Scandariato, “Prompt- Yao, K. Pei, O. Press, and K. Narasimhan,
ing Techniques for Secure Code Generation: “SWE-bench: Can Language Models Resolve
A Systematic Investigation,” arXiv preprint Real-World GitHub Issues?” arXiv preprint
arXiv:2407.07064, 2024. Available: https:// arXiv:2310.06770, 2023. Available: https://
arxiv.org/pdf/2407.07064. arxiv.org/pdf/2310.06770.
[49] D. G. Paul, H. Zhu, and I. Bayley, “Bench- [59] T. Y. Zhuo, J. Liu, Q. Liu, B. Hui, N. Muen-
marks and Metrics for Evaluations of Code Gen- nighoff, D. Fried, H. de Vries, L. von Werra, and
eration: A Critical Review,” arXiv preprint C. Fourrier, “BigCodeBench: The Next Genera-
arXiv:2406.12655, 2024. Available: https:// tion of HumanEval,” Hugging Face, 2024. Avail-
arxiv.org/html/2406.12655v1. able: https://huggingface.co/blog/leader
[50] J. Ip, “LLM Evaluation Metrics: The Ultimate board-bigcodebench.
LLM Evaluation Guide,” Confident AI, 2024.
[60] Onome, “Top LLMs for Coding All Developers
Available: https://www.confident-ai.com/
Should Know About,” AutoGPT, 2024. Avail-
blog/llm-evaluation-metrics-everythin
able: https://autogpt.net/top-llms-for-c
g-you-need-for-llm-evaluation.
17
oding-all-developers-should-know-about [69] J. A. Prenner, H. Babii, and R. Robbes,
/#:~:text=CodeQwen1.,working%20with%20d “Can OpenAI’s Codex Fix Bugs?” IEEE, 2022.
iverse%20programming%20languages. Available: https://ieeexplore-ieee-org.e
zproxy.lib.ou.edu/stamp/stamp.jsp?tp=&
[61] K. Zhang, J. Li, G. Li, X. Shi, and Z. Jin, arnumber=9809175.
“CodeAgent: Enhancing Code Generation with
Tool-Integrated Agent Systems for Real-World [70] H. F. Eniser, H. Zhang, C. David, M. Wang,
Repo-level Coding Challenges,” arXiv preprint M. Christakis, B. Paulsen, J. Dodds, and D.
arXiv:2401.07339, 2024. Available: https:// Kroening, “Towards Translating Real-World
arxiv.org/abs/2401.07339. Code with LLMs: A Study of Translating to
Rust,” arXiv preprint arXiv:2405.11514, 2024.
[62] H. Xu, “Github Copilot - A Groundbreak- Available: https://arxiv.org/abs/2405.115
ing Code Autocomplete Tool,” Research Gate, 14.
2023. Available: https://www.researchgate
.net/publication/376406939_Github_Copi
lot_-_A_Groundbreaking_Code_Autocomple
te_Tool.
18