A - Review - On - Code - Generation - With - LLMs - Application - and - Evaluation 2

chatbot

Uploaded by

Nouman Hakim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

191 views6 pages

A - Review - On - Code - Generation - With - LLMs - Application - and - Evaluation 2

chatbot

Uploaded by

Nouman Hakim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

2023 IEEE International Conference on Medical Artificial Intelligence (MedAI)

A Review on Code Generation with LLMs:

Application and Evaluation
Jianxun Wang Yixiang Chen
National Engineering Research Center National Engineering Research Center
of Trustworthy Embedded Software of Trustworthy Embedded Software
East China Normal University East China Normal University
2023 IEEE International Conference on Medical Artificial Intelligence (MedAI) | 979-8-3503-5878-0/23/$31.00 ©2023 IEEE | DOI: 10.1109/MEDAI59581.2023.00044

Shanghai, China Shanghai, China

[email protected] [email protected]

Abstract—Code generation is a longstanding subject in the can become able to read and write code smoothly, and this
field of computer science and software engineering, which aims exciting assumption has been proven true by researchers
at realizing an agent capable of writing code automatically in recent years. With sufficient code and descriptions from
aligning with human desire. With the booming development of
large language models (LLMs) in recent years, code generation human programmers in training data, LLMs can generate
techniques powered by LLMs with strong coding ability have code according to given natural language descriptions about
caught many researchers’ interest. In this study, we conduct target code, incomplete code segments to be filled or even
a review of recent studies about code generation with LLMs, buggy code snippets to be fixed. And after obtaining complete
from the application of LLM-based code generation to the code thoroughly generated by LLMs, to what extent is this
evaluation of LLM-generated code. We find, with the powerful
code understanding and writing ability LLMs provide, these code trustworthy becomes a new considerable question for
novel techniques can be applied to manage various software researchers.
engineering tasks, and indeed boost the productivity of developers In this study, we mainly review 20 existing studies about
to a great extent. But we also find, as an equally important code generation with LLMs in the past five years. We review
subject, the evaluation receives less attention from researchers these studies from two aspects: the application of code gen-
than the application. We conclude some limitations in existing
studies about the evaluation of code generated by LLMs, like eration with LLMs and the evaluation of code generated by
inadequate quality characteristics considered. And we think more LLMs. In terms of application, we focus on three prevailing
effort is needed to narrow the gap between research on the kinds: description to code, code completion, and automatic
evaluation and the application. program repair. As for evaluation, we discuss related studies
Index Terms—large language models (LLMs), code generation, based on the quality characteristics of generated code each
code completion, automatic program repair, code quality evalu-
ation
study concerns, which are accordingly divided into three parts:
functional correctness, security, and others (some quality char-
acteristics considered less by these studies). These studies we
I. I NTRODUCTION
review are presented in Table I, categorized by the subsection
Code generation, also called program synthesis sometimes, each is discussed in this paper.
is a fascinating topic for researchers in computer science
and software engineering. A machine capable of writing TABLE I
OVERVIEW OF S TUDIES R EVIEWED BY U S
desired code automatically according to human demand is
the researchers’ ultimate purpose. With consistent efforts for
decades, they have made great progress, but there is still a great Sectiona Subsectionb Reviewed Studies
Description to Code [12] [13] [14]
distance from fully automatic and practical code generation, A PPLICATION Code Completion [15] [16] [17]
especially generating code based on natural language de- Automatic Program Repair [19] [20] [21]
scriptions. However, the emergence of large language models Functional Correctness [5] [25] [27] [28] [29]
(LLMs) in recent years, the GPT series in particular, provides E VALUATION Security [30] [31] [32] [33]
Others [34] [36]
them with a new possibility to achieve this ultimate purpose. a Each refers to the corresponding section in this paper, where A PPLICA -

With a massive number of parameters up to 175B, terabyte- TION and E VALUATION indicate section III and IV respectively.
b Each refers to the corresponding subsection in this paper.
level training data, and well-designed architecture, LLMs were
endowed with unprecedentedly powerful language understand-
ing, processing, and generating capability. Coincidentally, in By reviewing these studies, we ﬁnd there is a signiﬁcant
order to make code more understandable for humans, most gap between research on the application and the evaluation of
program languages, especially high-level program languages, LLM-based code generation. Many pioneer researchers have
are designed based on natural languages. Therefore, it is been exploiting the potential of code generation with LLMs in
intuitive to think LLMs trained on massive existing code actual applications, and have gained lots of amazing results.

979-8-3503-5878-0/23/$31.00 ©2023 IEEE 284

DOI 10.1109/MedAI59581.2023.00044
Authorized licensed use limited to: American University of the Middle East. Downloaded on November 19,2024 at 06:13:55 UTC from IEEE Xplore. Restrictions apply.
However, as a subject of equal importance, the evaluation of C. Motivation
LLM-generated code is less noticed by researchers, and the
current state of related research can hardly catch up with the This study is motivated by our practical experience of
former. We finally find existing research on the evaluation using LLM-based code generation tools, especially Copilot
shares some limitations, which are critical to be addressed developed by GitHub. We are surprised by the powerful
in order to close the gap soon. code generation ability of Copilot, which can predicate and
complete our code smoothly, and support a large range of
II. BACKGROUND AND M OTIVATION programming languages. This experience drives us to explore
more amazing applications of this new technique.
A. Large Language Models (LLMs) In order to know more about it, we asked Copi-
In the field of natural language processing (NLP), language lot (VSCode extension: GitHub Copilot v1.105.350) to
models are usually designed for some specific tasks, like ma- complete a heap-sort C++ function (the function signa-
chine translation, information extraction, question answering ture as prompt) and tested it with an unordered array
etc [1]. However, large language models (LLMs), which are [9,1,12,0,-5,25,17,-2,6,21]. We found Copilot
machine learning-based language models with a large number generated the wrong code and its output result is wrong,
of parameters, and trained on massive data, not only have a as shown in Fig. 1. Moreover, the generated code snippet
more powerful capability but also can handle multiple kinds does not contain any comments, which makes it difficult for
of tasks without retraining. programmers to check and modify it manually. This worrying
To illustrate LLMs further, we have to mention the GPT outcome inspires our concerns about the evaluation of LLM-
series [2] [3] [4] [7] [8], which contains several most prevailing generated code.
language models in recent years. When given a string of
tokens within its context window, GPT-3 [4] can predict the
most possible token next according to its abundant training
corpus. Thus, users can directly communicate with GPT-3, or
use task descriptions as prompts to ask GPT-3 to do specific
downstream tasks. Later, in order to boost the code writing
ability, OpenAI retrained GPT-3 on massive open source codes
from GitHub [5]. The new language model they got was
named Codex, which can complete code based on existing
code or generate code snippets according to natural language
descriptions. After that, GitHub released Copilot [6], an AI
pair programmer powered by Codex, which can draw context
from comments and code to suggest individual lines or whole
functions instantly. In the following GPT-3.5 (ChatGPT) [7]
and GPT-4 [8], the amazing ability of Codex was integrated
and enhanced.

B. Code Generation
Code generation, also known as program synthesis, meaning Fig. 1. The Heap-Sort C++ Function Completed by Copilot and The Test
the automatic construction of software or self-writing code Output.
[9], is a longstanding topic discussed in the fields of software
engineering, computer science, programming language, and
artificial intelligence. Code generation aims at automatically III. A PPLICATION OF C ODE G ENERATION WITH LLM S
generating code according to given descriptions or existing
code. With surprising natural language processing ability, LLMs
For a long time in the past, most research about code gen- can not only read and write code but also handle documenta-
eration was limited to domain-specific languages (DSLs) and tion, comments, warnings, or error messages generated along
programmers had to write tedious formal specifications as de- with the development of software. Thus, code generation with
scriptions for generation [10]. However, with the development LLMs has great potential in automatically managing various
of artificial intelligence, lots of machine learning and deep software engineering tasks. In recent years, researchers have
learning techniques are applied to code generation, making it been exploiting its ability and have found many amazing ap-
more powerful and flexible [11]. Since most programming lan- plications of it. We discuss three prevailing applications here:
guages, especially high-level programming languages, share Description to Code, Code Completion, and Automatic
many similarities with natural language, it is intuitive to utilize Program Repair, as shown in Fig. 2. Besides these three,
powerful language models in code generation, and this idea there are some other applications also attractive, like code
has been proven very effective by studies in recent years. translation, test generation, documentation generation, etc.

285

Authorized licensed use limited to: American University of the Middle East. Downloaded on November 19,2024 at 06:13:55 UTC from IEEE Xplore. Restrictions apply.
productivity it brings is limited. However, LLMs, designed for
NLP, can naturally predict what users might type next based on
existing code and comments just like word prediction, lifting
the ability of code completion to the semantic level. And
Copilot, the powerful code completion tool based on Codex,
attracts many researchers’ and developers’ attention.
Moroz et al. [15] studied the applications and shortcomings
of Copilot. They found Copilot has many advantages as a
programming assistant, such as Copilot can be a good assistant
for skilled programmers, as well as novice developers. They
also found some problems and gaps in its applications like
unchecked low-quality code. Overall, they suggested Copilot
still has lots of growth opportunities, more effort should be
taken to make it safer, more reliable, and more effective.
Ziegler et al. [16] conducted a case study with Copilot to
Fig. 2. Three Prevailing Applications: Description to Code, Code Comple- find out its impact on the productivity of users. Combining
tion, and Automatic Program Repair these objective usage data and the subjective perceptions of
developers, they found suggestions’ acceptance rate can be a
great predictor of productivity developers perceive, which can
A. Description to Code reflect the perception of users to some extent. Moreover, they
Nothing will be more exciting for software developers than a found acceptance rate varies among developers, depending on
machine that can generate code according to given descriptions their behavior.
written in natural language. Some researchers have been trying Barke et al. [17] conducted a grounded theory analysis,
this with several state-of-the-art LLMs. aiming at knowing how programmers interact with code-
Finnie-Ansley et al. [12] applied Codex mentioned above generative models, like Copilot. They found users’ interactions
to introductory programming. They used 23 questions from with Copilot can be classified into two modes—acceleration
two introductory programming course tests as descriptions for and exploration. Acceleration mode can boost users’ produc-
Codex to generate Python code. They found Codex performed tivity greatly and exploration mode can always help users
better than most student participants in these two tests and its handle unfamiliar tasks. Based on this finding, they also
solutions include lots of variations. They thought LLMs with proposed some recommendations for users.
the ability to write code like Codex could be an opportunity
C. Automatic Program Repair
and also a threat to introductory programming education.
Jiang et al. [13] developed a natural language code synthesis Because of the high cost of traditional software maintenance
tool GetLine, and conducted a user study with it. GetLine is approaches, which occupies over half of the software devel-
backed by LaMDA and provides a user interface. Users can in- opment lifecycle [18], automatic program repair (APR) tech-
put natural language requests and select a target programming niques have been studied by many commercial companies and
language, and then GetLine can generate multiple outputs for academic institutions for years. The booming development of
users to choose from. Finally, the authors concluded several LLMs in recent years undoubtedly provides a new possibility
useful implications of future code synthesis tool design from for APR.
their user study. According to Xia and Zhang [19], template-based tools
Dong et al. [14] proposed a self-collaboration framework for have the best performance among traditional APR methods,
LLMs to enhance their capability of solving coding problems. which fix bugs by matching specific buggy code patterns and
They asked three ChatGPT instances to play analyst, coder, applying corresponding patches. But in this way, the capability
and tester along the development process of software respec- of template-based tools will be limited by the coverage of their
tively, and coding problems were fed as user requirements. finite pattern base. To overcome this issue, some researchers
Then, these three roles can interact and collaborate by chatting apply machine learning techniques to APR, recognizing bug
to generate code. By conducting a comprehensive method, fixing as a neural machine translation task, which translates
they found the performance of this self-collaboration code buggy code into correct code. Thus, the emerging LLMs
generation is 30% higher than the naive direct approach. capable of handling various NLP tasks can be a promising
solution.
B. Code Completion Kolak et al. [20] noticed the potential capability of language
Code completion is an indispensable feature for integrated models in distinguishing bugs and patches, so they studied
development environments (IDEs), which can offer developers the performance of LLMs with different scales in program
code suggestions according to available contextual informa- repair tasks. They selected three publicly available versions
tion. The ability of code completion has been staying at of PolyCoder and Codex. Then they tested these four models
the syntactic level for a long time, and the promotion of on 80 buggy programs. According to their result, they found

286

Authorized licensed use limited to: American University of the Middle East. Downloaded on November 19,2024 at 06:13:55 UTC from IEEE Xplore. Restrictions apply.
larger models are more successful patch generators and tend code evaluation metric (CEM) is indispensable for the devel-
to generate patches similar to a real developer. opment of code generation techniques, and there are mainly
Sobania et al. [21] evaluated and analyzed the performance two types of prevailing CEMs: match-based and execution-
of ChatGPT (GPT-3.5) in bug fixing. They selected all Python based [25].
problems from QuixBugs [22] benchmark, asking ChatGPT Since code generation can be viewed as the translation
for the bug fix four times for each problem. Compared with from description to code, the match-based evaluation met-
previous approaches, ChatGPT notably outperformed tradi- rics commonly used in machine translation can be helpful,
tional APR tools and is competitive with other LLM-based which evaluate by calculating the similarity between generated
approaches, like Codex. Their result also shows that, as a sentences and reference sentences, such as BLEU [26] and
dialogue system, with further information like error messages, CodeBLEU [27]. However, these match-based CEMs have de-
ChatGPT can outperform the state-of-the-art. ficiencies in reflecting the functional correctness of generated
Xia and Zhang [19] proposed a fully automated code, because these CEMs are unable to decide the functional
conversation-driven APR approach—C HAT R EPAIR, based on equivalence between generated code and reference code [25].
ChatGPT. This approach keeps asking ChatGPT to generate Therefore, more execution-based CEMs were proposed by
patches and giving detailed test results back until the correct later researchers.
patch is obtained. By evaluating C HAT R EPAIR against other Currently, there are two most commonly used execution-
APR tools on two widely studied benchmarks, they surprisedly based CEMs to evaluate the functional correctness of code:
found, though as a conversational APR approach, C HAT R E - AvgPassRatio [28] and Pass@k [29], which both depend on
PAIR obtained the state-of-the-art performance. the execution of generated code on a prepared test set. In order
to evaluate Codex they developed, Chen et al. [5] released
IV. E VALUATION OF C ODE G ENERATED BY LLM S a new evaluation problem set HumanEval and calculated the
Pass@k value.
Software source code quality has been an important subject
in the field of software engineering for decades, which can Because of the need for repeated executions on test cases,
be traced back to the 1960s [23]. Many researchers have Dong et al. [25] thought the computation of execution-based
proposed various metrics, methods, and models for source CEMs is costly, slow, and insecure, while match-based CEMs
code quality evaluation in different situations, which is also are inaccurate, so they proposed a new LLM-based CEM—
called software quality evaluation at the source level. These CodeScore, for functional correctness evaluation. They defined
evaluation approaches have become the basis of software PassRatio according to AvgPassRatio and made CodeScore an
source code quality assurance and promotion practices in the execution-free alternative for PassRatio. By evaluating with
industry today. their LLM-based framework, CodeScore outperforms match-
When researchers noticed the great potential of LLMs in based CEMs in accuracy compared with PassRatio and costs
writing code, concerns about the quality of LLM-generated much less time than execution-based CEMs.
code began to emerge at the same time. If unreliable code
generated by LLMs is directly introduced into the software B. Security
without carefully checking, this may lead to some disastrous The powerful code generation capability of LLMs relies
outcomes like system breakdown or privacy data leak. There- on massive available code snippets in training data, which
fore, there has been much research in recent years focusing are mostly from open-source code repositories. However, it
on evaluating LLM-generated code from the perspective of is likely there exist potential vulnerabilities or even malicious
software source code quality. code snippets in these open-source codes, which may leak
Software quality consists of multiple quality characteristics into the output of LLMs and harm the security of developed
(sometimes also called quality factors or quality attributions), software. Many researchers have similar worries and evaluate
such as functionality suitability, performance efficiency, com- the security of LLM-generated code from various aspects.
patibility, etc. according to ISO/IEC 25010 [24], and lots Pearce et al. [30] focused on the security of Copilot’s code
of corresponding sub-characteristics. Thus we review these contributions. First, based on MITRE’s Common Weakness
researches about LLM-generated source code evaluation and Enumerations (CWEs), they created a prompt dataset con-
present them according to the quality characteristics or sub- taining various security-relevant scenarios. The security of the
characteristics each study concerns. Specifically, we present completed code was evaluated with the automated analysis tool
them in three categories: Functional Correctness, Security, and manual inspection. They found that, in about 44% of all
and Others. scenarios, Copilot did generate code with a relevant weakness,
and some of the weaknesses are introduced more frequently.
A. Functional Correctness Sandovalet et al. [31] conducted a user study to find out
As mentioned above, automatic code generation is a long- whether student programmers will write more insecure code
standing topic that has been studied for decades. In order to with the help of an LLM-based code assistant. Considering
access the performance of a newly proposed code generation real-world programming is mostly project-based, they design
approach and compare it with other existing ones, a proper a "shopping list" C program completion task. They found the

287

Authorized licensed use limited to: American University of the Middle East. Downloaded on November 19,2024 at 06:13:55 UTC from IEEE Xplore. Restrictions apply.
number of server security bugs produced by LLM-assisted can greatly boost the productivity of developers. However, as
participants is no greater than 10% than the control. an equally critical subject, the evaluation of LLM-generated
Asare et al. [32] conducted a comparative empirical analysis code fails to keep up with the application. By reviewing related
to find out whether Copilot is as bad as human developers research in section IV, we found the following limitations
in introducing vulnerabilities when writing code. Based on a in existing studies about the evaluation of code generated by
dataset of C/C++ vulnerabilities from several projects in the LLMs.
real world, they recreated the same scenario for Copilot by Inadequate quality characteristics for evaluation. Ac-
deleting bug/patch-relevant code. They found Copilot intro- cording to ISO/IEC 25010 [24], there are plenty of quality
duced the same vulnerabilities as humans only one-third of characteristics and sub-characteristics to evaluate code from
the time, which is not as bad as human developers. different aspects. However, most existing research only focuses
Khoury et al. [33] experimented to evaluate the security on the functional correctness and security issues of LLM-
of code generated by ChatGPT. They designed 21 problems generated code. We think that’s because these two are truly the
across 5 programming languages, and each problem is prone most elementary and concerning aspects of code, also the most
to introduce a specific vulnerability in CWEs when solving. In easily perceived by programmers and users of final software
conclusion, they found ChatGPT can frequently generate inse- products. However, many other characteristics can also impact
cure code and experienced programmers are still irreplaceable the integral quality of code to a great extent, like compatibility,
to produce code reliable enough. maintainability, portability, etc.
Lack of systematic and quantitative evaluation model.
C. Others As for traditional code, there are many studies proposing
There are also some researchers concerning other aspects systematic quality [37] (or similarly trustworthiness [38])
of source qualities besides functional correctness and security. evaluation models, which can produce a quantitative evaluation
Understandability. Understandablity reflects how easily result about the quality of code. However, to the best of our
programmers can fully understand the logic and function of knowledge, there is still no research imposing these traditional
a specific code snippet. Nguyen et al. [34] conducted an em- methods on the evaluation of LLM-generated code. Though
pirical study to evaluate the correctness and understandability there may be some challenges to be conquered, we believe
of code generated by Copilot. In terms of the evaluation of these traditional models can bring many benefits to current
understandability, they calculated the cyclomatic complexity research on LLM-generated code evaluation.
and cognitive complexity of generated code, which both pos- Ignoring human engagement when conducting the eval-
itively correlate with understandability according to previous uation. According to Sandovalet et al. [31], there are gener-
study [35]. Overall, they thought Copilot could produce easily ally two kinds of software development modes with LLMs:
understandable code under this experimental circumstance, but the autopilot mode and the assisted mode. Due to limited
their data may be not enough to get a general conclusion. functional correctness and potential vulnerabilities of LLM-
Maintainability and Readability. As the definition given generated code, the assisted mode is a more practical way at
by ISO/IEC 25010 [24], maintainability represents the degree the present phase, like the practices of Copilot and ChatGPT.
of effectiveness and efficiency with which a product or system As for the evaluation of code generated in this way, human en-
can be modified to improve it, correct it, or adapt it to changes gagement should be taken into consideration as well, because
in environment and requirements. Readability, which is similar their prompt strategies, suggestions selections, and some other
to understandability, evaluates the complexity of code in the behaviors do affect the final code. However, from our review,
syntactic aspect, while understandability conducts evaluations we found most research failed to consider the role of human
in the dynamic aspect [35]. Siddiq et al. [36] conducted programmers when conducting the evaluation.
an empirical study on code smells in training code and Lack of specific research on evaluation. We found that
generating code of LLM-based code generation techniques, in most existing research mentioning LLM-generated code
as well as the relation between them. Code smells can include quality evaluation, the initial motivation of evaluation is to
security issues, design decision issues, and coding standard measure and compare the performance of LLMs in terms of
violations in source code, which are patterns that indicate code generation. For instance, functional correctness, which is
lower maintainability and readability. According to the results, most used by researchers, has been a golden metric to embody
they thought bad code patterns in training code do leak to the the code-writing ability of LLMs. Therefore, there is a lack of
code generated by models because they found the type of code specific research focusing on the evaluation of code generation
smells in generating code is a subset of those in training code. by LLMs, and we think they are much needed because specific
research on evaluation is meaningful not only for developers of
V. D ISCUSSION LLMs but also for users, which can help programmers evaluate
With the rapid development of code generation with LLMs, the trustworthiness of generated code.
many researchers have been exploring its possibility in prac- In our future work, we will recognize LLM-assisted coding
tice. As we have reviewed in section III, LLM-based code as a brand-new software development approach and evaluate
generation techniques have astonishing potential and various the trustworthiness of the resulting software. With this new de-
applications in managing software engineering tasks, which velopment process, we have to make some necessary changes

288

Authorized licensed use limited to: American University of the Middle East. Downloaded on November 19,2024 at 06:13:55 UTC from IEEE Xplore. Restrictions apply.
to the traditional software trustworthiness model to adapt to it. [17] S. Barke, M. B. James, and N. Polikarpova, “Grounded Copilot: How
Programmers Interact with Code-Generating Models.” arXiv, Oct. 31,
We will focus on not only the generated source code but also 2022. doi: 10.48550/arXiv.2206.15000.
the whole process assisted with LLMs, including the capability [18] Jiang JJ, Chen JJ, Xiong YF. Survey of automatic program repair
of a specific LLM, the interaction between human developers techniques. Ruan Jian Xue Bao/Journal of Software, 2021,32(9):2665-
2690 (in Chinese). http://www.jos.org.cn/1000-9825/6274.htm
and models and so on. [19] Chunqiu Steven Xia and Lingming Zhang, “Keep the Conversation
Going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT.”
VI. C ONCLUSION arXiv, Apr. 01, 2023. doi: 10.48550/arXiv.2304.00385.
In this study, we review recent research on code evaluation [20] S. Kolak, R. Martins, C. L. Goues, and V. J. Hellendoorn, “PATCH
GENERATION WITH LANGUAGE MODELS: FEASIBILITY AND
with LLMs from two aspects: the application of code gen- SCALING BEHAVIOR,” 2022.
eration with LLMs and the evaluation of code generated by [21] D. Sobania, M. Briesch, C. Hanna, and J. Petke, “An Analysis of the
LLMs. We find, with the help of recent emerging powerful Automatic Bug Fixing Performance of ChatGPT.” arXiv, Jan. 20, 2023.
doi: 10.48550/arXiv.2301.08653.
LLMs, code generation techniques can successfully handle [22] D. Lin, J. Koppel, A. Chen, and A. Solar-Lezama, “QuixBugs: A multi-
more complex tasks than before, and many researchers pro- lingual program repair benchmark set based on the Quixey Challenge,”
posed various novel applications of LLM-based code gen- in Proceedings Companion of the 2017 ACM SIGPLAN international
conference on systems, programming, languages, and applications: soft-
eration. However, research around the evaluation of LLM- ware for humanity, 2017, pp. 55-56.
generated code fails to keep pace with the application. By [23] Yikang Shao, Wu Liu, Jun Ai, and Chunhui Yang, “A Quantitative
reviewing a limited amount of related studies, we found some Measurement Method of Code Quality Evaluation Indicators based on
Data Mining,” in 2022 9th International Conference on Dependable
limitations in the current research stage, and we think more Systems and Their Applications (DSA), Aug. 2022, pp. 659-669. doi:
effort is needed to fill the gap in the future. 10.1109/DSA56465.2022.00094.
[24] “ISO 25010.” https://iso25000.com/index.php/en/iso-25000-
R EFERENCES standards/iso-25010 (accessed Jun. 24, 2023).
[25] Y. Dong, J. Ding, X. Jiang, Z. Li, G. Li, and Z. Jin, “CodeScore:
[1] S. J. Russell et al., Artificial intelligence: a modern approach, Fourth
Evaluating Code Generation by Learning Code Execution.” arXiv, Jan.
edition, Global edition. in Pearson series in artificial intelligence. Har-
21, 2023. doi: 10.48550/arXiv.2301.09043.
low: Pearson, 2022.
[26] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a method
[2] A. Radford and K. Narasimhan, “Improving Language Understanding
for automatic evaluation of machine translation,” in Proceedings of the
by Generative Pre-Training,” 2018.
40th Annual Meeting on Association for Computational Linguistics, in
[3] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,
ACL ’02. USA: Association for Computational Linguistics, Jul. 2002,
“Language Models are Unsupervised Multitask Learners,” 2019.
pp. 311-318. doi: 10.3115/1073083.1073135.
[4] T. B. Brown et al., “Language Models are Few-Shot Learners.” arXiv,
[27] S. Ren et al., “CodeBLEU: a Method for Automatic Evaluation of Code
Jul. 22, 2020. doi: 10.48550/arXiv.2005.14165.
Synthesis.” arXiv, Sep. 27, 2020. doi: 10.48550/arXiv.2009.10297.
[5] M. Chen et al., “Evaluating Large Language Models Trained on Code.”
[28] D. Hendrycks et al., “Measuring Coding Challenge Competence With
arXiv, Jul. 14, 2021. doi: 10.48550/arXiv.2107.03374.
APPS.” arXiv, Nov. 08, 2021. doi: 10.48550/arXiv.2105.09938.
[6] "GitHub Copilot · Your AI pair programmer," GitHub.
[29] S. Kulal et al., “SPoC: Search-based Pseudocode to Code.” arXiv, Jun.
https://github.com/features/copilot (accessed Jun. 20, 2023).
11, 2019. doi: 10.48550/arXiv.1906.04908.
[7] "ChatGPT." https://openai.com/chatgpt (accessed Jun. 20, 2023).
[30] H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep
[8] "GPT-4." https://openai.com/gpt-4 (accessed Jun. 20, 2023).
at the Keyboard? Assessing the Security of GitHub Copilot’s Code
[9] C. David and D. Kroening, “Program synthesis: challenges and opportu-
Contributions,” in 2022 IEEE Symposium on Security and Privacy (SP),
nities,” Philosophical Transactions of the Royal Society A: Mathemati-
May 2022, pp. 754-768. doi: 10.1109/SP46214.2022.9833571.
cal, Physical and Engineering Sciences, vol. 375, no. 2104, p. 20150403,
[31] G. Sandoval, H. Pearce, T. Nys, R. Karri, S. Garg, and B. Dolan-
Sep. 2017, doi: 10.1098/rsta.2015.0403.
Gavitt, “Lost at C: A User Study on the Security Implications of Large
[10] Gu B, Yu B, Dong XG, Li XF, Zhong RM, Yang MF. Intel-
Language Model Code Assistants,” Aug. 2022.
ligent program synthesis techniques: Literature review. Ruan Jian
[32] O. Asare, M. Nagappan, and N. Asokan, “Is GitHub’s Copilot as Bad as
Xue Bao/Journal of Software, 2021,32(5):1373-1384 (in Chinese).
Humans at Introducing Vulnerabilities in Code?” arXiv, Feb. 14, 2023.
http://www.jos.org.cn/1000-9825/6200.htm
doi: 10.48550/arXiv.2204.04741.
[11] E. Dehaerne, B. Dey, S. Halder, S. De Gendt, and W. Meert, "Code Gen-
[33] R. Khoury, A. R. Avila, J. Brunelle, and B. M. Camara, “How Se-
eration Using Machine Learning: A Systematic Review," IEEE Access,
cure is Code Generated by ChatGPT?” arXiv, Apr. 19, 2023. doi:
vol. 10, pp. 82434-82455, 2022, doi: 10.1109/ACCESS.2022.3196347.
10.48550/arXiv.2304.09655.
[12] J. Finnie-Ansley, P. Denny, B. A. Becker, A. Luxton-Reilly, and J.
[34] N. Nguyen and S. Nadi, “An empirical evaluation of GitHub copilot’s
Prather, “The Robots Are Coming: Exploring the Implications of Ope-
code suggestions,” in Proceedings of the 19th International Confer-
nAI Codex on Introductory Programming,” in Proceedings of the 24th
ence on Mining Software Repositories, in MSR ’22. New York, NY,
Australasian Computing Education Conference, in ACE ’22. New York,
USA: Association for Computing Machinery, Oct. 2022, pp. 1-5. doi:
NY, USA: Association for Computing Machinery, Feb. 2022, pp. 10-19.
10.1145/3524842.3528470.
doi: 10.1145/3511861.3511863.
[35] C. E. C. Dantas and M. A. Maia, “Readability and Understandability
[13] E. Jiang et al., “Discovering the Syntax and Strategies of Natural Lan-
Scores for Snippet Assessment: an Exploratory Study.” Aug. 20, 2021.
guage Programming with Generative Language Models,” in Proceedings
doi: 10.5281/zenodo.5224346.
of the 2022 CHI Conference on Human Factors in Computing Systems,
[36] M. L. Siddiq, S. H. Majumder, M. R. Mim, S. Jajodia, and J. C.
in CHI ’22. New York, NY, USA: Association for Computing Machinery,
S. Santos, “An Empirical Study of Code Smells in Transformer-based
Apr. 2022, pp. 1-19. doi: 10.1145/3491102.3501870.
Code Generation Techniques,” in 2022 IEEE 22nd International Working
[14] Y. Dong, X. Jiang, Z. Jin, and G. Li, “Self-collaboration Code Gen-
Conference on Source Code Analysis and Manipulation (SCAM), Oct.
eration via ChatGPT.” arXiv, Apr. 15, 2023. Accessed: Apr. 21, 2023.
2022, pp. 71-82. doi: 10.1109/SCAM55253.2022.00014.
[Online]. Available: http://arxiv.org/abs/2304.07590
[37] Meng Yan, Xin Xia, Xiaohong Zhang, Ling Xu, Dan Yang, and
[15] E. A. Moroz, V. O. Grizkevich, and I. M. Novozhilov, “The Potential of
Shanping Li, “Software quality assessment model: a systematic mapping
Artificial Intelligence as a Method of Software Developer’s Productivity
study,” Sci. China Inf. Sci., vol. 62, no. 9, p. 191101, Jul. 2019, doi:
Improvement,” in 2022 Conference of Russian Young Researchers in
10.1007/s11432-018-9608-3.
Electrical and Electronic Engineering (ElConRus), Jan. 2022, pp. 386-
[38] Yixiang Chen, Hongwei Tao. Software Trustworthiness Measurement
390. doi: 10.1109/ElConRus54750.2022.9755659.
Evaluation and Enhancement Specification. Beijing: Science Press,
[16] A. Ziegler et al., “Productivity Assessment of Neural Code Completion.”
2019. (in Chinese).
arXiv, May 13, 2022. doi: 10.48550/arXiv.2205.06537.

289

Authorized licensed use limited to: American University of the Middle East. Downloaded on November 19,2024 at 06:13:55 UTC from IEEE Xplore. Restrictions apply.

Tosem2hshzh024 5 PDF
No ratings yet
Tosem2hshzh024 5 PDF
79 pages
Kantek DP
No ratings yet
Kantek DP
100 pages
A Survey On Language Models For Code
No ratings yet
A Survey On Language Models For Code
125 pages
Evaluating Large Language Models Trained On Code
No ratings yet
Evaluating Large Language Models Trained On Code
35 pages
Large Language Models For Software Engineering
No ratings yet
Large Language Models For Software Engineering
79 pages
Case Study For Procurement
No ratings yet
Case Study For Procurement
62 pages
03-Towards An Understanding of Large Language
No ratings yet
03-Towards An Understanding of Large Language
41 pages
From LLMs To LLM Based Agents For Software Engineering 1723301316
100% (1)
From LLMs To LLM Based Agents For Software Engineering 1723301316
42 pages
Bugs in LLms Genereated Code
No ratings yet
Bugs in LLms Genereated Code
47 pages
Granite Code Models: A Family of Open Foundation Models For Code Intelligence
No ratings yet
Granite Code Models: A Family of Open Foundation Models For Code Intelligence
28 pages
A Survey On Large Language Models For Code Generation: Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim
No ratings yet
A Survey On Large Language Models For Code Generation: Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim
49 pages
Automatic Programming: Large Language Models and Beyond
No ratings yet
Automatic Programming: Large Language Models and Beyond
33 pages
2024 A Survey On LLM-based Code Generation For Low-Resource
No ratings yet
2024 A Survey On LLM-based Code Generation For Low-Resource
35 pages
A - Survey - On - Evaluating - Large - Language - Models - in - Co
No ratings yet
A - Survey - On - Evaluating - Large - Language - Models - in - Co
26 pages
CodePori - Large-Scale System For Autonomous Software Development Using Multi-Agent Technology - 2402.01411v2
No ratings yet
CodePori - Large-Scale System For Autonomous Software Development Using Multi-Agent Technology - 2402.01411v2
23 pages
Can We Trust Large Language Models Generated Code A
No ratings yet
Can We Trust Large Language Models Generated Code A
27 pages
From - LLMs To - LLM - Based - Agents
No ratings yet
From - LLMs To - LLM - Based - Agents
42 pages
Agent Coder 2312.13010v2
No ratings yet
Agent Coder 2312.13010v2
21 pages
Self-Planning Code Generation With Large Language Models
No ratings yet
Self-Planning Code Generation With Large Language Models
29 pages
Large Language Models For Code Analysis - Do LLMs Really Do Their Job
No ratings yet
Large Language Models For Code Analysis - Do LLMs Really Do Their Job
18 pages
Large Language Models For Software Engineering - A Systematic Literature Review
No ratings yet
Large Language Models For Software Engineering - A Systematic Literature Review
79 pages
Acecode: A Reinforcement Learning Framework For Aligning Code Efficiency and Correctness in Code Language Models
No ratings yet
Acecode: A Reinforcement Learning Framework For Aligning Code Efficiency and Correctness in Code Language Models
20 pages
Legal 2 AI
No ratings yet
Legal 2 AI
10 pages
Pretraining and Evaluation CodeLLMs
No ratings yet
Pretraining and Evaluation CodeLLMs
71 pages
Application of Large Language
No ratings yet
Application of Large Language
75 pages
1 s2.0 S0167739X24002449 Main
No ratings yet
1 s2.0 S0167739X24002449 Main
13 pages
LLM's For Code Generation
No ratings yet
LLM's For Code Generation
31 pages
Code Tree
No ratings yet
Code Tree
16 pages
2024 Acl-Long 737
No ratings yet
2024 Acl-Long 737
16 pages
Studying The Quality of Source Code Generated by Different AI Generative Engines An Empirical Evaluation
No ratings yet
Studying The Quality of Source Code Generated by Different AI Generative Engines An Empirical Evaluation
19 pages
ChatGPT Coding CompSac 23
No ratings yet
ChatGPT Coding CompSac 23
9 pages
Seed Coder
No ratings yet
Seed Coder
46 pages
ASE2024 CodeGenSurvey-7
No ratings yet
ASE2024 CodeGenSurvey-7
17 pages
Text To Web Application Using LLM
No ratings yet
Text To Web Application Using LLM
4 pages
133 Large Language Model Evaluatio
No ratings yet
133 Large Language Model Evaluatio
12 pages
How Well Do Large Language Models Serve As End-to-End Secure
No ratings yet
How Well Do Large Language Models Serve As End-to-End Secure
13 pages
2022 - Expectation vs. Experience - Evaluating The Usability of Code Generation Tools Powered by Large Language Models
No ratings yet
2022 - Expectation vs. Experience - Evaluating The Usability of Code Generation Tools Powered by Large Language Models
7 pages
E1. Code Language Models
No ratings yet
E1. Code Language Models
40 pages
SEED: Customize Large Language Models With Sample-Efficient Adaptation For Code Generation
No ratings yet
SEED: Customize Large Language Models With Sample-Efficient Adaptation For Code Generation
13 pages
LLM Survey
No ratings yet
LLM Survey
31 pages
Code Generation 2308.10335v5
No ratings yet
Code Generation 2308.10335v5
9 pages
Natural Language Generation and Understanding of Big Code For AI-Assisted Programming A Review
No ratings yet
Natural Language Generation and Understanding of Big Code For AI-Assisted Programming A Review
23 pages
Assessing Cybersecurity Vulnerabilities in Code Large Language Models - 2404.18567v1
No ratings yet
Assessing Cybersecurity Vulnerabilities in Code Large Language Models - 2404.18567v1
12 pages
From Llms To Llm-Based Agents For Software Engineering: A Survey of Current, Challenges and Future
No ratings yet
From Llms To Llm-Based Agents For Software Engineering: A Survey of Current, Challenges and Future
50 pages
代码大模型
No ratings yet
代码大模型
18 pages
Nai - Research Paper
No ratings yet
Nai - Research Paper
14 pages
Test-Driven Development and LLM-based Code Generation: Noble Saji Mathews Meiyappan Nagappan
No ratings yet
Test-Driven Development and LLM-based Code Generation: Noble Saji Mathews Meiyappan Nagappan
12 pages
Assessing Large Language Models For Code Generation: A Comprehensive Framework
No ratings yet
Assessing Large Language Models For Code Generation: A Comprehensive Framework
6 pages
Oracle-Guided Program Selection From Large Language Models: Zhiyu Fan Haifeng Ruan
No ratings yet
Oracle-Guided Program Selection From Large Language Models: Zhiyu Fan Haifeng Ruan
13 pages
IJPREMS50400010480
No ratings yet
IJPREMS50400010480
5 pages
Can Large Language Models Write Parallel Code?: Daniel Nichols Joshua H. Davis Zhaojun Xie
No ratings yet
Can Large Language Models Write Parallel Code?: Daniel Nichols Joshua H. Davis Zhaojun Xie
14 pages
Can Chatgpt Support Developers? An Empirical Evaluation of Large Language Models For Code Generation
No ratings yet
Can Chatgpt Support Developers? An Empirical Evaluation of Large Language Models For Code Generation
5 pages
Research Paper2410.14209v1
No ratings yet
Research Paper2410.14209v1
12 pages
Code Generation Using Machine Learning A Systematic Review 1ic7hqvz - Extracted
No ratings yet
Code Generation Using Machine Learning A Systematic Review 1ic7hqvz - Extracted
1 page
Towards Advancing Code Generation With Large Language Models: A Research Roadmap
No ratings yet
Towards Advancing Code Generation With Large Language Models: A Research Roadmap
10 pages
Fin Irjmets1715742677
No ratings yet
Fin Irjmets1715742677
6 pages
LLM Seminar PDF
No ratings yet
LLM Seminar PDF
10 pages
Code Generation Tools (Almost) For Free? A Study of Few-Shot, Pre-Trained Language Models On Code
No ratings yet
Code Generation Tools (Almost) For Free? A Study of Few-Shot, Pre-Trained Language Models On Code
12 pages
Natural Language To Code: Improving Semantic Reasoning in Code Generation Models
No ratings yet
Natural Language To Code: Improving Semantic Reasoning in Code Generation Models
10 pages
Retailer Outlet Name Retailer Nametelephone Number 1 Street No - Member Name
No ratings yet
Retailer Outlet Name Retailer Nametelephone Number 1 Street No - Member Name
5 pages
ICT Backup Procedure Policy
No ratings yet
ICT Backup Procedure Policy
8 pages
bcs402 Microcontrollers Model Question Paper Solutions For 4th Sem Be
No ratings yet
bcs402 Microcontrollers Model Question Paper Solutions For 4th Sem Be
44 pages
Ambo University Woliso Campus
100% (1)
Ambo University Woliso Campus
6 pages
59 Tweed 15w Amp Kit Instructions
100% (1)
59 Tweed 15w Amp Kit Instructions
44 pages
Java Program
No ratings yet
Java Program
24 pages
Test 2 PLC 2023-1
No ratings yet
Test 2 PLC 2023-1
3 pages
2018 - 4 - Answer Key of Naib Tehsildar (Main) - 2018 Held On 14-04-2018
No ratings yet
2018 - 4 - Answer Key of Naib Tehsildar (Main) - 2018 Held On 14-04-2018
2 pages
Excel Public School Mysore - AI Ready Seat of Education 2024
No ratings yet
Excel Public School Mysore - AI Ready Seat of Education 2024
21 pages
Testbank For Before We Are Born 9th Edition Moore
No ratings yet
Testbank For Before We Are Born 9th Edition Moore
17 pages
Quick Installation Guide: Netis 150Mbps Wireless N Portable Router
No ratings yet
Quick Installation Guide: Netis 150Mbps Wireless N Portable Router
1 page
Eve NG Comm Book
No ratings yet
Eve NG Comm Book
152 pages
AIRVO2 Quick Guide
No ratings yet
AIRVO2 Quick Guide
8 pages
Wooden Gear Clock - 9 Steps (With Pictures) - Instructables
No ratings yet
Wooden Gear Clock - 9 Steps (With Pictures) - Instructables
9 pages
Retire The Threetier Applica 308298
No ratings yet
Retire The Threetier Applica 308298
17 pages
Pythonpython
No ratings yet
Pythonpython
6 pages
An Understanding of AI's Limitations Is Starting To Sink in - The Economist
No ratings yet
An Understanding of AI's Limitations Is Starting To Sink in - The Economist
4 pages
External Optical Drive Case
No ratings yet
External Optical Drive Case
2 pages
Developing World Shaping Solutions For A Global Economy: Whitepaper 3.0
No ratings yet
Developing World Shaping Solutions For A Global Economy: Whitepaper 3.0
31 pages
Food - Wb.gov - in Food Digitalportal ApplyNewFPS - Aspx
No ratings yet
Food - Wb.gov - in Food Digitalportal ApplyNewFPS - Aspx
2 pages
32-Bit Fixed and Floating-Point Hardware Implementation For Enhanced Inverter Control Leveraging FPGA in Recurrent Neural Network Applications
No ratings yet
32-Bit Fixed and Floating-Point Hardware Implementation For Enhanced Inverter Control Leveraging FPGA in Recurrent Neural Network Applications
14 pages
Session 11
No ratings yet
Session 11
18 pages
Biometric User Manual
No ratings yet
Biometric User Manual
23 pages
Huawei BTS3900
No ratings yet
Huawei BTS3900
16 pages
Teach Your Raspberry Pi - "Yeah, World"
No ratings yet
Teach Your Raspberry Pi - "Yeah, World"
10 pages
Assignment1 20CE31002
No ratings yet
Assignment1 20CE31002
5 pages
BP Ahv Networking
No ratings yet
BP Ahv Networking
58 pages
ManicCompressor Manual
No ratings yet
ManicCompressor Manual
9 pages
100 Daily Job 25feb2021
No ratings yet
100 Daily Job 25feb2021
27 pages
Wajid File PDF
No ratings yet
Wajid File PDF
7 pages
Code Generation Techniques and Applications: Definitive Reference for Developers and Engineers
From Everand
Code Generation Techniques and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

A - Review - On - Code - Generation - With - LLMs - Application - and - Evaluation 2

Uploaded by

A - Review - On - Code - Generation - With - LLMs - Application - and - Evaluation 2

Uploaded by

2023 IEEE International Conference on Medical Artificial Intelligence (MedAI)

A Review on Code Generation with LLMs:

Shanghai, China Shanghai, China

979-8-3503-5878-0/23/$31.00 ©2023 IEEE 284

You might also like