A - Review - On - Code - Generation - With - LLMs - Application - and - Evaluation 2
A - Review - On - Code - Generation - With - LLMs - Application - and - Evaluation 2
Abstract—Code generation is a longstanding subject in the can become able to read and write code smoothly, and this
field of computer science and software engineering, which aims exciting assumption has been proven true by researchers
at realizing an agent capable of writing code automatically in recent years. With sufficient code and descriptions from
aligning with human desire. With the booming development of
large language models (LLMs) in recent years, code generation human programmers in training data, LLMs can generate
techniques powered by LLMs with strong coding ability have code according to given natural language descriptions about
caught many researchers’ interest. In this study, we conduct target code, incomplete code segments to be filled or even
a review of recent studies about code generation with LLMs, buggy code snippets to be fixed. And after obtaining complete
from the application of LLM-based code generation to the code thoroughly generated by LLMs, to what extent is this
evaluation of LLM-generated code. We find, with the powerful
code understanding and writing ability LLMs provide, these code trustworthy becomes a new considerable question for
novel techniques can be applied to manage various software researchers.
engineering tasks, and indeed boost the productivity of developers In this study, we mainly review 20 existing studies about
to a great extent. But we also find, as an equally important code generation with LLMs in the past five years. We review
subject, the evaluation receives less attention from researchers these studies from two aspects: the application of code gen-
than the application. We conclude some limitations in existing
studies about the evaluation of code generated by LLMs, like eration with LLMs and the evaluation of code generated by
inadequate quality characteristics considered. And we think more LLMs. In terms of application, we focus on three prevailing
effort is needed to narrow the gap between research on the kinds: description to code, code completion, and automatic
evaluation and the application. program repair. As for evaluation, we discuss related studies
Index Terms—large language models (LLMs), code generation, based on the quality characteristics of generated code each
code completion, automatic program repair, code quality evalu-
ation
study concerns, which are accordingly divided into three parts:
functional correctness, security, and others (some quality char-
acteristics considered less by these studies). These studies we
I. I NTRODUCTION
review are presented in Table I, categorized by the subsection
Code generation, also called program synthesis sometimes, each is discussed in this paper.
is a fascinating topic for researchers in computer science
and software engineering. A machine capable of writing TABLE I
OVERVIEW OF S TUDIES R EVIEWED BY U S
desired code automatically according to human demand is
the researchers’ ultimate purpose. With consistent efforts for
decades, they have made great progress, but there is still a great Sectiona Subsectionb Reviewed Studies
Description to Code [12] [13] [14]
distance from fully automatic and practical code generation, A PPLICATION Code Completion [15] [16] [17]
especially generating code based on natural language de- Automatic Program Repair [19] [20] [21]
scriptions. However, the emergence of large language models Functional Correctness [5] [25] [27] [28] [29]
(LLMs) in recent years, the GPT series in particular, provides E VALUATION Security [30] [31] [32] [33]
Others [34] [36]
them with a new possibility to achieve this ultimate purpose. a Each refers to the corresponding section in this paper, where A PPLICA -
With a massive number of parameters up to 175B, terabyte- TION and E VALUATION indicate section III and IV respectively.
b Each refers to the corresponding subsection in this paper.
level training data, and well-designed architecture, LLMs were
endowed with unprecedentedly powerful language understand-
ing, processing, and generating capability. Coincidentally, in By reviewing these studies, we find there is a significant
order to make code more understandable for humans, most gap between research on the application and the evaluation of
program languages, especially high-level program languages, LLM-based code generation. Many pioneer researchers have
are designed based on natural languages. Therefore, it is been exploiting the potential of code generation with LLMs in
intuitive to think LLMs trained on massive existing code actual applications, and have gained lots of amazing results.
B. Code Generation
Code generation, also known as program synthesis, meaning Fig. 1. The Heap-Sort C++ Function Completed by Copilot and The Test
the automatic construction of software or self-writing code Output.
[9], is a longstanding topic discussed in the fields of software
engineering, computer science, programming language, and
artificial intelligence. Code generation aims at automatically III. A PPLICATION OF C ODE G ENERATION WITH LLM S
generating code according to given descriptions or existing
code. With surprising natural language processing ability, LLMs
For a long time in the past, most research about code gen- can not only read and write code but also handle documenta-
eration was limited to domain-specific languages (DSLs) and tion, comments, warnings, or error messages generated along
programmers had to write tedious formal specifications as de- with the development of software. Thus, code generation with
scriptions for generation [10]. However, with the development LLMs has great potential in automatically managing various
of artificial intelligence, lots of machine learning and deep software engineering tasks. In recent years, researchers have
learning techniques are applied to code generation, making it been exploiting its ability and have found many amazing ap-
more powerful and flexible [11]. Since most programming lan- plications of it. We discuss three prevailing applications here:
guages, especially high-level programming languages, share Description to Code, Code Completion, and Automatic
many similarities with natural language, it is intuitive to utilize Program Repair, as shown in Fig. 2. Besides these three,
powerful language models in code generation, and this idea there are some other applications also attractive, like code
has been proven very effective by studies in recent years. translation, test generation, documentation generation, etc.
285
Authorized licensed use limited to: American University of the Middle East. Downloaded on November 19,2024 at 06:13:55 UTC from IEEE Xplore. Restrictions apply.
productivity it brings is limited. However, LLMs, designed for
NLP, can naturally predict what users might type next based on
existing code and comments just like word prediction, lifting
the ability of code completion to the semantic level. And
Copilot, the powerful code completion tool based on Codex,
attracts many researchers’ and developers’ attention.
Moroz et al. [15] studied the applications and shortcomings
of Copilot. They found Copilot has many advantages as a
programming assistant, such as Copilot can be a good assistant
for skilled programmers, as well as novice developers. They
also found some problems and gaps in its applications like
unchecked low-quality code. Overall, they suggested Copilot
still has lots of growth opportunities, more effort should be
taken to make it safer, more reliable, and more effective.
Ziegler et al. [16] conducted a case study with Copilot to
Fig. 2. Three Prevailing Applications: Description to Code, Code Comple- find out its impact on the productivity of users. Combining
tion, and Automatic Program Repair these objective usage data and the subjective perceptions of
developers, they found suggestions’ acceptance rate can be a
great predictor of productivity developers perceive, which can
A. Description to Code reflect the perception of users to some extent. Moreover, they
Nothing will be more exciting for software developers than a found acceptance rate varies among developers, depending on
machine that can generate code according to given descriptions their behavior.
written in natural language. Some researchers have been trying Barke et al. [17] conducted a grounded theory analysis,
this with several state-of-the-art LLMs. aiming at knowing how programmers interact with code-
Finnie-Ansley et al. [12] applied Codex mentioned above generative models, like Copilot. They found users’ interactions
to introductory programming. They used 23 questions from with Copilot can be classified into two modes—acceleration
two introductory programming course tests as descriptions for and exploration. Acceleration mode can boost users’ produc-
Codex to generate Python code. They found Codex performed tivity greatly and exploration mode can always help users
better than most student participants in these two tests and its handle unfamiliar tasks. Based on this finding, they also
solutions include lots of variations. They thought LLMs with proposed some recommendations for users.
the ability to write code like Codex could be an opportunity
C. Automatic Program Repair
and also a threat to introductory programming education.
Jiang et al. [13] developed a natural language code synthesis Because of the high cost of traditional software maintenance
tool GetLine, and conducted a user study with it. GetLine is approaches, which occupies over half of the software devel-
backed by LaMDA and provides a user interface. Users can in- opment lifecycle [18], automatic program repair (APR) tech-
put natural language requests and select a target programming niques have been studied by many commercial companies and
language, and then GetLine can generate multiple outputs for academic institutions for years. The booming development of
users to choose from. Finally, the authors concluded several LLMs in recent years undoubtedly provides a new possibility
useful implications of future code synthesis tool design from for APR.
their user study. According to Xia and Zhang [19], template-based tools
Dong et al. [14] proposed a self-collaboration framework for have the best performance among traditional APR methods,
LLMs to enhance their capability of solving coding problems. which fix bugs by matching specific buggy code patterns and
They asked three ChatGPT instances to play analyst, coder, applying corresponding patches. But in this way, the capability
and tester along the development process of software respec- of template-based tools will be limited by the coverage of their
tively, and coding problems were fed as user requirements. finite pattern base. To overcome this issue, some researchers
Then, these three roles can interact and collaborate by chatting apply machine learning techniques to APR, recognizing bug
to generate code. By conducting a comprehensive method, fixing as a neural machine translation task, which translates
they found the performance of this self-collaboration code buggy code into correct code. Thus, the emerging LLMs
generation is 30% higher than the naive direct approach. capable of handling various NLP tasks can be a promising
solution.
B. Code Completion Kolak et al. [20] noticed the potential capability of language
Code completion is an indispensable feature for integrated models in distinguishing bugs and patches, so they studied
development environments (IDEs), which can offer developers the performance of LLMs with different scales in program
code suggestions according to available contextual informa- repair tasks. They selected three publicly available versions
tion. The ability of code completion has been staying at of PolyCoder and Codex. Then they tested these four models
the syntactic level for a long time, and the promotion of on 80 buggy programs. According to their result, they found
286
Authorized licensed use limited to: American University of the Middle East. Downloaded on November 19,2024 at 06:13:55 UTC from IEEE Xplore. Restrictions apply.
larger models are more successful patch generators and tend code evaluation metric (CEM) is indispensable for the devel-
to generate patches similar to a real developer. opment of code generation techniques, and there are mainly
Sobania et al. [21] evaluated and analyzed the performance two types of prevailing CEMs: match-based and execution-
of ChatGPT (GPT-3.5) in bug fixing. They selected all Python based [25].
problems from QuixBugs [22] benchmark, asking ChatGPT Since code generation can be viewed as the translation
for the bug fix four times for each problem. Compared with from description to code, the match-based evaluation met-
previous approaches, ChatGPT notably outperformed tradi- rics commonly used in machine translation can be helpful,
tional APR tools and is competitive with other LLM-based which evaluate by calculating the similarity between generated
approaches, like Codex. Their result also shows that, as a sentences and reference sentences, such as BLEU [26] and
dialogue system, with further information like error messages, CodeBLEU [27]. However, these match-based CEMs have de-
ChatGPT can outperform the state-of-the-art. ficiencies in reflecting the functional correctness of generated
Xia and Zhang [19] proposed a fully automated code, because these CEMs are unable to decide the functional
conversation-driven APR approach—C HAT R EPAIR, based on equivalence between generated code and reference code [25].
ChatGPT. This approach keeps asking ChatGPT to generate Therefore, more execution-based CEMs were proposed by
patches and giving detailed test results back until the correct later researchers.
patch is obtained. By evaluating C HAT R EPAIR against other Currently, there are two most commonly used execution-
APR tools on two widely studied benchmarks, they surprisedly based CEMs to evaluate the functional correctness of code:
found, though as a conversational APR approach, C HAT R E - AvgPassRatio [28] and Pass@k [29], which both depend on
PAIR obtained the state-of-the-art performance. the execution of generated code on a prepared test set. In order
to evaluate Codex they developed, Chen et al. [5] released
IV. E VALUATION OF C ODE G ENERATED BY LLM S a new evaluation problem set HumanEval and calculated the
Pass@k value.
Software source code quality has been an important subject
in the field of software engineering for decades, which can Because of the need for repeated executions on test cases,
be traced back to the 1960s [23]. Many researchers have Dong et al. [25] thought the computation of execution-based
proposed various metrics, methods, and models for source CEMs is costly, slow, and insecure, while match-based CEMs
code quality evaluation in different situations, which is also are inaccurate, so they proposed a new LLM-based CEM—
called software quality evaluation at the source level. These CodeScore, for functional correctness evaluation. They defined
evaluation approaches have become the basis of software PassRatio according to AvgPassRatio and made CodeScore an
source code quality assurance and promotion practices in the execution-free alternative for PassRatio. By evaluating with
industry today. their LLM-based framework, CodeScore outperforms match-
When researchers noticed the great potential of LLMs in based CEMs in accuracy compared with PassRatio and costs
writing code, concerns about the quality of LLM-generated much less time than execution-based CEMs.
code began to emerge at the same time. If unreliable code
generated by LLMs is directly introduced into the software B. Security
without carefully checking, this may lead to some disastrous The powerful code generation capability of LLMs relies
outcomes like system breakdown or privacy data leak. There- on massive available code snippets in training data, which
fore, there has been much research in recent years focusing are mostly from open-source code repositories. However, it
on evaluating LLM-generated code from the perspective of is likely there exist potential vulnerabilities or even malicious
software source code quality. code snippets in these open-source codes, which may leak
Software quality consists of multiple quality characteristics into the output of LLMs and harm the security of developed
(sometimes also called quality factors or quality attributions), software. Many researchers have similar worries and evaluate
such as functionality suitability, performance efficiency, com- the security of LLM-generated code from various aspects.
patibility, etc. according to ISO/IEC 25010 [24], and lots Pearce et al. [30] focused on the security of Copilot’s code
of corresponding sub-characteristics. Thus we review these contributions. First, based on MITRE’s Common Weakness
researches about LLM-generated source code evaluation and Enumerations (CWEs), they created a prompt dataset con-
present them according to the quality characteristics or sub- taining various security-relevant scenarios. The security of the
characteristics each study concerns. Specifically, we present completed code was evaluated with the automated analysis tool
them in three categories: Functional Correctness, Security, and manual inspection. They found that, in about 44% of all
and Others. scenarios, Copilot did generate code with a relevant weakness,
and some of the weaknesses are introduced more frequently.
A. Functional Correctness Sandovalet et al. [31] conducted a user study to find out
As mentioned above, automatic code generation is a long- whether student programmers will write more insecure code
standing topic that has been studied for decades. In order to with the help of an LLM-based code assistant. Considering
access the performance of a newly proposed code generation real-world programming is mostly project-based, they design
approach and compare it with other existing ones, a proper a "shopping list" C program completion task. They found the
287
Authorized licensed use limited to: American University of the Middle East. Downloaded on November 19,2024 at 06:13:55 UTC from IEEE Xplore. Restrictions apply.
number of server security bugs produced by LLM-assisted can greatly boost the productivity of developers. However, as
participants is no greater than 10% than the control. an equally critical subject, the evaluation of LLM-generated
Asare et al. [32] conducted a comparative empirical analysis code fails to keep up with the application. By reviewing related
to find out whether Copilot is as bad as human developers research in section IV, we found the following limitations
in introducing vulnerabilities when writing code. Based on a in existing studies about the evaluation of code generated by
dataset of C/C++ vulnerabilities from several projects in the LLMs.
real world, they recreated the same scenario for Copilot by Inadequate quality characteristics for evaluation. Ac-
deleting bug/patch-relevant code. They found Copilot intro- cording to ISO/IEC 25010 [24], there are plenty of quality
duced the same vulnerabilities as humans only one-third of characteristics and sub-characteristics to evaluate code from
the time, which is not as bad as human developers. different aspects. However, most existing research only focuses
Khoury et al. [33] experimented to evaluate the security on the functional correctness and security issues of LLM-
of code generated by ChatGPT. They designed 21 problems generated code. We think that’s because these two are truly the
across 5 programming languages, and each problem is prone most elementary and concerning aspects of code, also the most
to introduce a specific vulnerability in CWEs when solving. In easily perceived by programmers and users of final software
conclusion, they found ChatGPT can frequently generate inse- products. However, many other characteristics can also impact
cure code and experienced programmers are still irreplaceable the integral quality of code to a great extent, like compatibility,
to produce code reliable enough. maintainability, portability, etc.
Lack of systematic and quantitative evaluation model.
C. Others As for traditional code, there are many studies proposing
There are also some researchers concerning other aspects systematic quality [37] (or similarly trustworthiness [38])
of source qualities besides functional correctness and security. evaluation models, which can produce a quantitative evaluation
Understandability. Understandablity reflects how easily result about the quality of code. However, to the best of our
programmers can fully understand the logic and function of knowledge, there is still no research imposing these traditional
a specific code snippet. Nguyen et al. [34] conducted an em- methods on the evaluation of LLM-generated code. Though
pirical study to evaluate the correctness and understandability there may be some challenges to be conquered, we believe
of code generated by Copilot. In terms of the evaluation of these traditional models can bring many benefits to current
understandability, they calculated the cyclomatic complexity research on LLM-generated code evaluation.
and cognitive complexity of generated code, which both pos- Ignoring human engagement when conducting the eval-
itively correlate with understandability according to previous uation. According to Sandovalet et al. [31], there are gener-
study [35]. Overall, they thought Copilot could produce easily ally two kinds of software development modes with LLMs:
understandable code under this experimental circumstance, but the autopilot mode and the assisted mode. Due to limited
their data may be not enough to get a general conclusion. functional correctness and potential vulnerabilities of LLM-
Maintainability and Readability. As the definition given generated code, the assisted mode is a more practical way at
by ISO/IEC 25010 [24], maintainability represents the degree the present phase, like the practices of Copilot and ChatGPT.
of effectiveness and efficiency with which a product or system As for the evaluation of code generated in this way, human en-
can be modified to improve it, correct it, or adapt it to changes gagement should be taken into consideration as well, because
in environment and requirements. Readability, which is similar their prompt strategies, suggestions selections, and some other
to understandability, evaluates the complexity of code in the behaviors do affect the final code. However, from our review,
syntactic aspect, while understandability conducts evaluations we found most research failed to consider the role of human
in the dynamic aspect [35]. Siddiq et al. [36] conducted programmers when conducting the evaluation.
an empirical study on code smells in training code and Lack of specific research on evaluation. We found that
generating code of LLM-based code generation techniques, in most existing research mentioning LLM-generated code
as well as the relation between them. Code smells can include quality evaluation, the initial motivation of evaluation is to
security issues, design decision issues, and coding standard measure and compare the performance of LLMs in terms of
violations in source code, which are patterns that indicate code generation. For instance, functional correctness, which is
lower maintainability and readability. According to the results, most used by researchers, has been a golden metric to embody
they thought bad code patterns in training code do leak to the the code-writing ability of LLMs. Therefore, there is a lack of
code generated by models because they found the type of code specific research focusing on the evaluation of code generation
smells in generating code is a subset of those in training code. by LLMs, and we think they are much needed because specific
research on evaluation is meaningful not only for developers of
V. D ISCUSSION LLMs but also for users, which can help programmers evaluate
With the rapid development of code generation with LLMs, the trustworthiness of generated code.
many researchers have been exploring its possibility in prac- In our future work, we will recognize LLM-assisted coding
tice. As we have reviewed in section III, LLM-based code as a brand-new software development approach and evaluate
generation techniques have astonishing potential and various the trustworthiness of the resulting software. With this new de-
applications in managing software engineering tasks, which velopment process, we have to make some necessary changes
288
Authorized licensed use limited to: American University of the Middle East. Downloaded on November 19,2024 at 06:13:55 UTC from IEEE Xplore. Restrictions apply.
to the traditional software trustworthiness model to adapt to it. [17] S. Barke, M. B. James, and N. Polikarpova, “Grounded Copilot: How
Programmers Interact with Code-Generating Models.” arXiv, Oct. 31,
We will focus on not only the generated source code but also 2022. doi: 10.48550/arXiv.2206.15000.
the whole process assisted with LLMs, including the capability [18] Jiang JJ, Chen JJ, Xiong YF. Survey of automatic program repair
of a specific LLM, the interaction between human developers techniques. Ruan Jian Xue Bao/Journal of Software, 2021,32(9):2665-
2690 (in Chinese). http://www.jos.org.cn/1000-9825/6274.htm
and models and so on. [19] Chunqiu Steven Xia and Lingming Zhang, “Keep the Conversation
Going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT.”
VI. C ONCLUSION arXiv, Apr. 01, 2023. doi: 10.48550/arXiv.2304.00385.
In this study, we review recent research on code evaluation [20] S. Kolak, R. Martins, C. L. Goues, and V. J. Hellendoorn, “PATCH
GENERATION WITH LANGUAGE MODELS: FEASIBILITY AND
with LLMs from two aspects: the application of code gen- SCALING BEHAVIOR,” 2022.
eration with LLMs and the evaluation of code generated by [21] D. Sobania, M. Briesch, C. Hanna, and J. Petke, “An Analysis of the
LLMs. We find, with the help of recent emerging powerful Automatic Bug Fixing Performance of ChatGPT.” arXiv, Jan. 20, 2023.
doi: 10.48550/arXiv.2301.08653.
LLMs, code generation techniques can successfully handle [22] D. Lin, J. Koppel, A. Chen, and A. Solar-Lezama, “QuixBugs: A multi-
more complex tasks than before, and many researchers pro- lingual program repair benchmark set based on the Quixey Challenge,”
posed various novel applications of LLM-based code gen- in Proceedings Companion of the 2017 ACM SIGPLAN international
conference on systems, programming, languages, and applications: soft-
eration. However, research around the evaluation of LLM- ware for humanity, 2017, pp. 55-56.
generated code fails to keep pace with the application. By [23] Yikang Shao, Wu Liu, Jun Ai, and Chunhui Yang, “A Quantitative
reviewing a limited amount of related studies, we found some Measurement Method of Code Quality Evaluation Indicators based on
Data Mining,” in 2022 9th International Conference on Dependable
limitations in the current research stage, and we think more Systems and Their Applications (DSA), Aug. 2022, pp. 659-669. doi:
effort is needed to fill the gap in the future. 10.1109/DSA56465.2022.00094.
[24] “ISO 25010.” https://iso25000.com/index.php/en/iso-25000-
R EFERENCES standards/iso-25010 (accessed Jun. 24, 2023).
[25] Y. Dong, J. Ding, X. Jiang, Z. Li, G. Li, and Z. Jin, “CodeScore:
[1] S. J. Russell et al., Artificial intelligence: a modern approach, Fourth
Evaluating Code Generation by Learning Code Execution.” arXiv, Jan.
edition, Global edition. in Pearson series in artificial intelligence. Har-
21, 2023. doi: 10.48550/arXiv.2301.09043.
low: Pearson, 2022.
[26] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a method
[2] A. Radford and K. Narasimhan, “Improving Language Understanding
for automatic evaluation of machine translation,” in Proceedings of the
by Generative Pre-Training,” 2018.
40th Annual Meeting on Association for Computational Linguistics, in
[3] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,
ACL ’02. USA: Association for Computational Linguistics, Jul. 2002,
“Language Models are Unsupervised Multitask Learners,” 2019.
pp. 311-318. doi: 10.3115/1073083.1073135.
[4] T. B. Brown et al., “Language Models are Few-Shot Learners.” arXiv,
[27] S. Ren et al., “CodeBLEU: a Method for Automatic Evaluation of Code
Jul. 22, 2020. doi: 10.48550/arXiv.2005.14165.
Synthesis.” arXiv, Sep. 27, 2020. doi: 10.48550/arXiv.2009.10297.
[5] M. Chen et al., “Evaluating Large Language Models Trained on Code.”
[28] D. Hendrycks et al., “Measuring Coding Challenge Competence With
arXiv, Jul. 14, 2021. doi: 10.48550/arXiv.2107.03374.
APPS.” arXiv, Nov. 08, 2021. doi: 10.48550/arXiv.2105.09938.
[6] "GitHub Copilot · Your AI pair programmer," GitHub.
[29] S. Kulal et al., “SPoC: Search-based Pseudocode to Code.” arXiv, Jun.
https://github.com/features/copilot (accessed Jun. 20, 2023).
11, 2019. doi: 10.48550/arXiv.1906.04908.
[7] "ChatGPT." https://openai.com/chatgpt (accessed Jun. 20, 2023).
[30] H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep
[8] "GPT-4." https://openai.com/gpt-4 (accessed Jun. 20, 2023).
at the Keyboard? Assessing the Security of GitHub Copilot’s Code
[9] C. David and D. Kroening, “Program synthesis: challenges and opportu-
Contributions,” in 2022 IEEE Symposium on Security and Privacy (SP),
nities,” Philosophical Transactions of the Royal Society A: Mathemati-
May 2022, pp. 754-768. doi: 10.1109/SP46214.2022.9833571.
cal, Physical and Engineering Sciences, vol. 375, no. 2104, p. 20150403,
[31] G. Sandoval, H. Pearce, T. Nys, R. Karri, S. Garg, and B. Dolan-
Sep. 2017, doi: 10.1098/rsta.2015.0403.
Gavitt, “Lost at C: A User Study on the Security Implications of Large
[10] Gu B, Yu B, Dong XG, Li XF, Zhong RM, Yang MF. Intel-
Language Model Code Assistants,” Aug. 2022.
ligent program synthesis techniques: Literature review. Ruan Jian
[32] O. Asare, M. Nagappan, and N. Asokan, “Is GitHub’s Copilot as Bad as
Xue Bao/Journal of Software, 2021,32(5):1373-1384 (in Chinese).
Humans at Introducing Vulnerabilities in Code?” arXiv, Feb. 14, 2023.
http://www.jos.org.cn/1000-9825/6200.htm
doi: 10.48550/arXiv.2204.04741.
[11] E. Dehaerne, B. Dey, S. Halder, S. De Gendt, and W. Meert, "Code Gen-
[33] R. Khoury, A. R. Avila, J. Brunelle, and B. M. Camara, “How Se-
eration Using Machine Learning: A Systematic Review," IEEE Access,
cure is Code Generated by ChatGPT?” arXiv, Apr. 19, 2023. doi:
vol. 10, pp. 82434-82455, 2022, doi: 10.1109/ACCESS.2022.3196347.
10.48550/arXiv.2304.09655.
[12] J. Finnie-Ansley, P. Denny, B. A. Becker, A. Luxton-Reilly, and J.
[34] N. Nguyen and S. Nadi, “An empirical evaluation of GitHub copilot’s
Prather, “The Robots Are Coming: Exploring the Implications of Ope-
code suggestions,” in Proceedings of the 19th International Confer-
nAI Codex on Introductory Programming,” in Proceedings of the 24th
ence on Mining Software Repositories, in MSR ’22. New York, NY,
Australasian Computing Education Conference, in ACE ’22. New York,
USA: Association for Computing Machinery, Oct. 2022, pp. 1-5. doi:
NY, USA: Association for Computing Machinery, Feb. 2022, pp. 10-19.
10.1145/3524842.3528470.
doi: 10.1145/3511861.3511863.
[35] C. E. C. Dantas and M. A. Maia, “Readability and Understandability
[13] E. Jiang et al., “Discovering the Syntax and Strategies of Natural Lan-
Scores for Snippet Assessment: an Exploratory Study.” Aug. 20, 2021.
guage Programming with Generative Language Models,” in Proceedings
doi: 10.5281/zenodo.5224346.
of the 2022 CHI Conference on Human Factors in Computing Systems,
[36] M. L. Siddiq, S. H. Majumder, M. R. Mim, S. Jajodia, and J. C.
in CHI ’22. New York, NY, USA: Association for Computing Machinery,
S. Santos, “An Empirical Study of Code Smells in Transformer-based
Apr. 2022, pp. 1-19. doi: 10.1145/3491102.3501870.
Code Generation Techniques,” in 2022 IEEE 22nd International Working
[14] Y. Dong, X. Jiang, Z. Jin, and G. Li, “Self-collaboration Code Gen-
Conference on Source Code Analysis and Manipulation (SCAM), Oct.
eration via ChatGPT.” arXiv, Apr. 15, 2023. Accessed: Apr. 21, 2023.
2022, pp. 71-82. doi: 10.1109/SCAM55253.2022.00014.
[Online]. Available: http://arxiv.org/abs/2304.07590
[37] Meng Yan, Xin Xia, Xiaohong Zhang, Ling Xu, Dan Yang, and
[15] E. A. Moroz, V. O. Grizkevich, and I. M. Novozhilov, “The Potential of
Shanping Li, “Software quality assessment model: a systematic mapping
Artificial Intelligence as a Method of Software Developer’s Productivity
study,” Sci. China Inf. Sci., vol. 62, no. 9, p. 191101, Jul. 2019, doi:
Improvement,” in 2022 Conference of Russian Young Researchers in
10.1007/s11432-018-9608-3.
Electrical and Electronic Engineering (ElConRus), Jan. 2022, pp. 386-
[38] Yixiang Chen, Hongwei Tao. Software Trustworthiness Measurement
390. doi: 10.1109/ElConRus54750.2022.9755659.
Evaluation and Enhancement Specification. Beijing: Science Press,
[16] A. Ziegler et al., “Productivity Assessment of Neural Code Completion.”
2019. (in Chinese).
arXiv, May 13, 2022. doi: 10.48550/arXiv.2205.06537.
289
Authorized licensed use limited to: American University of the Middle East. Downloaded on November 19,2024 at 06:13:55 UTC from IEEE Xplore. Restrictions apply.