10. Program Code Generation with Generative AIs
10. Program Code Generation with Generative AIs
Article
Program Code Generation with Generative AIs
Baskhad Idrisov and Tim Schlippe *
Abstract: Our paper compares the correctness, efficiency, and maintainability of human-generated
and AI-generated program code. For that, we analyzed the computational resources of AI- and
human-generated program code using metrics such as time and space complexity as well as runtime
and memory usage. Additionally, we evaluated the maintainability using metrics such as lines of code,
cyclomatic complexity, Halstead complexity and maintainability index. For our experiments, we had
generative AIs produce program code in Java, Python, and C++ that solves problems defined on
the competition coding website leetcode.com. We selected six LeetCode problems of varying diffi-
culty, resulting in 18 program codes generated by each generative AI. GitHub Copilot, powered by
Codex (GPT-3.0), performed best, solving 9 of the 18 problems (50.0%), whereas CodeWhisperer did
not solve a single problem. BingAI Chat (GPT-4.0) generated correct program code for seven prob-
lems (38.9%), ChatGPT (GPT-3.5) and Code Llama (Llama 2) for four problems (22.2%) and StarCoder
and InstructCodeT5+ for only one problem (5.6%). Surprisingly, although ChatGPT generated only
four correct program codes, it was the only generative AI capable of providing a correct solution
to a coding problem of difficulty level hard. In summary, 26 AI-generated codes (20.6%) solve the
respective problem. For 11 AI-generated incorrect codes (8.7%), only minimal modifications to the
program code are necessary to solve the problem, which results in time savings between 8.9% and
even 71.3% in comparison to programming the program code from scratch.
Keywords: artificial intelligence; generative AIs; AI program code generation; program code
efficiency; program code maintainability
they may not always produce correct, efficient, or maintainable program codes. Therefore,
it is crucial to carefully consider various aspects when deciding whether or not to use
these tools [14].
While substantial research has focused on evaluation metrics for correctness like
pass@k [15–17], there is a noticeable gap: The extensive comparison of AI-generated and
human-generated program codes based on various metrics has largely been understud-
ied. Consequently, our study embarks on a comprehensive and in-depth exploration of
the coding capabilities of seven state-of-the-art generative AIs. Our goal is to evaluate
the AI-generated program codes based on the interaction of various metrics such as cy-
clomatic complexity, maintainability index, etc., concerning their correctness, efficiency and
maintainability. Moreover, we go one step further by comparing the AI-generated codes
with corresponding human-generated codes written by professional programmers.
Our contributions are as follows:
• We investigate and compare the correctness, efficiency and maintainability of AI-generated
program codes using varying evaluation metrics.
• We are the first to extensively compare AI-generated program codes to human-
generated program codes.
• We analyze the program codes that address problems of three difficulty levels—easy,
medium and hard.
• We produced a dataset of 126 AI-generated and 18 human-generated program codes—
the AI/Human-Generated Program Code Dataset—which we share with the research
community (https://github.com/Back3474/AI-Human-Generated-Program-Code-
Dataset, accessed on 29 January 2024).
In the next section, we will provide an overview of related work. In Section 3, we will
present our experimental setup. Our experiments and results will be described in Section 4.
In Section 5, we will conclude our work and indicate possible future steps.
2. Related Work
In this section, we will look at existing work regarding automatic program code
generation and evaluation.
Codex (https://openai.com/blog/openai-codex, accessed on 29 January 2024) is an
AI model for program code generation based on GPT-3.0 and fine-tuned on program code
publicly available on GitHub [16]. Ref. [16] tested Codex’s capabilities in generating
Python code using natural language instructions found in in-code comments known as
docstrings. They also created HumanEval, a dataset of 164 hand-written coding problems in
natural language plus their unit tests to assess the functional correctness of program code.
One discovery from their research was that if Codex is asked to generate code for the
same problem several times, the probability that one generated code is correct increases.
Consequently, they used pass@k as an evaluation metric, where k is the number of generated
program codes, and pass is the number of tasks, of which all unit tests were passed. To
obtain an unbiased estimation of pass@k, ref. [16] applied additional adjustments to the
original calculation. Codex achieved a pass@1 of 28.8% in solving the provided problems.
They compared the Codex-generated code with program code generated by GPT-3.0 and
GPT-J [18], but GPT-3.0 demonstrated a pass@1 of 0% and GPT-J obtained 11.4%. With
pass@100, Codex achieved even 70.2%.
Ref. [19] evaluated the validity, correctness, and efficiency of program code generated
by GitHub (GH) Copilot using HumanEval. GH Copilot is an IDE extension that uses
Codex. Ref. [19] defined a code as valid, if it was compliant with the syntax rules of a
given programming language. The correctness was computed by dividing the programming
tasks’ passed unit tests by all existing unit tests for this specific task. The efficiency was
measured by determining the time and space complexities of the generated codes. Their
results demonstrate that Codex was able to generate valid code with a success rate of 91.5%.
Regarding code correctness, 28.7% were generated correctly, 51.2% were generated partially
correctly, and 20.1% were generated incorrectly.
Algorithms 2024, 17, 62 3 of 19
Ref. [17] used HumanEval to evaluate the validity, correctness, security, reliability,
and maintainability of Python code generated by GH Copilot, Amazon CodeWhisperer
(https://aws.amazon.com/de/codewhisperer, accessed on 29 January 2024), and Chat-
GPT. They defined a valid code and a correct code as in [19]. To determine the security,
reliability, and maintainability, they used SonarQube (https://www.sonarqube.org, accessed
on 29 January 2024). Their calculation of a code’s security is based on the potential cyber-
security vulnerabilities of this code. The reliability is based on the number of bugs. The
maintainability is measured by counting present code smells. ChatGPT generated 93.3%
valid program codes, GH Copilot 91.5%, and CodeWhisperer 90.2%. ChatGPT passed
most unit tests by generating 65.2% of correct program codes. GH Copilot reached 46.3%,
and CodeWhisperer 31.1%. But when they evaluated newer versions of GH Copilot and
CodeWhisperer, they measured 18% and 7% better values for correctness. Due to the small
number of generated code fragments, the numbers for security were not usable. Concern-
ing reliability, ChatGPT produced two bugs, GH Copilot three bugs and CodeWhisperer
one bug. CodeWhisperer produced the most maintainable code, ChatGPT the second most,
and GH Copilot the least maintainable.
Ref. [15] introduced a new framework for program code evaluation named EvalPlus.
Furthermore, they created HumanEval+, an extension of HumanEval using EvalPlus’ auto-
matic test input generator. HumanEval+ is 80 times larger than HumanEval which enables more
comprehensive testing and analysis of AI-generated code. With HumanEval+, ref. [15] eval-
uated the functional correctness of program code generated by 26 different AI models which
are based on GPT-4 [20], Phind-CodeLlama [21], WizardCoder-CodeLlama [22], Chat-
GPT [23], Code Llama [24], StarCoder [25], CodeGen [26], CODET5+ [27], MISTRAL [28],
CodeGen2 [29], VICUNA [30], SantaCoder [31], INCODER [32], GPT-J [33], GPT-NEO [34],
PolyCoder [35], and StableLM [36] with pass@k. Looking at pass@1, the top five perform-
ers were GPT-4 (76.2%), Phind-CodeLlama (67.1%), WizardCoder-CodeLlama (64.6%),
ChatGPT (63.4%), and Code Llama (42.7%).
DeepMind developed an AI model for code generation named AlphaCode [37]. On the
coding competition website codeforces.com, the AI model achieved an average ranking in
the top 54.3% of more than 5000 participants for Python and C++ tasks. The ranking takes
runtime and memory usage into account.
Ref. [38] evaluated GH Copilot’s ability to produce solutions for 33 LeetCode prob-
lems using Python, Java, JavaScript, and C. They calculated the correctness by dividing
the passed test cases by the total number of test cases per problem. The numbers for
understandability, cyclomatic and cognitive complexity were retrieved using SonarQube, but
due to configuration problems, they were not able to analyze the understandability of the
codes generated in C. Concerning correctness, GH Copilot performed best in Java (57%)
and worst in JavaScript (27%). In terms of understandability, GH Copilot’s Python, Java and
JavaScript program codes had a cognitive complexity of 6 and a cyclomatic complexity of 5 on
average, with no statistically significant differences between the programming languages.
In comparison to the aforementioned related work, our focus is to evaluate the
Java, Python and C++ code produced by Codex (GPT-3.0), CodeWhisperer, BingAI Chat
(GPT-4.0), ChatGPT (GPT-3.5), Code Llama (Llama 2), StarCoder, and InstructCodeT5+.
For comparison, we also assess human-generated program code. We obtain the correctness,
efficiency and maintainability for our program codes by measuring time and space complex-
ity, runtime and memory usage, lines of code, cyclomatic complexity, Halstead complexity and
maintainability index. As a user usually does not generate code for the same problem sev-
eral times, we evaluate pass@k with k = 1, which is our metric for correctness. Finally, we
analyze which of the incorrect and the not executable program codes have the potential to
be easily modified manually and then used quickly and without much effort to solve the
corresponding coding problems.
Algorithms 2024, 17, 62 4 of 19
3. Experimental Setup
In this section, we will provide a comprehensive overview of our experimental setup.
This includes an introduction to the state-of-the-art generative AIs that we employed in our
experiments, the coding problems we chose from LeetCode and the code quality metrics
we utilized for evaluation.
accessing via the same VS Code extension as for StarCoder [47]. The training data were
mostly deduplicated, publicly available code, with 8% from natural language datasets
related to program code. It covers many programming languages such as Python, C++,
Java, PHP, TypeScript, JavaScript, C#, and Bash [24].
3.1.6. CodeWhisperer
According to [49], Amazon’s “CodeWhisperer is a generative AI service powered by a
foundation model trained on various data sources, including Amazon and open-source
code”. No information regarding CodeWhisperer’s training, fine-tuning or number of
parameters is publicly available. Supported IDEs to access CodeWhisperer are Amazon
SageMaker Studio, JupyterLab, VS Code, all JetBrains IDEs, AWS Cloud9, AWS Lambda
and AWS Glue Studio. Supported programming languages are Java, Python, JavaScript,
TypeScript, C#, Ruby, Go, PHP, C++, C, Shell, Scala, Rust, Kotlin, and SQL [50].
the human-generated Java, Python and C++ program codes from LeetCode that solve our
coding problems and were rated best by the LeetCode community, i.e., were written by
programmers with the highest expertise level.
MI = MAX (0, (171 − 5.2 · ln( HV ) − 0.23 · (CC ) − 16.2 · ln( LOC )) · 100/171)
where HV is Halstead volume, CC is cyclomatic complexity and LOC is lines of code [59]. A
higher maintainability index indicates program code with a higher maintainability.
Figure 2. LeetCode Coding Problem Description (left) and Prompt to Instruct ChatGPT (right).
We observe that 122 of the 126 AI-generated program codes are executable. Four pro-
gram codes result in a compilation error which is indicated with ϵ. Out of our 18 program-
ming tasks, GH Copilot was able to generate nine correct program codes (50%) in total (∑),
followed by Bing AI with seven correct program codes (39%). ChatGPT and Code Llama
produced only four correct program codes (22%). CodeWhisperer did not produce correct
program code in all three programming languages. This results in 26 correct program codes,
which we will further analyze in Sections 4.2–4.8.
Looking at the difficulty levels shows that the generative AIs rather generated cor-
rect program code for easy and medium coding problems. Regarding the programming
languages, most correct Java (J) and C++ (C) code were generated with GH Copilot. Most
correct Python (P) code was produced with Bing AI. The program code which was written
by human programmers (human) was always correct, independent of the programming
language and the difficulty of the coding problem.
In the hard task#6, where only ChatGPT produced a correct program code, ChatGPT
was even able to solve the problem with 10 Python (P) lines of code, which is 60% fewer
lines of code than human. BingAI generated 86% fewer lines of code to solve task#3 in Python
(P). BingAI and GH Copilot produced 36% fewer lines of code to solve task#2 in Java (J); 22%
fewer C++ (C) lines of code are required in BingAI’s program code for task#3. Furthermore,
10% fewer Java (J) lines of code are required in GH Copilot’s and Code Llama’s program code
for task#3.
Out of the 26 correct program codes, in seven cases (27%) a generative AI was able
to solve the coding problem with program code that has less cyclomatic complexity than
human. Only four AI-generated program codes (15%) were outperformed by human in this
evaluation metric. Fifteen program codes (58%) contain the same cyclomatic complexity as
human. GH Copilot performed better than the other generative AIs being able to generate
three program codes with less cyclomatic complexity than human.
In the medium task#3, where only Bing AI produced correct program code in Python (P),
Bing AI was even able to solve the problem with a cyclomatic complexity of 3, which is 67%
less than human. ChatGPT generated code with 50% less cyclomatic complexity to solve task#6
in Python (P). GH Copilot and Code Llama produced code with 33% less cyclomatic complexity
to solve task#3 in Java (J). Bing AI and GH Copilot also generated code with 33% lower
cyclomatic complexity to solve task#3 in C++ (C). Moreover, 25% less cyclomatic complexity is
in the C++ code (C) for task#4.
Algorithms 2024, 17, 62 12 of 19
Out of the 26 correct program codes, in one case (4%) (AI) a generative AI was able
to solve the coding problem with code that has less time complexity than human. Thir-
teen AI-generated program codes (50%) were outperformed by human in this evaluation
metric. Twelve program codes (46%) contain the same time complexity as human. GH Copilot
performed better than the other generative AIs being able to generate one program code
with lower time complexity than human and four program codes with equal time complexity.
ChatGPT, Bing AI, and Code Llama produced program code with the same time complexity as
human in two program codes each.
4.6. Runtime
Table 8 demonstrates the runtime of the 26 correct program codes plus the runtime of
our human-written reference program codes (human) on LeetCode in milliseconds. The
lower the runtime of a program code, the better. The runtime of the six correct program
codes labeled with “*” could not be measured since their execution resulted in a Time Limit
Exceeded error when executed on LeetCode.
Algorithms 2024, 17, 62 13 of 19
Out of the 26 correct program codes, in six cases (23%) a generative AI was able to solve
the coding problem with code that executes with less runtime than human. Seventeen AI-
generated program codes (65%) were outperformed by human in this evaluation metric.
Three program codes (12%) took the same runtime as human. GH Copilot performed better
than the other generative AIs being able to generate two program codes with less runtime
than human and one program code with the same runtime.
In easy task#1, where only GH Copilot produced the correct program code in Java (J),
GH Copilot was even able to solve the problem in a runtime of 4 milliseconds, which is 125%
less than human. InstructCodeT5+ generated code that took 55% less runtime to solve task#2
in Python (P). GH Copilot produced code that took 12% less runtime to solve task#4 in C++
(C). Bing AI, GH Copilot and Code Llama generated code that took the same runtime as human
to solve task#2 in Java (J).
Out of the 26 correct program codes, in 13 cases (50%) a generative AI was able
to produce code with a higher maintainability index than human. Thirteen AI-generated
program codes (50%) were exceeded by human in this evaluation metric. GH Copilot and
Bing AI performed better than the other generative AIs being able to generate four program
codes (15%) with a higher maintainability index than human. ChatGPT and Code Llama
produced two program codes (8%) with a higher maintainability index than human.
Tincorrect
Tmaintain = − Tincorrect
MI/100
where MI is the maintainability index between 0 and 100, based on [60]. To obtain MI in a
range of 0 to 1, we divided it by 100. This way, Tincorrect is extended with a factor that is
higher for less maintainable program code.
Table 11 shows the MI, Tincorrect , Tcorrect , TTC as well as the relative difference between
Tcorrect and TTC (∆ Tcorrect –TTC (%)) of our 24 potentially correct program codes and their
corresponding correct program codes. We observe that for 11 program codes TTC < Tcorrect ,
i.e., the time to correct (TTC) the incorrect program code takes less time than the implementation
time of the correct program code Tcorrect . With these 11 codes, between 8.85% and even 71.31%
of time can be saved if the AI-generated program code is corrected and not programmed
from scratch.
Estimated Time To
Program in Seconds
# Lang. MI Tincorrect | Tcorrect TTC ∆ Tcorrect –TTC (%)
StarCoder 1 C 49.77 1001 | 1225 1234 −0.73
CodeWhisperer 1 J 57.55 713 | 1167 1693 −31.07
StarCoder 1 J 57.78 827 | 1464 1241 +17.97
ChatGPT (GPT-3.5) 3 J 49.73 1570 | 2984 3001 −0.57
StarCoder 3 P 57.41 1045 | 450 1370 −67.15
CodeWhisperer 2 C 58.46 355 | 361 258 +39.92
ChatGPT (GPT-3.5) 3 P 60.48 303 | 865 760 +13.82
ChatGPT (GPT-3.5) 2 J 52.97 656 | 500 738 –32.25
Code Llama (Llama 2) 1 P 64.87 492 | 406 352 +15.34
Bing AI Chat (GPT-4.0) 3 J 49.05 1910 | 1846 2048 −9.86
InstructCodeT5+ 3 C 56.69 757 | 829 650 +27.54
Code Llama (Llama 2) 4 J 53.78 538 | 574 498 +15.26
InstructCodeT5+ 2 C 64.83 187 | 356 279 +27.60
StarCoder 2 P 63.38 254 | 209 192 +8.85
ChatGPT (GPT-3.5) 1 P 56.59 697 | 577 655 −11.91
CodeWhisperer 2 J 57.89 459 | 355 438 −18.95
Code Llama (Llama 2) 3 P 63.98 364 | 275 294 −6.46
StarCoder 2 C 57.45 458 | 395 402 −1.74
InstructCodeT5+ 2 J 56.99 519 | 365 546 −33.15
StarCoder 1 P 57.89 682 | 628 550 +14.18
Bing AI Chat (GPT-4.0) 4 J 49.22 648 | 552 765 −27.84
Bing AI Chat (GPT-4.0) 4 P 57.16 450 | 436 454 −3.96
CodeWhisperer 2 P 63.55 204 | 209 122 +71.31
CodeWhisperer 3 J 56.93 605 | 752 605 +24.30
Algorithms 2024, 17, 62 16 of 19
a generative AI that delivers correct, efficient and maintainable program code in every case.
However, we have learned that AI-generated program code can have the potential to
speed up programming, even if the program code is incorrect because often only minor
modifications are needed to make it correct. For a quick and detailed evaluation of the
generated program codes, we used different evaluation metrics and introduced TTC, an
estimation of the time to correct incorrect program code.
In future work, we plan to analyze the quality of AI-generated program code in other
programming languages. For that, we will expand our AI/Human-Generated Program Code
Dataset to cover further programming languages and coding problems. To have a fair
comparison among the generative AIs, we applied a prompt engineering strategy that is
applicable to a wide range of generative AIs. However, in future work, we plan to investi-
gate the optimal prompting approach for each generative AI individually. Furthermore,
we are interested in investigating whether the interaction of different chatbots leveraging
different generative AIs helps to improve the final program code quality. For example, as in
a human programming team, the generative AIs could take on different roles, e.g., a chatbot
that develops the software architecture, a chatbot that is responsible for testing, a chatbot
that generates the code or different all-rounders that interact with each other. Since many
related works report pass@k, we could also have the program codes produced several times
for comparability and report pass@k. Since the development of generative AIs is rapid,
it makes sense to apply our experiments to new generative AIs soon. In this work, we
estimated the time for writing and correcting program code based on Halstead metrics. But a
comparison with the real time required by a representative set of programmers may also be
part of future work. We have provided insights into how state-of-the-art generative AI deals
with specific coding problems. Moving forward, it would be beneficial to comprehensively
investigate how generative AIs handle the generation of more complex programs or even
complete software solutions. This could include an analysis of their ability not only to
generate intricate algorithms, but also to manage large codebases, use frameworks and
libraries in reasonable contexts, and take best practices of software engineering into account.
Additionally, it would be interesting to explore how generative AIs could be integrated into
existing software development workflows, and whether they could contribute to increased
efficiency and productivity.
References
1. Pelau, C.; Dabija, D.C.; Ene, I. What Makes an AI Device Human-like? The Role of Interaction Quality, Empathy and Perceived
Psychological Anthropomorphic Characteristics in the Acceptance of Artificial Intelligence in the Service Industry. Comput. Hum.
Behav. 2021, 122, 106855. [CrossRef]
2. Dibitonto, M.; Leszczynska, K.; Tazzi, F.; Medaglia, C.M. Chatbot in a Campus Environment: Design of LiSA, a Virtual Assistant
to Help Students in Their University Life. In Proceedings of the Human-Computer Interaction; Interaction Technologies; Kurosu, M.,
Ed.; Springer: Cham, Switzerland, 2018; pp. 103–116.
3. Arteaga, D.; Arenas, J.J.; Paz, F.; Tupia, M.; Bruzza, M. Design of Information System Architecture for the Recommendation of
Tourist Sites in the City of Manta, Ecuador through a Chatbot. In Proceedings of the 2019 14th Iberian Conference on Information
Systems and Technologies (CISTI), Coimbra, Portugal, 19–22 June 2019; pp. 1–6.
4. Falala-Séchet, C.; Antoine, L.; Thiriez, I.; Bungener, C. Owlie: A Chatbot that Provides Emotional Support for Coping with
Psychological Difficulties. In Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, Paris, France,
2–5 July 2019.
5. Adiwardana, D.; Luong, M.T.; So, D.R.; Hall, J.; Fiedel, N.; Thoppilan, R.; Yang, Z.; Kulshreshtha, A.; Nemade, G.; Lu, Y.; et al.
Towards a Human-like Open-Domain Chatbot. arXiv 2020, arXiv:2001.09977.
Algorithms 2024, 17, 62 18 of 19
6. Schaaff, K.; Reinig, C.; Schlippe, T. Exploring ChatGPT’s Empathic Abilities. arXiv 2023, arXiv:2308.03527.
7. Taecharungroj, V. “What Can ChatGPT Do?” Analyzing Early Reactions to the Innovative AI Chatbot on Twitter. Big Data Cogn.
Comput. 2023, 7, 35. [CrossRef]
8. Loh, E. ChatGPT and Generative AI Chatbots: Challenges and Opportunities for Science, Medicine and Medical Leaders. BMJ
Lead. 2023. [CrossRef]
9. Mollick, E. ChatGPT Is a Tipping Point for AI. Harvard Business Review, 14 December 2022.
10. Alizadehsani, Z.; Gomez, E.G.; Ghaemi, H.; González, S.R.; Jordan, J.; Fernández, A.; Pérez-Lancho, B. Modern Integrated
Development Environment (IDEs). In Proceedings of the Sustainable Smart Cities and Territories, Doha, Qatar, 27–29 April 2021;
Corchado, J.M., Trabelsi, S., Eds.; Springer: Cham, Switzerland, 2022; pp. 274–288.
11. Kaur, A.; Jadhav, A.; Kaur, M.; Akter, F. Evolution of Software Development Effort and Cost Estimation Techniques: Five Decades
Study Using Automated Text Mining Approach. Math. Probl. Eng. 2022, 2022, 5782587. [CrossRef]
12. Bluemke, I.; Malanowska, A. Software Testing Effort Estimation and Related Problems: A Systematic Literature Review. ACM
Comput. Surv. 2021, 54, 1–38. [CrossRef]
13. Butt, S.A.; Misra, S.; Piñeres-Espitia, G.; Ariza-Colpas, P.; Sharma, M.M. A Cost Estimating Method for Agile Software
Development. In Proceedings of the Computational Science and Its Applications— ICCSA 2021, Cagliari, Italy, 13–16 September
2021; Gervasi, O., Murgante, B., Misra, S., Garau, C., Blečić, I., Taniar, D., Apduhan, B.O., Rocha, A.M.A.C., Tarantino, E., Torre,
C.M., Eds.; Springer: Cham, Switzerland, 2021; pp. 231–245.
14. Zhang, B.; Liang, P.; Zhou, X.; Ahmad, A.; Waseem, M. Practices and Challenges of Using GitHub Copilot: An Empirical Study.
In Proceedings of the International Conferences on Software Engineering and Knowledge Engineering, San Francisco, CA, USA,
1–10 July 2023; KSIR Virtual Conference Center, USA, 2023. [CrossRef]
15. Liu, J.; Xia, C.S.; Wang, Y.; Zhang, L. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language
Models for Code Generation. arXiv 2023, arXiv:2305.01210v3.
16. Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; de Oliveira Pinto, H.P.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al.
Evaluating Large Language Models Trained on Code. arXiv 2021, arXiv:2107.03374.
17. Yetiştiren, B.; Özsoy, I.; Ayerdem, M.; Tüzün, E. Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical
Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT. arXiv 2023, arXiv:2304.10778.
18. Wang, B.; Komatsuzaki, A. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. 2021. Available online: https:
//github.com/kingoflolz/mesh-transformer-jax/?tab=readme-ov-file#gpt-j-6b (accessed on 29 January 2024).
19. Yetistiren, B.; Ozsoy, I.; Tuzun, E. Assessing the Quality of GitHub Copilot’s Code Generation. In Proceedings of the 18th
International Conference on Predictive Models and Data Analytics in Software Engineering, Singapore, 17 November 2022.
20. OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774.
21. Phind. 2023. Available online: https://huggingface.co/Phind/Phind-CodeLlama-34B-v2 (accessed on 12 November 2023).
22. Luo, Z.; Xu, C.; Zhao, P.; Sun, Q.; Geng, X.; Hu, W.; Tao, C.; Ma, J.; Lin, Q.; Jiang, D. WizardCoder: Empowering Code Large
Language Models with Evol-Instruct. arXiv 2023, arXiv:2306.08568.
23. OpenAI. Introducing ChatGPT. 2022. Available online: https://openai.com/blog/chatgpt (accessed on 30 September 2023).
24. Rozière, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Remez, T.; Rapin, J.; et al. Code Llama: Open
Foundation Models for Code. arXiv 2023, arXiv:2308.12950.
25. Li, R.; Ben Allal, L.; Zi, Y.; Muennighoff, N.; Kocetkov, D.; Mou, C.; Marone, M.; Akiki, C.; Li, J.; Chim, J.; et al. StarCoder: May
the Source be with You! arXiv 2023, arXiv:2305.06161.
26. Nijkamp, E.; Pang, B.; Hayashi, H.; Tu, L.; Wang, H.; Zhou, Y.; Savarese, S.; Xiong, C. CodeGen: An Open Large Language Model
for Code with Multi-Turn Program Synthesis. arXiv 2023, arXiv:2203.13474.
27. Wang, Y.; Le, H.; Gotmare, A.D.; Bui, N.D.; Li, J.; Hoi, S.C. CodeT5+: Open Code Large Language Models for Code Understanding
and Generation. arXiv 2023, arXiv:2305.07922. [CrossRef]
28. Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.;
Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825.
29. Nijkamp, E.; Hayashi, H.; Xiong, C.; Savarese, S.; Zhou, Y. CodeGen2: Lessons for Training LLMs on Programming and Natural
Languages. arXiv 2023, arXiv:2305.02309.
30. Chiang, W.L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.; Zheng, L.; Zhuang, S.; Zhuang, Y.; Gonzalez, J.E.; et al. Vicuna: An
Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. 2023. Available online: https://lmsys.org/blog/2023-03-
30-vicuna (accessed on 29 January 2024).
31. Allal, L.B.; Li, R.; Kocetkov, D.; Mou, C.; Akiki, C.; Ferrandis, C.M.; Muennighoff, N.; Mishra, M.; Gu, A.; Dey, M.; et al.
SantaCoder: Don’t reach for the stars! arXiv 2023, arXiv:2301.03988
32. Fried, D.; Aghajanyan, A.; Lin, J.; Wang, S.; Wallace, E.; Shi, F.; Zhong, R.; tau Yih, W.; Zettlemoyer, L.; Lewis, M. InCoder: A
Generative Model for Code Infilling and Synthesis. arXiv 2023, arXiv:2204.05999.
33. Wang, B. Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX. 2021. Available
online: https://github.com/kingoflolz/mesh-transformer-jax (accessed on 29 January 2024).
34. Black, S.; Gao, L.; Wang, P.; Leahy, C.; Biderman, S.R. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow;
Zenodo; 2021. [CrossRef]
Algorithms 2024, 17, 62 19 of 19
35. Xu, F.F.; Alon, U.; Neubig, G.; Hellendoorn, V.J. A Systematic Evaluation of Large Language Models of Code. In Proceedings of
the 6th ACM SIGPLAN International Symposium on Machine Programming (MAPS 2022), New York, NY, USA, 13 June 2022;
pp. 1–10. [CrossRef]
36. Stability-AI. StableLM: Stability AI Language Models. 2023. Available online: https://github.com/Stability-AI/StableLM
(accessed on 12 November 2023).
37. Li, Y.; Choi, D.; Chung, J.; Kushman, N.; Schrittwieser, J.; Leblond, R.; Eccles, T.; Keeling, J.; Gimeno, F.; Lago, A.D.; et al.
Competition-Level Code Generation with AlphaCode. Science 2022, 378, 1092–1097. [CrossRef]
38. Nguyen, N.; Nadi, S. An Empirical Evaluation of GitHub Copilot’s Code Suggestions. In Proceedings of the 2022 IEEE/ACM
19th International Conference on Mining Software Repositories (MSR), Pittsburgh, PA, USA, 23–24 May 2022; pp. 1–5. [CrossRef]
39. OpenGenus IQ. GPT-3.5 Model Architecture. 2023. Available online: https://iq.opengenus.org/gpt-3-5-model/ (accessed on 30
September 2023).
40. Choudhry, S. Languages Supported by ChatGPT and How to Use It in Other Languages. 2023. Available online: https:
//www.mlyearning.org/languages-supported-by-chatgpt/ (accessed on 30 September 2023).
41. Patel, D.; Wong, G. GPT-4 Architecture, Infrastructure, Training Dataset, Costs, Vision, MoE. 2023. Available online: https:
//github.com/llv22/gpt4_essay/blob/master/GPT-4-4.JPG (accessed on 30 September 2023).
42. Yalalov, D.; Myakin, D. GPT-4’s Leaked Details Shed Light on its Massive Scale and Impressive Architecture. Metaverse Post, 11
July 2023. Available online: https://mpost.io/gpt-4s-leaked-details-shed-light-on-its-massive-scale-and-impressive-architecture
(accessed 29 January 2024).
43. OpenAI. GPT-4. OpenAI Research. 2023. Available online: https://openai.com/gpt-4 (accessed 29 January 2024).
44. GitHub. GitHub Copilot. 2021. Available online: https://github.com/features/copilot/ (accessed on 2 October 2023).
45. Zaremba, W.; Brockman, G. OpenAI Codex. 2021. Available online: https://openai.com/blog/openai-codex/ (accessed on 2
October 2023).
46. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al.
Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165.
47. Hugging Face. llm-Vscode. 2023. Available online: https://marketplace.visualstudio.com/items?itemName=HuggingFace.
huggingface-vscode (accessed on 2 October 2023).
48. Phillips, J. StarCoder. 2023. Available online: https://plugins.jetbrains.com/plugin/22090-starcoder/versions (accessed on 2
October 2023).
49. Amazon Web Services, Inc. Amazon CodeWhisperer FAQs. 2023. Available online: https://aws.amazon.com/de/
codewhisperer/faqs/ (accessed on 3 October 2023).
50. Amazon Web Services, Inc. CodeWhisperer User Guide. 2023. Available online: https://docs.aws.amazon.com/pdfs/
codewhisperer/latest/userguide/user-guide.pdf (accessed on 3 October 2023).
51. Hugging Face. Dataset Card for CodeSearchNet Corpus. 2023. Available online: https://huggingface.co/datasets/code_search_
net (accessed on 3 October 2023).
52. Hugging Face. GitHub Code Dataset. 2023. Available online: https://huggingface.co/datasets/codeparrot/github-code
(accessed on 3 October 2023).
53. Chaudhary, S. Code Alpaca: An Instruction-following LLaMA Model Trained on Code Generation Instructions. 2023. Available
online: https://github.com/sahil280114/codealpaca (accessed on 3 October 2023).
54. LeetCode. LeetCode QuickStart Guide. 2023. Available online: https://support.leetcode.com/hc/en-us/articles/360012067053
-LeetCode-QuickStart-Guide (accessed on 10 October 2023).
55. McCabe, T. A Complexity Measure. IEEE Trans. Softw. Eng. 1976, SE-2, 308–320. [CrossRef]
56. Cormen, T.; Leiserson, C.; Rivest, R.; Stein, C. Introduction to Algorithms, 4th ed.; MIT Press: Cambridge, MA, USA, 2022.
57. Baeldung. Understanding Space Complexity. Baeldung Comput. Sci. 2021. Available online: https://www.baeldung.com/cs/
time-vs-space-complexity (accessed on 29 January 2024).
58. Halstead, M.H. Elements of Software Science; Elsevier: Amsterdam, The Netherlands, 1977; pp. xiv, 127.
59. Heričko, T.; Šumak, B. Exploring Maintainability Index Variants for Software Maintainability Measurement in Object-Oriented
Systems. Appl. Sci. 2023, 13, 2972. [CrossRef]
60. Microsoft. Visual Studio—Maintainability Index. 2021. Available online: https://docs.microsoft.com/en-us/visualstudio/code-
quality/code-metrics-maintainability-index-range-and-meaning (accessed on 27 November 2023).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.