Programming-with-ChatGPT--How-far-can-we_2024_Machine-Learning-with-Applicat
Programming-with-ChatGPT--How-far-can-we_2024_Machine-Learning-with-Applicat
Keywords: Artificial intelligence (AI) has made remarkable strides, giving rise to the development of large language models
ChatGPT such as ChatGPT. The chatbot has garnered significant attention from academia, industry, and the general
Large language models public, marking the beginning of a new era in AI applications. This work explores how well ChatGPT can
Programming
write source code. To this end, we performed a series of experiments to assess the extent to which ChatGPT
is capable of solving general programming problems. Our objective is to assess ChatGPT’s capabilities in two
different programming languages, namely C++ and Java, by providing it with a set of programming problem,
encompassing various types and difficulty levels. We focus on evaluating ChatGPT’s performance in terms of
code correctness, run-time efficiency, and memory usage. The experimental results show that, while ChatGPT
is good at solving easy and medium programming problems written in C++ and Java, it encounters some
difficulties with more complicated tasks in the two languages. Compared to code written by humans, the one
generated by ChatGPT is of lower quality, with respect to runtime and memory usage.
∗ Corresponding author.
E-mail addresses: [email protected] (A. Bucaioni), [email protected] (P.T. Nguyen).
1
https://chat.openai.com/chat
2
https://leetcode.com/
https://doi.org/10.1016/j.mlwa.2024.100526
Received 15 July 2023; Received in revised form 27 December 2023; Accepted 6 January 2024
Available online 8 January 2024
2666-8270/© 2024 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
A. Bucaioni et al. Machine Learning with Applications 15 (2024) 100526
levels. However, its accuracy in generating correct code decreases when As NLP techniques evolve, they have enabled the development of
being prompted with more challenging problems. Similarly, concern- sophisticated chat-bots like ChatGPT. ChatGPT is developed by OpenAI
ing runtime performance and memory usage, ChatGPT obtains above- and built upon the advanced GPT-3.5 and GPT-4 architectures, which
average results for problems of lower and medium difficulty, while allow the model to understand and process complex human language
its performance worsens when solving more complicated programming effectively, as well as tackle complex tasks such as programming (Ope-
problems. The experimental results suggest that while ChatGPT cannot nAI, 2023). ChatGPT has been trained on a large amount of text
produce good source code as human programmers do, it demonstrates a and code, which includes various user questions and their appropriate
promising potential as a programming assistant marking the beginning responses. In addition, the extensive data-set has equipped ChatGPT
of a new era in the utilization of machine learning and large language with a broad knowledge base that encompasses various programming
modeling based approaches for code generation. languages and concepts (Israelsen, 2023; OpenAI, 2023). Given these
The main contributions of our work are summarized as follows. capabilities, ChatGPT is a promising candidate for investigating the
• By means of a dataset collected from LeetCode, we evaluate (i) potential of chat-bots as a replacement for human programmers. There-
how well can ChatGPT solve general programming problems; and fore, we believe that it is a suitable candidate for exploring the potential
(ii) how ChatGPT’s solutions compare to those by humans in of chat-bots replacing human programmers.
terms of runtime and memory usage.
• The dataset and the paper corpus curated in this paper have been 2.2. LeetCode
published online to allow for future research.3
LeetCode is an online platform offering a collection of program-
The remainder of this paper is organized as follows. Section 2
ming problems and challenges that is widely used by programmers
introduces key concepts and terminology essential for understanding
to practice and enhance their coding skills. LeetCode provides a reli-
the context of our work. Section 3 details the pipeline that we built
able repository of programming challenges categorized based on their
for evaluating the code generated by ChatGPT. In Section 4, we de-
difficulty levels and topics. These difficulty levels encompass ‘‘easy’’,
scribe the research methodology, including research goal and questions,
experimental design, and threats to validity. Section 5 presents a com- ‘‘medium’’, and ‘‘hard’’, while the topics span a wide range, including
prehensive summary of the results of our experiment, while Section 6 algorithms, databases, shell scripting, concurrency, and more. It is
delves into a discussion of the implications of our results. Finally, worth noting that our contribution does not include the categorization
Section 8 concludes the work with final remarks and future works. of problems based on their difficulty level or topic. However, readers –
who are interested in exploring the specific categorization of problems
2. Background – can refer to the LeetCode platform for greater details.4 Typically, each
programming problem comes with a problem statement, sample inputs
In this section, we introduce key concepts and terminology essential and outputs. LeetCode includes a built-in editor and compiler, which
for understanding the context of our work. allows users to test their code against a set of predefined test cases for
evaluating its correctness and efficiency. In addition, LeetCode tracks
2.1. Natural language processing and ChatGPT submission status, provides error messages, and places users in distinct
performance percentiles based on their submission scores. In this work,
Natural Language Processing (NLP) is a sub-field of AI (Aker et al., we leverage LeetCode as a source of diverse programming problems and
2019) that aims at enabling computers to understand, process, and gen- challenges for our experiment. We prompt ChatGPT with programming
erate human language. It encompasses various techniques, including problems and challenges of varying topics and difficulty available on
statistical analysis, and computational linguistics, to develop models the platform. We then submit the solutions generated by ChatGPT to
capable of interpreting the meaning, emotional tone, and intent behind LeetCode to compare the accuracy and efficiency of ChatGPT generated
human language, hence facilitating effective interaction between com- code against the code generated by human programmers.
puters and humans (Arnold et al., 1994). The field of NLP originated
in the 1950s and 1960s, as researchers began exploring the possibility 2.3. Code quality
of using computers to understand human language (Nadkarni et al.,
2011). Early NLP systems used a set of grammatical rules to analyze and
Avizienis et al. emphasized the significance of dependability and
create sentences, but they could not handle complex sentence structures
security in computing systems, which can only be achieved through
and the nuances of human language. In the 1980s and 1990s, re-
the development of high-quality code (Avizienis et al., 2004). Poor
searchers started to explore statistical approaches for NLP, as described
code quality can lead to bugs, errors, and vulnerabilities, jeopardizing
by Nadkarni et al. (2011). Models such as Hidden Markov Models
the reliability and security of a system (Avizienis et al., 2004). Error
(HMMs) and Probabilistic Context-Free Grammars (PCFGs) were de-
detection and recovery techniques also rely on properly functioning
veloped, which were more flexible and could handle a wider range
code. Thus, developing good code is crucial for dependable and secure
of sentence structures. However, they still faced challenges with the
computing systems (Avizienis et al., 2004). In software engineering,
nuances of language, such as detecting sarcasm. In the early 2000s,
code quality is evaluated based on several attributes including memory
researchers applied ML techniques, such as Support Vector Machines
efficiency, and runtime performance (Sharma et al., 2022). This work
(SVMs) and neural networks, to NLP (Nadkarni et al., 2011). These
techniques needed large amounts of labeled data to train models and focuses on both these two metrics. In this study, runtime performance
were able to capture more subtle patterns in language with greater refers to the total duration required for a solution to execute. Memory
accuracy. Today, NLP approaches include techniques such as Trans- utilization refers to the overall amount of memory resources used by
formers and Recurrent Neural Networks (RNNs) (Gillioz et al., 2020). the solution during its execution. For the measurement and evaluation
These approaches can process massive amounts of text and generate of runtime performance and memory utilization, we rely on the Leet-
human-like responses, making applications such as chat-bots, language Code’s publicly available data that serves as a reliable benchmark for
translation, and voice assistants possible (Natural Language Processing, assessing these properties of the generated code.
2023).
4
https://leetcode.com/discuss/interview-question?currentPage=1&
3
https://github.com/hampusekedahl/ChatGPT_Experiment_Extended orderBy=most_relevant&query=levels
2
A. Bucaioni et al. Machine Learning with Applications 15 (2024) 100526
In our study, we employed key tools for data collection and analysis,
including the GPT-4.0 language model, LeetCode, Git and GitHub,
Google Sheets, a Python GUI, and various Python frameworks. We
accessed and executed the GPT-4.0 language model through the OpenAI
website.5 To utilize the ChatGPT-4.0 API, we subscribed to a monthly
plan that incurred a cost of $20. This subscription was crucial for con-
ducting our experiments and generating the required solutions for our
research. An overview of the proposed pipeline is shown in Fig. 1. Data
was collected based on correctness, runtime, and memory usage using
LeetCode’s built-in code submission system. The system compiles and
executes submitted code on a range of test cases for each programming
question, and evaluates the output against the expected output for each
test case.
We stored all data in an Excel spreadsheet and linked it with
relevant code solutions and errors in GitHub6 by means of unique
IDs. A dedicated GitHub repository was created to store the solutions Fig. 2. The GUI for generating standardized prompts.
and errors generated during each iteration of the experiment. Each
experiment was assigned a Google Sheets document, and the data
was organized in a structured format, facilitating efficient analysis. It is worth mentioning that the prompt generator is a graphical user
Essentially, the use of Google Sheets minimizes the potential for errors interface tool developed as an effective means to generate standardized
and inconsistencies, and ensures that the data remains well-organized prompts. Essentially, the tool is conceived solely as a supporting ele-
and easily accessible for further analysis. The utilization of LeetCode is ment, helping users to generate prompts, nevertheless it does not have
threefold as follows: a mere impact on the final results.
Once the prompt was generated using the GUI tool, it was stored in a
• We leveraged the extensive collection of programming questions GitHub repository. The prompt was then passed as input to ChatGPT to
available on LeetCode to construct our dataset. produce an output response, which was then documented in the same
• LeetCode is used to collect statistical data on the runtime, memory GitHub repository for further reference and analysis. Subsequently,
usage, and acceptance rates of human-generated solutions for the output response was fed as input to LeetCode’s built-in editor to
programming tasks. submit it as a solution to the corresponding problem. If the solution
• Eventually, LeetCode provides a validation and bench-marking generated by ChatGPT passes the test cases on LeetCode, the corre-
mechanism for the solutions generated by ChatGPT. sponding feedback, including metrics and statistics, was documented
and collected using Google Sheets. In contrast, if an error occurs during
To ensure consistency and standardization in presenting the prob- the submission process, it is documented, and passed back to the GUI
lems to ChatGPT, we developed a graphical user interface (GUI) tool tool. The error message was then formatted into our template for
– shown in Fig. 2 – using the PySimpleGUI library for generating error occurrences, and the modified prompt was once again passed to
standardized prompts. This GUI plays a crucial role in our exper- ChatGPT for another iteration of generating a response. This process
iments by facilitating prompt generation. The Python GUI offers a was repeated for up to a maximum of three iterations. After the third
user-friendly interface, allowing us to seamlessly generate prompts. It iteration, if the solution still fails, then this will be counted as a negative
accepts as input each question in its corresponding starting template, score, to be used for computing the final results.
and automatically generates prompts in a specific format, described in When deciding on the prompt format for the experiments, our
Section 4.2. We could select programming problems from our data-set primary focus was to create clear and concise prompts. Clarity and con-
and generate prompts based on predefined templates. Additionally, the ciseness were crucial to ensure that ChatGPT accurately understands
GUI offers the functionality to incorporate solutions based on errors our instructions and to minimize any potential misunderstandings.
encountered in previous iterations. This feature enhances the efficiency In addition, this helps us ensure that the responses produced were
of our experiments, enabling us to iterate and improve the prompts meaningful and relevant to our research aim.
based on the feedback received. For data analysis and visualization, we employed four Python li-
braries, i.e., matplotlib, numpy, seaborn, and pandas. These frameworks
are used to analyze and visualize the data collected from our exper-
5
https://openai.com/ iments, allowing for insightful interpretations and representations of
6
https://github.com/ the results. To analyze the collected data, we performed a descriptive
3
A. Bucaioni et al. Machine Learning with Applications 15 (2024) 100526
4
A. Bucaioni et al. Machine Learning with Applications 15 (2024) 100526
Fig. 4. Example of prompt one of the Two Sum problem from LeetCode.
Table 1
RQ1 : Success rate per iteration corresponding to each difficulty level (%).
Easy Medium Hard
iteration iteration iteration
1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd
96.67 100.00 100.00 96.67 100.00 100.00 76.67 93.33 93.33
5
A. Bucaioni et al. Machine Learning with Applications 15 (2024) 100526
6
A. Bucaioni et al. Machine Learning with Applications 15 (2024) 100526
7
A. Bucaioni et al. Machine Learning with Applications 15 (2024) 100526
our work, we did not attempt to investigate ChatGPT’s superiority over a conversation, this could also impact the construct validity of our
other systems or methods. Instead, the ultimate aim is to provide a study.
reproducible assessment of ChatGPT’s code generation capabilities.
ChatGPT has been built on top of GPT (Generative Pre-trained 7. Related work
Transformer), which traces its roots back to Transformers, Encoder–
Decoder, consisting of two sides, one for encoding the input data, and To review related work, we conducted a systematic literature re-
the other for decoding the output data. In this way, we can think of view. We started with an automatic search in software engineer-
querying ChatGPT to get source code as putting a sentence written in ing (Kitchenham & Brereton, 2013; Petersen et al., 2008) on the
natural language on one side (Encoder), and getting the source code following four scientific databases and indexing systems: IEEE Xplore
on the other (Decoder). The main focus of our work is to evaluate Digital Library,7 ACM Digital Library,8 Scopus,9 and Web of Science.10
ChatGPT’s code generation capabilities. Thus, we conducted a practical Aiming at a reasonable trade-off between efficiency and the coverage
and empirical assessment of its utility as a code generation tool, rather of state-of-the-art studies, we adhered to the existing guidelines for
than running into an exhaustive analysis of ChatGPT’s underlying archi- systematic literature studies in software engineering. We queried the
tecture and reasoning mechanisms. A comprehensive examination of its aforementioned databases using concise and well-constructed search
architecture and reasoning mechanisms could reveal additional insights strings, collecting as many studies as possible. Given the novelty of the
into its strengths and weaknesses, especially in the context of general topic, the following query strings were utilized:
programming problems. This deserves an independent investigation, ‘‘ChatGPT’’ AND ‘‘programming’’
which can be investigated in another paper.
The initial search produced a set of 27 peer-reviewed publications.11
From this set, we removed impurities and duplicates and obtained a
6.2. Threats to validity
new set of 22 publications. We used the guidelines by Ali and Petersen
(2014) to define the following selection criteria for filtering the primary
In this section, we discuss different types of validity threats that
studies.
could affect the results of our study along with the strategies adopted
to mitigate them. • Inclusion criteria. We considered studies that are: (i) subject to
peer review; (ii) written in English; (iii) available as full-text; and
• Threats to external validity refer to the factors that could impact (iv) focusing on ChatGPT and programming.
the generalization of the findings of an experiment to real-world • Exclusion criteria. To ensure the inclusion of as many relevant
applications (Wohlin et al., 2012). In our study, one of the poten- studies as possible and minimize threats to validity, we excluded
tial threats to conclusion validity may be related to the chosen studies that are shorter than 4 pages.
programming language and to the problem set collected from
LeetCode, i.e., tasks related to C++ and Java, two popular pro- Apart from the studies obtained by conducting the literature review,
gramming languages. For future work, we plan to extend our we also took into account some additional papers that are relevant to
experiments with other languages, including Python. For the eval- the topic being under consideration, and eventually got a set of relevant
uation, we had a dataset of 240 programming exercises, and such papers to be reviewed as follows.
a number might not be large enough to be generalizable. In fact, Jacques (2023) investigated the potential of using ChatGPT to help
curating a dataset for the experiments by involving both ChatGPT instructors in enhancing the critical thinking skills of novice pro-
and LeetCode is a strenuous process, requiring a lot of time and grammers. Inspired by the historical developments in mathematical
effort, as the samples first were collected from ChatGPT, and then education, the author hypothesized that fluency and basic skills in
uploaded to LeetCode to run other measurements. Altogether, programming are important regardless of the availability of advanced
this is a prolonged procedure, preventing us from increasing the tools. Based on this hypothesis, Jacques proposed various tasks aimed
number of code samples. To mitigate this threat, we attempted to at enhancing the critical thinking skills of novice programmers. One
cover three main levels of difficulty, including Easy, Medium, and task involved instructing novices to create flowcharts that depict the
Hard. Moreover, the tasks spread over 10 different categories in logical structure of programs generated by ChatGPT. In addition, the
programming, including: ‘Array’, ‘Dynamic Programming’, ‘Hash author suggested a task where learners were prompted to extend or
Table’, ‘Queue Stack‘, ‘Binary Search’, ‘Greedy’, ‘Math’, ‘Sorting’, modify a program that was generated by ChatGPT. While the work
and ‘String’. utilized ChatGPT to generate programs, the primary focus was not on
• Threats to internal validity concern the factors that can potentially assessing the correctness, runtime performance, or memory usage of
affect the causal relationship between the independent variable the generated solutions.
Similar to Jacques (2023), Kazemitabaar et al. tried to understand
and the outcome. In the experiments, we evaluated the solutions
whether novice developers were able to understand the code generated
by ChatGPT and humans using the same metrics. This is to make
by ChatGPT and to modify or extend the generated code (Kazemitabaar
sure that we performed a fair comparison.
et al., 2023). In addition, the authors also studied if using such tools
• Threats to conclusion validity are factors that could impact the
would form a reliance, or help learners write code without such tools
ability to correctly draw conclusions about the relationship be-
being present. To investigate these aspects, they developed a web-
tween the treatment and the outcome of an experiment. The
based application called Coding Steps. This application was specifically
chosen metrics may not capture all aspects of code, such as
designed to facilitate the learning of basic Python programming. Coding
readability or scalability. Eventually, the data from LeetCode
Steps provides learners with a progressive set of programming tasks
may not be representative of the overall performance of hu-
that introduce new concepts gradually. It offers a submission function-
man programmers, which could also limit the validity of our
ality that allows learners to submit their code to remote instructors
conclusions
for grading and feedback, which is provided through a dedicated
• Threats to construct validity are related to the experimental design,
as well as to social factors. In our study, one potential threat
is that the programming tasks given to ChatGPT may not fully 7
https://ieeexplore.ieee.org/Xplore/home.jsp
represent the complexity of tasks typically performed by human 8
https://dl.acm.org/
programmers, which could affect the validity of our evaluation. 9
https://www.scopus.com/
Additionally, the competition among participants may impact on 10
http://webofscience.com/
the performance of human programmers on the LeetCode pro- 11
It is important to remark that we performed the automatic search in June
gramming tasks. As ChatGPT learns from previous prompts within 2023.
8
A. Bucaioni et al. Machine Learning with Applications 15 (2024) 100526
grading dashboard. Additionally, Coding Steps incorporates code gen- 8. Conclusions and future work
eration capabilities using the code-davinci-002 model from OpenAI’s
code completion API. This feature generates code to assist learners in In this paper, we have evaluated the potential of ChatGPT in replac-
completing their programming tasks. The study’s results demonstrated ing humans as programmers. To achieve this, we have started with a
that ChatGPT can be a valuable asset for computer science educators systematic literature review to gain an understanding of the current
and students. Novice programmers who used Coding Steps environment state-of-the-art in the use of ChatGPT for programming tasks. Then
exhibited improved performance, increased efficiency, and reduced we have designed and conducted experiments to assess the extent to
frustration when writing code. Importantly, the use of Coding Steps and which ChatGPT is capable of solving general programming problems.
The objective is to assess ChatGPT’s capabilities in two different pro-
ChatGPT did not negatively impact their ability to manually modify
gramming languages, namely C++ and Java. To evaluate the quality
code or perform tasks without its support.
of the generated code, we leveraged LeetCode’s automated submission
Yilmaz et al. investigated the effect of using ChatGPT on stu-
system, which assessed and provided feedback on the code correctness
dents’ computational thinking skills, programming self-efficacy, and and performance with respect to runtime and memory usage.
motivation towards the lesson (Yilmaz & Karaoglan Yilmaz, 2023). Future work may encompass different directions. One direction may
They conducted the research on 45 undergraduate students who took investigate the impact of dedicated training and its potential to enhance
a university-level programming course using the experimental design performance. Building upon the insights gained from our experiment,
with the pretest-posttest control group. Students were randomly di- we observed that a more detailed prompt yielded more effective results
vided into the experimental group that could use ChatGPT during in terms of finding solutions. We suppose that the architecture, the
the weekly programming practices, and the control group that could reasoning mechanisms, as well as the training of ChatGPT have a
not use it. Their findings revealed that the experimental group stu- certain effect on general programming problems. This, however, needs
dents’ computational thinking skills, programming self-efficacy, and to be investigated in a separate paper with concrete empirical evidence,
motivation for the lesson were significantly higher than the control and thus we consider it as our future work. A possible direction for
group students. Hence, it can be said that ChatGPT may be useful in future work is to explore the use of different programming languages
programming training. than C++ and Java, and of a larger and more complex data-set of
In a broader context, Lertbanjongngam et al. compared the per- problems to gain a more comprehensive understanding of ChatGPT’s
formance of the code generation tool AlphaCode12 to that of human capabilities. Finally, in our future work, we will also explore additional
code quality metrics than runtime and memory usage, as well as involve
programmers (Lertbanjongngam et al., 2022). The study discovered
revisiting the experiment with a focus on utilizing different parameters
that AlphaCode was capable of generating code that was often similar
to assess code quality.
to human-generated code, and performed equally or worse in terms of
execution-time and memory utilization. While the authors measured
Declaration of competing interest
the execution-time themselves to evaluate solutions (Lertbanjongngam
et al., 2022), we relied on the publicly available data provided by the The authors declare that they have no known competing finan-
LeetCode platform, which serves as a reliable benchmark for assessing cial interests or personal relationships that could have appeared to
the accuracy and efficiency of the generated code. Additionally, the influence the work reported in this paper.
authors used a popular competitive programming platform, namely
Codeforces,13 to retrieve code produced by humans. Similar to Lertban- Data availability
jongngam et al. (2022), we compared the performance of AI-generated
code to human programmers. However, we did not focus on com- Data will be made available on request.
petitive programming, which is an important distinction. Competitive
programming platforms like Codeforces and technical interview prepa- Acknowledgments
ration platforms like LeetCode serve different purposes and require
different skills and approaches. For example, Codeforces places more The work in this paper has been supported by the Swedish Knowl-
emphasis on time and memory limits (Mirzayanov, 2012), whereas edge Foundation (KKS) through the Modev project, by the Excellence in
LeetCode focuses on optimizing solutions and improving performance Production Research (XPRES) Framework and by the Swedish Govern-
mental Agency for Innovation Systems (VINNOVA) through the iSecure
relative to other users.
project. The authors would like to thank the anonymous reviewers for
Finnie-Ansley et al. investigated the accuracy of an AI model de-
their valuable comments.
signed for code writing named Codex on a range of first-year college-
level programming problems (Finnie-Ansley et al., 2022). Their study References
found that Codex outperformed most first-year programming students
by ranking in the top quartile of a typical first-year programming exam. Aker, A., Ceausu, A., Feng, Y., Gaizauskas, R. J., Hunsicker, S., Ion, R., Irimia, E.,
Their study evaluated the accuracy of Codex on 23 basic computer Stefanescu, D., & Tufis, D. (2019). Mapping and aligning units from comparable
corpora. In I. Skadina, R. J. Gaizauskas, B. Babych, N. Ljubesic, D. Tufis, &
science programming problems, where the authors presented the as-
A. Vasiljevs (Eds.), Using Comparable Corpora for under-Resourced Areas of Machine
signments exactly as they were given to the students. They also assessed Translation, Theory and Applications of Natural Language Processing (pp. 141–188).
Codex through variations of the problem wording for the Rainfall Springer, http://dx.doi.org/10.1007/978-3-319-99004-0_5.
Problem. Unlike Finnie-Ansley et al. (2022), we prompted ChatGPT Ali, N. B., & Petersen, K. (2014). Evaluating strategies for study selection in systematic
literature studies. In Proceedings of the 8th ACM/IEEE international symposium on
with programming problems of varying difficulty levels and a custom- empirical software engineering and measurement (pp. 1–4).
designed question and compared ChatGPT’s performance to public data Arnold, D., Balkan, L., Humphreys, R., Meijer, S., & Sadler, L. (1994). Machine
available on LeetCode, rather than using student-generated data. More- Translation: an Introductory Guide. London: NCC Blackwell, URL http://www.essex.
ac.uk/linguistics/external/clmt/MTbook/PostScript/.
over, we used predetermined prompts, whereas Finnie-Ansley et al.
Avizienis, A., Laprie, J.-C., Randell, B., & Landwehr, C. (2004). Basic concepts and
used randomly varied problem wordings (Finnie-Ansley et al., 2022). taxonomy of dependable and secure computing. IEEE Transactions on Dependable
and Secure Computing, 1(1), 11–33.
Finnie-Ansley, J., Denny, P., Becker, B. A., Luxton-Reilly, A., & Prather, J. (2022).
The robots are coming: Exploring the implications of openai codex on introductory
12
https://alphacode.deepmind.com/ programming. In Proceedings of the 24th Australasian Computing Education Conference
13
https://codeforces.com/ ACE ’22, (pp. 10–19). New York, NY, USA: Association for Computing Machinery.
9
A. Bucaioni et al. Machine Learning with Applications 15 (2024) 100526
Fisher, M. J., & Marshall, A. P. (2009). Understanding descriptive statistics. Australian Mirzayanov, M. (2012). Codeforces contest rules. URL https://codeforces.com/blog/
Critical Care, 22(2), 93–97. entry/4088.
Fuscaldo, D. (2023). How chatbots can help grow your small business. URL https: Müller, V. C. (2021). Ethics of artificial intelligence and robotics. In E. N. Zalta (Ed.),
//www.businessnewsdaily.com/16018-chatbots-for-growth.html. The Stanford Encyclopedia of Philosophy (Summer 2021 ed.). Metaphysics Research
Gillioz, A., Casas, J., Mugellini, E., & Khaled, O. A. (2020). Overview of the transformer- Lab, Stanford University.
based models for nlp tasks. In 2020 15th Conference on Computer Science and Nadkarni, P. M., Ohno-Machado, L., & Chapman, W. W. (2011). Natural language
Information Systems FedCSIS, (pp. 179–183). processing: An introduction. Journal of the American Medical Informatics Association,
Hintze, J. L., & Nelson, R. D. (1998). Violin plots: A box plot-density trace synergism. 18(5), 544–551.
The American Statistician, 52(2), 181–184. http://dx.doi.org/10.1080/00031305. Natural Language Processing 2023. URL https://www.ibm.com/topics/natural-
1998.10480559, URL https://amstat.tandfonline.com/doi/abs/10.1080/00031305. language-processing.
1998.10480559. OpenAI (2022). ChatGPT. URL https://openai.com/blog/chatgpt.
Israelsen, A. (2023). How to use ChatGPT to write code. URL https://www.pluralsight. OpenAI (2023). GPT-4 research. URL https://openai.com/research/gpt-4.
com/blog/software-development/how-use-chatgpt-programming-coding. Petersen, K., Feldt, R., Mujtaba, S., & Mattsson, M. (2008). Systematic mapping studies
Jacques, L. (2023). Teaching cs-101 at the dawn of chatgpt. ACM Inroads, [ISSN: in software engineering. In G. Visaggio, M. T. Baldassarre, S. G. Linkman, &
2153-2184] 14(2), 40–46. http://dx.doi.org/10.1145/3595634. M. Turner (Eds.), 12th International Conference on Evaluation and Assessment in
Kazemitabaar, M., Chow, J., Ma, C. K. T., Ericson, B. J., Weintrop, D., & Grossman, T. Software Engineering. Italy: University of Bari, 26-27 2008, Workshops in Computing.
(2023). Studying the effect of ai code generators on supporting novice learners in BCS. URL http://ewic.bcs.org/content/ConWebDoc/19543.
introductory programming. In Proceedings of the 2023 CHI Conference on Human Fac- Phillips, T. (2022). Is There a Shortage of Developers? Developer Shortage Statistics in 2022.
tors in Computing Systems. CHI ’23, New York, NY, USA: Association for Computing https://codesubmit.io/blog/shortage-of-developers/, 5.
Machinery., ISBN: 9781450394215, http://dx.doi.org/10.1145/3544548.3580919. Sein, M. K., Henfridsson, O., Purao, S., Rossi, M., & Lindgren, R. (2011). Action design
Kitchenham, B., & Brereton, P. (2013). A systematic review of systematic review process research. MIS Quarterly, 37–56.
research in software engineering. Information and Software Technology, [ISSN: 0950- Sharma, T., Kechagia, M., Georgiou, S., Tiwari, R., Vats, I., Moazen, H., & Sarro, F.
5849] 55(12), 2049–2075. http://dx.doi.org/10.1016/j.infsof.2013.07.010, https: (2022). A survey on machine learning techniques for source code analysis. arXiv
//www.sciencedirect.com/science/article/pii/S0950584913001560. preprint arXiv:2110.09610.
Lertbanjongngam, S., Chinthanet, B., Ishio, T., Kula, R. G., Leelaprute, P., Manaskasem- Sun, W., Fang, C., You, Y., Miao, Y., Liu, Y., Li, Y., Deng, G., Huang, S., Chen, Y.,
sak, B., Rungsawang, A., & Matsumoto, K. (2022). An empirical evaluation of Zhang, Q., Qian, H., Liu, Y., & Chen, Z. (2023). Automatic code summarization via
competitive programming ai: A case study of alphacode. arXiv preprint arXiv: chatGPT: How far are we?.
2208.08603. Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C., Regnell, B., & Wesslén, A. (2012).
Littman, M. L., Ajunwa, I., Berger, G., Boutilier, C., Currie, M., Doshi-Velez, F., Experimentation in Software Engineering (1st ed.). Berlin, Heidelberg: Springer, http:
Hadfield, G., Horowitz, M. C., Isbell, C., Kitano, H., Levy, K., Lyons, T., Mitchell, M., //dx.doi.org/10.1007/978-3-642-29044-2.
Shah, J., Sloman, S., Vallor, S., & Walsh, T. (2021). Gathering strength, gath- Yilmaz, R., & Karaoglan Yilmaz, F. G. (2023). The effect of generative artificial intelli-
ering storms: The one hundred year study on artificial intelligence (ai100) 2021 gence (ai)-based tool use on students’ computational thinking skills, programming
study. sq2: What are the most important advances in ai? Technical report, Stan- self-efficacy and motivation. Computers and Education: Artificial Intelligence, [ISSN:
ford University, URL https://ai100.stanford.edu/2021-report/standing-questions- 2666-920X] 4, Article 100147. http://dx.doi.org/10.1016/j.caeai.2023.100147,
and-responses/sq2-what-are-most-important-advances-ai. URL https://www.sciencedirect.com/science/article/pii/S2666920X23000267.
Marr, B. (2023). How ChatGPT and natural language technology might affect your job
if you are A computer programmer. URL http://bit.ly/44fpw95.
10