0% found this document useful (0 votes)
6 views

Programming-with-ChatGPT--How-far-can-we_2024_Machine-Learning-with-Applicat

This study investigates the capabilities of ChatGPT in writing source code by conducting experiments with 240 programming problems in C++ and Java. Results indicate that while ChatGPT performs well on easier tasks, its accuracy and code quality diminish with more complex problems, particularly in terms of runtime and memory usage compared to human programmers. The findings highlight ChatGPT's potential as a programming assistant, but also underscore the need for further research into its limitations and ethical implications in software development.

Uploaded by

Zhigen Wu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Programming-with-ChatGPT--How-far-can-we_2024_Machine-Learning-with-Applicat

This study investigates the capabilities of ChatGPT in writing source code by conducting experiments with 240 programming problems in C++ and Java. Results indicate that while ChatGPT performs well on easier tasks, its accuracy and code quality diminish with more complex problems, particularly in terms of runtime and memory usage compared to human programmers. The findings highlight ChatGPT's potential as a programming assistant, but also underscore the need for further research into its limitations and ethical implications in software development.

Uploaded by

Zhigen Wu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Machine Learning with Applications 15 (2024) 100526

Contents lists available at ScienceDirect

Machine Learning with Applications


journal homepage: www.elsevier.com/locate/mlwa

Programming with ChatGPT: How far can we go?


Alessio Bucaioni a , Hampus Ekedahl a , Vilma Helander a , Phuong T. Nguyen b ,∗
a
Mälardalen University, Västerås, Sweden
b
Department of Information Engineering, Computer Science and Mathematics, University of L’Aquila L’Aquila, Italy

ARTICLE INFO ABSTRACT

Keywords: Artificial intelligence (AI) has made remarkable strides, giving rise to the development of large language models
ChatGPT such as ChatGPT. The chatbot has garnered significant attention from academia, industry, and the general
Large language models public, marking the beginning of a new era in AI applications. This work explores how well ChatGPT can
Programming
write source code. To this end, we performed a series of experiments to assess the extent to which ChatGPT
is capable of solving general programming problems. Our objective is to assess ChatGPT’s capabilities in two
different programming languages, namely C++ and Java, by providing it with a set of programming problem,
encompassing various types and difficulty levels. We focus on evaluating ChatGPT’s performance in terms of
code correctness, run-time efficiency, and memory usage. The experimental results show that, while ChatGPT
is good at solving easy and medium programming problems written in C++ and Java, it encounters some
difficulties with more complicated tasks in the two languages. Compared to code written by humans, the one
generated by ChatGPT is of lower quality, with respect to runtime and memory usage.

1. Introduction regarding the reliability of chatbots in code generation, necessitating


further investigation into the ethical implications associated with their
Artificial Intelligence (AI) is a rapidly advancing field that gains use (Müller, 2021). Under these circumstances, it is crucial to delve
momentum in several domains, including finance, healthcare, enter- into the exploration and evaluation of chatbots’ capabilities in code
tainment, transportation (Littman et al., 2021), to name but a few. generation.
Notably, AI has shown remarkable advancements in medical diagnosis, The main objective of this work is to investigate the ability of
speech recognition, game playing, and autonomous vehicles (Littman ChatGPT in writing code, compared to human programmers. We con-
et al., 2021). Natural Language Processing (NLP) is a sub-field of ducted a series of experiments where ChatGPT was prompted with a
AI (Aker et al., 2019), playing a pivotal role in the advancement set of 240 programming problems curated from the well-known coding
of various applications, including machine translation (Arnold et al., website LeetCode.2 In our experiments, we targeted two different pro-
1994). Among these applications, chatbots have witnessed a surge gramming languages, i.e., C++ and Java, and used publicly available
in popularity and find utility in several domains, including customer data provided by LeetCode, which serves as a reliable benchmark for
service, financial applications and mental health counseling (Arnold assessing the accuracy and efficiency of the generated code. The set
et al., 1994; Fuscaldo, 2023). Code generation is one area of significant of programming problems was selected so as to encompass various
interest in chatbots, as they have the potential to generate code based type of programming tasks as well as difficulty levels. The code gen-
on natural language inputs.
erated by ChatGPT was submitted to LeetCode, and the results were
ChatGPT,1 a chatbot developed by OpenAI, sparked significant at-
compared with those performed by human programmers. To assess the
tention following its release in November 2022, primarily due to its
accuracy, we measured the number of attempts required to successfully
advanced natural language processing capabilities and proficiency in
complete the given programming problems. We studied the efficiency
code generation (OpenAI, 2022). The potential of ChatGPT to en-
by considering both runtime (in milliseconds), and memory usage (in
hance the software development process holds profound implications
megabytes).
for the future of software engineering and programming roles, par-
Through the experiments, we see that ChatGPT exhibits proficiency
ticularly considering the ongoing growth in demand for skilled pro-
in solving programming problems at lower and medium difficulty
grammers (Marr, 2023; Phillips, 2022). Nevertheless, concerns remain

∗ Corresponding author.
E-mail addresses: [email protected] (A. Bucaioni), [email protected] (P.T. Nguyen).
1
https://chat.openai.com/chat
2
https://leetcode.com/

https://doi.org/10.1016/j.mlwa.2024.100526
Received 15 July 2023; Received in revised form 27 December 2023; Accepted 6 January 2024
Available online 8 January 2024
2666-8270/© 2024 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
A. Bucaioni et al. Machine Learning with Applications 15 (2024) 100526

levels. However, its accuracy in generating correct code decreases when As NLP techniques evolve, they have enabled the development of
being prompted with more challenging problems. Similarly, concern- sophisticated chat-bots like ChatGPT. ChatGPT is developed by OpenAI
ing runtime performance and memory usage, ChatGPT obtains above- and built upon the advanced GPT-3.5 and GPT-4 architectures, which
average results for problems of lower and medium difficulty, while allow the model to understand and process complex human language
its performance worsens when solving more complicated programming effectively, as well as tackle complex tasks such as programming (Ope-
problems. The experimental results suggest that while ChatGPT cannot nAI, 2023). ChatGPT has been trained on a large amount of text
produce good source code as human programmers do, it demonstrates a and code, which includes various user questions and their appropriate
promising potential as a programming assistant marking the beginning responses. In addition, the extensive data-set has equipped ChatGPT
of a new era in the utilization of machine learning and large language with a broad knowledge base that encompasses various programming
modeling based approaches for code generation. languages and concepts (Israelsen, 2023; OpenAI, 2023). Given these
The main contributions of our work are summarized as follows. capabilities, ChatGPT is a promising candidate for investigating the
• By means of a dataset collected from LeetCode, we evaluate (i) potential of chat-bots as a replacement for human programmers. There-
how well can ChatGPT solve general programming problems; and fore, we believe that it is a suitable candidate for exploring the potential
(ii) how ChatGPT’s solutions compare to those by humans in of chat-bots replacing human programmers.
terms of runtime and memory usage.
• The dataset and the paper corpus curated in this paper have been 2.2. LeetCode
published online to allow for future research.3
LeetCode is an online platform offering a collection of program-
The remainder of this paper is organized as follows. Section 2
ming problems and challenges that is widely used by programmers
introduces key concepts and terminology essential for understanding
to practice and enhance their coding skills. LeetCode provides a reli-
the context of our work. Section 3 details the pipeline that we built
able repository of programming challenges categorized based on their
for evaluating the code generated by ChatGPT. In Section 4, we de-
difficulty levels and topics. These difficulty levels encompass ‘‘easy’’,
scribe the research methodology, including research goal and questions,
experimental design, and threats to validity. Section 5 presents a com- ‘‘medium’’, and ‘‘hard’’, while the topics span a wide range, including
prehensive summary of the results of our experiment, while Section 6 algorithms, databases, shell scripting, concurrency, and more. It is
delves into a discussion of the implications of our results. Finally, worth noting that our contribution does not include the categorization
Section 8 concludes the work with final remarks and future works. of problems based on their difficulty level or topic. However, readers –
who are interested in exploring the specific categorization of problems
2. Background – can refer to the LeetCode platform for greater details.4 Typically, each
programming problem comes with a problem statement, sample inputs
In this section, we introduce key concepts and terminology essential and outputs. LeetCode includes a built-in editor and compiler, which
for understanding the context of our work. allows users to test their code against a set of predefined test cases for
evaluating its correctness and efficiency. In addition, LeetCode tracks
2.1. Natural language processing and ChatGPT submission status, provides error messages, and places users in distinct
performance percentiles based on their submission scores. In this work,
Natural Language Processing (NLP) is a sub-field of AI (Aker et al., we leverage LeetCode as a source of diverse programming problems and
2019) that aims at enabling computers to understand, process, and gen- challenges for our experiment. We prompt ChatGPT with programming
erate human language. It encompasses various techniques, including problems and challenges of varying topics and difficulty available on
statistical analysis, and computational linguistics, to develop models the platform. We then submit the solutions generated by ChatGPT to
capable of interpreting the meaning, emotional tone, and intent behind LeetCode to compare the accuracy and efficiency of ChatGPT generated
human language, hence facilitating effective interaction between com- code against the code generated by human programmers.
puters and humans (Arnold et al., 1994). The field of NLP originated
in the 1950s and 1960s, as researchers began exploring the possibility 2.3. Code quality
of using computers to understand human language (Nadkarni et al.,
2011). Early NLP systems used a set of grammatical rules to analyze and
Avizienis et al. emphasized the significance of dependability and
create sentences, but they could not handle complex sentence structures
security in computing systems, which can only be achieved through
and the nuances of human language. In the 1980s and 1990s, re-
the development of high-quality code (Avizienis et al., 2004). Poor
searchers started to explore statistical approaches for NLP, as described
code quality can lead to bugs, errors, and vulnerabilities, jeopardizing
by Nadkarni et al. (2011). Models such as Hidden Markov Models
the reliability and security of a system (Avizienis et al., 2004). Error
(HMMs) and Probabilistic Context-Free Grammars (PCFGs) were de-
detection and recovery techniques also rely on properly functioning
veloped, which were more flexible and could handle a wider range
code. Thus, developing good code is crucial for dependable and secure
of sentence structures. However, they still faced challenges with the
computing systems (Avizienis et al., 2004). In software engineering,
nuances of language, such as detecting sarcasm. In the early 2000s,
code quality is evaluated based on several attributes including memory
researchers applied ML techniques, such as Support Vector Machines
efficiency, and runtime performance (Sharma et al., 2022). This work
(SVMs) and neural networks, to NLP (Nadkarni et al., 2011). These
techniques needed large amounts of labeled data to train models and focuses on both these two metrics. In this study, runtime performance
were able to capture more subtle patterns in language with greater refers to the total duration required for a solution to execute. Memory
accuracy. Today, NLP approaches include techniques such as Trans- utilization refers to the overall amount of memory resources used by
formers and Recurrent Neural Networks (RNNs) (Gillioz et al., 2020). the solution during its execution. For the measurement and evaluation
These approaches can process massive amounts of text and generate of runtime performance and memory utilization, we rely on the Leet-
human-like responses, making applications such as chat-bots, language Code’s publicly available data that serves as a reliable benchmark for
translation, and voice assistants possible (Natural Language Processing, assessing these properties of the generated code.
2023).

4
https://leetcode.com/discuss/interview-question?currentPage=1&
3
https://github.com/hampusekedahl/ChatGPT_Experiment_Extended orderBy=most_relevant&query=levels

2
A. Bucaioni et al. Machine Learning with Applications 15 (2024) 100526

Fig. 1. The pipeline.

3. The pipeline for evaluating code generated by ChatGPT

In our study, we employed key tools for data collection and analysis,
including the GPT-4.0 language model, LeetCode, Git and GitHub,
Google Sheets, a Python GUI, and various Python frameworks. We
accessed and executed the GPT-4.0 language model through the OpenAI
website.5 To utilize the ChatGPT-4.0 API, we subscribed to a monthly
plan that incurred a cost of $20. This subscription was crucial for con-
ducting our experiments and generating the required solutions for our
research. An overview of the proposed pipeline is shown in Fig. 1. Data
was collected based on correctness, runtime, and memory usage using
LeetCode’s built-in code submission system. The system compiles and
executes submitted code on a range of test cases for each programming
question, and evaluates the output against the expected output for each
test case.
We stored all data in an Excel spreadsheet and linked it with
relevant code solutions and errors in GitHub6 by means of unique
IDs. A dedicated GitHub repository was created to store the solutions Fig. 2. The GUI for generating standardized prompts.
and errors generated during each iteration of the experiment. Each
experiment was assigned a Google Sheets document, and the data
was organized in a structured format, facilitating efficient analysis. It is worth mentioning that the prompt generator is a graphical user
Essentially, the use of Google Sheets minimizes the potential for errors interface tool developed as an effective means to generate standardized
and inconsistencies, and ensures that the data remains well-organized prompts. Essentially, the tool is conceived solely as a supporting ele-
and easily accessible for further analysis. The utilization of LeetCode is ment, helping users to generate prompts, nevertheless it does not have
threefold as follows: a mere impact on the final results.
Once the prompt was generated using the GUI tool, it was stored in a
• We leveraged the extensive collection of programming questions GitHub repository. The prompt was then passed as input to ChatGPT to
available on LeetCode to construct our dataset. produce an output response, which was then documented in the same
• LeetCode is used to collect statistical data on the runtime, memory GitHub repository for further reference and analysis. Subsequently,
usage, and acceptance rates of human-generated solutions for the output response was fed as input to LeetCode’s built-in editor to
programming tasks. submit it as a solution to the corresponding problem. If the solution
• Eventually, LeetCode provides a validation and bench-marking generated by ChatGPT passes the test cases on LeetCode, the corre-
mechanism for the solutions generated by ChatGPT. sponding feedback, including metrics and statistics, was documented
and collected using Google Sheets. In contrast, if an error occurs during
To ensure consistency and standardization in presenting the prob- the submission process, it is documented, and passed back to the GUI
lems to ChatGPT, we developed a graphical user interface (GUI) tool tool. The error message was then formatted into our template for
– shown in Fig. 2 – using the PySimpleGUI library for generating error occurrences, and the modified prompt was once again passed to
standardized prompts. This GUI plays a crucial role in our exper- ChatGPT for another iteration of generating a response. This process
iments by facilitating prompt generation. The Python GUI offers a was repeated for up to a maximum of three iterations. After the third
user-friendly interface, allowing us to seamlessly generate prompts. It iteration, if the solution still fails, then this will be counted as a negative
accepts as input each question in its corresponding starting template, score, to be used for computing the final results.
and automatically generates prompts in a specific format, described in When deciding on the prompt format for the experiments, our
Section 4.2. We could select programming problems from our data-set primary focus was to create clear and concise prompts. Clarity and con-
and generate prompts based on predefined templates. Additionally, the ciseness were crucial to ensure that ChatGPT accurately understands
GUI offers the functionality to incorporate solutions based on errors our instructions and to minimize any potential misunderstandings.
encountered in previous iterations. This feature enhances the efficiency In addition, this helps us ensure that the responses produced were
of our experiments, enabling us to iterate and improve the prompts meaningful and relevant to our research aim.
based on the feedback received. For data analysis and visualization, we employed four Python li-
braries, i.e., matplotlib, numpy, seaborn, and pandas. These frameworks
are used to analyze and visualize the data collected from our exper-
5
https://openai.com/ iments, allowing for insightful interpretations and representations of
6
https://github.com/ the results. To analyze the collected data, we performed a descriptive

3
A. Bucaioni et al. Machine Learning with Applications 15 (2024) 100526

statistical analysis following existing work (Fisher & Marshall, 2009).


Our analysis primarily focuses on key metrics, including runtime, and
memory usage, and success rate per iteration.

4. Research methodology and study material

In this section, we describe the Research Questions (RQs), the


research methods, including the experimental design, employed in
our study that evaluated the possibility of ChatGPT writing correct
programs.

4.1. Research questions

The goal of our work is to understand whether ChatGPT can replace


humans in programming tasks. We break down this goal into the fol-
lowing RQs that provide unique and complementary answers to this
investigation.

• RQ1 : How well can ChatGPT solve general programming problems?


We perform an empirical evaluation using ChatGPT and LeetCode
to determine the extent to which ChatGPT is capable of solving
general programming problems.
• RQ2 : How do ChatGPT’s solutions compare to those by humans in
terms of runtime and memory usage? We look for insights into the
performance of ChatGPT’s solutions compared to those created by
human programmers in terms of runtime and memory usage. This
can be used to evaluate the feasibility and effectiveness of using
ChatGPT as a tool for automating programming tasks.

To answer the aforementioned RQs, we used an iterative research


process that draws upon the Action Design Research (ADR) process Fig. 3. The experiment process.
(Sein et al., 2011). We complemented the ADR process with several
empirical research methods including a systematic literature review
and experiments (Wohlin et al., 2012). In particular, we employ the
to include a clear directive for optimizing runtime and memory uti-
former to gain an understanding of the current state-of-the-art ChatGPT
lization, i.e., ‘‘Solve this programming problem with C++ as runtime
for programming tasks, and the latter for answering RQ1 and RQ2 . In
and memory efficient as possible’’. By explicitly stating the performance
the following sections, we provide details on the experiment processes.
requirements, we ensured that ChatGPT would focus on generating
solutions that were not only correct but also efficient in terms of
4.2. Quantitative evaluation
runtime and memory usage. The source code generated by ChatGPT
was then downloaded to our local computers, and then uploaded to
This section first describes the settings and tools used in the exper- LeetCode, which performs further analysis. In particular, thanks to its
imental approach and then delve on the design and execution of the internal design, LeetCode is able to measure both the running time and
experiments used to answer RQ1 and RQ2 . memory consumption for any input code snippet.
The overall experimental approach is depicted in Fig. 3. A di- When a solution was not accepted as correct, the next iteration
verse and versatile data-set was curated by selecting 120 programming was based on the error message from previous solution. The format
problems for each language, i.e., C++ and Java from LeetCode, encom- for prompts generated from error message was also straightforward
passing nine distinct categories and three difficulty levels, aiming to and followed this structure: With input ‘‘x’’ output was ‘‘y’’ but expected
reflect a natural learning progression for computer science students. A result was ‘‘z’’. If the error was a runtime error the complete error
random selection process was applied to choose questions from each message was used as prompt to be fed to ChatGPT. Given the time
category and level of difficulty. We designed our experiment to ensure constraints of our work, it was necessary to establish a threshold for
comprehensiveness and representativeness across various programming the maximum number of iterations allowed for prompt one and prompt
concepts and difficulty levels. The categories are array, string, sorting, two. We determined that a maximum of three iterations would be
math, hash table, binary search, dynamic programming, greedy, stack allocated to each research question. This decision was based on our
and queue, covering easy, medium, and hard tasks. investigation into the model’s behavior when faced with programming
With RQ1 , we assess the ability of ChatGPT to solve a programming problems. During this investigation, we noticed that the majority of
problem using C++ and Java. To this end, we chose a straightforward problems were typically resolved within the first attempt. However,
prompt: ‘‘Solve this programming problem with C++’’, or ‘‘Solve this we also identified a consistent pattern: if the model failed to produce
programming problem with Java’’. This prompt focuses solely on solving a solution after three consecutive attempts, we deemed the question
the problem without any specific requirements regarding, e.g., runtime either unsolvable or the process overly demanding. By setting this
performance or memory utilization. By keeping the prompt generic, threshold, we were able to efficiently allocate our time and resources
we wanted to evaluate ChatGPT’s problem-solving capabilities without while examining the limitations of the model and making necessary
biasing it towards any particular performance metric. An example of adjustments to our approach.
a prompt for this task is shown in Fig. 4, and the corresponding code
generated by ChatGPT is in Fig. 5. 5. Results
With RQ2 , we further evaluated ChatGPT’s runtime and memory
utilization. To this end, we provided specific instructions regarding In this section, we present the findings of our study organized and
performance expectations by modifying the prompt format for RQ2 aligned according to the identified research questions.

4
A. Bucaioni et al. Machine Learning with Applications 15 (2024) 100526

Fig. 4. Example of prompt one of the Two Sum problem from LeetCode.

Table 1
RQ1 : Success rate per iteration corresponding to each difficulty level (%).
Easy Medium Hard
iteration iteration iteration
1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd
96.67 100.00 100.00 96.67 100.00 100.00 76.67 93.33 93.33

ChatGPT was unable to solve any of the remaining unsolved problems


during the third iteration.
In our examination of ChatGPT’s problem solving abilities per level
of difficulty, we utilized the difficulty of the programming problems
and the corresponding cumulative success rate, as shown in Table 1.
The figure illustrates the success rate across iterative attempts, seg-
mented by the problem’s difficulty level. It is evident that ChatGPT
obtains a higher success rate by easy and medium programming prob-
lems compared to that obtained by hard problems. For the Easy cat-
egory, we get 96.67%, 100%, and 100% as success rate by the first,
second, and third attempt, respectively. This is also the result obtained
by the Medium category. This demonstrates that ChatGPT obtains a
comparable outcome for the first two categories, i.e., Easy and Medium.
However, with hard programming problems, ChatGPT suffers a worse
Fig. 5. An example of code solution generated by ChatGPT.
performance compared to the previous two categories. In particular,
by the first iteration, only 76.67% of the programming problems are
successful, and the remaining are not. Even by the second and third
5.1. RQ 1 : How well can ChatGPT solve general programming problems? attempts, though the overall performance is improved, not all the
programming problems get passed, as demonstrated by a success rate of
In this research question, we evaluate the capability of ChatGPT in 93.33% for both cases. This essentially means that the level of difficulty
solving various programming problems following three levels of diffi- of the programming problems poses a challenge for ChatGPT, i.e., it
culty defined by LeetCode. To this end, we used success rate to study becomes less effective with complex tasks. For Java tasks, we also
how well ChatGPT is successful by different attempts. We measured witnessed a similar trend in the obtained success rate. Thus, the results
the cumulative success rate across each of the three iterative attempts. are not shown for the sake of clarity.
We found out that 90% of the solutions passed the test of LeetCode,
on the first iteration. On the second iteration the success rate increases
Answer to RQ1 . The experimental results suggest that while ChatGPT
to 97.78%. However, by the third iteration, ChatGPT was incapable of
performs well with easy and medium programming problems, it encounters
solving the programming problems, i.e., success rate remains the same difficulties when solving more complicated tasks. In this respect, we
of the second iteration, 97.78%. This suggests that while ChatGPT had conclude that ChatGPT needs to be further enhanced, so as to boost up
an initial success rate of 90% on the first, and further solved some its ability in programming.
problems on the second iteration, increasing the success rate to 97.78%,

5
A. Bucaioni et al. Machine Learning with Applications 15 (2024) 100526

Fig. 6. RQ2 : Memory usage for C++ and Java tasks.

5.2. RQ 2 : How do ChatGPT’s solutions compare to those by humans in


terms of runtime and memory usage?

In this research question, we further evaluate the performance of


ChatGPT on programming problems in comparison to human program-
mers, focusing on two performance characteristics, i.e., runtime and
memory usage. The runtime for C++ solutions is quite small, especially
for those that are classified as easy. By most of the easy programming
problems, ChatGPT requires a tiny fraction of time to finish, i.e., less
than 100 ms. When ChatGPT deals with medium programming tasks,
it needs a bit more time to complete, nearly reaching 500 ms in some
extreme cases. With hard programming problems, the execution of
ChatGPT may last for several milliseconds. Apart from some outliers
starting from 1000 ms, and spanning up to 3000 ms, most of the
tasks require almost 500 ms to complete. For Java tasks, we can
see that there is a similar trend in the execution time according to
the level of difficulty, compared to that of tasks written in C++. In
Fig. 7. RQ2 : Percentile runtime.
general, ChatGPT requires more runtime when it comes to dealing with
complicated programming problems.
The memory usage for the tasks written in C++ and Java is depicted
in Fig. 6. To display the results, we make use of violin boxplots, a When the code generated by ChatGPT was submitted to LeetCode,
combination of boxplot and density traces to yield a more informative it was evaluated by LeetCode, which eventually produced a report
indication of the distribution, as well as the magnitude of the den- concerning the percentile of runtime and memory usage, in comparison
sity (Hintze & Nelson, 1998). The memory usage for tasks written in with the code written and submitted by several humans through the
C++ varies according to the level of difficulty, as shown in Fig. 6(a). platform. These metrics are used as a means to compare the overall
In particular, with Easy tasks, ChatGPT consumes a small amount of performance of code written by ChatGPT and that of humans.
Fig. 7 shows the percentile of runtime for the solutions generated
memory, i.e., less than 30MB. With respect to an increase in the level of
by ChatGPT, in comparison with those by humans. By most of the
difficulty, ChatGPT needs additional memory, i.e., up to 100 MB. When
programming tasks, ChatGPT ranks from the 25% up to the 100%
the tasks become harder, the memory usage increases accordingly.
percentile. For easy and medium programming problems, ChatGPT
Looking at Fig. 6(b), we witness a similar trend for the memory with
obtains an encouraging performance with respect to those implemented
Java tasks. With easy programming problems, ChatGPT usually finishes
by human developers. Especially by the medium tasks, by most of the
after using less than 45 MB, and it needs a bit more memory with
solutions, ChatGPT ranks in the 75% percentile, i.e., compared to code
medium tasks, i.e., ranging from 45 MB to 60 MB. Interestingly, for by humans. Looking at the average scores, we can see that ChatGPT
more difficult tasks, the runtime is not increased too much, i.e., apart is ranked in the 58th, 59th, and 56th for easy, medium, and hard
from outliers, most of the tasks require less than 60MB of memory. problems, respectively.
The results in Fig. 6 show that ChatGPT is faster and more memory With respect to the Java programming problems, as shown in
efficient when running on Java tasks, compared to running on C++ Fig. 7(b), ChatGPT yields a promising performance with easy tasks, i.e.,
tasks. We assume that this happens possibly due to the fact that most of the solutions produced by ChatGPT reside in the upper part of
ChatGPT has been trained with several Java snippets, which have been the corresponding diagram, close to the 100% percentile. However, by
optimized to save execution time and memory. This, however, is a pure medium and hard tasks, the percentile by the solutions of ChatGPT is
assumption, and in order to reach a solid conclusion we need empirical considerably low compared to that for solutions written by humans, i.e.,
evidence, which can only be obtained from additional evaluations. This most of the points are distributed across a low threshold. These rankings
is out of scope of this paper, and we consider the issue as our future suggest that with respect to timing efficiency, the code generated by ChatGPT
work. does not earn a good merit compared to that by real developers.

6
A. Bucaioni et al. Machine Learning with Applications 15 (2024) 100526

primary goal is to increase the chances of solving the programming


problem, providing a more detailed and explicit prompt is recom-
mended. This implies that there is a trade-off between performance
and problem-solving capabilities. The relation between ChatGPT’s per-
formance and more contextual information raises important questions
about how to effectively prompt ChatGPT and prioritize the user’s
needs. In real-world programming scenarios, developers must consider
multiple factors when determining the best solution, and the ability of
ChatGPT to handle such complexities becomes crucial.
A possible extension is to investigate the ability of ChatGPT on code
summarization tasks, i.e., explaining a program first by reading the
source code, and then giving a summary written in natural language.
This has been recently studied by Sun et al. in their research (Sun
et al., 2023), and we suppose that further attention should be paid in
order to explore how ChatGPT can summarize code written in different
languages.
It is worth remarking that our study focuses on small program-
Fig. 8. RQ2 : Percentile memory. ming problems, and does not consider the performance of ChatGPT
on larger and more comprehensive programming problems. The ques-
tion of how ChatGPT performs with such problems is indeed inter-
In Fig. 8, we show the distribution of percentiles of memory for esting and warrants further investigation. We expect that larger and
the solutions generated by ChatGPT, in comparison with those im- more complex problems may pose challenges for ChatGPT’s perfor-
plemented by humans. For C++ tasks, ChatGPT’s percentiles are dis- mance. Future research could delve into the performance of ChatGPT
tributed in a wide range of values, starting from 10% to 100%, with on larger problems, explore methods to improve its performance in
most of the instances residing from the 60% to 87% thresholds. When more complex scenarios, and investigate how developers can effectively
considering the mean values, ChatGPT is ranked in the 58th percentile leverage ChatGPT while considering various real-world constraints and
for easy problems, 47th percentile for medium problems, and 56th per- requirements.
centile for hard problems. These results suggest that ChatGPT obtains Our study suggests that ChatGPT has the potential to partially
a comparable outcome with respect to the majority of human program- replace humans in programming tasks. For instance, ChatGPT can be
mers on easy and medium problems. However, when it comes to more valuable as a programming assistant, particularly in automating repet-
complex problems, ChatGPT does not get a better result compared to a itive tasks and providing assistance with simpler problems. However,
significant proportion of human programmers. when it comes to more challenging and complex programming prob-
When we compared the result of RQ1 and RQ2 , we made an in- lems frequently encountered in real-world scenarios, our study suggest
teresting observation regarding the impact of context information on that ChatGPT’s performance are still limited. Indeed, as AI technologies,
ChatGPT’s performance in terms of runtime and memory utilization. including chatbots like ChatGPT, continue to evolve rapidly, their
ChatGPT’s overall performance in terms of runtime and memory utiliza- impact on software engineering and other fields is expected to grow.
tion was negatively affected when using prompt two. However, despite However, along with their potential benefits, the increasing use of AI
the decrease in performance metrics, we observed that adding more raises ethical concerns that must be carefully considered.
contextual information actually improved ChatGPT’s ability to solve a Considering the time limit and the nature of our process, alternative
problem on the first iteration. choices could have yielded different results. If we had opted for a larger
number of iterations for each research question, it might have allowed
the model to find a better solution, potentially leading to a higher
Answer to RQ2 . Despite getting an encouraging performance with respect
success rate in solving the programming problems. However, this would
to timing and memory usage, ChatGPT is far from optimal compared to
human developers by the submitted solutions for programming problems
have required a greater investment of time and resources. On the other
written in C++ and Java. hand, if we had set a lower maximum number of iterations, the model’s
ability to tackle challenging problems could have been compromised,
potentially limiting the insights we could gather regarding its limita-
6. Discussion tions. Moreover, our decision to establish the threshold based on the
observation that most problems were resolved within the initial attempt
The primary objective of our work was to investigate the potential meant that we prioritized efficiency and time management.
of ChatGPT as a replacement for human programmers in programming It is evident that different choices in terms of the maximum number
tasks. This section discusses the implications, and the threats to validity of iterations and prompt format could have influenced the outcomes
of our findings. and implications of our study. The data from our study suggest that
the majority of the problems were resolved during the second iteration,
6.1. Implications and subsequent iterations did not exhibit a significantly higher success
rate. Employing prompts that were more specific and explanatory could
While the metrics we selected offer valuable insights into the ca- have yielded different outcomes, not necessarily better ones.
pabilities of ChatGPT in programming tasks, code quality encompasses The evaluation shows that ChatGPT performs well with easy and
various aspects that extend beyond the metrics we focused on. Aspects medium tasks, however when the exercises are at the hard level (clas-
as readability and scalability are also crucial for assessing the quality of sified by LeetCode), then it encounters some difficulties in fulfilling the
code. Although our study has limitations in capturing these aspects, it tasks, and thus it does not pass the tests. This implies that ChatGPT has
provides valuable insights into ChatGPT’s ability to generate accurate some limitations with respect to hard tasks. Such limitations happen
and efficient solutions to a diverse range of programming problems. possibly due to the training, i.e., ChatGPT might not have been well
The conclusion drawn from these results may seem counterintuitive trained with this type of task. Moreover, we assume that if ChatGPT
at first. It suggests that when prioritizing performance, providing a had been trained with data from LeetCode, then the code generated
simple prompt to ChatGPT may be more effective. However, if the by ChatGPT would be successful when being tested with LeetCode. In

7
A. Bucaioni et al. Machine Learning with Applications 15 (2024) 100526

our work, we did not attempt to investigate ChatGPT’s superiority over a conversation, this could also impact the construct validity of our
other systems or methods. Instead, the ultimate aim is to provide a study.
reproducible assessment of ChatGPT’s code generation capabilities.
ChatGPT has been built on top of GPT (Generative Pre-trained 7. Related work
Transformer), which traces its roots back to Transformers, Encoder–
Decoder, consisting of two sides, one for encoding the input data, and To review related work, we conducted a systematic literature re-
the other for decoding the output data. In this way, we can think of view. We started with an automatic search in software engineer-
querying ChatGPT to get source code as putting a sentence written in ing (Kitchenham & Brereton, 2013; Petersen et al., 2008) on the
natural language on one side (Encoder), and getting the source code following four scientific databases and indexing systems: IEEE Xplore
on the other (Decoder). The main focus of our work is to evaluate Digital Library,7 ACM Digital Library,8 Scopus,9 and Web of Science.10
ChatGPT’s code generation capabilities. Thus, we conducted a practical Aiming at a reasonable trade-off between efficiency and the coverage
and empirical assessment of its utility as a code generation tool, rather of state-of-the-art studies, we adhered to the existing guidelines for
than running into an exhaustive analysis of ChatGPT’s underlying archi- systematic literature studies in software engineering. We queried the
tecture and reasoning mechanisms. A comprehensive examination of its aforementioned databases using concise and well-constructed search
architecture and reasoning mechanisms could reveal additional insights strings, collecting as many studies as possible. Given the novelty of the
into its strengths and weaknesses, especially in the context of general topic, the following query strings were utilized:
programming problems. This deserves an independent investigation, ‘‘ChatGPT’’ AND ‘‘programming’’
which can be investigated in another paper.
The initial search produced a set of 27 peer-reviewed publications.11
From this set, we removed impurities and duplicates and obtained a
6.2. Threats to validity
new set of 22 publications. We used the guidelines by Ali and Petersen
(2014) to define the following selection criteria for filtering the primary
In this section, we discuss different types of validity threats that
studies.
could affect the results of our study along with the strategies adopted
to mitigate them. • Inclusion criteria. We considered studies that are: (i) subject to
peer review; (ii) written in English; (iii) available as full-text; and
• Threats to external validity refer to the factors that could impact (iv) focusing on ChatGPT and programming.
the generalization of the findings of an experiment to real-world • Exclusion criteria. To ensure the inclusion of as many relevant
applications (Wohlin et al., 2012). In our study, one of the poten- studies as possible and minimize threats to validity, we excluded
tial threats to conclusion validity may be related to the chosen studies that are shorter than 4 pages.
programming language and to the problem set collected from
LeetCode, i.e., tasks related to C++ and Java, two popular pro- Apart from the studies obtained by conducting the literature review,
gramming languages. For future work, we plan to extend our we also took into account some additional papers that are relevant to
experiments with other languages, including Python. For the eval- the topic being under consideration, and eventually got a set of relevant
uation, we had a dataset of 240 programming exercises, and such papers to be reviewed as follows.
a number might not be large enough to be generalizable. In fact, Jacques (2023) investigated the potential of using ChatGPT to help
curating a dataset for the experiments by involving both ChatGPT instructors in enhancing the critical thinking skills of novice pro-
and LeetCode is a strenuous process, requiring a lot of time and grammers. Inspired by the historical developments in mathematical
effort, as the samples first were collected from ChatGPT, and then education, the author hypothesized that fluency and basic skills in
uploaded to LeetCode to run other measurements. Altogether, programming are important regardless of the availability of advanced
this is a prolonged procedure, preventing us from increasing the tools. Based on this hypothesis, Jacques proposed various tasks aimed
number of code samples. To mitigate this threat, we attempted to at enhancing the critical thinking skills of novice programmers. One
cover three main levels of difficulty, including Easy, Medium, and task involved instructing novices to create flowcharts that depict the
Hard. Moreover, the tasks spread over 10 different categories in logical structure of programs generated by ChatGPT. In addition, the
programming, including: ‘Array’, ‘Dynamic Programming’, ‘Hash author suggested a task where learners were prompted to extend or
Table’, ‘Queue Stack‘, ‘Binary Search’, ‘Greedy’, ‘Math’, ‘Sorting’, modify a program that was generated by ChatGPT. While the work
and ‘String’. utilized ChatGPT to generate programs, the primary focus was not on
• Threats to internal validity concern the factors that can potentially assessing the correctness, runtime performance, or memory usage of
affect the causal relationship between the independent variable the generated solutions.
Similar to Jacques (2023), Kazemitabaar et al. tried to understand
and the outcome. In the experiments, we evaluated the solutions
whether novice developers were able to understand the code generated
by ChatGPT and humans using the same metrics. This is to make
by ChatGPT and to modify or extend the generated code (Kazemitabaar
sure that we performed a fair comparison.
et al., 2023). In addition, the authors also studied if using such tools
• Threats to conclusion validity are factors that could impact the
would form a reliance, or help learners write code without such tools
ability to correctly draw conclusions about the relationship be-
being present. To investigate these aspects, they developed a web-
tween the treatment and the outcome of an experiment. The
based application called Coding Steps. This application was specifically
chosen metrics may not capture all aspects of code, such as
designed to facilitate the learning of basic Python programming. Coding
readability or scalability. Eventually, the data from LeetCode
Steps provides learners with a progressive set of programming tasks
may not be representative of the overall performance of hu-
that introduce new concepts gradually. It offers a submission function-
man programmers, which could also limit the validity of our
ality that allows learners to submit their code to remote instructors
conclusions
for grading and feedback, which is provided through a dedicated
• Threats to construct validity are related to the experimental design,
as well as to social factors. In our study, one potential threat
is that the programming tasks given to ChatGPT may not fully 7
https://ieeexplore.ieee.org/Xplore/home.jsp
represent the complexity of tasks typically performed by human 8
https://dl.acm.org/
programmers, which could affect the validity of our evaluation. 9
https://www.scopus.com/
Additionally, the competition among participants may impact on 10
http://webofscience.com/
the performance of human programmers on the LeetCode pro- 11
It is important to remark that we performed the automatic search in June
gramming tasks. As ChatGPT learns from previous prompts within 2023.

8
A. Bucaioni et al. Machine Learning with Applications 15 (2024) 100526

grading dashboard. Additionally, Coding Steps incorporates code gen- 8. Conclusions and future work
eration capabilities using the code-davinci-002 model from OpenAI’s
code completion API. This feature generates code to assist learners in In this paper, we have evaluated the potential of ChatGPT in replac-
completing their programming tasks. The study’s results demonstrated ing humans as programmers. To achieve this, we have started with a
that ChatGPT can be a valuable asset for computer science educators systematic literature review to gain an understanding of the current
and students. Novice programmers who used Coding Steps environment state-of-the-art in the use of ChatGPT for programming tasks. Then
exhibited improved performance, increased efficiency, and reduced we have designed and conducted experiments to assess the extent to
frustration when writing code. Importantly, the use of Coding Steps and which ChatGPT is capable of solving general programming problems.
The objective is to assess ChatGPT’s capabilities in two different pro-
ChatGPT did not negatively impact their ability to manually modify
gramming languages, namely C++ and Java. To evaluate the quality
code or perform tasks without its support.
of the generated code, we leveraged LeetCode’s automated submission
Yilmaz et al. investigated the effect of using ChatGPT on stu-
system, which assessed and provided feedback on the code correctness
dents’ computational thinking skills, programming self-efficacy, and and performance with respect to runtime and memory usage.
motivation towards the lesson (Yilmaz & Karaoglan Yilmaz, 2023). Future work may encompass different directions. One direction may
They conducted the research on 45 undergraduate students who took investigate the impact of dedicated training and its potential to enhance
a university-level programming course using the experimental design performance. Building upon the insights gained from our experiment,
with the pretest-posttest control group. Students were randomly di- we observed that a more detailed prompt yielded more effective results
vided into the experimental group that could use ChatGPT during in terms of finding solutions. We suppose that the architecture, the
the weekly programming practices, and the control group that could reasoning mechanisms, as well as the training of ChatGPT have a
not use it. Their findings revealed that the experimental group stu- certain effect on general programming problems. This, however, needs
dents’ computational thinking skills, programming self-efficacy, and to be investigated in a separate paper with concrete empirical evidence,
motivation for the lesson were significantly higher than the control and thus we consider it as our future work. A possible direction for
group students. Hence, it can be said that ChatGPT may be useful in future work is to explore the use of different programming languages
programming training. than C++ and Java, and of a larger and more complex data-set of
In a broader context, Lertbanjongngam et al. compared the per- problems to gain a more comprehensive understanding of ChatGPT’s
formance of the code generation tool AlphaCode12 to that of human capabilities. Finally, in our future work, we will also explore additional
code quality metrics than runtime and memory usage, as well as involve
programmers (Lertbanjongngam et al., 2022). The study discovered
revisiting the experiment with a focus on utilizing different parameters
that AlphaCode was capable of generating code that was often similar
to assess code quality.
to human-generated code, and performed equally or worse in terms of
execution-time and memory utilization. While the authors measured
Declaration of competing interest
the execution-time themselves to evaluate solutions (Lertbanjongngam
et al., 2022), we relied on the publicly available data provided by the The authors declare that they have no known competing finan-
LeetCode platform, which serves as a reliable benchmark for assessing cial interests or personal relationships that could have appeared to
the accuracy and efficiency of the generated code. Additionally, the influence the work reported in this paper.
authors used a popular competitive programming platform, namely
Codeforces,13 to retrieve code produced by humans. Similar to Lertban- Data availability
jongngam et al. (2022), we compared the performance of AI-generated
code to human programmers. However, we did not focus on com- Data will be made available on request.
petitive programming, which is an important distinction. Competitive
programming platforms like Codeforces and technical interview prepa- Acknowledgments
ration platforms like LeetCode serve different purposes and require
different skills and approaches. For example, Codeforces places more The work in this paper has been supported by the Swedish Knowl-
emphasis on time and memory limits (Mirzayanov, 2012), whereas edge Foundation (KKS) through the Modev project, by the Excellence in
LeetCode focuses on optimizing solutions and improving performance Production Research (XPRES) Framework and by the Swedish Govern-
mental Agency for Innovation Systems (VINNOVA) through the iSecure
relative to other users.
project. The authors would like to thank the anonymous reviewers for
Finnie-Ansley et al. investigated the accuracy of an AI model de-
their valuable comments.
signed for code writing named Codex on a range of first-year college-
level programming problems (Finnie-Ansley et al., 2022). Their study References
found that Codex outperformed most first-year programming students
by ranking in the top quartile of a typical first-year programming exam. Aker, A., Ceausu, A., Feng, Y., Gaizauskas, R. J., Hunsicker, S., Ion, R., Irimia, E.,
Their study evaluated the accuracy of Codex on 23 basic computer Stefanescu, D., & Tufis, D. (2019). Mapping and aligning units from comparable
corpora. In I. Skadina, R. J. Gaizauskas, B. Babych, N. Ljubesic, D. Tufis, &
science programming problems, where the authors presented the as-
A. Vasiljevs (Eds.), Using Comparable Corpora for under-Resourced Areas of Machine
signments exactly as they were given to the students. They also assessed Translation, Theory and Applications of Natural Language Processing (pp. 141–188).
Codex through variations of the problem wording for the Rainfall Springer, http://dx.doi.org/10.1007/978-3-319-99004-0_5.
Problem. Unlike Finnie-Ansley et al. (2022), we prompted ChatGPT Ali, N. B., & Petersen, K. (2014). Evaluating strategies for study selection in systematic
literature studies. In Proceedings of the 8th ACM/IEEE international symposium on
with programming problems of varying difficulty levels and a custom- empirical software engineering and measurement (pp. 1–4).
designed question and compared ChatGPT’s performance to public data Arnold, D., Balkan, L., Humphreys, R., Meijer, S., & Sadler, L. (1994). Machine
available on LeetCode, rather than using student-generated data. More- Translation: an Introductory Guide. London: NCC Blackwell, URL http://www.essex.
ac.uk/linguistics/external/clmt/MTbook/PostScript/.
over, we used predetermined prompts, whereas Finnie-Ansley et al.
Avizienis, A., Laprie, J.-C., Randell, B., & Landwehr, C. (2004). Basic concepts and
used randomly varied problem wordings (Finnie-Ansley et al., 2022). taxonomy of dependable and secure computing. IEEE Transactions on Dependable
and Secure Computing, 1(1), 11–33.
Finnie-Ansley, J., Denny, P., Becker, B. A., Luxton-Reilly, A., & Prather, J. (2022).
The robots are coming: Exploring the implications of openai codex on introductory
12
https://alphacode.deepmind.com/ programming. In Proceedings of the 24th Australasian Computing Education Conference
13
https://codeforces.com/ ACE ’22, (pp. 10–19). New York, NY, USA: Association for Computing Machinery.

9
A. Bucaioni et al. Machine Learning with Applications 15 (2024) 100526

Fisher, M. J., & Marshall, A. P. (2009). Understanding descriptive statistics. Australian Mirzayanov, M. (2012). Codeforces contest rules. URL https://codeforces.com/blog/
Critical Care, 22(2), 93–97. entry/4088.
Fuscaldo, D. (2023). How chatbots can help grow your small business. URL https: Müller, V. C. (2021). Ethics of artificial intelligence and robotics. In E. N. Zalta (Ed.),
//www.businessnewsdaily.com/16018-chatbots-for-growth.html. The Stanford Encyclopedia of Philosophy (Summer 2021 ed.). Metaphysics Research
Gillioz, A., Casas, J., Mugellini, E., & Khaled, O. A. (2020). Overview of the transformer- Lab, Stanford University.
based models for nlp tasks. In 2020 15th Conference on Computer Science and Nadkarni, P. M., Ohno-Machado, L., & Chapman, W. W. (2011). Natural language
Information Systems FedCSIS, (pp. 179–183). processing: An introduction. Journal of the American Medical Informatics Association,
Hintze, J. L., & Nelson, R. D. (1998). Violin plots: A box plot-density trace synergism. 18(5), 544–551.
The American Statistician, 52(2), 181–184. http://dx.doi.org/10.1080/00031305. Natural Language Processing 2023. URL https://www.ibm.com/topics/natural-
1998.10480559, URL https://amstat.tandfonline.com/doi/abs/10.1080/00031305. language-processing.
1998.10480559. OpenAI (2022). ChatGPT. URL https://openai.com/blog/chatgpt.
Israelsen, A. (2023). How to use ChatGPT to write code. URL https://www.pluralsight. OpenAI (2023). GPT-4 research. URL https://openai.com/research/gpt-4.
com/blog/software-development/how-use-chatgpt-programming-coding. Petersen, K., Feldt, R., Mujtaba, S., & Mattsson, M. (2008). Systematic mapping studies
Jacques, L. (2023). Teaching cs-101 at the dawn of chatgpt. ACM Inroads, [ISSN: in software engineering. In G. Visaggio, M. T. Baldassarre, S. G. Linkman, &
2153-2184] 14(2), 40–46. http://dx.doi.org/10.1145/3595634. M. Turner (Eds.), 12th International Conference on Evaluation and Assessment in
Kazemitabaar, M., Chow, J., Ma, C. K. T., Ericson, B. J., Weintrop, D., & Grossman, T. Software Engineering. Italy: University of Bari, 26-27 2008, Workshops in Computing.
(2023). Studying the effect of ai code generators on supporting novice learners in BCS. URL http://ewic.bcs.org/content/ConWebDoc/19543.
introductory programming. In Proceedings of the 2023 CHI Conference on Human Fac- Phillips, T. (2022). Is There a Shortage of Developers? Developer Shortage Statistics in 2022.
tors in Computing Systems. CHI ’23, New York, NY, USA: Association for Computing https://codesubmit.io/blog/shortage-of-developers/, 5.
Machinery., ISBN: 9781450394215, http://dx.doi.org/10.1145/3544548.3580919. Sein, M. K., Henfridsson, O., Purao, S., Rossi, M., & Lindgren, R. (2011). Action design
Kitchenham, B., & Brereton, P. (2013). A systematic review of systematic review process research. MIS Quarterly, 37–56.
research in software engineering. Information and Software Technology, [ISSN: 0950- Sharma, T., Kechagia, M., Georgiou, S., Tiwari, R., Vats, I., Moazen, H., & Sarro, F.
5849] 55(12), 2049–2075. http://dx.doi.org/10.1016/j.infsof.2013.07.010, https: (2022). A survey on machine learning techniques for source code analysis. arXiv
//www.sciencedirect.com/science/article/pii/S0950584913001560. preprint arXiv:2110.09610.
Lertbanjongngam, S., Chinthanet, B., Ishio, T., Kula, R. G., Leelaprute, P., Manaskasem- Sun, W., Fang, C., You, Y., Miao, Y., Liu, Y., Li, Y., Deng, G., Huang, S., Chen, Y.,
sak, B., Rungsawang, A., & Matsumoto, K. (2022). An empirical evaluation of Zhang, Q., Qian, H., Liu, Y., & Chen, Z. (2023). Automatic code summarization via
competitive programming ai: A case study of alphacode. arXiv preprint arXiv: chatGPT: How far are we?.
2208.08603. Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C., Regnell, B., & Wesslén, A. (2012).
Littman, M. L., Ajunwa, I., Berger, G., Boutilier, C., Currie, M., Doshi-Velez, F., Experimentation in Software Engineering (1st ed.). Berlin, Heidelberg: Springer, http:
Hadfield, G., Horowitz, M. C., Isbell, C., Kitano, H., Levy, K., Lyons, T., Mitchell, M., //dx.doi.org/10.1007/978-3-642-29044-2.
Shah, J., Sloman, S., Vallor, S., & Walsh, T. (2021). Gathering strength, gath- Yilmaz, R., & Karaoglan Yilmaz, F. G. (2023). The effect of generative artificial intelli-
ering storms: The one hundred year study on artificial intelligence (ai100) 2021 gence (ai)-based tool use on students’ computational thinking skills, programming
study. sq2: What are the most important advances in ai? Technical report, Stan- self-efficacy and motivation. Computers and Education: Artificial Intelligence, [ISSN:
ford University, URL https://ai100.stanford.edu/2021-report/standing-questions- 2666-920X] 4, Article 100147. http://dx.doi.org/10.1016/j.caeai.2023.100147,
and-responses/sq2-what-are-most-important-advances-ai. URL https://www.sciencedirect.com/science/article/pii/S2666920X23000267.
Marr, B. (2023). How ChatGPT and natural language technology might affect your job
if you are A computer programmer. URL http://bit.ly/44fpw95.

10

You might also like