Computer Science Education in ChatGPT Era- Experiences From an Experiment in a Programming Course for Novice Programmers
Computer Science Education in ChatGPT Era- Experiences From an Experiment in a Programming Course for Novice Programmers
Article
Computer Science Education in ChatGPT Era: Experiences
from an Experiment in a Programming Course for
Novice Programmers
Tomaž Kosar 1 , Dragana Ostojić 1 , Yu David Liu 2 and Marjan Mernik 1, *
1 Faculty of Electrical Engineering and Computer Science, University of Maribor, Koroška cesta 46,
2000 Maribor, Slovenia; [email protected] (T.K.); [email protected] (D.O.)
2 Department of Computer Science, State University of New York at Binghamton (SUNY), 4400 Vestal Parkway
East, Binghamton, NY 13902, USA; [email protected]
* Correspondence: [email protected]
Abstract: The use of large language models with chatbots like ChatGPT has become increasingly
popular among students, especially in Computer Science education. However, significant debates
exist in the education community on the role of ChatGPT in learning. Therefore, it is critical to under-
stand the potential impact of ChatGPT on the learning, engagement, and overall success of students
in classrooms. In this empirical study, we report on a controlled experiment with 182 participants in
a first-year undergraduate course on object-oriented programming. Our differential study divided
students into two groups, one using ChatGPT and the other not using it for practical programming
assignments. The study results showed that the students’ performance is not influenced by ChatGPT
usage (no statistical significance between groups with a p-value of 0.730), nor are the grading results
of practical assignments (p-value 0.760) and midterm exams (p-value 0.856). Our findings from
the controlled experiment suggest that it is safe for novice programmers to use ChatGPT if specific
measures and adjustments are adopted in the education process.
interactive and engaging learning experience, and increase their interest and motivation [8,9].
Computer Science students can ask questions about programming code and receive immedi-
ate answers, making learning programming more efficient [10]. Additionally, ChatGPT can
generate several different examples to explain complex programming concepts [7].
On the flip side, some believe it is risky [11,12]. The LLM is limited by the knowledge
it was trained on [7], giving a possibility of answering complex questions inaccurately [13].
Furthermore, code debugging and interpretation require a deep understanding of the
code under consideration. The educator can provide a step-by-step explanation of the
code, while current LLMs are still limited in this respect as shown in [14]. Another serious
drawback of using ChatGPT is that it could discourage students from developing skills,
e.g., reasoning [15]. If students rely excessively on ChatGPT to provide programming code,
they may not develop the required skills to solve problems on their own. Excessive use
and cheating are some Computer Science educators’ major concerns regarding ChatGPT
usage, especially for novice programmers (first-year students). In a nutshell, students
cannot develop important skills, such as critical thinking, creativity, decision-making [16],
and one of the essential capabilities for software developers, problem solving [17]. Some
universities even decided to take measures in blocking access to the ChatGPT website on
school grounds [18].
However, LLM technologies are probably here to stay. We believe that, rather than
avoiding these technologies, we need to embrace LLMs and modernize education [19].
To understand how LLMs and ChatGPT influence the learning process [16], there is a
need for experimental studies [10,20–23]. We have to test common beliefs empirically
and rigorously, such as the belief that students will use LLMs without hesitation for
plagiarism [24], or thinking that using LLMs will affect their critical thinking and problem-
solving skills negatively [16]. In this paper, we report our experience in ChatGPT-assisted
learning in Programming II, a course in the second semester of the first year of the Computer
Science and Information Technologies undergraduate program at the University of Maribor,
Slovenia. Our experiment was motivated by the following questions:
• Does the use of ChatGPT affect performance on practical assignments and midterm
exam results?
• Does the use of ChatGPT affect the overall student performance in the introductory
programming course?
• What impact does ChatGPT usage have on the course final grade?
• For what purpose did students use ChatGPT during the course on Programming II?
• Is ChatGPT useful for learning programming at all, according to students’ opinions?
In this context, we performed a controlled experiment [25] using ChatGPT for practical
assignments in the first-year undergraduate study of Computer Science. We formed two
groups, one using ChatGPT and the other not using it for practical assignments. Several
adjustments were made for the execution of this year’s introductory course on object-
oriented programming.
Our results from the controlled experiment show that overall performance in the course
was not influenced by ChatGPT usage or the results on practical assignments or midterm
exams. We believe a main contributor leading to this conclusion is the adjustments we
have made to the course during (1) constructing assignments, (2) defending assignments,
and (3) midterm exams. Those actions encouraged participants not to rely solely on the
use of ChatGPT. For example, all our assignments were designed carefully to minimize the
chance of ChatGPT answering the questions directly. As another highlight, we introduced an
evaluation process, where assignment grading was not based solely on the code submitted
to the original assignment questions; instead, grades were given in the lab session based on
an extended version of the assignment, through an interactive defense process involving the
students and the teaching assistants. Overall, we believe ChatGPT should be incorporated into
future education, and it must be embraced with adjustments in course evaluation to promote learning.
The paper is divided into sections, presenting a different part of this experiment. Section 2
discusses the background on ChatGPT, Section 3 describes related work, and Section 4 the
Mathematics 2024, 12, 629 3 of 22
experiment design. Section 5 presents the results and data analysis. Section 6 discusses the
threats to the validity of our controlled experiment, and, lastly, Section 7 summarizes our key
findings from our empirical study.
2. Background
LLMs [2] represent a transformative technology in the field of natural language pro-
cessing (NLP) [26], bringing a new linguistic capability and opportunities for diverse
applications. These models are designed with vast neural architectures, and supported
by extensive training data [27]. LLMs empower applications to understand, generate,
and manipulate human languages in ways that were previously impossible. The main
feature of LLMs is that they generate text similar to human speech. One of the most well-
known LLMs is Generative Pre-trained Transformer (GPT-3) [28] based on the transformer
architecture [29] that improved NLP significantly.
Chatbots [30] are computer programs designed to simulate conversations in text (or
voice [31]) over the Internet. They are programmed to understand natural languages and
respond to user questions in a way that imitates human-to-human conversations. Chatbots
can be categorized into two main types: rule-based and machine learning (sometimes
referred to as AI-powered) chatbots [30]. Rule-based chatbots operate on predefined rules.
They follow instructions and can provide responses based on specific keywords or phrases.
Rule-based chatbots are limited in their capabilities, and may struggle with complex
questions. On the other hand, AI-powered chatbots use advanced technologies, such as
LLMs. They are capable of understanding context and learning from interactions and
responses. Both types are often used in applications such as customer support, healthcare,
information retrieval assistance, virtual assistance, education, marketing, etc.
Chatbots, empowered by LLMs, represent a significant milestone in the evolution
of human–computer interaction. These intelligent agents have gone beyond traditional
chatbots to engage users in natural, context-aware conversations. LLM-powered chatbots
have an understanding of linguistic variations, making interactions feel more human-
like and personalized. One such system is ChatGPT [3]. ChatGPT is the most popular
chatbot supported by the LLM GPT-3, developed by OpenAI [4] and available publicly. It
is proficient in mimicking human-like communication with the users. GPT-3 models are
trained on extensive text data (approximately 175 billion trainable parameters and 570 GB
of text [32]). During our experiment (from February till June 2023), we used ChatGPT with
GPT-3.5, although GPT-4 was already available (March 2023) but not for free usage.
Prompts [33] refer to the input provided to the chatbot to generate responses. Prompts
are the human instructions or questions users provide while interacting with the chatbot.
There are different types of prompts: text-based, voice-based, task-driven, informational,
conversational, and programming prompts. In the latter, programmers can send specific
programming prompts, including code snippets, and chatbots can respond with a context
using this input. Hence, programmers (and other users) can modify and fine-tune the
prompt through a process called prompt engineering, which instructs the LMMs better
to provide more accurate and complex solutions. In this process, programmers can use
prompt patterns [34], which are similar to software patterns, reusable prompts to solve
common problems in LLM interaction. One such prompt pattern is the domain-specific
language (DSL) [35,36] creation pattern.
3. Related Work
The recent popularity of ChatGPT has brought much attention to its benefits (e.g.,
AI pair programming) or drawbacks (e.g., cheating) in different fields, as well as what
impact that chatbot has on higher education [12] in general. Studies on its capabilities
and limitations emerged almost as soon as the ChatGPT public release [37]. Our study
contributes to this field, and we summarize these studies in this section.
One of the most closely related empirical studies involving ChatGPT in learning pro-
gramming is reported by [21]. Similar to our study, theirs involved undergraduate students
Mathematics 2024, 12, 629 4 of 22
code completion (given code), and code analysis (given inputs/outputs, etc.). Their study
also differed from ours in topics (embedded systems vs. object-oriented programming),
participant experience (senior vs. novice), and experience duration (four quizzes vs. a
whole course). The findings from [23] concluded that the ChatGPT group performed
better answering code analysis and theoretical questions but faced problems with code
completion and questions that involved images. In writing complete code, the results were
inconsistent. The author concluded that the usage of ChatGPT is currently insufficient in
Computer Engineering programs, and learning related topics is still essential.
An interesting empirical study from Mathematics [19] explored the potential impact of
ChatGPT on their students. The study also focused on essential skills for Computer Science
students—how the use of ChatGPT can affect critical thinking, problem-solving, and group
work skills. The opinions of participants after assignments on these three skills included
a five-point Likert scale (from one “no affect”) to five (“it will affect a lot”)). The average
results on critical thinking (2.38), problem-solving (2.39), and group work (2.97) indicate
that participants perceive ChatGPT as having a small-to-moderate effect on the acquisition
of the skills as mentioned earlier. Group work appears to be the most affected skill. It would
be interesting to see the same results for Computer Science students. It might be an exciting
set of feedback questions for the replication study [38]. Instead of conducting a feedback
study, assessment instruments can validate students’ problem-solving skills [39] in, for
example, object-oriented programming (OOP). The research findings of this study conclude
that the integration of ChatGPT into education introduces new challenges, necessitating
the adjustment of teaching strategies and methodologies to develop critical skills among
engineers [19]. We followed the advice and adjusted practical assignments in our course,
Programming II.
which was asked not to use ChatGPT. An additional measure we agreed upon was to
remove students from Group II who reported using ChatGPT for practical assignments in
the feedback questionnaire, as this would compromise the results of this group.
Practical Assignments
Week Topic
Mandatory Optional
1 Programming I repetition Fuel Consumption Disarium number
2 Basic classes Exercise Fuel Log
Class variables and
3 Time Text Utility
methods
Aggregation and
4 Exercise Tracker Mail Box
composition
5 Inheritance Strength Exercise Bank
6 Midterm exam I
7 Abstract class Graph Graphic Layout
8 Template function Vector Util Vector Util
9 Template class Linear Queue Linked List
10 Additional help
11 Operator Overloading Smart Pointer Smart Pointer
12 C++11 and C++14 Exercise Tracker Printer
13 Exceptions, File streams Sensor Hub Log
14 Final practical assignments’ defense
15 Midterm exam II
• Description
Some assignments were provided with minimal text. Supplemental information was
given in the UML diagram.
• Input/output
This means the student receives the input of their program or the exact output of the
program, and they need to follow these instructions.
• Main
For some assignments, participants receive the main program and the assignment
description.
hended the assignments in Programming II. The second part of the questionnaire focused
on ChatGPT (consistency of usage/non usage, purpose of use, etc.). In this paper, we
report on a subset of statistics from the feedback questionnaire most relevant to under-
standing the main study’s results. The complete set of questions and answers of our
background and feedback questionnaires are available at https://github.com/tomazkosar/
DifferentialStudyChatGPT (accessed on 12 January 2024).
4.5. Hypotheses
Our experiment was aimed at confirming/unconfirming three hypotheses: one on
midterm exams, one on lab work, and one on overall results. This leads to six possibilities:
• H1null There is no significant difference in the score of the participants’ lab work when
using ChatGPT vs. those without ChatGPT.
• H1alt There is a significant difference in the score of the participants’ lab work when
using ChatGPT vs. those without ChatGPT.
• H2null There is no significant difference in the results of the participants’ midterm
exams when using ChatGPT vs. those without using ChatGPT for lab work.
• H2alt There is a significant difference in the results of the participants’ midterm exams
when using ChatGPT vs. those without using ChatGPT for lab work.
• H3null There is no significant difference in the final grade of the participants when
using ChatGPT vs. those without ChatGPT for lab work.
• H3alt There is a significant difference in the final grade of the participants when using
ChatGPT vs. those without ChatGPT for lab work.
These hypotheses were tested statistically, and the results are presented in the next section.
5. Results
This section compares the participants’ performance in Programming II in a ChatGPT
treatment group (Group I) vs. a control group without ChatGPT (Group II). To under-
stand the outcome of our controlled experiment, this section also presents a study on
the background and feedback questionnaires. Hence, the results of the feedback study
affected a number of participants in the groups. As explained in the feedback subsection,
we eliminated eight students from Group II due to the usage of ChatGPT. The inclusion
would have affected the results and represented a threat to the validity of our study.
All the observations were tested statistically with α = 0.05 as a threshold for judging
significance [45]. The Shapiro–Wilk test of normal distribution was performed for all the
data. If the data were not normally distributed, we performed a non-parametric Mann–
Whitney test for two independent samples. We performed the parametric Independent
Sample t-test to check if the data were normally distributed.
5.2. Comparison
Table 4 shows the results of both groups’ performance in lab work. The average
lab work success of Group I, which used ChatGPT, was 65.27%, whilst the average score
of Group II (no ChatGPT) was only slightly better, 66.72%. Results around 66% are due
to participants’ decisions to finish just mandatory assignments; only a small number
of students decided to work on optional assignments. Note that the mandatory and
optional weekly assignments are complementary—usually, optional assignments cover
advanced topics. Table 4, surprisingly, shows that results from the lab work on Group I
were worse, and, with that, not statistically significantly better compared to the lab work
results from Group II. Hence, we can conclude that using LLM is not a decisive factor if
the right actions are taken before the execution of the course. These results are discussed
further in the section on threats to validity, where concerns are provided regarding our
controlled experiment.
Table 5 compares the performance (by percentage) of the first, second, and overall
(average) groups in the midterm exams. Group I (ChatGPT) and Group II (no ChatGPT)
solved the same exams. From Table 5 it can be observed that Group I (ChatGPT) performed
slightly better than Group II (no ChatGPT) in terms of average success (mean) on the first
midterm. However, the difference was small, and not statistically significant. In both
groups, the results of the second midterm exam were approximately 10% worse compared
to the first midterm exam. We believe these results are connected with the advanced topics
in the second part of the semester in the course of Programming II; this is a common pattern
observed every year. In the second midterm exam, the results were opposite to the first
midterm exam—Group II (no ChatGPT) outperformed treatment Group I (ChatGPT) by
around 2%. Still, the results were not statistically significantly better. The latter observation
is also accurate for the overall midterm results (average between the first and second
midterms)—we could not confirm statistically significant differences between the midterm
results between both groups. However, Group II (no ChatGPT) performed slightly better
(65.96% vs. 66.58%). Before the experiment, we assumed that Group I (ChatGPT) would
have significantly worse results than Group II, which was wrong. As described earlier,
both groups were involved in paper-based midterm exams.
The results were similar for comparison of the overall results. Table 6 shows that Group
I’s average score of overall success was 65.93%. In contrast, Group II achieved a slightly
higher average score of 66.61%. Note that the overall grade breakdown was constituted
from 50% of midterm exams and 50% of practical assignments. The students received
bonus points for extra tasks (usually, no more than 5%). Table 6 reveals no statistically
significant difference between the overall success of Group I and Group II, as determined
by the Mann–Whitney test.
Table 6. Comparison of course final achievements between the groups (Mann–Whitney test).
These results (see Tables 4–6, again) allow us to accept all three null hypotheses,
and confirm that, in our study, there was no influence of ChatGPT on midterm exams,
practical assignments, and final results.
Table 7. Participants’ opinion on course complexity between the groups (Mann–Whitney test).
groups weekly if they were using ChatGPT and for what purpose. The results showed that
both groups followed our suggestions.
However, some students from Group II did not follow the instructions, as indicated in
Figure 3, and used ChatGPT for almost every practical assignment. We decided to eliminate
these eight students from the background, study, and feedback results since they corrupted
the group, and inclusion would compromise the statistical results as explained earlier in
this section. This is why the number of participants in Group II is slightly smaller than
in Group I.
Figure 3. Number of participants in Group II that used ChatGPT regularly for practical assignments.
Figure 4. Number of participants in Group I that used ChatGPT regularly for practical assignments.
6. Threats to Validity
This section discusses the construct, internal, and external validity threats [46] of our
controlled experiment.
Mathematics 2024, 12, 629 16 of 22
utilizing ChatGPT and similar LLMs. In our view, the autonomy granted to participants in
Group I to decide whether to leverage LLMs when encountering challenges or learning new
concepts—instead of forcing all participants in that group to use ChatGPT frequently—is
a feature consistent with realistic classroom learning. Our results affirm the potential to
permit and facilitate the use of LLMs in the subsequent executions of the Programming II
course if the assessment stays the same or similar.
may differ for sub-categories of students taking Programming II, such as those with signifi-
cant prior programming experiences, or those who have previously taken certain courses.
The findings of this study stem from an experiment conducted at a single university,
prompting considerations regarding the generalizability of the results. Our outcomes may
be subject to influence from factors such as demographic characteristics, cultural nuances,
and the scale of the institution (specifically, the number of computer science students).
To address this limitation, we reveal data from our experiment and encourage replica-
tions. Engaging in multi-institutional and multinational studies could provide a more
comprehensive understanding of ChatGPT’s impact on the learning experiences of novice
programmers in computer science education, yielding more precise and robust results.
7. Conclusions
ChatGPT has proven to be a valuable tool for many different purposes, like providing
instant feedback and explanations. However, many skeptics emphasize that ChatGPT
should not substitute learning and understanding in classrooms. We must exchange
opinions and experiences when a transformative and disruptive technology occurs in the
education process. Our study was motivated by this high-level goal.
This paper presents a controlled experiment that analyzes whether ChatGPT usage for
practical assignments in a Computer Science course influences the outcome of learning. We
formed two groups of first-year students, one that was encouraged to use ChatGPT and
the other that was discouraged. The experiment evaluated a set of common hypotheses
regarding the results from lab work, midterm exams, and overall performance.
The main findings suggest the following:
• Comparing the participants’ success in practical assignments between groups using ChatGPT and
others not using it, we found that the results were not statistically different (see Table 4, again).
We prepared assignments and lab sessions in a way that minimized the likelihood that
ChatGPT may help participants blindly without learning. Our results confirm that
our efforts were successful.
• Comparing the participants’ success in midterm exams between groups using ChatGPT and others
not using it, we found that the results were also not statistically different (see Table 5, again).
Although Group I was using ChatGPT, our adjustments probably resulted in enough
effort in learning by that treatment group. Therefore, their results were equal to the
control group that was discouraged from using ChatGPT.
• Comparing the participants’ overall success in a course on Programming II between groups
using ChatGPT and others not using it, we found that the results were also not statistically
different (see Table 6, again).
This means that our specific execution of the course (with all the introduced adjust-
ments) allows using ChatGPT as an additional learning aid.
Our results also indicate that participants believe ChatGPT impacted the final grade
positively (Figure 5), but the results do not confirm this on the lab work (Table 4), midterm
exams (Table 5), nor the final achievements (Table 6). They also reported positive learn-
ing experiences (e.g., program understanding; see Figure 7). In addition, we found that
ChatGPT was used for different purposes (code optimization, comparison, etc., as indi-
cated in Figure 6). The participants confirmed strongly that they will most likely use
ChatGPT (Figure 9).
Future Work
Our study results and ChatGPT-oriented adjustments must be taken with caution
in the future. Improvements in large language models will likely affect adjustments
(specifically for practical assignments). ChatGPT’s ever-evolving nature will probably drive
our adjustments tailored to AI technology’s current state. ChatGPT-oriented adjustments
might erode relevance swiftly, necessitating periodic updates and reassessments to remain
robust and practical. We wish to emphasize the importance of assignment defenses and
the accompanying discussion with a student. For courses where interactive assignment
Mathematics 2024, 12, 629 19 of 22
defenses are not used as a key form of evaluation, the adoption of ChatGPT may need to
be considered carefully. For example, if teaching assistants were only to test the correctness
of the code submitted by students without any interactive communication, the results of
the evaluation may be different.
This study needs additional replications [38,49]. Different problems (applications)
need to be applied to lab work with a different programming language, to name a few pos-
sibilities for strengthening the validity of our conclusions. In addition, we need to compare
the results from midterm exams using IDE support. It would be interesting to see how the
use of development tools affects the results of midterm exams. We are also interested in
using our experiment design and specific adjustments in the introductory programming
(CS1) course, an introductory course in the Computer Science program, as ChatGPT is
successful with basic programming concepts and providing solutions. An empirical study
to understand the obtained essential skills (critical thinking, problem-solving, and group
work skills) [19] is also necessary, to understand its potential impact for future Computer
Science engineers. As discussed in the section on threats to validity, broadening the
perspective in replicated studies to involve more institutions and conducting a multi-
institutional and multinational study has the potential to yield a deeper comprehension of
the integration of large language models in education, leading to more precise and robust
outcomes compared to the results presented in this study.
Our future research endeavors in empirical studies with students and ChatGPT should
address the limitations of traditional performance comparative metrics (also used in this
empirical study). By enriching our research with qualitative assessments, we could uncover
profound insights into cognitive engagement and pedagogical interactions driven by AI
technology. These metrics could offer a more comprehensive understanding of its impact
on teaching and learning processes.
Besides the research directions highlighted above, there exist a multitude of directions
associated with the integration of AI technology into pedagogical processes that warrant
further investigation. It is essential to delve deeper into the identified risks associated with
integrating ChatGPT into educational settings. These risks include the potential unreli-
ability of generated data, students’ reliance on technology, and the potential impact on
students’ cognitive abilities and interpersonal communication skills. Exploring these risks
comprehensively is crucial for informing educators about the challenges and limitations
of incorporating AI technology in pedagogy. Another intriguing direction for research
would involve examining the positive impacts of ChatGPT. Future experiments aimed at
investigating the potential educational benefits of large language models could yield crucial
insights into their overall impact on learning. On the other hand, we must study the benefits
not only for students but also for educators. These include educators’ abilities to automate
various tedious tasks, such as assessment preparation, monitoring academic performance,
generating reports, etc. This technology can act as a digital assistant for educators, assisting
with generating additional demonstration examples and visual aids for instructional ma-
terials. Understanding how educators utilize these positive features can provide insights
into optimizing ChatGPT’s role in educational environments. Furthermore, future research
should address the limitations of ChatGPT in answering questions. Examining students’
reactions when ChatGPT fails to answer questions correctly is essential for understanding
their perceptions and experiences with AI technology in learning contexts. This insight can
guide the development of interventions to support students’ interaction with ChatGPT and
mitigate potential frustrations or challenges they may encounter. These and many more
topics hold great relevance for the education community and merit thorough exploration.
Author Contributions: Conceptualization, T.K.; methodology, T.K.; software, T.K. and D.O.; valida-
tion, M.M. and Y.D.L.; investigation, T.K., D.O., M.M. and Y.D.L.; writing—original draft preparation,
T.K., D.O., M.M. and Y.D.L.; writing—review and editing, T.K., D.O., M.M. and Y.D.L. All authors
have read and agreed to the published version of the manuscript.
Mathematics 2024, 12, 629 20 of 22
Funding: The first, second and fourth authors acknowledge the financial support from the Slovenian
Research Agency (Research Core Funding No. P2-0041). The third author acknowledges the financial
support from the Fulbright Scholar Program.
Institutional Review Board Statement: Ethical review and approval were waived for this study
because the tests had the form of a midterm exam.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: The data presented in this study are available in https://github.com/
tomazkosar/DifferentialStudyChatGPT, accessed on 12 January 2024.
Acknowledgments: The authors wish to thank the whole team of the Programming Methodologies
Laboratory at the University of Maribor, Faculty of Electrical Engineering and Computer Science, for
their help and fruitful discussions during the execution of the controlled experiment.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Stokel-Walker, C.; Van Noorden, R. What ChatGPT and generative AI mean for science. Nature 2023, 614, 214–216. [CrossRef]
2. MacNeil, S.; Tran, A.; Mogil, D.; Bernstein, S.; Ross, E.; Huang, Z. Generating diverse code explanations using the GPT-3
large language model. In Proceedings of the 2022 ACM Conference on International Computing Education Research, Virtual,
7–11 August 2022; Volume 2, pp. 37–39.
3. Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018.
Available online: https://www.mikecaptain.com/resources/pdf/GPT-1.pdf (accessed on 24 September 2023).
4. OpenAI. ChatGPT. 2023. Available online: https://chat.openai.com/ (accessed on 24 September 2023).
5. Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H.P.d.O.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al.
Evaluating large language models trained on code. arXiv 2021, arXiv:2107.03374.
6. Tian, H.; Lu, W.; Li, T.O.; Tang, X.; Cheung, S.C.; Klein, J.; Bissyandé, T.F. Is ChatGPT the Ultimate Programming Assistant–How
far is it? arXiv 2023, arXiv:2304.11938.
7. Rahman, M.M.; Watanobe, Y. ChatGPT for education and research: Opportunities, threats, and strategies. Appl. Sci. 2023, 13, 5783.
[CrossRef]
8. Shoufan, A. Exploring Students’ Perceptions of ChatGPT: Thematic Analysis and Follow-Up Survey. IEEE Access 2023,
11, 38805–38818. [CrossRef]
9. Muñoz, S.A.S.; Gayoso, G.G.; Huambo, A.C.; Tapia, R.D.C.; Incaluque, J.L.; Aguila, O.E.P.; Cajamarca, J.C.R.; Acevedo, J.E.R.;
Rivera, H.V.H.; Arias-Gonzáles, J.L. Examining the Impacts of ChatGPT on Student Motivation and Engagement. Soc. Space 2023,
23, 1–27.
10. Qureshi, B. Exploring the use of ChatGPT as a tool for learning and assessment in undergraduate computer science curriculum:
Opportunities and challenges. arXiv 2023, arXiv:2304.11214.
11. Milano, S.; McGrane, J.A.; Leonelli, S. Large language models challenge the future of higher education. Nat. Mach. Intell. 2023,
5, 333–334. [CrossRef]
12. Dempere, J.; Modugu, K.; Hesham, A.; Ramasamy, L.K. The Impact of ChatGPT on Higher Education. Front. Educ. 2023, 8, 1206936.
[CrossRef]
13. DeFranco, J.F.; Kshetri, N.; Voas, J. Are We Writing for Bots or Humans? Computer 2023, 56, 13–14. [CrossRef]
14. Cao, J.; Li, M.; Wen, M.; Cheung, S.C. A study on prompt design, advantages and limitations of ChatGPT for deep learning
program repair. arXiv 2023, arXiv:2304.08191.
15. Qin, C.; Zhang, A.; Zhang, Z.; Chen, J.; Yasunaga, M.; Yang, D. Is ChatGPT a general-purpose natural language processing task
solver? arXiv 2023, arXiv:2302.06476.
16. Dwivedi, Y.K.; Kshetri, N.; Hughes, L.; Slade, E.L.; Jeyaraj, A.; Kar, A.K.; Baabdullah, A.M.; Koohang, A.; Raghavan, V.; Ahuja, M.;
et al. “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative
conversational AI for research, practice and policy. Int. J. Inf. Manag. 2023, 71, 102642. [CrossRef]
17. Winslow, L.E. Programming pedagogy—A psychological view. ACM SIGCSE Bull. 1996, 28, 17–22. [CrossRef]
18. Lukpat, A. ChatGPT Banned in New York City Public Schools over Concerns about Cheating, Learning Development.
2023. Available online: https://www.wsj.com/articles/chatgpt-banned-in-new-york-city-public-schools-over-concerns-about-
cheating-learning-development-11673024059 (accessed on 24 September 2023).
19. Sánchez-Ruiz, L.M.; Moll-López, S.; Nuñez-Pérez, A.; Moraño-Fernández, J.A.; Vega-Fleitas, E. ChatGPT Challenges Blended
Learning Methodologies in Engineering Education: A Case Study in Mathematics. Appl. Sci. 2023, 13, 6039. [CrossRef]
20. Susnjak, T. ChatGPT: The end of online exam integrity? arXiv 2022, arXiv:2212.09292.
21. Yilmaz, R.; Yilmaz, F.G.K. Augmented intelligence in programming learning: Examining student views on the use of ChatGPT
for programming learning. Comput. Hum. Behav. Artif. Hum. 2023, 1, 100005. [CrossRef]
Mathematics 2024, 12, 629 21 of 22
22. Geng, C.; Yihan, Z.; Pientka, B.; Si, X. Can ChatGPT Pass An Introductory Level Functional Language Programming Course?
arXiv 2023, arXiv:2305.02230.
23. Shoufan, A. Can Students without Prior Knowledge Use ChatGPT to Answer Test Questions? An Empirical Study. ACM Trans.
Comput. Educ. 2023, 23, 45. [CrossRef]
24. King, M.R.; ChatGPT. A conversation on artificial intelligence, chatbots, and plagiarism in higher education. Cell. Mol. Bioeng.
2023, 16, 1–2. [CrossRef]
25. Wohlin, C.; Runeson, P.; Höst, M.; Ohlsson, M.C.; Regnell, B.; Wesslén, A. Experimentation in Software Engineering; Springer Science
& Business Media: Berlin/Heidelberg, Germany, 2012.
26. Chowdhary, K.R. Natural language processing. In Fundamentals of Artificial Intelligence; Springer: New Delhi, India, 2020; pp. 603–649.
27. King, M.R. The future of AI in medicine: A perspective from a Chatbot. Ann. Biomed. Eng. 2023, 51, 291–295. [CrossRef]
28. Floridi, L.; Chiriatti, M. GPT-3: Its nature, scope, limits, and consequences. Minds Mach. 2020, 30, 681–694. [CrossRef]
29. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need.
Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008.
30. Adamopoulou, E.; Moussiades, L. Chatbots: History, technology, and applications. Mach. Learn. Appl. 2020, 2, 100006. [CrossRef]
31. Jeon, J.; Lee, S.; Choe, H. Beyond ChatGPT: A conceptual framework and systematic review of speech-recognition chatbots for
language learning. Comput. Educ. 2023, 206, 104898. [CrossRef]
32. Hughes, A. ChatGPT: Everything You Need to Know about OpenAI’s GPT-4 Tool. 2023. Available online: https://www.
sciencefocus.com/future-technology/gpt-3 (accessed on 26 September 2023).
33. White, J.; Fu, Q.; Hays, S.; Sandborn, M.; Olea, C.; Gilbert, H.; Elnashar, A.; Spencer-Smith, J.; Schmidt, D.C. A prompt pattern
catalog to enhance prompt engineering with ChatGPT. arXiv 2023, arXiv:2302.11382.
34. White, J.; Hays, S.; Fu, Q.; Spencer-Smith, J.; Schmidt, D.C. ChatGPT Prompt Patterns for Improving Code Quality, Refactoring,
Requirements Elicitation, and Software Design. arXiv 2023, arXiv:2303.07839.
35. Giner-Miguelez, J.; Gómez, A.; Cabot, J. A domain-specific language for describing machine learning datasets. J. Comput. Lang.
2023, 76, 101209. [CrossRef]
36. de la Vega, A.; García-Saiz, D.; Zorrilla, M.; Sánchez, P. Lavoisier: A DSL for increasing the level of abstraction of data selection
and formatting in data mining. J. Comput. Lang. 2020, 60, 100987. [CrossRef]
37. Kasneci, E.; Sessler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.;
Hüllermeier, E.; et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ.
Differ. 2023, 103, 102274. [CrossRef]
38. Kosar, T.; Gaberc, S.; Carver, J.C.; Mernik, M. Program comprehension of domain-specific and general-purpose languages:
Replication of a family of experiments using integrated development environments. Empir. Softw. Eng. 2018, 23, 2734–2763.
[CrossRef]
39. Sonnleitner, P.; Brunner, M.; Greiff, S.; Funke, J.; Keller, U.; Martin, R.; Hazotte, C.; Mayer, H.; Latour, T. The Genetics Lab.
Acceptance and psychometric characteristics of a computer-based microworld to assess complex problem solving. Psychol. Test
Assess. Model. 2012, 54, 54–72.
40. Ouh, E.L.; Gan, B.K.S.; Shim, K.J.; Wlodkowski, S. ChatGPT, Can You Generate Solutions for my Coding Exercises? An Evaluation
on its Effectiveness in an undergraduate Java Programming Course. arXiv 2023, arXiv:2305.13680.
41. Moradi Dakhel, A.; Majdinasab, V.; Nikanjam, A.; Khomh, F.; Desmarais, M.C.; Jiang, Z.M.J. GitHub Copilot AI pair programmer:
Asset or Liability? J. Syst. Softw. 2023, 203, 111734. [CrossRef]
42. Imai, S. Is GitHub Copilot a Substitute for Human Pair-Programming? An Empirical Study. In Proceedings of the ACM/IEEE
44th International Conference on Software Engineering: Companion Proceedings (ICSE ’22), Pittsburgh, PA, USA, 21–29 May
2022; pp. 319–321.
43. Asare, O.; Nagappan, M.; Asokan, N. Is GitHub’s Copilot as bad as humans at introducing vulnerabilities in code? Empir. Softw.
Eng. 2023, 28, 129. [CrossRef]
44. Likert, R. A technique for the measurement of attitudes. Arch. Psychol. 1932, 22, 55.
45. Sheskin, D.J. Handbook of Parametric and Nonparametric Statistical Procedures, 5th ed.; Chapman and Hall/CRC: New York, NY,
USA, 2011.
46. Feldt, R.; Magazinius, A. Validity Threats in Empirical Software Engineering Research—An Initial Survey. In Proceedings of
the 22nd International Conference on Software Engineering & Knowledge Engineering (SEKE’2010), Redwood City, CA, USA,
1–3 July 2010; Knowledge Systems Institute Graduate School: Skokie, IL, USA, 2010; pp. 374–379.
47. Ralph, P.; Tempero, E. Construct Validity in Software Engineering Research and Software Metrics. In Proceedings of the 22nd
International Conference on Evaluation and Assessment in Software Engineering 2018 (EASE’18), Christchurch, New Zealand,
28–29 June 2018; pp. 13–23.
48. Sjøberg, D.I.K.; Bergersen, G.R. Construct Validity in Software Engineering. IEEE Trans. Softw. Eng. 2023, 49, 1374–1396.
[CrossRef]
Mathematics 2024, 12, 629 22 of 22
49. Shull, F.J.; Carver, J.C.; Vegas, S.; Juristo, N. The role of replications in empirical software engineering. Empir. Softw. Eng. 2008,
13, 211–218. [CrossRef]
50. Carver, J.C. Towards reporting guidelines for experimental replications: A proposal. In Proceedings of the 1st International
Workshop on Replication in Empirical Software Engineering, Cape Town, South Africa, 2–8 May 2010; pp. 1–4.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.