0% found this document useful (0 votes)

16 views12 pages

LS24 evaluationLLM CSedu

This study evaluates the effectiveness of CodeTutor, an LLM-powered assistant, in enhancing learning outcomes for students in introductory computer science courses over a semester. Results indicate that students using CodeTutor showed significant performance improvements compared to a control group, particularly among those with no prior experience with LLM tools. However, while students appreciated CodeTutor's assistance with programming syntax, they expressed concerns about its limited role in fostering critical thinking skills and preferred traditional human support over time.

Uploaded by

somerandomaltxd22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views12 pages

LS24 evaluationLLM CSedu

Uploaded by

somerandomaltxd22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Evaluating the Effectiveness of LLMs in Introductory Computer

Science Education: A Semester-Long Field Study

Wenhan Lyu Yimeng Wang Tingting (Rachel) Chung
William & Mary William & Mary William & Mary
Williamsburg, VA, USA Williamsburg, VA, USA Williamsburg, VA, USA
[email protected] [email protected] [email protected]

Yifan Sun Yixuan Zhang

William & Mary William & Mary
Williamsburg, VA, USA Williamsburg, VA, USA
[email protected] [email protected]

ABSTRACT need to integrate Generative AI literacy into curricula to foster crit-

The integration of AI assistants, especially through the development ical thinking skills, and turn to examining the temporal dynamics
of Large Language Models (LLMs), into computer science education of user engagement with LLM-powered tools. We further discuss
has sparked significant debate, highlighting both their potential to the discrepancy between the anticipated functions of tools and stu-
augment student learning and the risks associated with their misuse. dents’ actual capabilities, which sheds light on the need for tailored
An emerging body of work has looked into using LLMs in education, strategies to improve educational outcomes.
primarily focusing on evaluating the performance of existing mod-
els or conducting short-term human subject studies. However, very CCS CONCEPTS
little work has examined the impacts of LLM-powered assistants • Human-centered computing → Human computer interac-
on students in entry-level programming courses, particularly in tion (HCI).
real-world contexts and over extended periods. To address this re-
search gap, we conducted a semester-long, between-subjects study KEYWORDS
with 50 students using CodeTutor, an LLM-powered assistant de-
Field study, Large Language Models, Tutoring
veloped by our research team. Our study results show that students
who used CodeTutor (the “CodeTutor group” as the experimental ACM Reference Format:
group) achieved statistically significant improvements in their final Wenhan Lyu, Yimeng Wang, Tingting (Rachel) Chung, Yifan Sun, and Yixuan
scores compared to peers who did not use the tool (the “control Zhang. 2024. Evaluating the Effectiveness of LLMs in Introductory Computer
group”). Within the CodeTutor group, those without prior experi- Science Education: A Semester-Long Field Study. In L@S’24:In Proceedings
of the Tenth ACM Conference on Learning @ Scale, July 18–20, 2024, Atlanta,
ence with LLM-powered tools demonstrated significantly greater
Georgia, GA, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/
performance gain than their counterparts. We also found that stu-
XXXXXXX.XXXXXXX
dents expressed positive feedback regarding CodeTutor’s capability
to comprehend their queries and assist in learning programming
language syntax. However, they had concerns about CodeTutor’s 1 INTRODUCTION
limited role in developing critical thinking skills. Over the course Recent advancements in Generative AI and Large Language Mod-
of the semester, students’ agreement with CodeTutor’s suggestions els (LLMs), exemplified by GitHub Copilot [15] and ChatGPT [32],
decreased, with a growing preference for support from traditional have demonstrated their capacity to tackle complex problems with
human teaching assistants. Our findings also show that students human-like proficiency. These innovations raise significant con-
turned to CodeTutor for different tasks, including programming cerns within the educational domain, particularly as students might
task completion, syntax comprehension, and debugging, partic- misuse these tools, thereby compromising the quality of education
ularly seeking help for programming assignments. Our analysis and breaching academic integrity norms [36]. Specifically, entry-
further reveals that the quality of user prompts was significantly level computer science education is directly affected by the progress
correlated with CodeTutor’s response effectiveness. Building upon in LLMs [58]. LLMs’ capability in handling programming tasks
these results, we discuss the implications of our findings for the means they can complete many assignments typically given in in-
troductory courses, thus becoming highly appealing to students
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed looking for easy solutions.
for profit or commercial advantage and that copies bear this notice and the full citation Despite these challenges, LLM-powered tools offer great oppor-
on the first page. Copyrights for components of this work owned by others than ACM tunities to enrich computer science education [23]. When used
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a ethically and appropriately, they can serve as powerful educational
fee. Request permissions from [email protected]. resources. For instance, LLMs can provide students instant feedback
L@S ’24, July 18–20, 2024, Atlanta, Georgia, GA, USA on their coding assignments or generate diverse examples of code
© 2024 Association for Computing Machinery.
ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00 that help demonstrate programming concepts [35]. Moreover, as
https://doi.org/XXXXXXX.XXXXXXX Generative AIs are becoming popular in production environments,
L@S ’24, July 18–20, 2024, Atlanta, Georgia, GA, USA Lyu et al.

familiarizing students with these technologies is increasingly be- 3) We offered insights and outlined design implications for future
coming a crucial aspect of computer science education. research.
The unique challenges posed by LLMs stem from the difficulty in
detecting the use of AI tools [54, 57]. Traditional approaches, such
as plagiarism detection software, fall short in determining the origi-
2 RELATED WORK
nality of student submissions [28]. Given the challenges in identify- 2.1 Intelligent Tutoring Systems
ing LLMs usage and recognizing the potential advantages of these Using computerized tools for assisting educational purposes is not
technologies, we consider integrating LLMs into computer science a new idea. As early as the 1950s, the first concept of using com-
education inevitable. As students have already started using such puters to assist learning has already emerged [29]. From where
tools, the impact of LLMs on computer science education remains the factor of intelligence had been considered and it had started
unknown. Indeed, a growing body of research has begun to explore evolving into Intelligent Tutoring Systems (ITS) [46]. ITS leverages
the application of LLMs within educational settings, primarily fo- artificial intelligence to provide personalized learning experiences
cusing on assessing the capabilities of current models with existing in computer science education, adapting instruction and feedback
datasets or previous assignments from students [18, 27]. However, to individual student needs [3, 14]. These systems have enhanced
there is still a research gap in understanding how students interact student engagement, comprehension, and problem-solving skills
with LLM-powered tools in introductory programming classes, par- by offering tailored support and immediate feedback, similar to
ticularly regarding their engagement in genuine learning settings one-on-one tutoring [10, 52]. Research has demonstrated that ITS
over extended periods. Furthermore, while previous studies have can significantly improve understanding of complex concepts in
shown individual differences in intelligent tutoring systems [22], programming courses compared to traditional teaching methods,
research into how these differences apply to LLM tools is lacking. leading to higher student satisfaction due to the personalized learn-
Investigating these variations is important for tailoring educational ing environment [9, 42]. The Internet also empowered ITS to offer
strategies to diverse student needs. In short, understanding these more interactivity and adaptivity [5–7], leveraging the path of later
nuanced attitudes of and interactions with LLM-powered tools in boost with natural language processing techniques [13, 19].
CS education over extended periods is crucial for identifying the However, prior work has shown that as the granularity of tutor-
evolving challenges and opportunities LLMs introduce. ing decreases, its effectiveness increases [52]. Significant limitations
To address the research gap, we asked the following research for ITS include the complexity and cost of building them, the in-
questions (RQs) in this work: capability to answer questions and tasks out of their programmed
RQ1. Does the integration of LLM-powered tools in introductory domains, and the difficulty to develop with the purpose of pro-
programming courses enhance or impair students’ learning out- ductively used by individuals without expertise [16]. Even though
comes, compared to traditional teaching methods? How are individ- the Generalized Intelligent Framework for Tutoring (GIFT) frame-
ual differences associated with students’ learning outcomes using work [47] was proposed and evolved for developing ITS for use at
LLM-powered tools? scale, those limitations mostly remain unresolved.
RQ2. What are students’ attitudes towards LLM-powered tools,
how do they change over time, and which factors might influence
these attitudes? 2.2 Large Language Models in CS Education
RQ3. How do students engage with LLM-powered tools, and how The release of ChatGPT and other Generative AI applications brought
do they respond to their programming needs? LLMs into the public view and attracted enormous attention [1, 48].
We believe that addressing the following research questions LLMs offer researchers and users the flexibility to employ a single
(RQs) is critical for enabling researchers to make informed deci- tool across various tasks [53], such as medical research [8, 49], fi-
sions about incorporating LLMs into their courses and guiding nance [55], and education [21]. Adopting LLM-powered tools in
students on the optimal and responsible use of LLM-powered tools. educational settings is facilitated by their broad accessibility and
To answer the questions, we conducted a longitudinal, between- cost-free nature [56]. Recent studies have looked into the potential
subject field study with 50 students over the course of the fall of AI assistants to enhance student learning by helping with stu-
semester from September to December 2023 with a web-based tool dents’ problem-solving [2, 25, 37] and generating computer science
we developed called CodeTutor. content [11, 43]. Current research on the use of LLMs in education
The contributions of this work are: 1) We conducted a semester- has primarily looked into their performance and capabilities [40]
long longitudinal field study to assess the effectiveness of an LLM- compared to humans, such as generating code for programming
powered tool (CodeTutor) on students’ learning outcomes in an tasks [24, 39], answering general inquriries [38, 44], addressing
introductory programming course. By comparing the performance textbook questions [20] and exam questions [12].
of students who used CodeTutor against those who did not, our Despite the growing interest in examining the capabilities of
study contributes to new empirical evidence regarding the role of LLMs in education, very few empirical studies have examined the
LLM-powered tools in the programming learning experience; 2) We emerging concerns regarding their impact. Therefore, there is an
characterized patterns of student engagement with CodeTutor and urgent need for research into the long-term effects of LLMs in CS
analyzed the ways in which it can meet students’ learning needs. education and the development of strategies to counteract potential
Through the analysis of conversational interactions and feedback negative consequences. One exceptional work was conducted by
loops between students and the tool, we contributed new knowl- Liffiton et al. [26], who developed a tool called CodeHelp for assist-
edge regarding how CodeTutor facilitates or impedes learning; and ing students with their debug needs in an undergraduate course
Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study L@S ’24, July 18–20, 2024, Atlanta, Georgia, GA, USA

over 12 weeks. Their follow-up study [45] categorized history mes- model to answer questions as a teaching assistant in an entry-level
sages in their tool, and found a positive relationship between tool Python class, making answers from OpenAI API consistent even if
usage and course performance. However, their study specifically the length of a conversation exceeds its token limit.
focused on debugging issues and did not compare the outcomes
with those achieved through traditional TA methods. 3.2 Participants
Furthermore, prior research has demonstrated that individual Upon approval from our institution’s Institutional Review Board
differences, such as gender, race, and prior experiences with tech- (IRB), we conducted a field study evaluation study with 50 par-
nologies, significantly influence the effectiveness of traditional in- ticipants. The field study took place in the Computer Science De-
telligent tutoring systems [22]. However, work that examines how partment of a 4-year university in the United States. Our criteria
these individual differences affect interactions with and percep- for participation include: Participants need to be 18 years or older,
tions of LLM-powered tools in educational settings is sparse. Given be able to speak and write in English, and register as entry-level
the increasing integration of LLMs in programming courses and undergraduate computer science students at our institution. Table 1
beyond, understanding the role of demographic and individual vari- presents an overview of our participants’ demographic information.
ability is crucial for developing inclusive and effective educational
tools that suit diverse students’ needs.
Table 1: Overview of Participant Characteristics
Our work seeks to address these research gaps by conducting
a field study that evaluates the use of LLM-powered tools for an
Characteristics Options Number of
extended period of time. Particularly, our study not only aims to participants
evaluate the practicality of LLMs in programming learning educa-
Gender Man 25
tional contexts, but also intends to contribute to a more nuanced Woman 22
understanding of their long-term implications for learning and Non-binary 1
teaching methodologies. Prefer not to say 2
Major Computer Science 18
Data Science 9
3 METHOD Biology 5
Mathematics 4
In this section, we describe the design of CodeTutor (subsection 3.1), Economics 3
an overview of our participants (subsection 3.2), our study proce- Others 10
dure and data collection (subsection 3.3), and our quantitative and Not reported 1

qualitative data analysis (subsection 3.4). The source code of Code- Year of Study Freshman 37
Junior 6
Tutor, pre-test questions, and data analysis code is available on Sophomore 5
osf.io/e3zgh. Senior 1
Not reported 1
Race White 26
3.1 Design of CodeTutor Asian 17
We developed CodeTutor, a browser-based web application using Multiracial 3
African American or Black 1
TypeScript and front-end frameworks (e.g., SolidJS, Astro, and li- Not reported 3
braries such as Zag), for a responsive and interactive user interface. Ethnicity Latino/Hispanic 3
CodeTutor integrates OpenAPI API, which enables the GPT-3.5 Prior Experience Only ChatGPT 28
model offered by OpenAI. The main interface is shown in Figure 1. with LLM tools ChatGPT and other tools 11
Never used 11
Login. Students log in to CodeTutor using their email addresses,
with a randomly generated unique identifier (UID) that tracks their
activities anonymously.
User Interface. The CodeTutor interface features a navigation 3.3 Study Procedure & Data Collection
sidebar and a central chat area. The sidebar enables easy navigation,
Our field study lasted from September 27 (after the course add-
with a button for starting new conversations and a chronological
drop period) to December 11, 2023 (the final exam due). Below, we
listing of existing ones for quick access.
describe each component of our study.
User Feedback Structure. Feedback is important in CodeTu-
tor in order to understand user engagement and students’ atti- 3.3.1 Pre-test. Participants were initially requested to provide their
tudes towards it. CodeTutor provides two feedback mechanisms: 1) consent to participate, with being informed about the study’s ob-
conversation-level and 2) message-level feedback. jectives, procedures, and their rights as participants, including the
Data Storage. CodeTutor stores data locally on the user’s browser right to withdraw at any time without penalty. Following the con-
with IndexedDB and can only upload essential information with sent process, the pre-test assessment was administered to evaluate
our secure server for research purposes, where a unique ID for students’ existing knowledge of Python programming, providing a
anonymous tracking identifies each conversation. To protect pri- baseline for subsequent analysis.
vacy, CodeTutor cannot read stored data from our server. This pre-test included three sections with Python questions,
API Usage. OpenAI only offered limited configuration ability for with a total of 22 questions that varied in difficulty for an evalua-
their API at the time we started our experiment. So we carefully tion of participant skills. The first section featured eight questions
crafted the system role text in our implementation to specify the (Questions 1-8, for example, “What is the output of the following
L@S ’24, July 18–20, 2024, Atlanta, Georgia, GA, USA Lyu et al.

3 Conversation-level feedback

Comprehension
Conversation-level feedback mode triggers when
1 ) users are inactive for 10 minutes, or
2) users end the conversation; or Critical Thinking
3) users click on the providing feedback button

Syntax Mastery

Delete Independent Learning

Light/ Dark messages
mode
1 Conversation History 2 Main Conversation TA Replacement

4 Message-level feedback

Message-level feedback mode triggers when

users click on the upvote or downvote

Figure 1: CodeTutor is a web application that leverages OpenAI API, featuring four main components: 1 Conversation
History that lists different conversation threads, 2 Main Conversation that shows an ongoing dialogue with CodeTutor, 3
Conversation-level Feedback module that allows users to elaborate on their attitudes towards CodeTutor by proving ratings on
1) comprehension, 2) critical thinking, 3) syntax mastery, 4) independent learning, and 5) TA replacement likelihood, and to
provide specific comments, and 4 Message-level Feedback that offers options for users to give detailed feedback on individual
messages or responses from CodeTutor.

code: print(3+4)?” ), the second section included seven questions human TAs. Using LLM-based tools other than CodeTutor in this
of medium difficulty (Questions 9-15, for example, “If I wanted a course was prohibited.
function to return the product of two numbers a and b, what should To divide participants into a control group and an experimental
the return statement look like?”), and the third section presented group, we initially sorted the entire sample based on their previ-
seven challenging questions (Questions 16-22, for example, “What ous engagement with LLM-powered tools, resulting in two groups:
will be the output of the following code? [Multiple lines of code]”). those who have used any LLM-powered tools before (Used Before)
The total score of the three sections was 100 points. Pre-test sub- and those who have not (Never Used). Within the Used Before cate-
missions were graded by our researchers with Computer Science gory, we split the participants into two subsets, Used Before Subset
backgrounds, using predetermined scoring criteria. A and Used Before Subset B, based on the overall pre-test result
This pre-test also asked about participants’ prior experience distribution to ensure both subsets are representative of the wider
with LLMs, specifically asking, “Which of the following Large Lan- group. The same process was applied to the Never Used group, gen-
guage Model AI tools have you used before? Please select all that erating two additional subsets: Never Used Subset A and Never Used
apply.” Participants were also asked to provide demographic in- Subset B. The experimental group is then formed by combining
formation, including their major (or intended major), gender, and Used Before Subset A with Never Used Subset A, while the control
race/ethnicity. Participants were assured that all demographic infor- group consists of the combination of Used Before Subset B and Never
mation would remain anonymous and be used solely for research Used Subset B. This method ensures the experimental and control
purposes. groups were balanced regarding prior experience with Chatbots
and their pre-test performance (see Figure 2).
Following their group assignments, students in the experimental
3.3.2 Control vs. Experimental Group. Participants were divided
group were sent detailed instructions via email on how to access
into two groups: the control group, which used traditional learning
and use CodeTutor. In the field study, participants were not man-
methods and had access to human teaching assistants (TAs) for
dated to adhere to a specific frequency of engagement with Code-
additional support outside class hours, and the experimental group,
Tutor; instead, they were encouraged to utilize the tool at their own
which used CodeTutor as their primary educational tool beyond
pace. This approach allowed for a naturalistic observation of how
class hours, alongside access to standard learning materials and
Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study L@S ’24, July 18–20, 2024, Atlanta, Georgia, GA, USA

t Student(48) = 0.61, p = 0.55, g Hedges = 0.17, CI95% [−0.38, 0.71], n obs = 50

impact of CodeTutor accessibility on academic performance with
ANOVA method. Moreover, we conducted a chi-squared test to ex-
plore the relationship between the quality of students’ content and
15
prompts and CodeTutor performance. To understand students’ at-
Total correct answers

titudes towards CodeTutor, we calculated Spearman’s correlation

10
matrix for continuous variables, given the characteristics of our
µmean = 9.44
µmean = 8.68 data, which are non-normal and exhibit unequal variance. Fur-
thermore, to examine differences between questions, we used the
5 Kruskal-Wallis Rank Sum Test (using R package stats [41]) and then
performed post-hoc tests using Dunnett’s test (using the R package
Control Experiment
FSA [30]) in cases where significant differences were found. To
(n = 25)
group
(n = 25)
investigate the importance of time on students’ attitudes towards
CodeTutor, we introduced a linear mixed effects (LME) model (using
Figure 2: Parametric pairwise comparison (ANOVA) reveals the R package lme4 [4]). We considered statistical significance at
no significant difference in correct answer count of pre-test a significance level of 𝑝 < 0.05 for most cases, except in multiple
in the control and experimental groups. regression analyses where we used 𝑝 < 0.1 and showed effect sizes
were significant enough to indicate the relationship of variables.
3.4.2 Qualitative Data Analysis. We also analyzed the conversa-
students integrate LLM-powered educational resources into their tional history between users. Specifically, we used the General
learning processes, without imposing additional constraints that Inductive Approach [50] to guide our thematic analysis of the con-
could influence their study habits or the study’s outcomes. versational data. The first author conducted a close reading of the
3.3.3 Student Evaluation. At the end of the semester, students’ fi- data to gain a preliminary understanding of the conversational
nal grades were used as a primary measure to assess their learning data and then labeled the text segments to formulate categories,
outcomes and the impact of CodeTutor interventions. While ac- which served as the basis for constructing low-level codes to cap-
knowledging that final grades are influenced by various factors, ture specific elements of the user-CodeTutor interactions. Similar
they offer a standardized measure of overall academic success, en- low-level codes were then clustered together to achieve high-level
abling an assessment of CodeTutor’s role in improving student themes. During the analysis, the research team engaged in ongoing
learning outcomes. discussions to refine and clarify emerging themes.
Final grades were determined by a weighted average that in-
cludes several components for each student: labs (practical mini- 4 RESULTS
projects), assignments (individual coding tasks, such as array sum- In this section, we examined the impact of CodeTutor on student
mation), mid-terms, and a final exam (comprising questions similar academic performance (subsection 4.1 to answer RQ1), analyzed
to those in the pre-test). Note that a student’s final grade can surpass students’ attitudes towards learning with CodeTutor (subsection 4.2
100 if bonus points are awarded throughout the semester. Access to to answer RQ2), and characterized their engagement patterns in
CodeTutor is restricted during mid-terms and final exams, categoriz- entry-level programming courses (subsection 4.3 to answer RQ3).
ing the assessment components into two groups: CodeTutor-Allowed
(labs and assignments) and CodeTutor-Not-Allowed (mid-terms and 4.1 RQ1: Learning Outcomes with CodeTutor
final exams). This categorization facilitates an analysis of Code- 4.1.1 Comparative Analysis of Score Improvements. Overall, stu-
Tutor’s impact on student performance by examining potential dents in the experimental group exhibited a greater average im-
dependencies on the tool and the improvement of learning out- provement in scores, as illustrated by comparing their pre-test and
comes in its absence. final scores to those in the control group. Specifically, the average
increase for the experimental group was 12.50, whereas the control
3.4 Data Analysis group showed an average decrease of 3.17 when comparing final
3.4.1 Quantitative Data Analysis. We examined the students’ scores, scores to pre-test scores.
interaction behaviors, and attitudes of using CodeTutor through We conducted paired t-tests for both the experimental and con-
multiple statistical analyses. trol groups to determine if the observed improvements were sta-
First, we calculated descriptive statistics for all variables, includ- tistically significant, starting with the premise that there were no
ing frequency with percentage for categorical variables and means differences in pre-test scores between these two groups. Our null
and standard deviations for continuous variables. To examine the hypothesis assumed that the true mean difference between pre-test
variation in students’ scores before and after the intervention (i.e., and final scores was zero. For the control group, the null hypothesis
the use of CodeTutor), we conducted paired-t tests for both the could not be rejected, suggesting that the differences between pre-
experimental and control groups. Multiple regression analyses with test and final scores were not statistically significant (𝑡 = -0.879, 𝑝 =
family-wise p-value adjustment were used to examine the effects 0.394). Conversely, participants in the experimental group demon-
of CodeTutor on score improvement, taking into account students’ strated significant improvement from the pre-test to final scores,
past experiences using LLM-powered tools and demographic vari- indicating a statistically significant enhancement in their scores
ables, such as major, gender, and race. We then investigated the (𝑡 = -2.847, 𝑝 = 0.009).
L@S ’24, July 18–20, 2024, Atlanta, Georgia, GA, USA Lyu et al.

Furthermore, when examining the improvement in CodeTutor- Table 2: Multiple regression models explaining respondents’
Not-Allowed components, the experimental group exhibited an final scores in experiment group. (Significance level: † 𝑝 < 0.1,
average increase of 7.33, whereas the control group showed no * 𝑝 < 0.05, ** 𝑝 < 0.01, *** 𝑝 < 0.001).
significant change. By conducting a paired t-test comparing the
pre-test and final exam scores (during which the use of CodeTutor Estimate Std. Error t value Pr(>|t|)
was not permitted), it was observed that students in the experi- Const 93.683 3.877 24.166 0.000 ***
mental group demonstrated a statistically significant improvement Prior Experiences with LLM tools
(𝑡 = -2.405, 𝑝 = 0.026). This result suggests that students who have (Reference: Used before)
Never used 18.877 5.054 3.735 0.032 *
used CodeTutor exhibit more substantial improvement even when
Major
CodeTutor is unavailable. (Reference: Computer science)
Data Science 14.532 5.662 2.567 0.073 †
Mathematics 17.692 5.852 3.023 0.057 †
t Student(40) = 2.31, p = 0.03, g Hedges = 0.69, CI95% [0.07, 1.30], n obs = 42
Biology 16.257 5.662 2.871 0.057 †
120
Economics 1.362 4.799 0.284 0.784
Others -13.004 6.022 -2.160 0.115
Gender
µmean = 102.29
100 (Reference: Female)
µmean = 93.40
Male 5.917 3.845 1.539 0.223
score

Race
(Reference: White)
80
Asian -7.831 3.933 -1.991 0.128
African American or Black 8.099 7.107 1.140 0.322
Others 6.102 5.416 1.127 0.322
60

CodeTutor Allowed CodeTutor Not Allowed

(n = 21) (n = 21)
group compared to those majoring in computer science, suggesting that
these majors achieved higher final scores. In terms of gender, no
Figure 3: Parametric pairwise comparison (ANOVA) reveals a significant effects were observed, indicating no difference between
significantly higher mean score in the “CodeTutor-Allowed” genders in final scores. Additionally, no significant differences were
group compared to the “CodeTutor-Not-Allowed” group. noted across the races in final scores.
Summary of results of RQ1: Collectively, our findings sug-
4.1.2 Effect of CodeTutor Accessibility on Academic Performance. gest that students in the experimental group achieved significant
By constructing the CodeTutor-Allowed and CodeTutor-Not-Allowed, score improvements with CodeTutor. Particularly, those who were
we determine the correlation between CodeTutor’s accessibility new to CodeTutor achieved even greater improvements, while stu-
and student academic performance. Using the ANOVA technique dents majoring in data science, mathematics, and biology surpassed
on the data from the experiment group, Figure 3 reveals that the their computer science counterparts. Moreover, students exhibited
mean score for the CodeTutor-Allowed category stands at 102.29, higher scores when permitted to use CodeTutor.
in contrast to the CodeTutor-Not-Allowed components, which has
a mean score of 93.40. The statistical analysis results show a sig- 4.2 RQ2: Students’ Attitudes towards CodeTutor
nificant difference between the two groups (𝑡 = 2.31, 𝑝 = 0.03), 4.2.1 Descriptive Analysis. In terms of students’ attitudes towards
suggesting that the allowance of CodeTutor correlates with higher CodeTutor (see Figure 1 3 for the specific questions), we found
student scores. that a small portion of students (8%) strongly disagreed or dis-
agreed that CodeTutor accurately understood what students intended
4.1.3 Correlation Between Student Demographics and Final Scores in to ask, while most (67%) agreed or strongly agreed. In addition, 35%
the Experimental Group. Subsequently, we evaluated demographic strongly disagreed or disagreed that CodeTutor helped them think
factors to determine whether specific student groups, particularly critically, while 19% agreed or strongly agreed. Furthermore, 13%
those with prior tech experience, experienced greater benefits from students disagreed that CodeTutor improved their understanding of
CodeTutor. Table 2 shows the results of multiple regression models, programming syntax, with a larger proportion of individuals agree-
examining how students’ final scores in the experiment group are ing (33%) or strongly agreeing (25%). Nearly half of the students
associated with their LLM history, major, gender, and race. Students (42%) agreed or strongly agreed that CodeTutor helped students build
who have never used any LLM-powered tools performed a signifi- their own understandings, while very few (17%) strongly disagreed
cant increase (𝛽 = 18.877, 𝑝 = 0.032) in final score than the students or disagreed. Finally, regarding the potential of CodeTutor to sub-
who used it before. stitute for a human teaching assistant, 20% of the students strongly
Moreover, differences in final scores among various majors within disagreed or disagreed with this notion, while 42% of them agreed
the experiment group were statistically significant, indicating that or strongly agreed. Figure 4 shows the distribution of students’
majors play a substantial role in final scores in the experiment responses across these five questions.
group. Students majoring in data science (𝛽 = 14.532, 𝑝 = 0.073),
mathematics (𝛽 = 17.692, 𝑝 = 0.057), and biology (𝛽 = 16.257, 𝑝 = 4.2.2 Exploring Relationships in Student Attitudes Toward CodeTu-
0.057) exhibited a significant positive correlation with final scores tor. Figure 5 reveals key relationships among students’ attitudes
Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study L@S ’24, July 18–20, 2024, Atlanta, Georgia, GA, USA

Table 3: Linear Mixed-Effects Model of Student Attitudes over time. (Significance level: † 𝑝 < 0.1, * 𝑝 < 0.05, ** 𝑝 < 0.01, ***
𝑝 < 0.001). Over time, students exhibit a significant decline in their agreement with CodeTutor’s comprehension and replacement
of human teaching assistants.

Comprehension Critical Thinking Syntax Mastery Independent Learning TA Replacement

𝛽 (Std. Error) 𝛽 (Std. Error) 𝛽 (Std. Error) 𝛽 (Std. Error) 𝛽 (Std. Error)
Const 4.700(0.297)*** 2.690(0.247)*** 3.760(0.262)*** 3.044(0.218)*** 3.964(0.330)***
Time -0.114(0.039)** 0.040(0.037) -0.018(0.041) 0.054(0.036) -0.099(0.051)†

Strongly Disagree Disagree Neutral Agree Strongly Agree χ2Kruskal−Wallis(4) = 32.99, p = 1.20e−06, ε2ordinal = 0.14, CI95% [0.09, 1.00], n obs = 240
p Holm−adj. = 0.03
Comprehension 2.0%6.0% 25.0% 21.0% 46.0% 6
p Holm−adj. = 5.32e−04
Critical Thinking 6.0% 29.0% 46.0% 15.0% 4.0% p Holm−adj. = 0.01
Syntax Mastery 13.0% 29.0% 33.0% 25.0% p Holm−adj. = 5.34e−07

Independent Learning 2.0% 15.0% 40.0% 37.0% 6.0%

TA Replacement 8.0% 12.0% 38.0% 13.0% 29.0%

Pairwise test: Dunn, Bars shown: significant

µmedian = 4.00 µmedian = 4.00
Figure 4: Participants’ attitudes toward CodeTutor, in terms 4

Result
of comprehension, critical thinking, syntax mastery, inde-
pendent learning, and TA replacement (see Figure 1 for de- µmedian = 3.00 µmedian = 3.00 µmedian = 3.00

tailed questions).
2

1.00

Comprehension 1 0.75

0.50 Comprehension Critical Thinking Syntax Mastery Independent Learning TA Replacement

Critical Thinking 0.26 1 (n = 48) (n = 48) (n = 48)

Question
(n = 48) (n = 48)

0.25

Syntax Mastery 0.46 0.22 1 0.00

Figure 6: Non-parametric pairwise comparison test (Dunn’s
0.25 test): Differences in agreement levels across different ques-
Independent Learning 0.13 0.23 0.5 1 tions. We can see that students predominantly favored Code-
0.50
Tutor for its comprehension and syntax support rather than
TA Replacement 0.24 -0.15 0 0 1 0.75
its ability to foster critical thinking. Additionally, there was
1.00
a stronger consensus on CodeTutor’s proficiency in under-
n

nta ing l

ery

Re ing t

nt
Sy hink itica

TA arn en
sio

standing queries compared to its effectiveness in enhancing

ast

Le pend
en

T Cr

ce
xM
reh

pla
e

programming syntax.
Ind
mp
Co

Figure 5: A correlation matrix heatmap visualizing the rela-

tionship between different metrics. The blue color indicates understand, help in learning syntax and serving as a replacement
positive correlations, while pink represents negative correla- for a teaching assistant. Moreover, our findings suggest that re-
tions. Correlation coefficients are displayed inside each cell. spondents were significantly more in agreement with CodeTutor’s
effectiveness in comprehension than in its ability to improve stu-
dents’ understanding of programming syntax.
on CodeTutor. The moderate positive correlation between Com- We then conducted a linear mixed effects (LME) model to explore
prehension and Syntax Mastery suggests that proficiency in one is time’s influence on students’ attitudes toward CodeTutor:
associated with higher performance in the other. Critical Thinking
𝑄𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝐼𝑛𝑑𝑖𝑐𝑎𝑡𝑜𝑟𝑖𝑡 = 𝛽 0 + 𝑏 0𝑖 + (𝛽 1 + 𝑏 1𝑖 )𝑡 + 𝜖𝑖𝑡
is slightly positive with Comprehension and Independent Learning
but slightly negative with TA Replacement. Furthermore, Syntax where 𝛽 0 and 𝛽 1 are unknown fixed effect parameters; 𝑏 0𝑖 and
Mastery strongly correlates with Independent Learning, indicating 𝑏 1𝑖 are the unknown student-specific random intercept and slope,
a close relationship between mastering programming syntax and respectively, which are assumed to have a bivariate normal distribu-
self-directed learning outcomes. In addition, TA Replacement has tion with mean zero and covariance matrix 𝐷; 𝑄𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝐼𝑛𝑑𝑖𝑐𝑎𝑡𝑜𝑟
minimal to no significant correlations with other variables, suggest- is the student response at time 𝑡; and 𝜖𝑖𝑡 is the residual error for
ing its effects vary independently of these educational aspects. student 𝑖 at time 𝑡, with a normal distribution 𝑁 (0, 𝜎 2 ), which is
To further explore the relationship of different students’ attitudes assumed to be independent of the random effects. From Table 3, we
among questions, we present the results of multiple comparisons can see that students’ attitudes toward CodeTutor show a significant
across the five questions. Specifically, our results show that respon- decrease in Comprehension (𝛽 = -0.114, 𝑝 < 0.01), which indicates
dents’ attitudes (𝜒 2 = 32.99, 𝑝 < 0.05) significantly differ across that students disagree with CodeTutor’s understanding accuracy
questions. Our post-hoc tests (see Figure 6) further reveal that over time. Moreover, there is a weakly significant decrease in TA
students were significantly less in agreement about CodeTutor’s Replacement (𝛽 = -0.099, 𝑝 < 0.1) with increasing time. This shows
assistance in fostering critical thinking compared to its ability to a slight tendency for them to consider more human TA help over
L@S ’24, July 18–20, 2024, Atlanta, Georgia, GA, USA Lyu et al.

time. Also, students perform no significant difference over time in specific details necessary for CodeTutor to understand the context;
Critical Thinking, Syntax Mastery, and Independent Learning. lack of clear goals (𝑛 = 172, 23%), where the desired outcome was
Summary of results of RQ2: In summary, students recognize not explicitly stated; over-reliance on CodeTutor (𝑛 = 362, 48%),
CodeTutor’s ability to understand their queries and assist with where the assignment questions are directly copied and pasted into
programming syntax yet question its capacity to promote critical CodeTutor; and poor structural organization (𝑛 = 25, 3%), which
thinking skills. Additionally, students’ confidence in CodeTutor’s exhibited unclear or confusing request structures. Prompts were
comprehension abilities decreases over time, with a growing pref- further labeled as “working” if they elicited an appropriate response
erence for support from human teaching assistants. from CodeTutor, and “not working” if they failed to do so.
Using a Chi-square test, we investigated whether the prompt
4.3 RQ3: Students’ Engagement with CodeTutor quality and the effectiveness of CodeTutor’s responses were inde-
In total, we documented 82 conversation sessions1 with CodeTu- pendent. Our results showed a significant correlation (𝜒 2 = 144.84,
tor, encompassing a total of 2,567 messages. In these sessions, 415 𝑝 < 0.001). In other words, clearer and more detailed prompts are
unique topics were discussed, averaging 5.06 topics per session and associated with responses that are more likely to be effective.
6.19 messages per topic. Summary of results of RQ3: We characterized the messages
exchanged between users and CodeTutor. We categorize these inter-
4.3.1 Message Classification & Interaction Patterns. In total, we actions between users and CodeTutor into inquiries (e.g., program-
collected 2567 conversational messages exchanged between users ming tasks, syntax questions) and feedback alongside CodeTutor’s
and the CodeTutor. Of these, 1288 messages originated from the responses (corrections and explanations), illustrating a dynamic
users, and CodeTutor responded with 1279 messages. exchange aimed at facilitating learning. We also found that the
Table 4 presents categorizations of messages between users and clarity and completeness of prompts are significantly correlated
CodeTutor. Each category has a description and an example to with the quality of responses from CodeTutor.
illustrate the message type. Categories of messages from both users
and CodeTutor include Programming Task inquiries, addressing
specific Python programming challenges; Grammar and Syntax 5 DISCUSSION
questions, focusing on Python’s basic grammar or syntax without Our semester-long field study provided insights into how students
necessitating runnable programs; General Questions, which are not in introductory computer science courses utilized CodeTutor and
directly related to Python; and Greetings, initiating or finishing its effects on educational outcomes. In short, our results show that
interaction. 1) students who used CodeTutor had shown significant improve-
From the users’ side , additional categories highlight their ments in scores; 2) while CodeTutor was valued for its assistance in
engagement with CodeTutor: Modification Requests for alterations comprehension and syntax, students expressed concerns about its
to previous answers; Help Ineffective indicating issues or errors in capacity to enhance critical thinking skills; 3) skepticism regarding
CodeTutor’s provided solutions; Further Information to elaborate CodeTutor as an alternative to human teaching assistants grew
on prior queries; and Debug Requests for assistance in resolving over time; 4) CodeTutor was primarily used for various coding
bugs or errors in code snippets. tasks, including syntax comprehension, debugging, and clarifying
CodeTutor’s responses are classified into Corrections, which fundamental concepts; 5) the effectiveness of CodeTutor responses
address and amend errors in previous responses and Explanations, was notably higher when prompts were clearer and more detailed.
providing further details on provided solutions or clarify why cer- Building on these findings, we discuss the implications for future
tain requests cannot be fulfilled. enhancements and research directions in the rest of the section.
4.3.2 Analysis of Prompt Quality & Correlation with Response Effec-
tiveness. To further examine user interaction patterns with CodeTu- 5.1 Towards Enhancing Generative AI Literacy
tor and their implications for its educational value, we analyzed the
Our research indicates a positive correlation between the use of
relationship between prompt quality and response accuracy. This
Generative AI tools and improved student learning outcomes. How-
analysis stems from the premise that detailed and precise prompts
ever, 63% of student-generated prompts were deemed unsatisfactory,
are likely to improve the AI’s understanding of user requirements,
indicating a lack of essential skills to fully exploit Generative AI
thereby potentially raising the standard of its responses.
tools. This finding also suggests the need to promote Generative AI
To do so, we evaluated a corpus of 1,190 prompts, after removing
literacy among students. Here, we define Generative AI literacy as
all greeting messages, to assess their quality. Our analysis showed
the ability to effectively interact with AI tools and understand how
that 37% were deemed good quality. The remaining 63% were identi-
to formulate queries and interpret responses. Our findings suggest
fied as poor quality. We defined “good quality” prompts as providing
that while students can leverage CodeTutor for practical coding
sufficient detail for CodeTutor to generate an accurate response. In
assistance and syntax understanding, there is a gap in using these
contrast, “poor quality” prompts were those that did not meet this
tools to enhance critical thinking skills. We suggest educational
criterion. We categorized the deficiencies in poor quality prompts
programs integrate Generative AI literacy as a core component
into four types: incomplete information (𝑛 = 189, 25%), which lacked
of their curriculum, teaching students how to use these tools for
1 In our analysis, a conversation session is a continuous exchange of messages between
immediate problem-solving and engaging with them to promote
users and CodeTutor within a specific period, characterized by a coherent topic or deeper analytical and critical thinking. This could include work-
purpose. shops on effective query formulation, sessions on interpreting AI
Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study L@S ’24, July 18–20, 2024, Atlanta, Georgia, GA, USA

Table 4: Categorizations of messages, from users’ side and from CodeTutor’s side . [Code Snippet] represents a Python code
segment. The Percentage column represents the ratio of occurrences of each category to the total number of messages. Note that
some categories may only apply to messages sent by either users or CodeTutor, and messages may carry multiple categories.

Category Name Description Example Percentage

Programming Task Any questions or answers related to Python pro- “Write a function that prints the nth(argument) prime 86.52%
gramming. number.”
Grammar & Syntax When a message is related to basic Python grammar “What does {} do in Python?” 14.26%
or syntax problems, a runnable program is most
likely unnecessary.
General Question When a message is not directly related to Python. “What is ASCII?” 4.29%
Greetings When a message is greeting. “Hello! How can I assist you today?” 0.62%
Help Ineffective When a user message says the previous answer “This code still fails.” 12.86%
generated by CodeTutor is wrong or provides error
information.
Debug Request When a user message asks CodeTutor to fix bugs or “Debug this code. [Code Snippet]” 8.22%
explain what was wrong in code snippets provided
or in previous messages.
Modification Request When a user requires CodeTutor to change some- “Remove comments.” 4.48%
thing on its previous answer.
Further Information When a user message provides more context on “All the input strings will be the same length.” 3.97%
their previous input.
Explanation When CodeTutor explains something in previous “I’m sorry, but I need more information to provide the 28.94%
messages or why it cannot complete the current answers for questions 4 and 6.”
task from users.
Correction When CodeTutor corrects content in its previous “Apologies for the syntax error. Here is the corrected 13.95%
answer. version: [Code Snippet]”

responses, and exercises designed to challenge students to critically their confidence in its accuracy diminishes with prolonged use.
evaluate the information and solutions offered by AI tools. Additionally, our model uncovers a weakly significant decrease in
We also propose approaches to integrate HCI tools and principles students’ preference for CodeTutor as a TA replacement over time.
into LLM-enabled platforms, such as prompt construction templates This trend implies a growing inclination among students to seek
providing users with templates or structured forms for crafting human TA support as they progress in their courses, possibly due to
queries. They can guide users in formulating more effective and the nuanced understanding and personalized feedback that human
precise questions. Templates could include placeholders for essen- TAs can offer, which might not be fully replicated by LLMs. How-
tial details and context, providing the necessary information for the ever, our study found no significant temporal change in students’
AI to generate accurate responses to users. Furthermore, integrating attitudes toward CodeTutor’s impact on critical thinking, syntax
Critical Thinking Prompts might be particularly effective in stimu- mastery, and independent learning. This stability suggests that
lating in-depth analytical thinking. For example, the interface could while students may question CodeTutor’s comprehension abilities
pose follow-up questions encouraging users to assess AI answers’ and its adequacy as a TA replacement over time, they still recognize
adequacy critically. Questions such as, “Does this response fully its utility in facilitating certain aspects of the learning process, such
address your query?” or “What additional information might you as mastering syntax and promoting independent study habits.
need?” may prompt users to engage in a more thorough evaluation Collectively, our findings highlight the importance of investigat-
of the information provided, fostering a habit of critical reflection ing the temporal dynamics of student attitudes towards and their
and assessment. Another possible approach is Facilitating Collab- use of LLM-powered tools for learning and shed light on the need
orative Query Building, which leverages the power of collective for a balanced approach to integrating LLMs into CS education.
intelligence. By designing interfaces that support real-time collabo- While these tools offer great support in specific areas, their limita-
ration among users, individuals can work together to construct and tions become more apparent with extended use. In other words, it is
refine queries. We can also use LLMs to evaluate and refine user important to complement LLMs with human instruction to address
questions instantly as they perform well in prompting [59]. learning objectives, such as critical thinking and problem-solving,
which are crucial for computer science education. Furthermore, we
argue that educators and developers should work collaboratively
5.2 Turning to the Temporal Dynamics of to enhance the capabilities of LLM-powered tutoring systems, en-
LLM-Powered Tutoring Tools suring they remain effective and relevant over time.
The temporality aspect of using CodeTutor in computer science
education presents a nuanced perspective on their integration and 5.3 Alignments of LLMs for Education
effectiveness over time. Our analysis reveals a complex relation- Our observations regarding students’ utilization of CodeTutor pro-
ship between the duration of CodeTutor use and students’ attitudes vide insights into their learning approaches and completion of
towards it. Specifically, our results show that although students ini- assignments. The exams that prohibit using CodeTutor reflect stu-
tially find CodeTutor a reliable tool for understanding their queries, dents’ understanding of programming, as they must rely solely on
L@S ’24, July 18–20, 2024, Atlanta, Georgia, GA, USA Lyu et al.

their internal knowledge. Conversely, assignments and lab tasks additional instructional features. Additionally, it is crucial to empha-
that permit using CodeTutor result in higher scores, indicating that size the boundaries of using LLM-powered tools, clarifying what is
students may prioritize completion over deep comprehension [17]. permissible and the consequences of inappropriate usage.
While students employ CodeTutor to fulfill homework require-
ments, they may not perceive it as a tool for a comprehensive
6 LIMITATIONS AND FUTURE WORK
understanding of course materials.
Our results show that nearly half of the low-quality prompts clas- Our study, while providing valuable insights into the use of LLM-
sified as over-reliance were copied and pasted original assignment powered tools in educational settings, has several limitations that
questions into CodeTutor. This suggests that students primarily suggest avenues for further research. First, The current study was
used CodeTutor as a quick-fix solution, neglecting the opportu- conducted on a relatively small scale, limiting the generalizability
nity to engage with the underlying question logic and determine of our findings. Therefore, our future work will conduct larger-
appropriate solutions to the question. As the complexity of assign- scale tests involving more diverse student populations and settings.
ments increased, students’ perceptions of CodeTutor’s ability to Second, regarding the applicability to different levels of coding
understand their queries turned more negative. However, students courses, our work has focused on beginning levels of CS courses.
acknowledge its proficiency in syntax mastery, which reveals a Our findings may not directly translate to intermediate or advanced
gap between their expectations and the tool’s capabilities. Complex programming courses. Furthermore, we relied on GPT-3.5 in this
questions require students to integrate and apply the knowledge study, which may not always provide accurate or contextually ap-
acquired in class [51], challenging the notion that CodeTutor can propriate responses, potentially affecting the quality of tutoring
easily break down questions into manageable components. Addi- provided. Lastly, controlling the experimental environment in a
tionally, CodeTutor’s limitations, such as its training on a predeter- semester-long study, particularly the control group, was challeng-
mined database and inability to handle custom or complex queries, ing, indicating the need for more experimental designs in future
suggest that it is important to simplify questions and structure studies to better understand the factors affecting student learning.
prompts effectively for optimal results.
Furthermore, we argue that students’ previous experiences with 7 CONCLUSION
chatbots, if unrelated to structured learning, such as a simple one- In this work, we conducted a semester-long between-subjects study
line request (e.g., “help me write a summary”), may not adequately with 50 students to examine the ways in which students use an
prepare them for using CodeTutor effectively in a programming LLM-powered virtual teaching assistant (i.e., CodeTutor) in their
context, as evidenced by our findings that nearly 70% of student introductory-level programming learning. The experimental group
submissions in our corpus were of poor quality. Students with using CodeTutor showed significant improvements in final scores
limited experience interacting with chatbots might be hesitant to over the control group, with first-time users of LLM-powered tools
trust tools like CodeTutor fully, potentially affecting their use and experiencing the most substantial gains. While positive feedback
reliance on its outputs. This lack of familiarity could lead them to was received on CodeTutor’s ability to understand queries and aid
prefer traditional learning approaches, fostering deeper analytical in syntax learning, concerns were raised about its effectiveness in
thinking and minimizing dependency on automated assistance. cultivating critical thinking skills. Over time, we observed a shift
Design Implications. Our findings shed light on the future towards preferring human assistant support over CodeTutor, de-
implementation and enhancement of CodeTutor in programming spite its utility in completing programming tasks, understanding
courses. The inherent limitations of CodeTutor, which is trained on syntax, and debugging. Our study also shows the importance of
a general dataset, may necessitate the creation of custom datasets prompt quality in leveraging CodeTutor’s effectiveness, indicating
tailored to specific class contexts. Through instructors’ reflections that detailed and clear prompts yield more accurate responses. Our
on the quality of students’ assignments, it becomes evident that findings point to the critical need for embedding Generative AI
while CodeTutor produces impressive results due to its training literacy into educational curricula and to promote critical thinking
on datasets crafted by professional programmers aimed at effi- abilities among students. Looking ahead, our research suggests inte-
ciency, the emphasis in entry-level classes should prioritize human- grating LLM-powered tools in computer science education requires
readable code over complex solutions. One potential solution is more tools, resources, and regulations to help students develop Gen-
to leverage GPT models with the Assistant API [31]. This API erative AI literacy and customize teaching strategies to bridge the
enables the development of AI assistants with features, such as gap between tool capabilities and educational goals. By adjusting
the Code Interpreter [33], which can execute Python code in a expectations and guiding students on effective tool use, educators
sandboxed environment, and Knowledge Retrieval [34], allowing may harness the full potential of Generative AI to complement
users to upload documents to enhance the assistant’s knowledge traditional teaching methods.
base. These features align more closely with the requirements of
a virtual TA in entry-level programming courses. The Code In-
terpreter can enhance the quality of responses containing code ACKNOWLEDGMENTS
blocks, while Knowledge Retrieval empowers instructors to pro- This project is funded by the Studio for Teaching & Learning Inno-
vide course-specific information. Meanwhile, providing systematic vation Learn, Discover, Innovate Grant, the Faculty Research Grant
instructions to students can enhance their understanding of how from William & Mary, and the Microsoft Accelerate Foundation
to use the tool effectively while improving its accessibility through Models Research Award. We thank our participants in this study
and our anonymous reviewers for their feedback.
Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study L@S ’24, July 18–20, 2024, Atlanta, Georgia, GA, USA

REFERENCES Strategies for LLM Use on Learner Performance and Perception. arXiv preprint
[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- arXiv:2310.13712 (2023). https://doi.org/10.48550/arXiv.2310.13712
cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal [24] Juho Leinonen, Paul Denny, Stephen MacNeil, Sami Sarsa, Seth Bernstein, Joanne
Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 Kim, Andrew Tran, and Arto Hellas. 2023. Comparing code explanations created
(2023). https://doi.org/10.48550/arXiv.2303.08774 by students and large language models. arXiv preprint arXiv:2304.03938 (2023).
[2] Toufique Ahmed, Noah Rose Ledesma, and Premkumar Devanbu. 2022. SYN- https://doi.org/10.48550/arXiv.2304.03938
SHINE: improved fixing of syntax errors. IEEE Transactions on Software Engineer- [25] Juho Leinonen, Arto Hellas, Sami Sarsa, Brent Reeves, Paul Denny, James Prather,
ing 49, 4 (2022), 2169–2181. https://doi.org/10.1109/TSE.2022.3212635 and Brett A Becker. 2023. Using large language models to enhance programming
[3] John R Anderson, C Franklin Boyle, and Brian J Reiser. 1985. Intelligent tutoring error messages. In Proceedings of the 54th ACM Technical Symposium on Computer
systems. Science 228, 4698 (1985), 456–462. https://doi.org/10.1126/science.228. Science Education V. 1. 563–569. https://doi.org/10.1145/3545945.3569770
4698.456 [26] Mark Liffiton, Brad E Sheese, Jaromir Savelka, and Paul Denny. [n. d.]. Codehelp:
[4] Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. 2015. Fitting Using large language models with guardrails for scalable support in programming
Linear Mixed-Effects Models Using lme4. Journal of Statistical Software 67, 1 classes. ([n. d.]), 1–11. https://doi.org/10.1145/3631802.3631830
(2015), 1–48. https://doi.org/10.18637/jss.v067.i01 [27] Atharva Mehta, Nipun Gupta, Dhruv Kumar, Pankaj Jalote, et al. 2023. Can
[5] Peter Brusilovsky et al. 1998. Adaptive educational systems on the world-wide- ChatGPT Play the Role of a Teaching Assistant in an Introductory Programming
web: A review of available technologies. In Proceedings of Workshop" WWW-Based Course? arXiv preprint arXiv:2312.07343 (2023). https://doi.org/10.48550/arXiv.
Tutoring" at 4th International Conference on Intelligent Tutoring Systems (ITS’98), 2312.07343
San Antonio, TX. [28] Jesse G Meyer, Ryan J Urbanowicz, Patrick CN Martin, Karen O’Connor, Ruowang
[6] Peter Brusilovsky, Elmar Schwarz, and Gerhard Weber. 1996. ELM-ART: An intel- Li, Pei-Chen Peng, Tiffani J Bright, Nicholas Tatonetti, Kyoung Jae Won, Gra-
ligent tutoring system on World Wide Web. In Intelligent Tutoring Systems: Third ciela Gonzalez-Hernandez, et al. 2023. ChatGPT and large language models
International Conference, ITS’96 Montréal, Canada, June 12–14, 1996 Proceedings 3. in academia: opportunities and challenges. BioData Mining 16, 1 (2023), 20.
Springer, 261–269. https://doi.org/10.1007/3-540-61327-7_123 https://doi.org/10.1186/s13040-023-00339-9
[7] Cory J Butz, Shan Hua, and R Brien Maguire. 2006. A web-based bayesian [29] Hyacinth S Nwana. 1990. Intelligent tutoring systems: an overview. Artificial
intelligent tutoring system for computer programming. Web Intelligence and Intelligence Review 4, 4 (1990), 251–277. https://doi.org/10.1007/BF00168958
Agent Systems: An International Journal 4, 1 (2006), 77–97. [30] Derek H. Ogle, Jason C. Doll, A. Powell Wheeler, and Alexis Dinno. 2023. FSA:
[8] Jan Clusmann, Fiona R Kolbinger, Hannah Sophie Muti, Zunamys I Carrero, Simple Fisheries Stock Assessment Methods. https://CRAN.R-project.org/package=
Jan-Niklas Eckardt, Narmin Ghaffari Laleh, Chiara Maria Lavinia Löffler, Sophie- FSA R package version 0.9.4.
Caroline Schwarzkopf, Michaela Unger, Gregory P Veldhuizen, et al. 2023. The [31] OpenAI. 2024. Assistants Overview - OpenAI API. https://platform.openai.com/
future landscape of large language models in medicine. Communications Medicine docs/assistants/overview. Accessed: 2024-02-11.
3, 1 (2023), 141. https://doi.org/10.1038/s43856-023-00370-1 [32] OpenAI. 2024. ChatGPT. https://openai.com/chatgpt. Accessed: 2024-02-11.
[9] Albert T Corbett, Kenneth R Koedinger, and John R Anderson. 1997. Intelligent [33] OpenAI. 2024. Code Interpreter. https://platform.openai.com/docs/assistants/
tutoring systems. In Handbook of human-computer interaction. Elsevier, 849–874. tools/code-interpreter. Accessed: 2024-02-11.
https://doi.org/10.1016/B978-044481862-1.50103-5 [34] OpenAI. 2024. Knowledge Retrieval. https://platform.openai.com/docs/assistants/
[10] Dorottya Demszky and Jing Liu. 2023. M-Powering Teachers: Natural Language tools/knowledge-retrieval. Accessed: 2024-02-11.
Processing Powered Feedback Improves 1: 1 Instruction and Student Outcomes. [35] Maciej Pankiewicz and Ryan S Baker. 2023. Large Language Models (GPT) for au-
(2023). https://doi.org/10.1145/3573051.3593379 tomating feedback on programming assignments. arXiv preprint arXiv:2307.00150
[11] Paul Denny, Sami Sarsa, Arto Hellas, and Juho Leinonen. 2022. Robosourcing (2023). https://doi.org/10.48550/arXiv.2307.00150
Educational Resources–Leveraging Large Language Models for Learnersourcing. [36] Mike Perkins, Jasper Roe, Darius Postma, James McGaughran, and Don Hickerson.
arXiv preprint arXiv:2211.04715 (2022). https://doi.org/10.1145/3501385.3543957 2023. Detection of GPT-4 generated text in higher education: Combining academic
[12] Felix Dobslaw and Peter Bergh. 2023. Experiences with Remote Examination judgement and software to identify generative AI tool misuse. Journal of Academic
Formats in Light of GPT-4. arXiv preprint arXiv:2305.02198 (2023). https://doi. Ethics (2023), 1–25. https://doi.org/10.1007/s10805-023-09492-6
org/10.48550/arXiv.2305.02198 [37] Tung Phung, José Cambronero, Sumit Gulwani, Tobias Kohn, Rupak Majumdar,
[13] Gilan M El Saadawi, Eugene Tseytlin, Elizabeth Legowski, Drazen Jukic, Melissa Adish Singla, and Gustavo Soares. 2023. Generating High-Precision Feedback
Castine, Jeffrey Fine, Robert Gormley, and Rebecca S Crowley. 2008. A natural for Programming Syntax Errors using Large Language Models. arXiv preprint
language intelligent tutoring system for training pathologists: Implementation arXiv:2302.04662 (2023). https://doi.org/10.48550/arXiv.2302.04662
and evaluation. Advances in health sciences education 13 (2008), 709–722. https: [38] Tung Phung, Victor-Alexandru Pădurean, José Cambronero, Sumit Gulwani,
//doi.org/10.1007/s10459-007-9081-3 Tobias Kohn, Rupak Majumdar, Adish Singla, and Gustavo Soares. 2023. Gen-
[14] Mark Elsom-Cook. 1984. Design considerations of an intelligent tutoring system erative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and
for programming languages. Ph. D. Dissertation. University of Warwick. Human Tutors. International Journal of Management 21, 2 (2023), 100790.
[15] GitHub, Inc. 2024. GitHub Copilot. https://github.com/features/copilot. Accessed: https://doi.org/10.48550/arXiv.2306.17156
2024-02-11. [39] Russell A Poldrack, Thomas Lu, and Gašper Beguš. 2023. AI-assisted coding:
[16] Arthur C Graesser, Xiangen Hu, and Robert Sottilare. 2018. Intelligent tutoring Experiments with GPT-4. arXiv preprint arXiv:2304.13187 (2023). https://doi.org/
systems. In International handbook of the learning sciences. Routledge, 246–255. 10.48550/arXiv.2304.13187
[17] Morgan Gustafson. 2022. The Effect of Homework Completion on Students’ [40] James Prather, Paul Denny, Juho Leinonen, Brett A Becker, Ibrahim Albluwi,
Academic Performance. Dissertations, Theses, and Projects. https://red.mnstate. Michelle Craig, Hieke Keuning, Natalie Kiesler, Tobias Kohn, Andrew Luxton-
edu/thesis/662 662. Reilly, et al. 2023. The robots are here: Navigating the generative ai revolution in
[18] Yann Hicke, Anmol Agarwal, Qianou Ma, and Paul Denny. 2023. ChaTA: Towards computing education. arXiv preprint arXiv:2310.00658 (2023). https://doi.org/10.
an Intelligent Question-Answer Teaching Assistant using Open-Source LLMs. 1145/3623762.3633499
arXiv preprint arXiv:2311.02775 (2023). https://doi.org/10.48550/arXiv.2311.02775 [41] R Core Team. 2022. R: A Language and Environment for Statistical Computing. R
[19] Danial Hooshyar, Rodina Binti Ahmad, Moslem Yousefi, Farrah Dina Yusop, and Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.
S-J Horng. 2015. A flowchart-based intelligent tutoring system for improving org/
problem-solving skills of novice programmers. Journal of computer assisted [42] Steven Ritter, John R Anderson, Kenneth R Koedinger, and Albert Corbett. 2007.
learning 31, 4 (2015), 345–361. https://doi.org/10.1111/jcal.12099 Cognitive Tutor: Applied research in mathematics education. Psychonomic bul-
[20] Sajed Jalil, Suzzana Rafi, Thomas D LaToza, Kevin Moran, and Wing Lam. 2023. letin & review 14 (2007), 249–255. https://doi.org/10.3758/BF03194060
Chatgpt and software testing education: Promises & perils. In 2023 IEEE Inter- [43] Sami Sarsa, Paul Denny, Arto Hellas, and Juho Leinonen. 2022. Automatic
national Conference on Software Testing, Verification and Validation Workshops generation of programming exercises and code explanations using large language
(ICSTW). IEEE, 4130–4137. https://doi.org/10.1109/ICSTW58534.2023.00078 models. In Proceedings of the 2022 ACM Conference on International Computing
[21] Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna Education Research-Volume 1. 27–43.
Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke [44] Jaromir Savelka, Arav Agarwal, Christopher Bogart, and Majd Sakr. 2023. Large
Hüllermeier, et al. 2023. ChatGPT for good? On opportunities and challenges language models (gpt) struggle to answer multiple-choice questions about code.
of large language models for education. Learning and individual differences 103 arXiv preprint arXiv:2303.08033 (2023). https://doi.org/10.48550/arXiv.2303.08033
(2023), 102274. https://doi.org/10.1016/j.lindif.2023.102274 [45] Brad Sheese, Mark Liffiton, Jaromir Savelka, and Paul Denny. 2023. Patterns of
[22] James A Kulik and JD Fletcher. 2016. Effectiveness of intelligent tutoring systems: Student Help-Seeking When Using a Large Language Model-Powered Program-
a meta-analytic review. Review of educational research 86, 1 (2016), 42–78. https: ming Assistant. arXiv preprint arXiv:2310.16984 (2023). https://doi.org/10.1145/
//doi.org/10.3102/0034654315581420 3636243.3636249
[23] Harsh Kumar, Ilya Musabirov, Mohi Reza, Jiakai Shi, Anastasia Kuzminykh, [46] Derek Sleeman and John Seely Brown. 1982. Intelligent tutoring systems. London:
Joseph Jay Williams, and Michael Liut. 2023. Impact of Guidance and Interaction Academic Press.
L@S ’24, July 18–20, 2024, Atlanta, Georgia, GA, USA Lyu et al.

[47] Robert A Sottilare, Keith W Brawner, Benjamin S Goldberg, and Heather K Holden. [54] Junchao Wu, Shu Yang, Runzhe Zhan, Yulin Yuan, Derek F Wong, and Lidia S
2012. The generalized intelligent framework for tutoring (GIFT). Orlando, FL: Chao. 2023. A survey on llm-gernerated text detection: Necessity, methods, and
US Army Research Laboratory–Human Research & Engineering Directorate (ARL- future directions. arXiv preprint arXiv:2310.14724 (2023). https://doi.org/10.
HRED) (2012). 48550/arXiv.2310.14724
[48] Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Chujie Gao, [55] Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebas-
Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, et al. 2024. Trustllm: Trust- tian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann.
worthiness in large language models. arXiv preprint arXiv:2401.05561 (2024). 2023. Bloomberggpt: A large language model for finance. arXiv preprint
https://doi.org/10.48550/arXiv.2401.05561 arXiv:2303.17564 (2023). https://doi.org/10.48550/arXiv.2303.17564
[49] Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura [56] JD Zamfirescu-Pereira, Richmond Y Wong, Bjoern Hartmann, and Qian Yang.
Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. 2023. Large language models 2023. Why Johnny can’t prompt: how non-AI experts try (and fail) to design
in medicine. Nature medicine 29, 8 (2023), 1930–1940. https://doi.org/10.1038/ LLM prompts. In Proceedings of the 2023 CHI Conference on Human Factors in
s41591-023-02448-8 Computing Systems. 1–21. https://doi.org/10.1145/3544548.3581388
[50] David R Thomas. 2006. A general inductive approach for analyzing qualitative [57] Jiawei Zhou, Yixuan Zhang, Qianni Luo, Andrea G Parker, and Munmun
evaluation data. American journal of evaluation 27, 2 (2006), 237–246. https: De Choudhury. 2023. Synthetic lies: Understanding ai-generated misinforma-
//doi.org/10.1177/1098214005283748 tion and evaluating algorithmic and human solutions. In Proceedings of the
[51] Ulrich Trautwein and Olaf Köller. 2003. The relationship between homework and 2023 CHI Conference on Human Factors in Computing Systems. 1–20. https:
achievement—still much of a mystery. Educational psychology review 15 (2003), //doi.org/10.1145/3544548.3581318
115–145. https://doi.org/10.1023/A:1023460414243 [58] Kyrie Zhixuan Zhou, Zachary Kilhoffer, Madelyn Rose Sanfilippo, Ted Under-
[52] Kurt VanLehn. 2011. The relative effectiveness of human tutoring, intelligent wood, Ece Gumusel, Mengyi Wei, Abhinav Choudhry, and Jinjun Xiong. 2024.
tutoring systems, and other tutoring systems. Educational psychologist 46, 4 "The teachers are confused as well": A Multiple-Stakeholder Ethics Discussion on
(2011), 197–221. https://doi.org/10.1080/00461520.2011.611369 Large Language Models in Computing Education. arXiv preprint arXiv:2401.12453
[53] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian (2024). https://doi.org/10.48550/arXiv.2401.12453
Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. [59] Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis,
2022. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 Harris Chan, and Jimmy Ba. 2022. Large language models are human-level
(2022). https://doi.org/10.48550/arXiv.2206.07682 prompt engineers. arXiv preprint arXiv:2211.01910 (2022). https://doi.org/10.
48550/arXiv.2211.01910