0% found this document useful (0 votes)
5 views

CodeHelp- Using Large Language Models with Guardrails for Scalable Support in Programming Classes

The document introduces CodeHelp, an LLM-powered tool designed to provide scalable programming support to students while incorporating guardrails to prevent over-reliance on direct solutions. It details the tool's design, features for instructors, and its deployment in a first-year computer science course, highlighting positive student reception and effective assistance. The findings suggest that CodeHelp complements traditional instructor support and encourages students to engage with programming challenges more independently.

Uploaded by

zhouyizhou25
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

CodeHelp- Using Large Language Models with Guardrails for Scalable Support in Programming Classes

The document introduces CodeHelp, an LLM-powered tool designed to provide scalable programming support to students while incorporating guardrails to prevent over-reliance on direct solutions. It details the tool's design, features for instructors, and its deployment in a first-year computer science course, highlighting positive student reception and effective assistance. The findings suggest that CodeHelp complements traditional instructor support and encourages students to engage with programming challenges more independently.

Uploaded by

zhouyizhou25
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

CodeHelp: Using Large Language Models with Guardrails for

Scalable Support in Programming Classes


Mark Liffiton Brad Sheese
[email protected] [email protected]
Illinois Wesleyan University Illinois Wesleyan University
Bloomington, Illinois, USA Bloomington, Illinois, USA

Jaromir Savelka Paul Denny


[email protected] [email protected]
arXiv:2308.06921v1 [cs.CY] 14 Aug 2023

Carnegie Mellon University The University of Auckland


Pittsburgh, Pennsylvania, USA Auckland, New Zealand

ABSTRACT unlikely to be exhaustive. Thus, there is great need for scalable ap-
Computing educators face significant challenges in providing timely proaches for providing immediate, high-quality support to students
support to students, especially in large class settings. Large lan- who are learning to program.
guage models (LLMs) have emerged recently and show great promise Large language models (LLMs) have recently garnered consider-
for providing on-demand help at a large scale, but there are concerns able interest due to their capabilities for generating human-like text
that students may over-rely on the outputs produced by these mod- in a wide array of contexts, including computing education [27].
els. In this paper, we introduce CodeHelp, a novel LLM-powered There, LLMs have shown great potential for generating resources
tool designed with guardrails to provide on-demand assistance to such as programming exercises, code explanations and model so-
programming students without directly revealing solutions. We lutions [11]. Recent work has even shown that LLM-generated
detail the design of the tool, which incorporates a number of useful explanations of code are perceived as more useful to students than
features for instructors, and elaborate on the pipeline of prompt- explanations produced by their peers [20]. Thus, the prospect of
ing strategies we use to ensure generated outputs are suitable for using LLMs to produce real-time, on-demand help for students
students. To evaluate CodeHelp, we deployed it in a first-year com- appears promising. However, a common concern is that students
puter and data science course with 52 students and collected student may rely too heavily on the outputs produced by such models, espe-
interactions over a 12-week period. We examine students’ usage cially if they can be used to generate solutions directly [1]. Related
patterns and perceptions of the tool, and we report reflections from concerns around student over-reliance on LLM-based tools are com-
the course instructor and a series of recommendations for classroom mon in educational settings [16]. Indeed, when OpenAI recently
use. Our findings suggest that CodeHelp is well-received by stu- released the widely publicised GPT-4 model, they showcased the
dents who especially value its availability and help with resolving example of a ‘socratic’ tutor, highlighting how the model could be
errors, and that for instructors it is easy to deploy and complements, steered away from revealing solutions directly to the user1 .
rather than replaces, the support that they provide to students. In this paper we introduce CodeHelp, an LLM-powered tool
for generating real-time help for programming and computer sci-
CCS CONCEPTS ence students. A key contribution of CodeHelp is its use of robust
“guardrails” that are specifically designed to not reveal solutions
• Social and professional topics → Computer science edu-
directly while helping students resolve their issues, thus mitigating
cation; Software engineering education; • Human-centered
the over-reliance trap that direct use of LLMs may cause. We de-
computing → Interactive systems and tools.
scribe the design of the CodeHelp tool and elaborate on the LLM
prompting strategies that we use to generate outputs that guide
KEYWORDS students towards a solution without producing answers directly.
Intelligent tutoring systems, Intelligent programming tutors, Pro- We also discuss the tool’s useful features for instructors, including
gramming assistance, Novice programmers, Natural language in- the ability to observe, summarise, and review how their students
terfaces, Large language models, Guardrails engage with it. To explore its potential, we deployed CodeHelp in a
first-year computer- and data-science course with 52 students and
1 INTRODUCTION AND MOTIVATION monitored its usage over a 12-week period. We investigate when
As student interest in programming continues to grow and class and how frequently students engaged with CodeHelp, what types
sizes expand, educators face significant challenges in providing of help they request, and how useful they found the tool. To date,
effective and timely support to all students. Traditional approaches there has been significant interest in the computing education liter-
of offering on-demand expert help do not scale well in very large ature focusing on the accuracy of LLMs, the types of resources they
settings, and not all students feel comfortable approaching an in- can generate, and comparative analyses involving historical student
structor or a teaching assistant for help [13]. Similarly, authoring data [11]. To our knowledge, this work represents the first evalua-
static hints or responses to commonly encountered issues that can tion of an always-available LLM-powered teaching assistant with
be presented to students needing help is both time intensive and 1 https://openai.com/research/gpt-4
Mark Liffiton, Brad Sheese, Jaromir Savelka, and Paul Denny

guardrails tailored for computer science education. We found that et al. found that Codex (GitHub Copilot) outperforms typical stu-
CodeHelp is well-received by students, it is easy and inexpensive to dents in CS1 programming exams [12]. Similarly, Savelka et al.
deploy, and most importantly, it appears to effectively complement found that GPT-4 comfortably passes diverse types of assessments
and expand on the support students receive from course instructors from introductory and intermediate Python programming classes
and teaching assistants (TAs). at the post-secondary education level [31]. Denny et al. evaluated
Copilot on 166 CS1 coding problems and found that it successfully
solves around half of these problems on its very first attempt, and
that it solves 60% of the remaining problems if the problem de-
2 RELATED WORK scription is reformulated appropriately [9]. Tian et al. evaluated
Providing effective automated assistance to novice programmers ChatGPT as a programming assistant and found that it successfully
has been a longstanding research problem. Considerable attention handles typical programming challenges [33]. LLMs have also been
has been devoted to the development and evaluation of so-called in- applied to other computing education tasks, such as writing tests
telligent tutoring systems for programming, sometimes referred to [5, 15], and helping novices learn how to craft effective prompts
as intelligent programming tutors (IPT). Such systems vary greatly [10]. Moreover, LLMs have been employed to generate example ex-
and contain a large range of supplementary features [8]. Most of planations as scaffolding to help students learn how to understand
the work has been devoted to various approaches for the generation and explain code themselves [20] and to generate programming
of effective hints [21, 22] and feedback [18]. The primary difference exercises and code explanations [30]. This prior work demonstrates
between CodeHelp and previous work in this area is that CodeHelp the capabilities and the flexibility of the LLMs that power CodeHelp.
is able to respond to a far wider range of requests and requires Despite their impressive performance at many tasks, LLMs may
little or no configuration or setup for any specific class context not be as effective as human tutors in some domains. For instance,
due to its underlying use of LLMs. Prior to the development and LLMs may struggle with certain types of programming multiple-
use of LLMs, similar tools had to rely on various rule-based and choice questions [32] or certain types of coding exercises [31]. An
machine learning-based natural language processing techniques empirical evaluation of GitHub Copilot’s code suggestions revealed
that were much more specialized and, hence, brittle. For example, limitations in generating reliable code [23]. Pardos and Bhandari
they could only support a single programming language or type of [26] compared learning gains from hints generated by LLMs and
support request. CodeHelp supports any programming language human tutors, finding that although both led to positive learning
with sufficient coverage in the underlying LLM’s training set. In gains, human-generated hints were superior. They also found that
particular, programming languages that are commonly used in com- only 70% of ChatGPT-generated hints were usable. Our vision for
puting education are covered very well. CodeHelp can also respond CodeHelp is that it will serve to augment existing instruction, pro-
effectively to a wide variety of request types. viding students with another convenient and accessible avenue to
Chatbots provide a convenient interaction experience and have seek support, rather than replacing human instructors or TAs.
previously been deployed as intelligent assistants in programming Two recent studies in the computing education literature pro-
education contexts. For example, Carreira et al. developed Pyo, a vide excellent motivation for our work. Both studies highlight
chatbot designed to help novice programmers in online courses by the pressing need for a tool that provides appropriate guardrails
providing definitions of concepts, guiding them through errors, and when generating responses to students’ requests. The first study, by
assisting with exercises [4]. Although the goal of Pyo is very similar Kazemitabaar et al., analyses student use of their Coding Steps tool
to that of CodeHelp, a notable distinction is that Pyo is rule-based [17]. Coding Steps integrates an AI code generator into the user
with predetermined topics and conversation flows, while CodeHelp interface of an online programming tool. When a student uses this
is far more flexible. In similar work, Konecki et al. proposed a rule- code generator, they provide a natural language prompt which is
based intelligent assistant for programming education aiming to packaged together with their existing code and six static examples
increase engagement, motivation and learning time [19]. Although and sent to the OpenAI Codex API. The response from the API is
the primary focus of CodeHelp is to assist students in resolving their then automatically inserted for the student into the code editor.
issues when programming, we expect it may influence engagement In their study, where students tackled 45 Python programming
and motivation as well. tasks over ten 90-minute sessions, AI-generated code was submit-
Python-Bot [24] and RevBot [25] are examples of AI-based sys- ted by students without any modification 49% of the time. This
tems that help students understand Python syntax and practice past heavy use of the code generator raises concerns around student
exam questions. Here, the focus is not on resolving issues, as with over-reliance which has been identified as a key challenge for edu-
CodeHelp, but rather on helping students understand particular cators [1, 3, 7, 28]. The second study that is particularly pertinent
topics and testing their knowledge. Duckbot is another chatbot to our work is the recent paper by Hellas et al. exploring responses
designed to enhance help-seeking between students and teaching generated by Codex and GPT-3.5 to 150 student help requests from
staff in programming tutorials [29]. Walden et al. [34] developed a historical dataset [14]. The data had previously been collected via
a chatbot for teaching secure programming in PHP. Unlike many a platform that allowed students to click a ‘Request help’ button
existing chatbot tools that have a narrow focus, CodeHelp lever- when their code did not pass automated tests. This added their
ages the power of LLMs to provide support across a wide variety request to a queue that was monitored by a teacher who could
of contexts involving various programming languages. respond manually. When assessing the GPT-3.5 model, they found
LLMs have been shown to exhibit remarkable performance on a that many of the generated responses were accurate and that 99%
broad range of tasks, including code generation [6]. Finnie-Ansley of the responses contained source code. Interestingly, the authors
CodeHelp: Using Large Language Models with Guardrails

characterise the language model as an ‘unreliable tutor’ that has


a ‘penchant for blurting out model solutions even when you di-
rectly ask them not to’. Again, this work emphasises the need for
tools that can provide assistance to students without immediately
revealing answers.
Our work differs from these recent studies in several key ways.
Our primary contribution is the explicit design of appropriate
guardrails to avoid student over-reliance on model-generated code.
Like Kazemitabaar et al. [17], we deployed our tool in the class-
room; however, our evaluation ran for 12 weeks, and we explore
how students interact with it outside of scheduled class sessions.
In the dataset used by Hellas et al. [14], students infrequently used
the ‘Request help’ button likely due to the fact that requests were
added to a queue and responded to manually by a teacher. In our
work, students receive immediate feedback from CodeHelp at any
time of the day or night.

Figure 1: The Help Request form (text areas have been shrunk
3 CODEHELP DESIGN AND here to save space). The four separate inputs (language, code,
IMPLEMENTATION error, and issue) and connected guidance text help students
We designed CodeHelp to augment and complement the learning structure their request and encourage good practices when
support students receive from instructors and teaching assistants. requesting support.
We aimed to provide a tool in which a student could 1) request help
with issues they face in programming activities and 2) immediately
receive a helpful response that provides guidance and explanation
without providing a complete solution. To accomplish this, we cre-
ated CodeHelp with a simple, clear interface for students (Sec. 3.1);
developed a workflow of multiple LLM prompts to generate the
desired responses, with guardrails, from a student’s input (Sec. 3.2);
and implemented features specifically for instructors to manage
and observe their students’ usage (Sec. 3.3). For broad accessibility,
CodeHelp is implemented as a web application; it is accessible at
https://codehelp.app/.

3.1 Student Interfaces


CodeHelp’s student interfaces are simple, with minimal choices
and clear guidance. Students accessing CodeHelp are brought di-
Figure 2: The Response View. Students are shown the details
rectly to the Help Request form, shown in Figure 1. We opted for
of their request followed by the system’s response. A promi-
a structured input, organizing it into several specific fields rather
nent warning reminds students that the response may be
than having a single free-form text input. This both provides guid-
incorrect. A simple feedback form allows the students to
ance to students about what information is typically needed for an
indicate whether the answer was helpful.
effective query and gives more context and structure to the prompt
that is ultimately fed to an LLM, which increases the chances of
supporting the student successfully. Moreover, the structured in- • The question or description of the issue with which the
put provides students an opportunity to practice asking technical student needs help.
questions, providing the necessary relevant context.
After submitting a request for help, the student is brought to
Students are asked to provide:
the response view, an example of which is shown in Figure 2. This
• The programming language in which they are working. view displays the query (for reference) and the generated response.
The instructor can set a class-wide initial default, and the Because there is a chance the LLM may generate an incorrect or
form then defaults to each student’s most recently selected confusing answer (discussed further in Section 4), a warning re-
language. minder is displayed prominently above every response. A simple
• The relevant snippet of code. This is optional, as not all feedback form allows the student to note whether the response was
queries reference existing code. helpful. The query, response, and any feedback are stored for the
• The error message. This is optional as well. If an error student’s future reference and made available to the instructor.
message is provided, the underlying LLM is prompted to In practice, students do not always provide sufficient information
explain the error message to the student. or context to provide accurate assistance. CodeHelp attempts to
Mark Liffiton, Brad Sheese, Jaromir Savelka, and Paul Denny

A student’s request for help (query) is included in a “sufficiency


check” prompt and in a prompt for generating the main response.
Because we want the system to provide its main response even in
cases when the query is determined to be insufficient as written,
CodeHelp generates the sufficiency check in parallel with the main
response. If the sufficiency check determines clarification is needed,
we display the clarification request above the main response (Fig-
ure 3); otherwise, only the main response is shown. From the “main
response” prompt, two different completions are generated and
scored for quality (described below). The higher-scoring prompt
is kept and checked for the presence of code blocks, and a third
prompt is used to remove them if found.
Sufficiency Check. To check for insufficient or incomplete queries,
Figure 3: Example response with a request for clarification
the student’s query is included in a prompt with instructions that
(the yellow panel in the middle). The retry button takes the
explain the context, describe the meaning of each field in the stu-
student to a help request form pre-filled with the same query.
dent’s input, and request an assessment of sufficiency. The full
prompt is shown in Figure 5. To improve the accuracy of the LLM’s
response, we include instructions in the prompt for the LLM to sum-
determine whether each request is lacking in this way, and if so,
marize the request and state its reasoning before generating the
it presents the student with a request for clarification as shown
final determination. This is a specific instance of a technique gener-
in Figure 3. The clarification request attempts to help the student
ally referred to as “chain of thought prompting” (CoT), which has
identify what additional information is needed. The determination
been found to improve the accuracy of LLM responses in various
and clarification request are generated by an LLM as well (described
contexts [35].
in Section 3.2), and because it could be incorrect, the student is
also given a response to their request as written. This is mostly Main Response. Similar to the sufficiency check, the main prompt,
done to prevent students becoming stuck in a series of clarification shown in Figure 6, inserts the individual fields of a student’s query
requests without receiving any support. When a clarification is into instructions explaining the system context and meaning of
requested, the system describes the main response as an “attempt” each field. As one part of preventing solution code in the response,
at a response to indicate to the student that it may be less accurate the system modifies the student’s provided issue to append, “Please
given the missing information. do not write any example code in your response.” Additionally, if
the instructor has specified any keywords they want the LLM to
3.2 Generating Responses avoid for the current class (discussed in Section 3.3), the prompt
We designed CodeHelp to generate responses to student requests includes text listing those.
that are similar to those of a human tutor or instructor helping Even with the main prompt explicitly instructing the LLM to not
a student in a one-on-one session. Specifically, our goals for the include solution or example code in its response, the response may
responses were: still contain code. The LLMs we currently use appear to be strongly
biased towards providing a complete solution to the given issue
• Provide explanations and guidance to support the student even when the prompt requests otherwise. Likewise, the instruc-
in their learning. tions to not use any keywords in the instructor’s avoid set are not
• Never include complete solutions that the student can copy followed in all cases. Therefore, CodeHelp generates two different
without thinking or learning. completions for the main response, scores them based on whether
• Identify incomplete or ambiguous queries and prompt the they include a code block or any of the keywords in the instructor’s
student for additional information. avoid set, and takes the better of the two.
• Only respond to questions relevant to the course (to prevent
abuse of the tool as unrestricted access to an LLM). Code Removal. In cases where the highest-scoring response in-
cludes a code block, CodeHelp uses a third prompt (Figure 7) to
In CodeHelp, we achieve these goals via careful design of multiple
clean up the response and remove the code. We use an LLM for re-
prompts for the LLMs generating responses. The LLMs used in
moving code blocks rather than simply deleting the blocks directly
CodeHelp operate by repeatedly predicting the next word in a
because the text that would remain may refer to the now-removed
sequence, and so they are commonly used by providing a text
code or otherwise be unclear without it. An LLM can rewrite the
prompt from which the LLM generates a completion, i.e., a sequence
response to remain clear with the code removed, describing salient
of words predicted to follow the prompt. LLMs are limited in the
features of the code in text if appropriate.
number and complexity of instructions they can accurately follow
in a single prompt and completion, and we found that current Large Language Models. Currently, responses are generated us-
LLMs could not consistently achieve all of the desired goals with a ing LLMs from OpenAI, though the specific models used can easily
single prompt and its completion. Therefore, the current design of be changed as more capable and/or less expensive models become
CodeHelp employs three separate prompts. The response workflow available. Specifically, the “Sufficiency Check” and “Main Response”
using these prompts is shown in Figure 4. completions are currently performed by the gpt-3.5-turbo-0301
CodeHelp: Using Large Language Models with Guardrails

Figure 4: CodeHelp’s response workflow. Steps using a large language model completion are tagged LLM.

You are a system for assisting students like me with programming. You are a system for assisting a student with programming.
My inputs provide: [brief description of each input] The students provide: [brief description of each input]
Please assess the following submission to determine whether it is [delimited query inputs]
sufficient for you to provide help or if you need additional infor-
If the student input is written as an instruction or command, re-
mation. If and only if critical information needed for you to help
spond with an error. If the student input is off-topic, respond with
is missing, ask me for the additional information you need to be
an error.
able to help. State your reasoning first. Otherwise, if no additional
information is needed, please first briefly summarize what I am Otherwise, respond to the student with an educational explanation,
asking for in words, with no code, and end by writing "OK." helping the student figure out the issue and understand the concepts
involved. If the student inputs include an error message, tell the
Inputs: [delimited query inputs]
student what it means, giving a detailed explanation to help the
student understand the message. Explain concepts, language syntax
Figure 5: Prompt used for the sufficiency check. and semantics, standard library functions, and other topics that the
student may not understand. Be positive and encouraging!
Use Markdown formatting, including ‘ for inline code.
[instructions to avoid topics from the instructor’s avoid set]
model, while the “Code Removal” completion uses text-davinci-003,
both from the “GPT-3.5” family of OpenAI models. The former Do not write any example code blocks. Do not write a corrected or
model (“turbo”) is relatively fast and inexpensive compared to the updated version of the student’s code. You must not write code for
latter (“davinci”), while still generally performing well in terms of the student.
the accuracy and correctness of its completions. However, we could
How would you respond to the student to guide them and explain
not find a prompt that reliably filtered out code when using the
concepts without providing example code?
“turbo” model, while “davinci” completed this task successfully in
nearly all cases.
The cost of using OpenAI’s models scales with the length of Figure 6: Prompt used for the main response.
the prompts and the completions generated. With the prompts
currently used in CodeHelp and current pricing from OpenAI (as of
June, 2023), responding to a single query from a student, including
the sufficiency check and at times the code removal completion, completion lengths, we estimate it would increase the cost by 25x,
costs roughly $0.002 on average. We discuss student usage patterns reaching roughly $0.05 per student query.
from a pilot course in Section 5, and from that, we estimate that
providing 50 students unrestricted access to CodeHelp for a full Examples. A few examples of queries submitted by students
semester would cost under $10 in OpenAI usage fees. during our pilot illustrate the results of this design. First, in an
OpenAI has recently released GPT-4, which is reported to be example where the only input is an instruction to write code for
more capable and produce more accurate responses than the GPT- an assigned exercise, we see how CodeHelp provides assistance
3.5 family. We have not yet tested using GPT-4 for CodeHelp’s without writing any code, as intended.
completions, but based on the current pricing and our prompt and
Mark Liffiton, Brad Sheese, Jaromir Savelka, and Paul Denny

The following was written to help a student in a CS class. However, Response: The error message you are seeing is an Attribute Error,
any example code (such as in ``` Markdown delimiters) can give the which means that the object you are trying to access does not have
student an assignment’s answer rather than help them figure it out the attribute or method you are trying to use. In this case, the
themselves. We need to provide help without including example object is the string word and the attribute you are trying to use is
code. To do this, rewrite the following to remove any code blocks remove(). The remove() method is not a valid method for strings,
so that the response explains what the student should do but does so you are getting this error.
not provide solution code.
You can use the replace() method instead. This method takes two
[original response to be rewritten] parameters: the substring you want to replace and the substring
you want to replace it with. In this case, you want to replace the
character with an empty string. To do this, you can pass an empty
Figure 7: Prompt used for code removal.
string as the second parameter to the replace() method. This will
remove the character from the string.
For example, if you have a string word = 'Hello' and a character
Issue/Question: Write a while loop that starts at the last character character = 'l', you can remove the character from the string
in the string and works its way backwards to the first character in by calling word.replace(character, ''). This will return a new
the string, printing each letter on a separate line, except backwards. string with the character removed, 'Heo'.

Response: You can use a while loop to iterate over the characters
in the string in reverse order. To do this, you can use the len() func-
tion to get the length of the string, and use the range() function 3.3 For Instructors
to generate a sequence of numbers from the length of the string
CodeHelp provides instructors with interfaces that enable them to
down to 0. Then you can use the str[index] syntax to access each
configure their classes and view their students’ interactions with the
character in the string.
system. When configuring a class, an instructor can set the default
To help you understand how this works, think about the following language for the class (students can still choose other languages
example: when requesting help) and specify a set of keywords that should be
avoided in any response to the student, aka an “avoid set.” The avoid
Say you have a string my_str = "Hello World". You can use the set is important because LLMs, with no context about the course
len() function to get the length of the string, which is 11. Then you for which help is being requested, tend to generate responses using
can use the range() function to generate a sequence of numbers common programming “best practices,” and these practices can be
from 11 down to 0. This sequence of numbers will represent the too advanced or otherwise inappropriate for a given course context.
indices of the characters in the string. Finally, you can use the For example, in an introductory programming course, an instructor
str[index] syntax to access each character in the string. may want students to practice summing the values in a list using a
Using this approach, you can iterate over the characters in the for loop and an accumulator variable, while an LLM might suggest
string in reverse order. This will print out each character in the a student use the sum() function to solve such a problem.
string on a separate line, except backwards. Instructors may access their students’ queries in the “Instructor
View” (Figure 8). This view provides a list of the users in their class
When students provide a problem specification with little or no with query counts (total and within the past week) and a list of all
code, CodeHelp tends to respond with a high-level outline like this the student queries. The list of queries shows salient details of each
as a starting point. query (with full text for any field appearing when hovering the
When students provide an error message, CodeHelp will typ- cursor over it), and any row can be selected to take the instructor to
ically explain what the message means and suggest a fix. In the the response view for that query. The list of queries can be filtered
following example, the student has not provided the details from to show those from a selected user, and it is searchable (full text)
the error message, but the LLM is still able to identify where it and sortable. Instructors can also download their class data as CSV
occurs, providing the student with effective help. files.
CodeHelp integrates with learning management systems (LMSes)
Code: like Moodle or Canvas that support LTI (Learning Tools Interoper-
ability). With a small amount of setup, an instructor can provide
def character_remover ( word , character ):
their students access to CodeHelp via a simple link in their course
word = word . remove ( character ) on the LMS. Via this link, students may access CodeHelp and be au-
return word tomatically authenticated without having to create, manage, or use
a separate login. Instructors and TAs are identified automatically
Error: Attribute Error by LTI, so they have access to the instructor interfaces in CodeHelp
Issue/Question: I want this function to print the word, excluding with no additional work. They can then configure their course for
the character. student use and monitor their students’ queries and the responses
they are receiving.
CodeHelp: Using Large Language Models with Guardrails

Figure 8: An instructor’s view of student help requests. The full contents of each field are displayed in a tooltip when the user
hovers a mouse pointer over it. Note that real usernames have been replaced with pseudonyms.

4 LIMITATIONS AND RISKS 5 EXPERIENCES AND RESULTS


CodeHelp is subject to many of the known limitations and risks of We used CodeHelp in two sections of an undergraduate introductory-
using LLMs. In particular, completions can be factually incorrect level computer- and data-science course taught by an author of this
and can include harmful biases. The problem of inaccuracies in the paper in the Spring semester of 2023. Fifty two students completed
LLM responses (sometimes called “hallucination” or “confabula- the course. Of those students, our analyses includes data from 49
tion”) is present in CodeHelp with the models it is currently using. who used CodeHelp at least once during the semester, and data from
Sometimes, the response contains one or more false statements, 45 who completed a survey about using CodeHelp at the end of
and this may confuse or mislead the user. Users are sensitised to the semester. The course is designed to serve a broad audience and
this issue via the prominent notice above each response saying attracts students from across the institution who take the course to
“Remember: It will not always be correct!” In our experience, when meet general education requirements or to meet requirements for
inaccuracies did occur, they were often in a particular detail of the data-analytic or data-science related credentials.
response, which still gave correct high-level guidance or pointed The course provides twelve weeks of instruction in Python foun-
the user in the right direction. In our and our students’ experiences, dations and three weeks of instruction in Pandas2 and Seaborn3 .
the rate of inaccuracies is low enough for the tool to still be valuable The format of the course is “flipped,” with students responsible for
and worth the students’ time, and as models improve, the accuracy reading course materials prior to class, while class time is spent
will improve. working through assignments on lab computers. The instructor
LLMs can learn harmful biases such as gender or racial stereo- and a TA assist students and provide instruction/support as needed.
types from their training data, which can then be reflected in the CodeHelp was introduced in the fourth week of the semester with
completions they generate. This is a well-known and heavily studied a quick demonstration in class. During class, students were en-
issue in language model research [36], and it has been an important couraged to use CodeHelp for assistance first before asking the
issue to the computing education community as well [1]. While instructor or TA for help, but they were otherwise free to make
the models used by CodeHelp have been specifically trained and their own choices about when and how to use it.
improved by OpenAI to reduce these biases, some still exist [37].
These models generally do not make offensive statements unless 5.1 Student Use
one actively crafts a prompt to elicit one, but for example they might
Even with no firm requirement to do so, students used CodeHelp
respond in a way that implicitly reflects a common stereotype. This
consistently throughout the semester. Figure 9 shows that roughly
is highly unlikely to occur in the context of requesting help on a
half of the class used CodeHelp each week, and we saw that roughly
specific programming issue, but the possibility exists.
70% of the students used CodeHelp in four or more different weeks.
The above issues apply to most LLM-based tools, and the likeli-
We also observed a wide range of intensity of use between students.
hood of an LLM’s response being incorrect, harmful, off-topic, or
Roughly 80% of the class submitted 10 or more queries (indicating
otherwise “off the rails” increases with additional rounds of user
more than initial trial usage), roughly 50% submitted 30 or more, and
input and model response. Therefore, by design, every query to
seven of the 49 submitted over 100 queries, including one student
CodeHelp is a one-shot request, independent of any others and
with more than 600 queries. The heatmap in Figure 10 shows the
with no possibility for follow-up or dialogue. This limits the use-
usage concentrated during two separate class sessions (1 and 2pm
fulness of the system, as asking a follow-up question or requesting
on Mon/Wed/Fri) and before assignments were due on Saturday.
additional information in the context of an initial response could
Otherwise, there was some use across nearly all hours, including
be very helpful, but the one-shot limitation is imposed to mitigate
many when no instructor or TA would have been available. Overall,
many of the risks of using LLMs. Users can submit revised queries
with additional information or questions informed by an earlier
response if they choose to. 2 Pandas. Available at: https://pandas.pydata.org/ [accessed 2023-06-20]
3 Seaborn. Available at: https://seaborn.pydata.org/ [accessed 2023-06-20]
Mark Liffiton, Brad Sheese, Jaromir Savelka, and Paul Denny

07:00
09:00 150
70 11:00 125
13:00

Total # Queries
60
Percentage of Students

Hour of Day
15:00 100
50 17:00
19:00 75
40
21:00 50
30 23:00
20 01:00 25
10 03:00
Sun Mon Tue Wed Thu Fri Sat
0 Day of Week
1 2 3 4 5 6 7 8 9 10 11 12
Week
Figure 10: Queries by hour (y axis) and day (x axis) over
Figure 9: Percentage of the class (y axis) using CodeHelp the whole term. The time span between 4 and 7 AM is not
each week (x axis) across the semester [7 = spring break]. shown due to no activity. The high activity blocks on Mon,
Note that the y axis scale only extends to 70. The figure Wed, and Fri correspond to the times students were in the
shows consistent use across the whole semester. classroom. The higher activity on Saturday evening is prior
to a recurring deadline for weekly assignments.

the continuing, consistent usage strongly suggests that the students Many students (11) explicitly appreciated that CodeHelp could
generally found the tool beneficial. aid them in “fixing errors”, which was the next most common
theme. This included getting help to understand error messages
and producing explanations of errors. The following are two ex-
5.2 Student Survey amples of typical quotes supporting this theme: “it was helpful in
At the end of the course we distributed an online survey to un- understanding some of the error message we hadn’t learned about
derstand students’ perceptions of CodeHelp. Taking the survey in class” and “it really helps with trouble shooting when it comes to
was optional, but students did receive extra-credit for completing semantic errors”.
it. A total of 45 students (87 percent of the class) completed the One interesting theme that emerged (10 students), distinct from
survey. Table 1 shows the results for a selection of questions about the “availability” of CodeHelp, was that it supported “indepen-
students’ perceptions of the tool and its value to them. Overall, stu- dence” by enabling students to make progress without the need
dents found it valuable, and a large majority (95%) were interested to seek external help when they were stuck. This included provid-
in using it in future CS courses. ing initial support to students who had difficulty starting work,
For additional detail, the survey included the following open- nudging students in the right direction when they were close to a
response questions, which were designed to elicit both positive and solution, and helping students who were anxious to ask for help
negative responses: without the fear of embarrassment. Comments that supported this
theme included “It was nice to have a source to ask when I was unsure
• Q1: What did you find most beneficial about using Code-
how to begin coding”, “it helped lead me in the right direction if I
Help?
almost had the right code” and “I felt like I could ask it any question,
• Q2: Do you think there is anything negative about students
even dumb ones, which I often did to avoid embarrassing myself in
using CodeHelp?
front of the Professor or TA”.
In general, responses were relatively short but tended to be The remaining themes, which were less common, focused on
longer for the first question on beneficial aspects (word count; M = the “speed” (6) with which students could make progress or obtain
16.2, SD = 10.3) compared to the second question on negative aspects feedback and the use of CodeHelp to assist with “learning/un-
(M = 12.0, SD = 13.0). To understand the patterns present in the derstanding” (7). Typical comments aligning with these themes
responses, we conducted a thematic analysis in which interesting included“Helped me work faster” and “it helped understand the code
features of each response were extracted as codes and then collated I was writing sometimes”. Students also appreciated that CodeHelp
into higher-level themes [2]. We identified five prominent themes would provide guidance rather than directly revealing the solution,
in the response to Q1, highlighted in bold in the text that follows. as exemplified by the comment “It gave us help on the answer not
The most prominent theme by a clear margin, appearing in 19 just the answer itself”. Overall, the responses to Q1 tell a story that
of the student responses, was around “availability” and specifi- CodeHelp was seen as a useful resource for obtaining rapid assis-
cally that students valued the convenience of being able to ask for tance and a complementary tool to traditional TA and instructor
assistance outside of the classroom when TAs and the professor support.
were busy or unavailable. Responses representative of this theme As to the concerns (Q2), we also identified five prominent themes,
include: “it was a tool that was always there when I needed it, I didn’t again highlighted in bold. Around half of the students (24) stated
have to go to office or TA hours or email” and “the ability to get help that they had “no concerns”. Some of the students would even
without talking to professor or TA”. suggest the use of the tool should have been more extensive: “We
CodeHelp: Using Large Language Models with Guardrails

Table 1: Results for selected questions in the student survey (𝑛 = 45 of 52 students). Rows may not sum to 100% due to rounding.

Strongly Strongly
Agree Disagree
Agree Disagree
CodeHelp helped me complete my 9% 71% 18% 2%
work successfully.
CodeHelp helped me learn the 7% 56% 33% 4%
course material.
If I took more Computer Science 31% 64% 4% 0%
courses, I would like to be able to
use CodeHelp in those classes.

should even use it during quizzes”. Others explained why they did CodeHelp was easy to introduce to the class. As an instructional
not have any concerns: “No, absolutely not, especially considering it resource, its utility is immediately and obviously apparent. Stu-
never handed me the answer on a silver platter.” dents required little convincing to give it a try. While in class, we
The most prominent theme as to the concerns was the perceived requested that students ask CodeHelp for help before seeking help
“difficulty” in using CodeHelp. Multiple students (14) stated that from the instructor or teaching assistant. We did not enforce this as
the tool is difficult to use when the problem is not understood: a rule but encouraged it throughout the semester. The idea was that
“sometimes i didnt know exactly what to ask.. but i usually got there CodeHelp could provide an initial level of support and handle rela-
eventually” and “I did not like how hard it was to ask something I tively straightforward but common concerns, such as syntax errors.
do not understand.”. Several students also reported receiving an- CodeHelp performed very well in this capacity, and given its flexi-
swers that were difficult to utilize or not helpful: “There were many bility and low-cost, it is a great addition to the classroom for this
times that CodeHelp misunderstood my question and gave me advice functionality alone. However, CodeHelp also provided much more
which confused me even more.” and “Sometimes it gives really strange sophisticated help on a huge range of introductory CS problems
responses that are not related to the problem”. throughout the semester.
Several students (5) reported that sometimes an answer provided CodeHelp appeared to provide accurate and helpful responses
by CodeHelp contained elements that were “not covered” in class to students the majority of the time. CodeHelp did not “give away
and, hence, the students were not expected to have knowledge of the answer” or otherwise become a complete replacement for ac-
those elements. Responses representative of this theme included: tively working through problems. It appears to strike a nice balance
“Sometimes it tells you to do code that we haven’t learned in class” between providing enough information to move students forward
and “I would run into the issue where it wanted me to use concepts without undermining the intent of the assignments.
that I haven’t been taught yet. This is both and good and a bad thing CodeHelp was a great addition to the course in terms of serving
because it can introduce students to resources, but also confuse them.”. students who had difficulty attending office hours or who needed
A small number of students’ responses (3) were hinting on using frequent reassurance or feedback as they worked through assign-
CodeHelp without investing proper effort at solving the problem ments outside of class time. It was also exceptional in providing a
independently (i.e., “over-reliance”). The responses suggest that novel avenue for delivering support to students who did not take
the students were aware this could have negative effects on their advantage of traditional avenues of support. For example, some
learning, yet, they would still engage in that practice: “ think some students who seemed uncomfortable, embarrassed, or otherwise re-
people could complete the code without help and by going directly to luctant to ask for help from the instructor or TA had no reservations
CodeHelp their limiting themselves” and “I do think that sometimes I about asking CodeHelp.
can get to dependent on CodeHelp and I have to scale it back a bit.”. CodeHelp sometimes provided assistance that was inconsistent
Several responses (3) stated that CodeHelp is “not human” and, with the content of the class and the knowledge-level of the stu-
hence, its capabilities are in some way limited as compared to dents. For example, CodeHelp might suggest solving problems with
the assistance provided by an instructor or a TA. However, the methods that had not yet been introduced. This was confusing
responses do not go into much detail as why this might be the and frustrating for some students. During the semester, the avoid
case: “less personal” and “No, but it cannot be a substitute for a set functionality (Section 3.3) was added to allow the instructor to
real person.” One of the responses explained the preference for explicitly prohibit certain kinds of content in CodeHelp responses,
human assistance in terms of difficulty (see above) of formulating which largely resolved the problem. Students sometimes provided
the proper question for CodeHelp: “no but personally I prefer to ask too little information describing their problem to get a useful re-
a real person because its difficult to phrase you questions in a way sponse and required some coaching to provide detailed or thought-
that won’t confuse CodeHelp”. ful descriptions of problems to CodeHelp.
Reviewing student queries submitted to CodeHelp provided an
5.3 Instructor Reflections entirely new type of insight into student learning. In comparison to
submitted work, the queries were a much more direct and unfiltered
After the conclusion of the semester, the instructor, who is also one look into student thinking as they worked through problems. On
of the authors, reflected on what did and did not work:
Mark Liffiton, Brad Sheese, Jaromir Savelka, and Paul Denny

some occasions, this feedback guided modifications of assignments plan to give instructors more ways to provide context about their
and additional class instruction during the semester. courses and thus further tailor the LLM responses for their students.
Overall, given its great utility in a wide range of circumstances, Additionally, we plan to explore different forms or levels of inter-
its ease of use, and low cost, I found CodeHelp to be a tremen- vention that might be appropriate depending on the complexity of
dous asset in my course. I intend to continue using it in all of my the task, the experience level of the student, or even the specific
introductory courses moving forward. learning objectives of the course. And we see many opportunities
for the tool to be more individualized, adapting to the needs of each
6 RECOMMENDED PRACTICES student. For example, it could record and maintain information
Based on our experiences, we have collected a few recommenda- about each individual student’s mastery of different topics, using
tions for integrating CodeHelp into a class effectively. that to guide the responses generated for them.
While encouraging, this work presents only an initial exploration
Initial introduction. When first introducing CodeHelp to stu- into the effective deployment of LLMs in computing education. For
dents, motivate its use by sharing some of the benefits identified in example, while students positively rated CodeHelp and the instruc-
this work, as relevant to your course. Explain carefully its strengths tor found it easy to use and deploy, future work should establish
and limitations in the context of your course: how it will likely more robust metrics for gauging efficacy, such as measuring impact
be able to help, and where may it produce incorrect responses. on student learning outcomes or comparing student performance
Provide guidance on how to ask for help most effectively. This in- in classrooms that use CodeHelp to those that do not.
cludes providing the relevant portions of one’s code, identifying We also recognize that further work needs to be conducted with
and copying the important information from error messages, and larger, more diverse populations of students. It would also be inter-
providing enough information for the issue to be identified. These esting to deploy CodeHelp in different educational settings, such as
are the same skills one needs to effectively communicate issues to in distance learning or self-paced programming courses, to evaluate
instructors or peers. Providing good and bad examples or taking a its flexibility and adaptability.
moment to roleplay a few situations may help here. Demonstrate Our findings could have implications beyond computing educa-
CodeHelp with a few issues similar to those you expect your stu- tion. LLMs such as those used in CodeHelp could potentially be
dents to encounter. Model how to provide sufficient information adapted to support learning in other domains. We hope that our
and communicate clearly. work serves as an impetus for other researchers and educators to
explore the use of LLMs in diverse educational contexts, continuing
During Use. Throughout the course, while students are using the dialogue around the opportunities and challenges they present.
CodeHelp, it is helpful to view the students’ queries regularly. You
can gain detailed insight into where they are struggling at each
point in the term that may lead to adapting course plans. Addi- REFERENCES
tionally, you might identify students whose usage is not effective [1] Brett A Becker, Paul Denny, James Finnie-Ansley, Andrew Luxton-Reilly, James
Prather, and Eddie Antonio Santos. 2023. Programming Is Hard-Or at Least It
(e.g., repeatedly submitting ineffective queries or demonstrating Used to Be: Educational Opportunities and Challenges of AI Code Generation. In
over-reliance), and reach out to them directly to provide guidance Proceedings of the 54th ACM Technical Symposium on Computer Science Education
or a nudge. V. 1. 500–506.
[2] Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology.
Instructors and TAs should sample CodeHelp’s responses in each Qualitative Research in Psychology 3, 2 (2006), 77–101. https://doi.org/10.1191/
section of the course to spot and mitigate issues. For example, if 1478088706qp063oa
CodeHelp suggests a technique, function, or concept that does not [3] Peter Brusilovsky, Barbara J Ericson, Cay S Horstmann, and Christian Servin.
2023. The Future of Computing Education Materials. (2023).
fit the design of your course, you can add that to the avoid set [4] Gustavo Carreira, Leonardo Silva, Antonio Jose Mendes, and Hugo Goncalo
(Section 3.3) to prevent it from being used in future responses. Oliveira. 2022. Pyo, a Chatbot Assistant for Introductory Programming Students.
In 2022 International Symposium on Computers in Education (SIIE). IEEE, Coimbra,
Portugal, 1–6. https://doi.org/10.1109/SIIE56031.2022.9982349
7 CONCLUSION AND FUTURE WORK [5] Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang
Lou, and Weizhu Chen. 2022. CodeT: Code Generation with Generated Tests.
This work shows that LLMs, when properly implemented and inte- arXiv:2207.10397 [cs.CL]
grated into a learning environment, can be a valuable aid to both [6] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde
de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph,
students and educators. We developed CodeHelp to provide imme- Greg Brockman, et al. 2021. Evaluating large language models trained on code.
diate, high-quality support to students working on programming arXiv:2107.03374 [cs.LG]
exercises while mitigating the risk of fostering an over-reliance [7] Jonathan E Collins. 2023. Policy Solutions: Policy questions for ChatGPT and
artificial intelligence. Phi Delta Kappan 104, 7 (2023), 60–61.
on the automated assistance. Providing an automated option for [8] Tyne Crow, Andrew Luxton-Reilly, and Burkhard Wuensche. 2018. Intelligent
this kind of help can increase the level of support students receive tutoring systems for programming education: a systematic review. In Proceed-
throughout a course due to a combination of being constantly avail- ings of the 20th Australasian Computing Education Conference. ACM, Brisbane
Queensland Australia, 53–62. https://doi.org/10.1145/3160489.3160492
able and avoiding the anxiety associated with asking a professor [9] Paul Denny, Viraj Kumar, and Nasser Giacaman. 2023. Conversing with Copi-
or TA for help. In our pilot study, students found CodeHelp to be a lot: Exploring Prompt Engineering for Solving CS1 Problems Using Natu-
ral Language. In Proceedings of the 54th ACM Technical Symposium on Com-
welcome addition to direct support from a professor and teaching puter Science Education V. 1. ACM, Toronto ON Canada, 1136–1142. https:
assistants. //doi.org/10.1145/3545945.3569823
Going forward, we intend to continue developing and improv- [10] Paul Denny, Juho Leinonen, James Prather, Andrew Luxton-Reilly, Thezyrie
Amarouche, Brett A. Becker, and Brent N. Reeves. 2023. Promptly: Using Prompt
ing CodeHelp. The “avoid set” functionality proved to be critical Problems to Teach Learners How to Effectively Utilize AI Code Generators.
for obtaining course-appropriate responses in many cases, and we arXiv:2307.16364 [cs.HC]
CodeHelp: Using Large Language Models with Guardrails

[11] Paul Denny, James Prather, Brett A. Becker, James Finnie-Ansley, Arto Hellas, [23] Nhan Nguyen and Sarah Nadi. 2022. An empirical evaluation of GitHub copilot’s
Juho Leinonen, Andrew Luxton-Reilly, Brent N. Reeves, Eddie Antonio San- code suggestions. In Proceedings of the 19th International Conference on Mining
tos, and Sami Sarsa. 2023. Computing Education in the Era of Generative AI. Software Repositories. ACM, Pittsburgh Pennsylvania, 1–5. https://doi.org/10.
arXiv:2306.02608 [cs.CY] 1145/3524842.3528470
[12] James Finnie-Ansley, Paul Denny, Brett A Becker, Andrew Luxton-Reilly, and [24] Chinedu Wilfred Okonkwo and Abejide Ade-Ibijola. 2021. Python-Bot: A Chatbot
James Prather. 2022. The robots are coming: Exploring the implications of openai for Teaching Python Programming. Engineering Letters 29 (02 2021), 25–34.
codex on introductory programming. In Proceedings of the 24th Australasian [25] Chinedu Wilfred Okonkwo and Abejide Ade-Ibijola. 2022. Revision-Bot: A
Computing Education Conference. 10–19. https://doi.org/10.1145/3511861.3511863 Chatbot for Studying Past Questions in Introductory Programming. IAENG
[13] Zhikai Gao, Sarah Heckman, and Collin Lynch. 2022. Who Uses Office Hours? International Journal of Computer Science 49, 3 (2022).
A Comparison of In-Person and Virtual Office Hours Utilization. In Proceedings [26] Zachary A. Pardos and Shreya Bhandari. 2023. Learning gain differences between
of the 53rd ACM Technical Symposium on Computer Science Education - Volume ChatGPT and human tutor generated algebra hints. arXiv:2302.06871 [cs.CY]
1 (Providence, RI, USA) (SIGCSE 2022). Association for Computing Machinery, [27] James Prather, Paul Denny, Juho Leinonen, Brett A Becker, Ibrahim Albluwi,
New York, NY, USA, 300–306. https://doi.org/10.1145/3478431.3499334 Michael E Caspersen, Michelle Craig, Hieke Keuning, Natalie Kiesler, Tobias
[14] Arto Hellas, Juho Leinonen, Sami Sarsa, Charles Koutcheme, Lilja Kujanpää, Kohn, et al. 2023. Transformed by Transformers: Navigating the AI Coding
and Juha Sorva. 2023. Exploring the Responses of Large Language Models to Revolution for Computing Education: An ITiCSE Working Group Conducted by
Beginner Programmers’ Help Requests. arXiv:2306.05715 [cs.CY] Humans. In Proceedings of the 2023 Conference on Innovation and Technology in
[15] Sajed Jalil, Suzzana Rafi, Thomas D. LaToza, Kevin Moran, and Wing Lam. Computer Science Education V. 2. 561–562.
2023. ChatGPT and Software Testing Education: Promises & Perils. In 2023 [28] James Prather, Brent N. Reeves, Paul Denny, Brett A. Becker, Juho Leinonen,
IEEE International Conference on Software Testing, Verification and Valida- Andrew Luxton-Reilly, Garrett Powell, James Finnie-Ansley, and Eddie Antonio
tion Workshops (ICSTW). IEEE. https://doi.org/10.1109/icstw58534.2023.00078 Santos. 2023. "It’s Weird That it Knows What I Want": Usability and Interactions
arXiv:arXiv:2302.03287 with Copilot for Novice Programmers. arXiv:2304.02491 [cs.HC]
[16] Enkelejda Kasneci, Kathrin Sessler, Stefan Küchemann, Maria Bannert, Daryna [29] Margot Rutgers. 2021. Duckbot: A chatbot to assist students in programming
Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, tutorials. Master’s thesis. University of Twente.
Eyke Hüllermeier, Stepha Krusche, Gitta Kutyniok, Tilman Michaeli, Claudia [30] Sami Sarsa, Paul Denny, Arto Hellas, and Juho Leinonen. 2022. Automatic Gen-
Nerdel, Jürgen Pfeffer, Oleksandra Poquet, Michael Sailer, Albrecht Schmidt, eration of Programming Exercises and Code Explanations Using Large Language
Tina Seidel, Matthias Stadler, Jochen Weller, Jochen Kuhn, and Gjergji Kasneci. Models. In Proceedings of the 2022 ACM Conference on International Computing
2023. ChatGPT for good? On opportunities and challenges of large language Education Research V.1. ACM, Lugano and Virtual Event Switzerland, 27–43.
models for education. Learning and Individual Differences 103 (2023), 102274. https://doi.org/10.1145/3501385.3543957
https://doi.org/10.1016/j.lindif.2023.102274 [31] Jaromir Savelka, Arav Agarwal, Marshall An, Chris Bogart, and Majd Sakr. 2023.
[17] Majeed Kazemitabaar, Justin Chow, Carl Ka To Ma, Barbara J. Ericson, David Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle
Weintrop, and Tovi Grossman. 2023. Studying the Effect of AI Code Generators to Pass Assessments in Higher Education Programming Course. In Proceedings
on Supporting Novice Learners in Introductory Programming. In Proceedings of the 2023 ACM Conference on International Computing Education Research V.1.
of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, ACM.
Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, [32] Jaromir Savelka, Arav Agarwal, Christopher Bogart, and Majd Sakr. 2023. Large
Article 455, 23 pages. https://doi.org/10.1145/3544548.3580919 Language Models (GPT) Struggle to Answer Multiple-Choice Questions about
[18] Hieke Keuning, Johan Jeuring, and Bastiaan Heeren. 2019. A Systematic Lit- Code. arXiv:2303.08033 [cs.CL]
erature Review of Automated Feedback Generation for Programming Exer- [33] Haoye Tian, Weiqi Lu, Tsz On Li, Xunzhu Tang, Shing-Chi Cheung, Jacques
cises. ACM Transactions on Computing Education 19, 1 (March 2019), 1–43. Klein, and Tegawendé F. Bissyandé. 2023. Is ChatGPT the Ultimate Programming
https://doi.org/10.1145/3231711 Assistant – How far is it? arXiv:2304.11938 [cs.SE]
[19] Mario Konecki, Nikola Kadoic, and Rok Piltaver. 2015. Intelligent assistant for [34] James Walden, Nicholas Caporusso, and Ludiana Atnafu. 2022. A Chatbot for
helping students to learn programming. In 2015 38th International Convention Teaching Secure Programming. In Proceedings of the EDSIG Conference ISSN,
on Information and Communication Technology, Electronics and Microelectronics Vol. 2473. 4901.
(MIPRO). IEEE, Opatija, Croatia, 924–928. https://doi.org/10.1109/MIPRO.2015. [35] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei
7160406 Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-Thought Prompting
[20] Juho Leinonen, Paul Denny, Stephen MacNeil, Sami Sarsa, Seth Bernstein, Joanne Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL]
Kim, Andrew Tran, and Arto Hellas. 2023. Comparing Code Explanations Created [36] Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang,
by Students and Large Language Models. arXiv:2304.03938 [cs.CY] John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, Court-
[21] Mariam Mahdaoui, Said Nouh, My Seddiq ELKASMI Alaoui, and Mounir Sadiq. ney Biles, Sasha Brown, Zac Kenton, Will Hawkins, Tom Stepleton, Abeba
2022. Comparative study between automatic hint generation approaches in Birhane, Lisa Anne Hendricks, Laura Rimell, William Isaac, Julia Haas, Sean
Intelligent Programming Tutors. Procedia Computer Science 198 (2022), 391–396. Legassick, Geoffrey Irving, and Iason Gabriel. 2022. Taxonomy of Risks Posed by
https://doi.org/10.1016/j.procs.2021.12.259 Language Models. In 2022 ACM Conference on Fairness, Accountability, and Trans-
[22] Jessica McBroom, Irena Koprinska, and Kalina Yacef. 2022. A Survey of Auto- parency (Seoul, Republic of Korea) (FAccT ’22). Association for Computing Ma-
mated Programming Hint Generation: The HINTS Framework. Comput. Surveys chinery, New York, NY, USA, 214–229. https://doi.org/10.1145/3531146.3533088
54, 8 (Nov. 2022), 1–27. https://doi.org/10.1145/3469885 [37] Terry Yue Zhuo, Yujin Huang, Chunyang Chen, and Zhenchang Xing. 2023. Red
teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity.
arXiv:2301.12867 [cs.CL]

You might also like