0% found this document useful (0 votes)
3 views17 pages

Llms

This document explores the effectiveness of large language models (LLMs) in providing feedback in programming education by assessing their program repair capabilities and their proficiency in generating natural language explanations of coding mistakes. The authors conducted a benchmarking study on various LLMs to evaluate their performance in repairing student programs and explaining issues, finding a correlation between the two abilities. The results suggest that LLMs that excel in repairing code also tend to offer more complete and accurate explanations, which could enhance the feedback provided to learners.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views17 pages

Llms

This document explores the effectiveness of large language models (LLMs) in providing feedback in programming education by assessing their program repair capabilities and their proficiency in generating natural language explanations of coding mistakes. The authors conducted a benchmarking study on various LLMs to evaluate their performance in repairing student programs and explaining issues, finding a correlation between the two abilities. The results suggest that LLMs that excel in repairing code also tend to offer more complete and accurate explanations, which could enhance the feedback provided to learners.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Using Program Repair as a Proxy for Language Models’ Feedback Ability

in Programming Education

Charles Koutcheme and Nicola Dainese and Arto Hellas


Aalto University, Espoo, Finland
[email protected]

Abstract 1.0
One of the key challenges in programming ed-
ucation is being able to provide high-quality 0.8
model

completeness
feedback to learners. Such feedback often in- TinyLlama
cludes explanations of the issues in students’ 0.6 Gemma-2b
programs coupled with suggestions on how CodeLlama
Zephyr-beta
to fix these issues. Large language models 0.4 Mistral
(LLMs) have recently emerged as valuable Gemma-7b
tools that can help in this effort. In this arti- gpt-3.5-turbo
0.2 gpt-4-turbo
cle, we explore the relationship between the
program repair ability of LLMs and their profi-
ciency in providing natural language explana- 0.2 0.4 0.6
tions of coding mistakes. We outline a bench- pass@1
marking study that evaluates leading LLMs (in-
cluding open-source ones) on program repair Figure 1: Summary benchmarking results. The quality
and explanation tasks. Our experiments study of LLMs’ Natural Language descriptions of issues in
the capabilities of LLMs both on a course level students’ code (completeness) tends to increase with
and on a programming concept level, allowing LLMs’ ability to fix the student programs (pass@1).
us to assess whether the programming concepts
practised in exercises with faulty student pro-
grams relate to the performance of the models. back exist in programming (Keuning et al., 2018),
Our results highlight that LLMs proficient in re- explaining code issues in natural language can be
pairing student programs tend to provide more particularly useful. Providing students with natural
complete and accurate natural language expla- language explanations of the mistakes in their code
nations of code issues. Overall, these results
allows them to gain a better understanding of gaps
enhance our understanding of the role and ca-
pabilities of LLMs in programming education. in their knowledge.
Using program repair as a proxy for explana- With the increasing number of LLMs proficient
tion evaluation opens the door for cost-effective at providing feedback (Koutcheme et al., 2023a)
assessment methods. to some degree, selecting the best one before de-
ploying it in classrooms (Liu et al., 2024) can be
1 Introduction
challenging. Human evaluation can take time, as
Large Language Models (LLMs) and applications it requires either manual assessment or annotated
leveraging them such as ChatGPT have been em- datasets. While research in the automated eval-
braced by both the general public and academia. uation of LLM generation is on the rise (Zheng
The adoption is also visible in the domain of et al., 2023), also in educational areas (Fernan-
computing and programming education, where re- dez et al., 2024), the developed methods often rely
searchers have highlighted a variety of learning on other language models (e.g., utilizing powerful
tasks that LLMs can tackle (Denny et al., 2023; yet expensive LLMs such as GPT-4), which can
Prather et al., 2023), including their performance induce computational or financial costs. A more
in providing help and feedback to students (Hellas cost-effective approach is needed.
et al., 2023). Before the advent of LLMs, a stream of work
Feedback is a crucial part of learning (Hattie and in programming education has focused on educa-
Timperley, 2007). While various forms of feed- tional program repair (Gulwani et al., 2018; Parihar
165
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications, pages 165–181
June 20, 2024 ©2024 Association for Computational Linguistics
et al., 2017; Yi et al., 2017), where the goal is to shown useful (Wu et al., 2021) in making human
produce fixes for students’ incorrect programs. Al- annotations as data efficient as possible. However,
though repairs to student programs are not always coming up with such annotations remains a time-
directly provided to students, they serve as a funda- consuming endeavour.
mental step in generating different types of support,
including next-step hints for Intelligent Tutoring Educational Program Repair. Trying to alle-
Systems (McBroom et al., 2021). While direct eval- viate the need for manual annotation, feedback
uation of feedback with natural language explana- on programming assignments has often been gen-
tions can be challenging, evaluating whether LLMs erated with the aid of automated program repair
can fix programs is much more straightforward. tools (Hu et al., 2019a), attempting to repair syn-
With this in mind, we hypothesize that the stu- tax and/or semantic errors in students’ programs.
dent program repair capability of an LLM may In this area, LLMs have also shown great promise.
relate to its capability to provide natural language Much of this line of work has mainly used early ver-
explanations of code issues. If this would hold, pro- sions of the OpenAI Codex model, thus obtaining
gram repair capability – which is easier to assess both syntax fixes (Zhang et al., 2022; Ahmed et al.,
– could serve as a proxy for evaluating feedback 2022; Leinonen et al., 2023) and semantic fixes
quality. Our intuition is supported by prior work for students’ non-working solutions (Zhang et al.,
that has found relationships between LLMs’ abil- 2022). Such fixes can inform Intelligent Tutoring
ities in related domains. For instance, LLMs that Systems, which could then provide next-step hints
are proficient in solving specific problems are effec- to students (Rivers and Koedinger, 2017). However,
tive judges of the quality of explanations in those while automatically constructed next-step hints can
domains (Zheng et al., 2023). Similarly, there is tell the students what to do next (in templated nat-
some evidence that instruction-tuned LLMs trained ural language sentences), they are not always able
on specific tasks can generalize to unseen parallel to explain the reasons why the code does not work.
or close tasks (Wei et al., 2022a).
In this article, we investigate whether there ef- Natural Language Explanations. The rise of
fectively exists a relationship between the ability of newer and more powerful LLMs (e.g., C HAT GPT)
LLMs to repair students’ programs and their abil- has opened the possibility of directly generating
ity to explain code issues in natural language. If high-quality code explanations (Sarsa et al., 2022).
our hypothesis holds, researchers could more easily In addition to such progress, research in improv-
benchmark LLMs for other educational purposes, ing program repair remains useful. In particular,
allowing educators to streamline the selection of recent efforts suggest that generated repairs can be
LLMs. Our evaluation focuses on several leading included in the prompt to allow language models
and popular open-source language models, as well to provide more accurate natural language explana-
as proprietary models. tions of a program’s issues (Phung et al., 2023a).
The main contributions of this article are (1) the In parallel, prior work has also explored using
benchmarking of several leading language models’ program repair to validate the quality of LLM-
abilities for program repair and (2) natural language generated feedback. In this space, the quality of
explanation of code issues, as well as (3) the anal- LLM-generated repairs (i.e., whether the repairs
ysis and identification of the relationship between pass all unit tests) would indicate whether the as-
the two tasks. sociated LLM-generated feedback would be given
to students. The repairs could be generated by
2 Related Work the LLM providing the feedback (Shubham Sahai,
2023), or by another, less powerful LLM acting as
2.1 Program Repair and Feedback an artificial student (Phung et al., 2023b).
Propagating feedback. Generating natural lan- In contrast to efforts using program repair as a
guage explanations of the issues in student pro- means for validating single generations, our work
grams has been a long-standing challenge, with aims to assess whether the overall ability of a single
much work leveraging part of human annotations language model to provide repair across a larger set
to bootstrap efforts (Piech et al., 2015; Malik et al., of programs is indicative of the language model’s
2021; Koivisto and Hellas, 2022). In that area, overall ability to generate natural language expla-
early pretrained code language models have also nations.
166
2.2 Evaluating Language Models art open-source LLMs having less than 7 billion
Benchmarking code language models. When (7B) parameters. Our experiments leverage a pub-
new language models are released, their perfor- licly available dataset comprising real-life students’
mance is often assessed through multiple code submissions to Python programming problems.
generation benchmarks such as HumanEval (Chen Next, we describe the programming dataset, out-
et al., 2021), APPS (Hendrycks et al., 2021), line our evaluation methodology, and list the lan-
MBPP (Austin et al., 2021), or DS-1000 (Lai et al., guage models included in this evaluation. We re-
2022). In parallel, prior work has also evaluated lease the code used to perform our experiments as
LLMs’ ability to fix buggy programs in bench- an additional contribution 1 .
marks such as HumanEval+ (Muennighoff et al.,
2023), CodeXGlue (Lu et al., 2021), or QuixBugs 3.1 Dataset
(Lin et al., 2017). However, while such benchmarks
We use a subset of the FalconCode (de Freitas et al.,
contain multiple tasks that could potentially inform
2023) dataset, a large-scale dataset containing thou-
us of LLMs’ performance in educational contexts,
sands of first-year students’ solutions (over three
it is important to note that students’ submitted in-
semesters) to hundreds of Python programming as-
correct programs can contain issues/defects that go
signments. It is the largest and most comprehensive
beyond mere simple bugs (e.g. implementation of
publicly available dataset of student programs at
the wrong algorithm). Hence, educational bench-
the time of writing this manuscript. Beyond its sub-
marks are needed.
stantial scale, this dataset distinguishes free-form
Benchmarking in education. In the educational assignments (i.e., not scoped to function writing),
context, much work has looked into the perfor- and exercise-level programming with concept an-
mance of proprietary models (Codex, and Chat- notations, enabling a broader evaluation of LLM
GPT) on private datasets and educational datasets feedback.
(Finnie-Ansley et al., 2022; Hellas et al., 2023)
both for program synthesis (Finnie-Ansley et al., Dataset processing. Due to the financial and
2022; Savelka et al., 2023b) and feedback (Hellas computational costs of running LLM evaluations,
et al., 2023). for our experiments, we curate a smaller subset of
submissions. The dataset contains three semesters
Open-source language models. While there worth of submissions (fall 2021, spring 2021, and
exist few efforts looking at the performance fall 2022). We start by selecting submissions from
of open-language models for generating repairs the last semester (fall 2022). Each exercise in the
(Koutcheme et al., 2023a; Koutcheme, 2023), or dataset can be categorized based on a type (practice,
answering student programming questions (Hicke or exam) and a level of difficulty (“skill”, “lab”, or
et al., 2023), only the work of (Koutcheme et al., “project”, i.e., easy, medium, hard). We omit exam
2024) look into the performance of open-source exercises and focus on practice exercises (as these
models for generating educational programming are the ones students require help with). Addition-
feedback. Still, none of these works studies the ally, we exclude more complex “project” assign-
relationship between program repair abilities and ments, requiring extensive code writing across mul-
the quality of LLM-generated natural language ex- tiple files, and those requiring external files. Fol-
planations. lowing Hu et al. (2019b), we select only the final
incorrect submissions for each student for each as-
3 Methodology
signment. Although this selection may not capture
We (1) evaluate how LLMs perform in generat- the full range of student difficulties, it aligns with
ing repairs to incorrect programs, (2) evaluate how the idea that a student’s last attempt often reflects
LLMs perform in explaining the issues in programs, their final understanding. Finally, we remove sub-
and (3) study the potential relationship between missions with identical abstract syntax tree struc-
the ability to generate repairs and the ability to tures after variable normalization (Koutcheme et al.,
generate natural language explanations. To en- 2023a,c). The final dataset contains 370 programs
sure a comprehensive assessment, our study en- from 44 assignments.
compasses zero-shot evaluations (Yogatama et al.,
2019; Linzen, 2020) of proprietary and state-of-the- 1
https://github.com/KoutchemeCharles/bea2024

167
3.2 Repairing Student Programs 3.3 Explaining Issues in Students Programs
The second task is for our language models to ex-
Given a student’s incorrect program in our test set,
plain all the issues in a given student’s incorrect
the first task is for an LLM to produce a repair to
program. For each incorrect program, we prompt
that incorrect program that passes all unit tests. Be-
our language model to explain the issues using
cause of the wide range of issues found in students’
the prompt shown in Figure 5 (Appendix A.1), a
programs, in contrast to classical program repair
variant of the prompt used in (Hellas et al., 2023).
benchmarks (Lin et al., 2017; Muennighoff et al.,
Following prior work, we generate a single output
2023), in most educational scenarios, we do not as-
using greedy decoding (Hellas et al., 2023; Savelka
sume the existence of a single unique ground truth
et al., 2023a; Leinonen et al., 2023).
repair to an incorrect program. However, while
such unique ground truth does not exist, repairs that Evaluation criteria. For each natural language
align with the original incorrect programs are often explanation, we focus on two particular quantitative
preferred. The general assumption is that closely aspects of quality: (1) ensuring that the feedback is
aligned programs can generate (Phung et al., 2023a) complete, i.e., it identifies and mentions all issues
or are associated with feedback (Koutcheme et al., in the code, and (2) ensuring that it avoids hal-
2023a) (e.g. natural language explanations or hints) lucinations, i.e., it does not mention non-existent
that are more understandable to students, as this issues (Phung et al., 2023b; Hicke et al., 2023;
feedback would require a lower cognitive load to Hellas et al., 2023). We highlight that our explana-
understand the issues in the program and the modi- tion task is a specific form of feedback that differs
fications that need to be operated to reach a solution from hints. In the explanation task, the answer
(Shubham Sahai, 2023). Moreover, we aim to in- is meant to be given to students, while for hints
vestigate whether the language model’s ability to (Roest et al., 2024), the feedback helps the students
produce repairs that closely resemble the original find the answer themselves. While prior work in
incorrect programs correlates with its proficiency in hint generation has investigated other qualitative
generating complete and accurate natural language aspects, such as the “right level of detail”((Phung
explanations of the issues in the programs. The et al., 2023a; Scarlatos et al., 2024)), we believe
constraints on functional correctness and closeness these are less likely to be correlated with an LLM
are reflected in our evaluation procedure, which we repair ability.
adapt from the work of Koutcheme et al. (2023a).
Automated Evaluation. Given the scale of our
dataset and the multitude of language models to
Evaluation procedure. To evaluate functional assess, conducting human evaluation would be im-
correctness, for each incorrect program in our test practical. Therefore, we rely on automated evalu-
set, we generate a single repair using greedy decod- ation using language models (Zheng et al., 2023).
ing (Rozière et al., 2023). To measure the ability Powerful language models like ChatGPT have ex-
of the language model to generate close repairs, we hibited near-human performance across various
compute the ROUGE-L (Lin, 2004) score between tasks, sparking interest in their application for eval-
the incorrect program and the candidate repair ex- uating other LLMs (Zhou et al., 2023a; Cui et al.,
tracted from the single greedy generation. While 2023; Tunstall et al., 2023), including in educa-
other distance measures exist and have been used to tional contexts (McNichols et al., 2024; Hicke et al.,
measure closeness between programs (e.g., BLEU 2023). Notably, GPT-4 has demonstrated good
(Papineni et al., 2002) and CodeBERT score (Zhou performance in evaluating programming feedback
et al., 2023b)), the ROUGE-L score has been shown quality (Koutcheme et al., 2024). In our work, we
to correlate well with human judgement of high- ask GPT-4 to grade the quality of the natural lan-
quality repairs (Koutcheme et al., 2023b) while guage explanations for each incorrect program. We
remaining fast to compute, as it does not rely on a ask the model to provide a binary label of whether
language model. each criterion (completeness, and avoiding high-
We report the average repair success rate as the lighting non-existent issues) holds for the feedback
pass rate (’pass@1’ (Chen et al., 2021)) and the generated by each language model. Figure 6 (ap-
average ROUGE-L score, abbreviated as ’rouge’, pendix A.1) shows our prompt. For each criterion,
over the programs in our test set. we report the average over the test set.
168
3.4 Models Table 1: Summary of the performance of the models in
program repair and code issue explanation. For the met-
We focus our evaluation on instruction-tuned and rics pass@1, rouge, and completeness, a higher score
chat models. While pretrained language models indicates better performance. Conversely, for the hallu-
can also be useful for multiple tasks, as prior stud- cination rate metric, a lower score is preferable. Legend:
ies using Codex (Phung et al., 2023a) have shown, compl. (completeness), hall. rate (hallucination rate).
instruction-tuned models alleviate the need for com-
plex queries and allow easier interactions which repair explanation
model
benefit educators and researchers. pass@1 rouge compl. hall. rate (↓)

Closed-source models. We evaluate GPT- TinyLlama 0.070 0.062 0.068 0.335


3.5 (gpt-3.5-turbo) and GPT-4-turbo Gemma-2b 0.224 0.175 0.165 0.400
(gpt-4-1106-preview) on our two tasks. CodeLlama 0.292 0.251 0.343 0.841
Due to the financial costs of running GPT-4, we Zephyr-beta 0.295 0.236 0.624 0.716
use the Turbo version for feedback generation, but Mistral 0.324 0.241 0.738 0.397
we keep the standard GPT-4 for evaluating the Gemma-7b 0.327 0.298 0.905 0.005
quality of the natural language generations.
gpt-3.5-turbo 0.530 0.470 0.838 0.368
Open-source models. While prior work in pro- gpt-4-turbo 0.665 0.536 0.992 0.024
gramming feedback using LLMs has focused
mainly on ChatGPT models (i.e., GPT-3.5 and GPT-
4), we aim to cover the wider range of available op- adopted in the community. Additionally, within
tions and include a selected number of instruction- these families, we choose language models having
tuned open-source/permissive models. We report 7 billion parameters or less, as such models can
the performance of the following family of models: generally fit within one large GPU (without quan-
tization). This choice is reflected by the potential
• TinyLLama (Zhang et al., 2024), a 1.1B pa- need for educators to run models on custom hard-
rameter model following the Llama (Touvron ware, who are unlikely to have the computational
et al., 2023) architecture. and financial resources to access more than a single
GPU.
• CodeLLAMA (Rozière et al., 2023), series of
Llama (Touvron et al., 2023) models special- Technical details. We query ChatGPT models
ized for code. We report the performance of using OpenAI’s Python API. We run the selected
the 7B parameters model. open-source language models using the Hugging-
Face Transformers library (Wolf et al., 2020), each
• Mistral 7B (Jiang et al., 2023), a 7B parame- model is run on a single NVIDIA A100 using our
ters language model released by the MistralAI institution research cluster. We run all models us-
team. ing their recommended precision. The details of
each model (the names) can be found in Table 3
• Zephyr (Tunstall et al., 2023) are 7B param-
(appendix A.2).
eters language models fine-tuned by Hug-
gingFace using Direct Preference Optimiza- 4 Results
tion (Rafailov et al., 2023) on top of Mistral
7B model. We evaluate the performance of First, we describe our general results, then, we out-
Zephyr 7B β. line an ablation analysis detailing the performance
of the selected models over a set of programming
• Gemma (Google, 2024), open source model concepts.
released by Google DeepMind. We evaluate
the performance of the 2B and 7B parameters 4.1 Main Results
models. Table 1 summarizes the performance of the LLMs
in program repair and in explaining issues in code.
We chose these families of models because they We can make the following observations:
are fully open-source and well-documented, they
perform competitively on various code benchmarks LLMs proficient in program repair generate
(for models of their size), and they are widely repairs closer to the original incorrect program.
169
Figure 2 highlights the scaling relationship between
the pass rate and the rouge score. We see that as 0.8
language models become more and more proficient

hallucination rate
in generating repairs, these repairs become closer 0.6 model
TinyLlama
to students’ original programs and thus more useful. Gemma-2b
One could expect that LLMs which produce more 0.4 CodeLlama
Zephyr-beta
fixes could generate generic solutions (which are Mistral
far away from the student code) (Koutcheme et al., Gemma-7b
0.2 gpt-3.5-turbo
2023c) – however, this is not the case. gpt-4-turbo

0.0
0.5 0.2 0.4 0.6 0.8 1.0
completeness
0.4 model Figure 3: Relationship between completeness and hallu-
TinyLlama cination rate.
Gemma-2b
rouge

0.3 CodeLlama
Zephyr-beta
Mistral
0.2 Gemma-7b better than Mistral), it produces very complete ex-
gpt-3.5-turbo planations (and with fewer hallucinations). Interest-
gpt-4-turbo
0.1 ingly, the performance gap between models’ ability
to repair does not reflect the gap between their abil-
0.2 0.4 0.6 ity to explain in natural language. For instance, the
pass@1 difference between CodeLLama and Zephyr-7B in
Figure 2: Relationship between pass rate and rouge pass@1 (0.003) is almost 10 × smaller than the
score. performance gap between the models’ abilities to
generate complete explanations (0.281).

Hallucination conditionally decreases as a func- Reparing student programs is harder than ex-
tion of completeness. Figure 3 highlights the plaining issues in natural language. When look-
relationship between the ability of a model to iden- ing at the maximum value that the pass@1 metric
tify all issues in a program (completeness), and assumes (0.665), we see that it is smaller than the
the model’s tendency to hallucinate (hallucination one of the completeness (0.992). We believe repair-
rate). If we omit language models with less than ing programs is more challenging than providing
2B parameters (i.e., TinyLlama and Gemma-2B), explanations, as the latter requires understanding
we observe that the hallucination rate decreases as the issues while the former requires both compre-
completeness increases. This relationship seems hension and expertise on how to implement the
to hold only for large enough language models. fixes.
Our interpretation is supported by prior work that
On base models and fine-tuning. We hypothe-
has shown that many emerging behaviours in lan-
size that pass@1 and completeness are reflective
guage models appear when sufficiently large sizes
of the capabilities of the underlying base model,
are reached (Wei et al., 2022b) (e.g. their ability
while the hallucination rate seems to depend more
to solve new tasks via chain-of-thought prompting
on the fine-tuning procedure. Our intuition is jus-
(Wei et al., 2023)).
tified by the following observations: (1) Mistral
The ability to explain moderately scales with the and Zephyr share the same base model (but only
ability to repair. Figure 1 highlights the relation- different fine-tuning) and have comparable pass@1
ship between repair performance and explanation and completeness, but very different hallucination
performance (in terms of completeness). Generally, rates. OpenAI and Google invest significant ef-
a language model that is better at program repair forts into curating datasets for fine-tuning to avoid
tends to also produce more complete descriptions. hallucinations. On the other hand, small language
In the set of our LLMs, only Gemma-7B and GPT- models (TinyLLama and Gemma-2b) are probably
3.5 disrupt this relationship: although Gemma-7B too inaccurate (i.e., not powerful enough) to even
has a lower pass rate than GPT-3.5 (only slightly hallucinate.
170
4.2 Concept Level Performance Analysis Table 2: Programming concepts performance summary.
We show the programming concept for which each lan-
The FalconCode dataset contains information about guage model struggles the most. Legend: IS (input
20 programming concepts or "skills" (e.g., function string), IC (input casting), C (conditionals), FC (func-
definition, assignment, conditionals). The authors tion call), FD (function definition), L (list), LU (loop
of the dataset manually annotated each exercise until), L2D (list 2D), hall. rate (hallucination rate).
with information on whether each of these skills is
practised (or needs to be mastered) in each exercise. pass@1 rouge completeness hall. rate
We refer the reader to the original paper for details TinyLlama IC IC LU IS
about the concepts (de Freitas et al., 2023). Gemma-2b LU IS LU L2D
In the same way that some students exhibit vary- CodeLlama IC IC L2D L
ing struggles with understanding and practising Zephyr-beta IS IS FD C
specific programming concepts (Liu et al., 2023), Mistral IS IS FD L
we suspect that language models might face a sim- Gemma-7b IS IS FC LU
ilar challenge. By examining the performance of gpt-3.5-turbo IC IC LU FC
language models on a per-concept basis, we aim gpt-4-turbo IS IS LU L2D
to provide insights into their strengths and weak-
nesses in addressing specific programming chal-
lenges, thus informing educators and developers on
their suitable application scenarios. When looking into the worst-practised concepts
We thus conduct an ablation study looking at the for repairing student programs, almost all of them
per-concept performance of our language models are related to input manipulation (input string, or
for repair and natural language explanation genera- input casting), similar to what has been observed
tion. in LLMs capability to provide suggestions to pro-
gramming help requests (Hellas et al., 2023). More-
Methodology. For each of the 20 concepts, we over, LLMs that perform poorly at fixing a given
obtain the list of exercises practising the concept concept are also likely to perform poorly at gener-
and subsequently retrieve the incorrect programs in ating close solutions for these concepts.
our test set submitted to these exercises. For each When looking at the worst concepts for natu-
concept, we then report and compare the perfor- ral language explanations, these concern a wider
mance of the language models for program repair range (looping, data structure, functions, basic op-
and natural language generation (using the same erations). For completeness, there is not much
evaluation metrics) based on the retrieved subset of variation in the performance in explaining issues
incorrect programs. for different concepts, but rather the overall perfor-
It is important to note that because all exercises mance is correlated with the pass@1 of the corre-
practice multiple concepts, knowing which individ- sponding model. For hallucination rate, each model
ual concept is responsible for the language model has its own “base performance”, which doesn’t cor-
failing to fix (or explain) the issues in a program relate with pass@1 and it’s roughly constant across
is impossible. As such, the following results will concepts, with the exceptions of Zephyr and gpt-
give us an overview of the likelihood that an LLM 3.5-turbo, which respectively over- and underper-
would struggle to support students if an exercise form on function-related concepts concerning other
involves such a concept. Table 4 (Appendix A.3) concepts. There is no clear association between the
shows the number of exercises and programs that concepts where LLMs are accurate and those where
practice each specific concept. We limit our analy- they hallucinate. Both small language models (less
sis to concepts practised in more than 3 exercises. than 7B parameters) and proprietary models strug-
Results. Due to space limitations, we focus our gle most to be accurate with the ‘looping until’
analysis on the concepts with which language mod- concept, while language models of 7B parameters
els struggle the most. Table 2 shows these con- struggle more with function-related assignments.
cepts for all performance metrics, which are de- It is important to note that “struggling” here is
rived from Table 6 in Appendix B.2 showing the relative to the model’s performance with other con-
detailed scores of all models. We can make the cepts. GPT-4 “struggling” more on completeness
following observations: with looping is still accurate 90% of the time.
171
5 Discussion hardware (e.g., consumer laptop GPU, or acceler-
ated hardware). However, the performance of such
Repair as a proxy for feedback. Our results sug- LLMs, as our results suggest is still lagging behind
gest that language models’ relative ability to fix stu- their 7B parameters counterparts.
dents’ programs (which is easy to evaluate) tells us
how these language models will compare in finding Identifying specific knowledge gaps. Unfortu-
all issues in students’ code while avoiding halluci- nately, our results do not yet allow us to identify
nation (for big enough language models). Based on which programming concepts LLMs will struggle
our discovery, one can devise more efficient LLM to explain in natural language from their program
selection pipelines. For instance, a simple strategy repair performance. While individual repair perfor-
consists of filtering out language models for which mance depends on the concept being practised, a
repair performance does not reach a certain thresh- language model’s performance in explaining issues
old, a threshold set based on a few evaluations of does not (i.e., the performance is constant across all
LLM natural language generation performance. As concepts). We hypothesize that the per-concept per-
an illustrative scenario, only evaluating the Mistral formance gap is only revealed for the harder task
model on our dataset allows us to reasonably as- of fixing students’ programs. Uncovering LLM
sume that language models performing worse than knowledge gaps with automated measures might
0.32 in pass rate (pass@1) are unlikely to generate require us to rely on harder automatically evaluable
complete explanations for more than 73.8 % of pro- tasks (e.g. QLCs (Lehtinen et al., 2024)).
grams. Using this pass rate value can thus act as
a selection lower bound. As LLMs are becoming Interplay of programming feedback types. Our
more widely adopted in education (Prather et al., primary research objective is to deepen our under-
2023; Denny et al., 2024), and as the number of standing of LLMs’ feedback capabilities in educa-
available models is increasing, these insights can tional contexts. Specifically, we seek to explore the
help in the adoption process as institutions evaluat- relationship between different forms of feedback
ing LLMs for their context can potentially reduce and program repair. While we treated feedback
the number of LLMs to consider or limit the num- (identifying and explaining issues in programs) and
ber of tasks conducted during the evaluation. program repair as distinct tasks in this study, we
acknowledge their inherent interdependence. Pre-
Open-source language models strike back. An- vious research suggests that high-quality repairs
other important finding emerging from our results can induce high-quality feedback when provided
is that while high-performance program repair must in context (Phung et al., 2023b,a). However, gen-
rely on proprietary models, recent 7B parameters erating high-quality repairs is inherently challeng-
models such as Gemma-7B can generate high- ing, as our results suggest, requiring the language
quality feedback competitive with SOTA models model to comprehend what is wrong in a program
(Koutcheme et al., 2024). This has positive impli- and how to address the issues. In contrast, we be-
cations for educators interested primarily in giving lieve explanations of issues in students’ programs
students feedback rather than repairing solutions, could serve as reasoning steps (Wei et al., 2023), en-
as such feedback can also be generated via privacy- hancing the subsequent generation of repairs (Chen
preserving open-source models. et al., 2023). These refined repairs, in turn, could
However, it’s important to acknowledge that run- facilitate the generation of high-quality next-step
ning such models requires custom computational hints (Roest et al., 2024). Research investigating
resources. In the literature, 7B parameter models the interplay between different types of feedback
are sometimes termed “small” due to their relative is thus pivotal in unlocking the full potential of lan-
size compared to many large language models (e.g. guage models to support programming education.
Falcon-180B model (Almazrouei et al., 2023)). Yet, By studying the performance of generating repairs
a 7B parameter is not small in terms of computa- without conditioning on feedback, nor generating
tional resources as it requires a large GPU to fit feedback based on repairs, our work establishes
entirely into memory (without quantization). There a foundational understanding that will allow the
is currently a trend in developing small language research community to assess the extent to which
models (less than 3B parameters) such as TinyL- various prompting techniques enhance feedback
lama and Gemma which can run on more modest performance.
172
6 Conclusions many popular state-of-the-art open-source and pro-
prietary models, many more exist. Including more
In this article, we have uncovered an intriguing models would be necessary to strengthen the claim
relationship between LLM performance in pro- of the relationship between repair and natural lan-
gram repair and the capability to explain issues guage explanations. Beyond this, the concept anal-
in code. Our evaluations encompassed both open- ysis is only indicative, as many assignments feature
source and proprietary models, examining their multiple concepts. Finally, we only considered
generic performance as well as concept-specific single-turn zero-shot repair, which does not take
proficiency. advantage of LLMs’ ability to reason with few-shot
While selecting and deploying a specific lan- examples (Brown et al., 2020), or LLMs’ ability to
guage model may not be challenging, identifying correct their own mistakes (Chen et al., 2023; Xia
the most suitable one for a particular purpose can and Zhang, 2023).
be complex, particularly when considering finan-
cial, hardware, or other limitations. At a time Ethics Statement
when there are calls to rethink how programming
is taught (Denny et al., 2024), the insights gleaned The work in the present article has been conducted
from our work can provide valuable guidance for following national and institutional ethics guide-
educators in choosing LLMs that align with their lines. We recognize the increasing importance of
instructional contexts. ethical considerations in artificial intelligence re-
search, particularly concerning data usage and po-
Future work. Our future work will involve two tential societal impacts.
specific directions. First, we’ll continue our inves- The dataset employed in this research is openly
tigation of the relationships between various types available to researchers. Our overarching goal
of programming feedback and program repair. all is to contribute to the development and evalua-
these efforts remain an attempt to streamline the tion of open-source language models for providing
selection process of language models based on au- feedback in programming education. By focus-
tomated evaluation measures. ing on open-source models, we aim to promote
Besides studying LLM performance, our second transparency, accessibility, and accountability in
objective is to leverage our computational resources AI research and development, thereby addressing
to improve these LLMs’ ability to provide feedback. concerns regarding the privacy implications of us-
In particular, small language models’ poor explain- ing proprietary language models.
ing performance suggests that these models will We further acknowledge the broader ethical im-
benefit from alignment procedures designed specif- plications of our work, including issues related to
ically to improve feedback abilities (Scarlatos et al., fairness and accessibility of LLM feedback, how
2024). LLMs might favour certain styles of interaction,
and how LLMs might contribute to inequalities in
Limitations the quality of provided education worldwide.
Our work is not free of limitations. We evalu-
ated the LLMs on a subset of solutions from a References
single dataset (from one institution with one pro-
gramming language). Moreover, our evaluation Toufique Ahmed, Noah Rose Ledesma, and Premkumar
Devanbu. 2022. Synfix: Automatically fixing syntax
of natural language explanations relied on GPT-4, errors using compiler diagnostics.
which, although a state-of-the-art language model,
is not a perfect evaluator. Human evaluation is Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Al-
necessary to strengthen our results. Furthermore, shamsi, Alessandro Cappelli, Ruxandra Cojocaru,
Mérouane Debbah, Étienne Goffinet, Daniel Hesslow,
refinement would benefit the evaluation prompt Julien Launay, Quentin Malartic, Daniele Mazzotta,
(e.g., allowing GPT-4 to reason (Wei et al., 2023) Badreddine Noune, Baptiste Pannier, and Guilherme
before providing its final answers). Additionally, Penedo. 2023. The falcon series of open language
the results of our evaluation also depend on the models.
specific prompts used to interact with each lan- Jacob Austin, Augustus Odena, Maxwell Nye, Maarten
guage model. Similarly, our benchmarking exper- Bosma, Henryk Michalewski, David Dohan, Ellen
iment was not exhaustive – although we included Jiang, Carrie Cai, Michael Terry, Quoc Le, and

173
Charles Sutton. 2021. Program synthesis with large ucation in the era of generative ai. Commun. ACM,
language models. 67(2):56–67.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Nigel Fernandez, Alexander Scarlatos, and Andrew Lan.
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind 2024. Syllabusqa: A course logistics question an-
Neelakantan, Pranav Shyam, Girish Sastry, Amanda swering dataset.
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child, James Finnie-Ansley, Paul Denny, Brett A. Becker, An-
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, drew Luxton-Reilly, and James Prather. 2022. The
Clemens Winter, Christopher Hesse, Mark Chen, Eric robots are coming: Exploring the implications of ope-
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, nai codex on introductory programming. In Proceed-
Jack Clark, Christopher Berner, Sam McCandlish, ings of the 24th Australasian Computing Education
Alec Radford, Ilya Sutskever, and Dario Amodei. Conference, ACE ’22, page 10–19, New York, NY,
2020. Language models are few-shot learners. USA. Association for Computing Machinery.

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Google. 2024. Gemma: Open models
Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka- based on gemini research and technol-
plan, Harri Edwards, Yuri Burda, Nicholas Joseph, ogy. Technical report, Google DeepMind.
Greg Brockman, Alex Ray, Raul Puri, Gretchen Https://storage.googleapis.com/deepmind-
Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas- media/gemma/gemma-report.pdf.
try, Pamela Mishkin, Brooke Chan, Scott Gray,
Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Sumit Gulwani, Ivan Radiček, and Florian Zuleger.
Kaiser, Mohammad Bavarian, Clemens Winter, 2018. Automated Clustering and Program Re-
Philippe Tillet, Felipe Petroski Such, Dave Cum- pair for Introductory Programming Assignments.
mings, Matthias Plappert, Fotios Chantzis, Eliza- ArXiv:1603.03165 [cs].
beth Barnes, Ariel Herbert-Voss, William Hebgen
Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie John Hattie and Helen Timperley. 2007. The power of
Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, feedback. Review of educational research, 77(1):81–
William Saunders, Christopher Hesse, Andrew N. 112.
Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan
Morikawa, Alec Radford, Matthew Knight, Miles Arto Hellas, Juho Leinonen, Sami Sarsa, Charles
Brundage, Mira Murati, Katie Mayer, Peter Welinder, Koutcheme, Lilja Kujanpää, and Juha Sorva. 2023.
Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Exploring the responses of large language models to
Sutskever, and Wojciech Zaremba. 2021. Evaluating beginner programmers’ help requests. In Proceed-
language models trained on code. ings of the 2023 ACM Conference on International
Computing Education Research - Volume 1, ICER
Xinyun Chen, Maxwell Lin, Nathanael Schärli, and ’23, page 93–105, New York, NY, USA. Association
Denny Zhou. 2023. Teaching large language models for Computing Machinery.
to self-debug.
Dan Hendrycks, Steven Basart, Saurav Kadavath, Man-
Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, tas Mazeika, Akul Arora, Ethan Guo, Collin Burns,
Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Samir Puranik, Horace He, Dawn Song, and Jacob
Maosong Sun. 2023. Ultrafeedback: Boosting lan- Steinhardt. 2021. Measuring coding challenge com-
guage models with high-quality feedback. petence with apps.

Adrian de Freitas, Joel Coffman, Michelle de Freitas, Yann Hicke, Anmol Agarwal, Qianou Ma, and Paul
Justin Wilson, and Troy Weingart. 2023. Falcon- Denny. 2023. Ai-ta: Towards an intelligent question-
code: A multiyear dataset of python code samples answer teaching assistant using open-source llms.
from an introductory computer science course. In
Proceedings of the 54th ACM Technical Symposium Yang Hu, Umair Z. Ahmed, Sergey Mechtaev, Ben
on Computer Science Education V. 1, SIGCSE 2023, Leong, and Abhik Roychoudhury. 2019a. Re-
page 938–944, New York, NY, USA. Association for factoring based program repair applied to program-
Computing Machinery. ming assignments. In 2019 34th IEEE/ACM Int. Conf.
on Automated Software Engineering (ASE), pages
Paul Denny, James Prather, Brett A Becker, James 388–398. IEEE/ACM.
Finnie-Ansley, Arto Hellas, Juho Leinonen, An-
drew Luxton-Reilly, Brent N Reeves, Eddie Anto- Yang Hu, Umair Z. Ahmed, Sergey Mechtaev, Ben
nio Santos, and Sami Sarsa. 2023. Computing ed- Leong, and Abhik Roychoudhury. 2019b. Re-
ucation in the era of generative ai. arXiv preprint factoring based program repair applied to program-
arXiv:2306.02608. ming assignments. In 2019 34th IEEE/ACM Interna-
tional Conference on Automated Software Engineer-
Paul Denny, James Prather, Brett A. Becker, James ing (ASE), pages 388–398. IEEE/ACM.
Finnie-Ansley, Arto Hellas, Juho Leinonen, An-
drew Luxton-Reilly, Brent N. Reeves, Eddie Anto- Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men-
nio Santos, and Sami Sarsa. 2024. Computing ed- sch, Chris Bamford, Devendra Singh Chaplot, Diego

174
de las Casas, Florian Bressand, Gianna Lengyel, Guil- error messages. In Proceedings of the 2023 ACM
laume Lample, Lucile Saulnier, Lélio Renard Lavaud, SIGCSE Technical Symposium on Computer Science
Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Education.
Thibaut Lavril, Thomas Wang, Timothée Lacroix,
and William El Sayed. 2023. Mistral 7b. Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas
Muennighoff, Denis Kocetkov, Chenghao Mou, Marc
Hieke Keuning, Johan Jeuring, and Bastiaan Heeren. Marone, Christopher Akiki, Jia Li, Jenny Chim,
2018. A systematic literature review of automated Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo,
feedback generation for programming exercises. Thomas Wang, Olivier Dehaene, Mishig Davaadorj,
ACM Transactions on Computing Education (TOCE), Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko,
19(1):1–43. Nicolas Gontier, Nicholas Meade, Armel Zebaze,
Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu,
Teemu Koivisto and Arto Hellas. 2022. Evaluating Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo
codeclusters for effectively providing feedback on Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp
code submissions. In 2022 IEEE Frontiers in Educa- Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey,
tion Conference (FIE), pages 1–9. IEEE. Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya,
Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo
Charles Koutcheme. 2023. Training Language Models
Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel
for Programming Feedback Using Automated Repair
Romero, Tony Lee, Nadav Timor, Jennifer Ding,
Tools. In Artificial Intelligence in Education, pages
Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri
830–835, Cham. Springer Nature Switzerland.
Dao, Mayank Mishra, Alex Gu, Jennifer Robinson,
Charles Koutcheme, Nicola Dainese, Sami Sarsa, Arto Carolyn Jane Anderson, Brendan Dolan-Gavitt, Dan-
Hellas, Juho Leinonen, and Paul Denny. 2024. Open ish Contractor, Siva Reddy, Daniel Fried, Dzmitry
source language models can provide feedback: Eval- Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis,
uating llms’ ability to help students using gpt-4-as- Sean Hughes, Thomas Wolf, Arjun Guha, Leandro
a-judge. In Proceedings of the 2024 Innovation and von Werra, and Harm de Vries. 2023. Starcoder: may
Technology in Computer Science Education, Volume the source be with you!
1, ITICSE ’24.
Chin-Yew Lin. 2004. ROUGE: A package for auto-
Charles Koutcheme, Nicola Dainese, Sami Sarsa, Juho matic evaluation of summaries. In Text Summariza-
Leinonen, Arto Hellas, and Paul Denny. 2023a. tion Branches Out, pages 74–81, Barcelona, Spain.
Benchmarking educational program repair. In Association for Computational Linguistics.
NeurIPS’23 Workshop on Generative AI for Educa-
tion (GAIED). NeurIPS. Derrick Lin, James Koppel, Angela Chen, and Armando
Solar-Lezama. 2017. Quixbugs: A multi-lingual
Charles Koutcheme, Sami Sarsa, Juho Leinonen, Lassi program repair benchmark set based on the quixey
Haaranen, and Arto Hellas. 2023b. Evaluating dis- challenge. In Proceedings Companion of the 2017
tance measures for program repair. In Proceedings ACM SIGPLAN International Conference on Systems,
of the 2023 ACM Conference on International Com- Programming, Languages, and Applications: Soft-
puting Education Research - Volume 1, ICER ’23, ware for Humanity, SPLASH Companion 2017, page
page 495–507, New York, NY, USA. Association for 55–56, New York, NY, USA. Association for Com-
Computing Machinery. puting Machinery.

Charles Koutcheme, Sami Sarsa, Juho Leinonen, Arto Tal Linzen. 2020. How can we accelerate progress
Hellas, and Paul Denny. 2023c. Automated Program towards human-like linguistic generalization? In
Repair Using Generative Models for Code Infilling. Proceedings of the 58th Annual Meeting of the Asso-
In Artificial Intelligence in Education, pages 798– ciation for Computational Linguistics, pages 5210–
803, Cham. Springer Nature Switzerland. 5217, Online. Association for Computational Lin-
guistics.
Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang,
Ruiqi Zhong, Luke Zettlemoyer, Scott Wen tau Yih, Qi Liu, Shuanghong Shen, Zhenya Huang, Enhong
Daniel Fried, Sida Wang, and Tao Yu. 2022. Ds-1000: Chen, and Yonghe Zheng. 2023. A survey of knowl-
A natural and reliable benchmark for data science edge tracing.
code generation.
Rongxin Liu, Carter Zenke, Charlie Liu, Andrew
Teemu Lehtinen, Charles Koutcheme, and Arto Hel- Holmes, Patrick Thornton, and David J. Malan. 2024.
las. 2024. Let’s ask ai about their programs: Ex- Teaching cs50 with ai: Leveraging generative artifi-
ploring chatgpt’s answers to program comprehension cial intelligence in computer science education. In
questions. In Proceedings of the 46th International Proceedings of the 55th ACM Technical Symposium
Conference on Software Engineering: Software Engi- on Computer Science Education V. 1, SIGCSE 2024,
neering Education and Training, ICSE-SEET ’24. page 750–756, New York, NY, USA. Association for
Computing Machinery.
Juho Leinonen, Arto Hellas, Sami Sarsa, Brent Reeves,
Paul Denny, James Prather, and Brett A. Becker. 2023. Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey
Using language models to enhance programming Svyatkovskiy, Ambrosio Blanco, Colin Clement,

175
Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong James Prather, Paul Denny, Juho Leinonen, Brett A
Zhou, Linjun Shou, Long Zhou, Michele Tufano, Becker, Ibrahim Albluwi, Michelle Craig, Hieke Ke-
Ming Gong, Ming Zhou, Nan Duan, Neel Sundare- uning, Natalie Kiesler, Tobias Kohn, Andrew Luxton-
san, Shao Kun Deng, Shengyu Fu, and Shujie Liu. Reilly, et al. 2023. The robots are here: Navigating
2021. Codexglue: A machine learning benchmark the generative ai revolution in computing education.
dataset for code understanding and generation. In Proceedings of the 2023 Working Group Reports
on Innovation and Technology in Computer Science
Ali Malik, Mike Wu, Vrinda Vasavada, Jinpeng Song, Education, pages 108–159.
Madison Coots, John Mitchell, Noah Goodman,
and Chris Piech. 2021. Generative Grading: Near Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano
Human-level Accuracy for Automated Feedback on Ermon, Christopher D. Manning, and Chelsea Finn.
Richly Structured Problems. In Proceedings of the 2023. Direct preference optimization: Your language
14th Educational Data Mining conference. model is secretly a reward model.

Jessica McBroom, Irena Koprinska, and Kalina Yacef. Kelly Rivers and Kenneth R. Koedinger. 2017. Data-
2021. A survey of automated programming hint gen- Driven Hint Generation in Vast Solution Spaces: a
eration: The hints framework. ACM Computing Sur- Self-Improving Python Programming Tutor. Interna-
veys (CSUR), 54(8):1–27. tional Journal of Artificial Intelligence in Education,
27(1):37–64.
Hunter McNichols, Wanyong Feng, Jaewook Lee,
Alexander Scarlatos, Digory Smith, Simon Wood- Lianne Roest, Hieke Keuning, and Johan Jeuring. 2024.
head, and Andrew Lan. 2024. Automated distractor Next-step hint generation for introductory program-
and feedback generation for math multiple-choice ming using large language models. In Proceedings of
questions via in-context learning. the 26th Australasian Computing Education Confer-
ence, ACE ’24, page 144–153, New York, NY, USA.
Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Association for Computing Machinery.
Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam
Singh, Xiangru Tang, Leandro von Werra, and
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle,
Shayne Longpre. 2023. Octopack: Instruction tuning
Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi
code large language models.
Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom
Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wen-
Jing Zhu. 2002. Bleu: a method for automatic evalu-
han Xiong, Alexandre Défossez, Jade Copet, Faisal
ation of machine translation. In Proceedings of the
Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier,
40th Annual Meeting of the Association for Compu-
Thomas Scialom, and Gabriel Synnaeve. 2023. Code
tational Linguistics, pages 311–318, Philadelphia,
llama: Open foundation models for code.
Pennsylvania, USA. Association for Computational
Linguistics.
Sami Sarsa, Paul Denny, Arto Hellas, and Juho
Sagar Parihar, Ziyaan Dadachanji, Praveen Kumar Leinonen. 2022. Automatic generation of program-
Singh, Rajdeep Das, Amey Karkare, and Arnab Bhat- ming exercises and code explanations using large
tacharya. 2017. Automatic grading and feedback language models. In Proceedings of the 2022 ACM
using program repair for introductory programming Conference on International Computing Education
courses. In Proceedings of the 2017 ACM conference Research-Volume 1, pages 27–43.
on innovation and technology in computer science
education, pages 92–97. Jaromir Savelka, Arav Agarwal, Marshall An, Chris
Bogart, and Majd Sakr. 2023a. Thrilled by your
Tung Phung, José Cambronero, Sumit Gulwani, Tobias progress! large language models (gpt-4) no longer
Kohn, Rupak Majumdar, Adish Singla, and Gustavo struggle to pass assessments in higher education pro-
Soares. 2023a. Generating high-precision feedback gramming courses. In Proceedings of the 2023 ACM
for programming syntax errors using language mod- Conference on International Computing Education
els. Research - Volume 1, ICER ’23, page 78–92, New
York, NY, USA. Association for Computing Machin-
Tung Phung, Victor-Alexandru Pădurean, Anjali Singh, ery.
Christopher Brooks, José Cambronero, Sumit Gul-
wani, Adish Singla, and Gustavo Soares. 2023b. Au- Jaromir Savelka, Arav Agarwal, Christopher Bogart, Yi-
tomating human tutor-style programming feedback: fan Song, and Majd Sakr. 2023b. Can generative pre-
Leveraging gpt-4 tutor model for hint generation and trained transformers (gpt) pass assessments in higher
gpt-3.5 student model for hint validation. education programming courses? arXiv preprint.

Chris Piech, Jonathan Huang, Andy Nguyen, Mike Phul- Alexander Scarlatos, Digory Smith, Simon Woodhead,
suksombati, Mehran Sahami, and Leonidas Guibas. and Andrew Lan. 2024. Improving the validity of
2015. Learning program embeddings to propagate automatically generated feedback via reinforcement
feedback on student code. learning.

176
Ben Leong Shubham Sahai, Umair Z. Ahmed. 2023. Jialu Zhang, José Cambronero, Sumit Gulwani, Vu Le,
Improving the coverage of gpt for automated feed- Ruzica Piskac, Gustavo Soares, and Gust Verbruggen.
back on high school programming assignments. In 2022. Repairing bugs in python assignments using
NeurIPS’23 Workshop on Generative AI for Educa- language models.
tion (GAIED). NeurIPS.
Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Wei Lu. 2024. Tinyllama: An open-source small
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
language model.
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
Grave, and Guillaume Lample. 2023. Llama: Open Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
and efficient foundation language models. Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang,
Lewis Tunstall, Edward Beeching, Nathan Lambert, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging
Nazneen Rajani, Kashif Rasul, Younes Belkada, llm-as-a-judge with mt-bench and chatbot arena.
Shengyi Huang, Leandro von Werra, Clémentine
Fourrier, Nathan Habib, Nathan Sarrazin, Omar San- Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao
seviero, Alexander M. Rush, and Thomas Wolf. 2023. Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu,
Zephyr: Direct distillation of lm alignment. Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis,
Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Luke Zettlemoyer, and Omer Levy. 2023a. Lima:
Guu, Adams Wei Yu, Brian Lester, Nan Du, An- Less is more for alignment.
drew M. Dai, and Quoc V. Le. 2022a. Finetuned
language models are zero-shot learners. Shuyan Zhou, Uri Alon, Sumit Agarwal, and Graham
Neubig. 2023b. CodeBERTScore: Evaluating code
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, generation with pretrained models of code. In Pro-
Barret Zoph, Sebastian Borgeaud, Dani Yogatama, ceedings of the 2023 Conference on Empirical Meth-
Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. ods in Natural Language Processing, pages 13921–
Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy 13937, Singapore. Association for Computational
Liang, Jeff Dean, and William Fedus. 2022b. Emer- Linguistics.
gent abilities of large language models.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and A Experiment details
Denny Zhou. 2023. Chain-of-thought prompting elic-
its reasoning in large language models. A.1 Prompts used
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Figure 4 (resp. Figure 5) shows our prompts to
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- obtain repairs (resp. feedback) from the language
icz, Joe Davison, Sam Shleifer, Patrick von Platen, models. Figure 6 shows the prompt used to grade
Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, the feedback generated by the language models us-
Teven Le Scao, Sylvain Gugger, Mariama Drame, ing GPT-4 as our automatic evaluator (we adapt
Quentin Lhoest, and Alexander M. Rush. 2020. Hug-
gingface’s transformers: State-of-the-art natural lan- the prompt from (Koutcheme et al., 2024)). The
guage processing. reported value for "completeness" corresponds to
the proportion of "yes" responses across our test
Mike Wu, Noah D. Goodman, Chris Piech, and Chelsea
Finn. 2021. Prototransformer: A meta-learning dataset to the first criterion, while the reported
approach to providing student feedback. CoRR, value for the hallucination rate corresponds to the
abs/2107.14035. proportion of "no" responses to the second crite-
Chunqiu Steven Xia and Lingming Zhang. 2023. Con- rion. We note that regarding the issues present
versational automated program repair. in the students’ incorrect program, we assumed
them to be identified by GPT-4 during evaluation
Jooyong Yi, Umair Z Ahmed, Amey Karkare,
Shin Hwei Tan, and Abhik Roychoudhury. 2017. A (without a separate prompt). We acknowledge the
feasibility study of using automated program repair limitations of this prompting strategy (i.e., no space
for introductory programming assignments. In Pro- for reasoning) which we’ll refine in future work.
ceedings of the 2017 11th Joint Meeting on Founda-
tions of Software Engineering, pages 740–751.
A.2 Official model names
Dani Yogatama, Cyprien de Masson d’Autume, Jerome
Connor, Tomas Kocisky, Mike Chrzanowski, Ling- Table 3 translates each model name into their Hug-
peng Kong, Angeliki Lazaridou, Wang Ling, Lei Yu, ginface id 2 .
Chris Dyer, and Phil Blunsom. 2019. Learning and
2
evaluating general linguistic intelligence. https://huggingface.co/models

177
Judging
Repair generation You are a computer science professor teaching introduc-
You are a computer science professor teaching introduc- tory programming using Python. 1
tory programming using Python. 1
Below is a problem description, and an incorrect program
Bellow is a problem description and an incorrect program written by a student. You are also provided with the
submitted by a student. Repair the student program with feedback generated by a language model. Your task is to
as few changes as possible such that the corrected program evaluate the quality of the feedback (by saying yes or no)
fulfils the requirements of the problem description. The to ensure it adheres to the multiple criteria outlined below.
corrected Python code must be between “‘python and “‘." For each criterion, provide your answer in a separate line
with the format '(CRITERIA_NUMBER): Yes/No'. Do
2 not provide comments, but be attentive to the problem
**Problem:** description requirements. 2
<handout>
3 ## Problem description:
**Incorect code:** <handout>
<submitted_code>
## Student Code:
<submitted_code>

Figure 4: Our template for prompting the LLMs to ## Feedback: 3


provide feedback. (1) A system prompt specifying the <feedback>
behaviour of the model. (2) A description of the grading
## Criteria:
task. (3) Information necessary to grade the feedback.
(1) Identifies and mentions all actual issues
(2) Does not mention any non-existent issue

Feedback generation
Figure 6: Judging prompt template. We provide (1)
You are a computer science professor teaching introduc-
a system prompt specifying GPT-4’s behaviour, (2) a
tory programming using Python. 1
description of the grading task, and (3) contextual infor-
Below is a problem statement and an incorrect program mation.
submitted by a student. List and explain all the issues
in the student program that prevent it from solving the
associated problem and fulfilling all the requirements in A.3 Concept analysis
the problem description. 2
Table 4 shows the number of exercises which prac-
**Problem:** tice each concept. Additionally, figure 7 shows an
<handout>
3 upset plot of the number of incorrect programs for
**Incorect code:** which each combination of programming concepts
<submitted_code>
is practised.

Figure 5: Our template for prompting the LLMs to B Results details


provide feedback. (1) A system prompt specifying the
B.1 Additional performance scores
behaviour of the model. (2) A description of the grading
task. (3) Information necessary to grade the feedback. Some work in program synthesis has evaluated the
ability of language models to generate programs
using another method to estimate pass@1. This
method, originally proposed in the work of Chen
Table 3: Official model names for HuggingFace models.
et al. (Chen et al., 2021), is based on generating
multiple samples, and is particularly adapted to
name HuggingFace/OpenAI id
non-instruction tuned models. We report the results
TinyLlama TinyLlama/TinyLlama-1.1B-Chat-v1.0 of the program repair performance evaluation based
CodeLlama codellama/CodeLlama-7b-hf on this multi-sample strategy.
Llama meta-llama/Llama-2-7b-chat-hf
Mistral mistralai/Mistral-7B-v0.1
Multi-sample performance evaluation. For
each incorrect program, we generate n = 20 sam-
Zephyr HuggingFaceH4/zephyr-7b-beta
ples using top_p nucleus sampling and a tempera-
Gemma google/gemma-7b-it
ture of 0.2 (Chen et al., 2021; Li et al., 2023). We
evaluate functional correctness using the pass@1
178
64
60

Intersection size
40 36
30
26 24 26
20 18 18
14 14 13
6 9 7 7 7 8 6 6
3 4 5 5 5 2 3 2
1 1
0

3 loop_nested
17 list_2d
30 loop_elements
38 list
38 loop_until
43 function_return
63 function_call
65 function_def
68 input_str
105 loop_counting
107 stat_calculate
217 assignment
249 output
257 conditional
257 input_cast

250 0

Figure 7: Programming concepts upset plot.

estimator, which tells us the probability that a lan-


guage model will fix an incorrect program in a
single attempt (Muennighoff et al., 2023).
Table 4: Number of exercises and incorrect programs
To evaluate the ability of a language model to
practised for each concept.
generate a solution close to the student program,
we average the ROUGE-L score between each of
concept # exercises # programs
the k(k ≤ n) candidate repairs that pass all unit
input string 4 18 tests and the incorrect program.
input casting 27 257 Results. Table 5 shows the performance results
output 28 249 with the adapted pass@1 and rouge scores for a
assignment 26 217 subset of the models (those with more than 7B
parameters).
conditional 22 257
function calling 8 63 Table 5: We show the pass@1, rouge, completeness,
function definition 9 65 and hallucination rate (hall. rate).
function return 6 43
model pass@1 rouge completeness hall. rate
loop counting 9 105
Gemma-7b 0.267 0.353 0.905 0.005
loop until 5 38 Zephyr-beta 0.276 0.336 0.624 0.716
loop elements 1 30 Mistral 0.304 0.365 0.738 0.397
loop nested 1 3 gpt-3.5-turbo 0.529 0.561 0.838 0.368
gpt-4-turbo 0.634 0.559 0.992 0.024
stat calculation 10 38
list 3 38
In general, we notice an absolute drop in per-
list 2D 3 17 formance from the greedy decoding. Beyond this
absolute difference, the main change is that the
ranking of the model changed. Gemma-7B is now
the least performant of the 7B parameters models.
179
The performance of the 7B parameters model are
dependent on these.

B.2 Programming concepts performance


Table 6 shows the detailed per concept performance
results for all models.

180
Table 6: Per concept performance results. Legend: IS (input string), IC (input casting), O (output), A (assignment),
C (conditionals), FC (function call), FD (function definition), FR (function read), LC (loop counting), LU (loop
until), SC (stat calculate), L (list), L2D (list 2D).

(a) Pass@1

IS IC O A C FC FD FR LC LU SC L L2D
TinyLlama 0.04 0.03 0.03 0.07 0.05 0.21 0.09 0.14 0.03 0.03 0.04 0.03 0.06
Gemma-2b 0.06 0.13 0.13 0.12 0.19 0.44 0.60 0.77 0.13 0.05 0.10 0.11 0.47
CodeLlama 0.22 0.18 0.24 0.24 0.26 0.54 0.52 0.56 0.19 0.29 0.21 0.24 0.59
Zephyr-beta 0.10 0.17 0.17 0.24 0.25 0.60 0.58 0.86 0.23 0.26 0.26 0.26 0.41
Mistral 0.13 0.23 0.22 0.27 0.28 0.56 0.49 0.67 0.19 0.24 0.24 0.34 0.65
Gemma-7b 0.16 0.22 0.25 0.29 0.25 0.52 0.52 0.53 0.21 0.47 0.21 0.26 0.47
gpt-3.5-turbo 0.44 0.41 0.50 0.52 0.46 0.84 0.86 0.91 0.49 0.50 0.55 0.68 0.76
gpt-4-turbo 0.21 0.58 0.63 0.64 0.63 0.86 0.92 1.00 0.39 0.50 0.48 0.42 0.76
average 0.17 0.24 0.27 0.30 0.30 0.57 0.57 0.68 0.23 0.29 0.26 0.29 0.52

(b) Completeness

IS IC O A C FC FD FR LC LU SC L L2D
TinyLlama 0.04 0.07 0.07 0.04 0.08 0.03 0.08 0.07 0.04 0.00 0.05 0.08 0.18
Gemma-2b 0.15 0.15 0.14 0.17 0.15 0.21 0.20 0.26 0.16 0.05 0.17 0.18 0.06
CodeLlama 0.31 0.33 0.32 0.37 0.35 0.43 0.35 0.35 0.33 0.45 0.42 0.26 0.24
Zephyr-beta 0.54 0.64 0.65 0.59 0.63 0.60 0.51 0.51 0.55 0.71 0.60 0.76 0.59
Mistral 0.81 0.74 0.71 0.77 0.75 0.79 0.69 0.79 0.73 0.76 0.79 0.79 0.82
Gemma-7b 0.94 0.94 0.92 0.91 0.95 0.76 0.86 0.91 0.90 1.00 0.94 0.95 1.00
gpt-3.5-turbo 0.93 0.82 0.82 0.87 0.84 0.84 0.86 0.81 0.83 0.76 0.89 0.97 0.94
gpt-4-turbo 1.00 0.99 0.99 1.00 0.99 0.98 0.98 0.98 0.99 0.97 1.00 1.00 1.00
average 0.59 0.58 0.58 0.59 0.59 0.58 0.57 0.58 0.57 0.59 0.61 0.62 0.60

(c) ROUGE

IS IC O A C FC FD FR LC LU SC L L2D
TinyLlama 0.04 0.03 0.03 0.07 0.05 0.17 0.08 0.13 0.03 0.03 0.04 0.03 0.06
Gemma-2b 0.05 0.11 0.11 0.11 0.15 0.32 0.45 0.57 0.12 0.05 0.10 0.09 0.31
CodeLlama 0.20 0.15 0.20 0.22 0.21 0.46 0.45 0.47 0.17 0.25 0.18 0.21 0.51
Zephyr-beta 0.07 0.13 0.13 0.20 0.19 0.50 0.48 0.71 0.18 0.21 0.21 0.22 0.32
Mistral 0.08 0.15 0.16 0.20 0.20 0.44 0.38 0.52 0.14 0.15 0.16 0.26 0.48
Gemma-7b 0.16 0.20 0.22 0.28 0.23 0.49 0.47 0.49 0.20 0.43 0.21 0.25 0.44
gpt-3.5-turbo 0.41 0.37 0.44 0.47 0.41 0.74 0.76 0.80 0.45 0.45 0.51 0.63 0.72
gpt-4-turbo 0.17 0.46 0.51 0.52 0.50 0.70 0.72 0.77 0.31 0.40 0.38 0.35 0.61
average 0.15 0.20 0.22 0.26 0.24 0.48 0.47 0.56 0.20 0.25 0.22 0.26 0.43

(d) hallucination rate

IS IC O A C FC FD FR LC LU SC L L2D
TinyLlama 0.47 0.30 0.25 0.35 0.31 0.41 0.42 0.40 0.43 0.21 0.40 0.39 0.18
Gemma-2b 0.12 0.37 0.42 0.37 0.35 0.32 0.48 0.40 0.24 0.55 0.28 0.13 0.65
CodeLlama 0.87 0.86 0.87 0.82 0.85 0.75 0.82 0.81 0.82 0.89 0.80 0.97 0.88
Zephyr-beta 0.88 0.79 0.78 0.80 0.78 0.46 0.60 0.49 0.86 0.61 0.89 0.87 0.65
Mistral 0.43 0.42 0.42 0.39 0.41 0.32 0.32 0.35 0.41 0.32 0.37 0.45 0.41
Gemma-7b 0.00 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.01 0.03 0.02 0.00 0.00
gpt-3.5-turbo 0.28 0.31 0.35 0.36 0.35 0.54 0.49 0.51 0.34 0.21 0.29 0.24 0.24
gpt-4-turbo 0.00 0.02 0.02 0.02 0.02 0.00 0.03 0.05 0.00 0.05 0.01 0.05 0.06
average 0.38 0.38 0.39 0.39 0.38 0.35 0.40 0.38 0.39 0.36 0.38 0.39 0.38

181

You might also like