Automatic Grading of Short Answers Using Large Language Models in
Automatic Grading of Short Answers Using Large Language Models in
Research Collection School Of Computing and School of Computing and Information Systems
Information Systems
5-2024
Yi Meng CHAI
Part of the Educational Assessment, Evaluation, and Research Commons, Higher Education
Commons, and the Software Engineering Commons
This Conference Proceeding Article is brought to you for free and open access by the School of Computing and
Information Systems at Institutional Knowledge at Singapore Management University. It has been accepted for
inclusion in Research Collection School Of Computing and Information Systems by an authorized administrator of
Institutional Knowledge at Singapore Management University. For more information, please email
[email protected].
Automatic Grading of Short Answers Using Large
Language Models in Software Engineering Courses
Ta Nguyen Binh Duong, Chai Yi Meng
School of Computing and Information Systems
Singapore Management University
Email: [email protected]
Abstract—Short-answer based questions have been used widely techniques [3], [4], which require careful feature extractions
due to their effectiveness in assessing whether the desired learning before model training and score prediction. More recent ap-
outcomes have been attained by students. However, due to their proaches leverage deep learning techniques, which are able
open-ended nature, many different answers could be considered
entirely or partially correct for the same question. In the context to learn representative features from huge amounts of data
of computer science and software engineering courses where the instead manual feature engineering work. The deep learning
enrolment has been increasing recently, manual grading of short- based approaches may suffer from the lack of data on short
answer questions is a time-consuming and tedious process for answer based assessments.
instructors. Latest advances in pre-trained LLMs, e.g., OpenAI’s release
In software engineering courses, assessments concern not just
coding but many other aspects of software development such
of the GPT family of models, have enabled researchers to
as system analysis, architecture design, software processes and further investigate autograding for text-based responses from
operation methodologies such as Agile and DevOps. However, students, e.g., [5], [6]. However, not much work has been done
existing work in automatic grading/scoring of text-based answers for LLM-based autograding of short answers in the context of
in computing courses have been focusing more on coding-oriented computing education, especially software engineering courses
questions. In this work, we consider the problem of autograding a
broader range of short answers in software engineering courses.
[7]. Such courses cover a wide range of topics including
We propose an automated grading system incorporating both programming, system design, Agile processes, DevOps prac-
text embedding and completion approaches based on recently tices in system deployment, operation, and maintenance, etc.
introduced pre-trained large language models (LLMs) such as Assessment questions in these topics, e.g., “list one problem
GPT-3.5/4. We design and implement a web-based system so with agile processes such as Scrum?” could have a wide
that students and instructors can easily leverage autograding for
learning and teaching. Finally, we conduct an extensive evaluation
range of correct answers. We note that automated grading in
of our automated grading approaches. We use a popular public computing education has been focusing more on coding based
dataset in the computing education domain and a new software questions [8], [9], which could have a rather limited set of valid
engineering dataset of our own. The results demonstrate the responses and could be graded by running pre-determined unit
effectiveness of our approach, and provide useful insights for testcases.
further research in this area of AI-enabled education.
In this work, we consider the problem of autograding of
Index Terms—automatic grading, large language models, em-
bedding, software engineering courses, short answers short answers in the context of software engineering courses,
which are not limited to just programming/coding questions.
We makes the following contributions in this paper:
I. I NTRODUCTION
• We propose an automated grading method incorporating
Assessments in education can be done in many forms, for both text embedding and completion approaches based
instance multiple-choice questions, essays, short written re- on recently introduced pre-trained LLMs such as GPT-
sponses, coding tests, etc. We note that questions which require 3.5-Turbo and GPT-4. The completion-based autograding
short textual answers are popular in educational assessments approach also leverages Retrieval Augmented Generation
[1]. One of the main reasons is that they could be considered to [10] for better grading accuracy.
be more effective compared to multiple-choice questions due • We design and implement a web based system for our
to a greater level of information retrieval from memory when LLM-based autograding approaches. The system targets
trying to come up with answers [2]. However, short-answer both instructors and students. Instructors can use the
questions can accept different correct and partially correct web system to do manual adjustments of the autograded
answers. Grading many of such answers is undoubtedly a very scores and to provide additional feedback to answers from
tedious and time-consuming process, especially in computing students; while students can practice question answering
courses at the university level where the number of students with instant grading.
has been increasing significantly recently. • We compile a new dataset containing popular questions
Automatic grading/scoring of short textual answers is an and short answers in the context of our software engineer-
established problem in technology-enabled education. Various ing courses. These courses cover important software con-
existing approaches made use of traditional machine learning cepts in addition to programming, namely system design,
software testing, Agile processes, DevOps practices, etc. were described in [14] and [15], which leverage the Siamese
This dataset complements existing ones, e.g., the Mohler Bidirectional Long Short-Term Memory networks (BiLSTMs).
dataset [3] which is mainly about programming based Their results were also reported using the same dataset in
questions. [3]. More recent approaches to short answer grading including
• We conduct an extensive evaluation of our automated [16], which uses the Transformer architecture [17] and other
grading approach using the new dataset, together with optimization techniques to address the problem of insufficient
another public dataset in the domain of computer science. training data.
To this end, we compare our approach in short-answer
grading to some of the most popular existing deep learn- B. LLM-based approaches
ing based approaches including paragraph embeddings
Due to recent advances in pre-trained LLMs, there have
and Siamese long short-term memory (LSTM) neural
been a growing body of work making use of LLMs for
networks. The results demonstrate the effectiveness of our
automated grading in educational contexts. In particular, [5]
approach, and provide useful insights for further work in
investigated text augmentation techniques using GPT-3.5 to
this area.
improve the dataset for training machine learning models
This paper is organized as follows. Section II discusses which will be used to provide automated feedback to students.
related work in short answer autograding, especially recent [6] evaluated the accuracy of using GPT-3’s text-davinci-003
work in deep learning and LLMs. Section III describes our model for automatic grading of essays. Using 12,100 essays,
approaches to autograding of short answers. Section IV pro- it concluded that GPT-3 models, combined with linguistic
vides details on our web based system implementation. Section features, provided a high level of accuracy. Note that this is for
V presents our evaluation methodology, while Section VI essay scoring, not short answer grading in computer science
discusses the experimental results. Section VII concludes the related courses. [18] also used OpenAI’s GPT-3.5 text-davinci-
paper and highlights possible future work. 003 model for one-shot prompting and the text completion
II. R ELATED WORK API to do automatic grading. However, they made use of the
Prize Short Answer Scoring dataset, which includes questions
Below we summarize several key recent and existing work from science, biology, English, etc., but not computer science
in automatic short answer grading. We will compare the related courses. Similarly, [19] investigated automated scoring
reported performance for these approaches to ours in this paper for the subject of divergent thinking. The authors performed
where possible. fine-tuning of LLMs on human-judged responses. The authors
A. Deep learning based approaches of [20] evaluated GPT-4 for short answer grading using the
SciEntsBank and Beetle datasets. They found that for these
Traditional machine learning techniques have been applied
datasets, GPT-4’s performance is comparable to manually
to the problem of automated short answer grading for many
crafted machine learning models.
years. In these approaches, e.g., [3], [4], [11], manual feature
Regarding autograding of short answers in the context of
engineering is needed before training the models on a part
computer science related courses, very recent works including
of the dataset. For instance, [4] described feature extraction
[21] which made use of ChatGPT for grading exams in a data
methods including text similarity, question demoting, term
science course. They also evaluated ChatGPT for a German-
weighting, etc. Using these features, a simple ridge regression
based information system introductory course. They found that
model was trained. The authors reported autograding perfor-
such LLM deployment can be valuable, but it is not yet ready
mance, e.g., accuracy, in the form of the Pearson correlation
for fully automated grading. ChatGPT was also used in [22]
coefficient value of 0.592, and the root mean squared error
to provide corrections to open-ended answers from software
(RMSE) of 0.887. They used a dataset consisting of many
development professionals participating in technical training.
computer programing related questions and answers [3] made
The authors found that subject matter experts usually agreed
available by Mohler et al.
with the corrections given by ChatGPT. None of these work
Recently, deep learning based approaches have gained
made use of well-known datasets in computer science courses
much popularity. Deep learning based autograders automat-
such as the Mohler dataset [3]. The exception is [7], in which
ically learn representative features from large datasets. In
the authors compared pre-trained LLMs such as ELMo, BERT,
[12], the authors did a comprehensive survey of deep learn-
GPT-2, etc., directly on their autograding performance for the
ing approaches, including embedding, sequential models and
Mohler dataset. We note that this work was done a while ago
attention-based neural networks for short answer grading.
so the latest GPT models were not included.
The authors then showed that the features learned by deep
learning methods mainly work as complementary to manually
C. Summary
crafted features of the autograding model. [13] considered
automatic grading of short answers using two different types We note that existing deep learning based approaches to
of paragraph embedding models. They obtained a Pearson short answer grading could provide good accuracy, but they
correlation coefficient of 0.569 and RMSE of 0.797 on the need to be combined with hand-crafted features and require
Mohler dataset [3]. Other neural network based approaches extensive training with large datasets. On the other hand, more
recent approaches based on generative AI, in particular pre- The cosine similarity will range from 0 to 1, with 0 being
trained LLMs, have been focusing more on other educational the least similar and 1 being the most similar. After comparing
domains which are not computer science related. In addition, the cosine similarities between each student answer and all
many of the existing approaches made use of the computer the reference answers, the most similar reference answer to
science dataset from [3] which had been released a while the student answer will be selected (lines 12-14 of Listing
ago. This dataset is about basic data structures and computer 1). A mark will then be given to the student answer which
programming concepts. is proportional to the cosine similarity (line 16 of Listing 1).
In this work, we aim to develop new LLM-based approaches This is done by multiplying the cosine similarity score with
which do not require training, and to evaluate these approaches the reference answer’s score.
using an entirely new dataset obtained from software engineer- The embedding-based autoscoring of short answers can be
ing courses which include many more topics and concepts implemented and deployed to use quickly due to the general
beyond just programming. We plan to release our new dataset availability and affordability of of state-of-the-art text embed-
publicly to encourage further research in this area. ding models such as text-embedding-ada-002. For instance, its
III. LLM- BASED AUTO -G RADING A PPROACHES pricing as the time of writing is just $0.0001 per 1K tokens.
However, this approach might require a wide range of possible
In this section, we describe in details our proposed
reference answers to be provided for more accurate grading.
approaches to auto-grading short answers, namely the
For short-answer questions, this is potentially challenging as
embedding-based, and the completion-based approach. Both
there can be a large number of possibly correct answers to a
of the approaches are based on latest advances in pre-trained
single question. We can mitigate this issue by using correct
LLMs, in particular the text embedding and chat completion
answers from students as reference answers. Another issue is
models released publicly by OpenAI.
that, although models such as text-embedding-ada-002 is quite
A. Embedding-based affordable, computing embeddings for answers every time you
Text embeddings are numerical representations of text in need to do grading (lines 8 and 10 of Listing 1) will add to
which words or phrases are represented as a vector of numbers. the total cost. For this, we could use a vector database such
They are used to capture semantic meanings and relationships as Chroma (https://www.trychroma.com) to store and retrieve
between words or phrases, enabling more efficient processing the pre-computed embeddings when required.
and understanding of human languages [23].
B. Completion-based
1 Input: pair of question, answer (Q, A)
2 list R = [reference answers for Q] Completion is essentially the generation of output based
3
4 Output: numerical score S for A on the text prompts given to a pre-trained LLM such as
5 Steps: GPT-3.5-Turbo. Prompt construction, or prompt engineering
6 Ch = 0 for LLMs is an active research area [25]. In a prompt, we
7 Sq = 0
8 Compute the embedding Ea for A may provide relevant instructions, examples, etc., in natural
9 For each reference answer Ar in R languages. Such data would help direct the model to produce
10 Compute the embedding Er for Ar the desired output. One way to do prompting is called zero-
11 Compute a cosine similarity Cr = cos(Er, Ea)
12 If Cr > Ch: shot, in which a query is sent to the LLM without concrete
13 Ch = Cr examples of expected results. On the other hand, in few-
14 Sq = score of Ar shot prompting, we provide multiple examples of questions
15
16 S = Ch * Sq and their corresponding answers in a simulated multi-turn
17 Return S conversation with the LLM. At the end of the conversation,
Listing 1. Embedding-based autograding approach we can ask the LLM to score a student answer for a given
question.
The algorithm for our embedding-based autoscoring ap-
In this completion-based autograding approach, we make
proach is shown in Listing 1. In this approach, the algorithm
use of the OpenAI’s Chat Completions API1 . The API defines
computes the embeddings of all the reference answers and
prompts as sequences of messages. Each message has two
student answers for a particular question using an available text
components, namely role and content. The role can be “sys-
embedding model (lines 8-11 of Listing 1). In this work, we
tem”, “user”, or “assistant”. A message with ”system” role
use OpenAI’s text-embedding-ada-002 model as it is OpenAI’s
is usually used first to define the behavior of the model. A
best and most cost-effective embedding model as of 2023.
“user” message gives instructions, and an “assistant” message
The cosine similarity [24] between each reference answer
provides an example of the desired output. The prompt is
and student answer (to be auto-graded) is then calculated using
constructed with all the required messages and sent to the LLM
their corresponding embedding vectors, A and B respectively,
via an API call. Our completion-based autograding approach
as follows:
is shown in Listing 2.
Pn
A·B i=1 Ai Bi 1 https://platform.openai.com/docs/guides/text-generation/chat-completions-
cos(A, B) = = pPn pPn (1)
∥A∥∥B∥ i=1 (A i)
2
i=1 (Bi )
2 api
1 Message 1: {"role": "system", "content": "You are an We note that more examples used translates to more cost,
AI assistant for teaching software engineering as models such as OpenAI’s offerings charge based on the
concepts."}
2
number of tokens in the requests and responses. However, in
3 # Start providing examples in the prompt here this work we focus on ways to provide more relevant examples
4 Message 2: {"role": "user", "content": "Given the to improve grading accuracy rather than cost.
question ’What could be a problem with
monolithic software?’, provide a score for the In our completion-based grading approach, we split the an-
corresponding answer ’Scaling needs to be done swers in a dataset into three different categories, namely low-
for the whole application’."} quality (having low marks), medium-quality (having average
5 Message 3: {"role": "assistant", "content": "Score:
4/4"} to quite decent marks), and high-quality (having full marks).
6 During the automated grading process for a particular question,
7 Message 4: {"role": "user", "content": "Given the our algorithm will select a random answer from each answer
question ’What could be a problem with
monolithic software?’, provide a score for the category and construct the appropriate prompt to be sent to
corresponding answer ’It is easier to develop’." the LLM. The number of answers to be used as examples for
} each category can be configurable. For instance, in this work
8 Message 5: {"role": "assistant", "content": "Score:
1/4"} we have considered using 1, 2, and 3 answers per category as
9 examples. As a result, the completion-based grading approach
10 # Provide more examples using additional messages if can construct prompts having a total of 3, 6, or 9 examples
needed
11
(for 3 categories). We believe that this approach provides the
12 # This message is used for autograding LLM with a better understanding of the grading rubrics for
13 Last message: {"role": "user", "content": "Given the each given question.
question ’What could be a problem with
monolithic software?’, provide a score for the Incorporating Retrieval Augmented Generation (RAG):
corresponding answer ’It is hard to make changes Pre-trained LLMs have been shown to perform well in many
.’"} common NLP tasks. However, their knowledge base could
14
15 # The LLM will respond with an appropriate score in not be easily revised or expanded beyond simple fine-tuning,
the below message and they may hallucinate in their responses [26]. RAG [10],
16 Message: {"role": "assistant", "content": "Score: < [27] enables a LLM to access external knowledge databases
predicted_score>"}
to complete domain-specific tasks with better consistency,
Listing 2. Completion-based autograding approach reliability and reduced hallucinations. Given an input, e.g.,
When instructors need to do autograding, the completion- a question, RAG retrieves relevant texts from the specified
based approach constructs a sequence of messages as de- external knowledge databases, and adds those texts as context
scribed in Listing 2. Each “user” message provides the ques- to the prompt to be sent to the LLM. With more appropriate
tion and a corresponding answer, which could be a reference context, the LLM can generate output with higher quality.
answer, or a student answer. This “user” message is imme- In the completion-based autograding approach using RAG,
diately followed by an “assistant” message which has the we make use of the course content to provide additional
score given for the answer. Together, this pair of messages context. The aim is to improve grading accuracy and reliability.
provides a concrete example of how scoring should be done For our software engineering courses, we make available PDF
for a question and its corresponding answer. For example, in lecture notes for each topic covered, e.g., automation, software
Listing 2, messages 2 and 3 provide a score of 4/4 for the processes, software testing, etc. The lecture notes will then
following question/answer pair: “What could be a problem be parsed and partitioned into chunks of texts for which
with monolithic software?” / ”Scaling needs to be done for corresponding text embeddings will be computed. Given a
the whole application”. Similarly, messages 4 and 5 provide specific question to be graded, the most relevant chunks will be
another example for the same question with a different answer. retrieved based on comparing the embedding of the question
We can give more examples to the prompt by providing more versus the embedding of each chunk. The relevant chunks are
of such pair of messages. Finally, the last “user” message then fed into the LLM as the grading context.
in the prompt will provide the answer to be graded for the
same question which has been used in the previous examples. IV. I MPLEMENTATION
Following the Chat Completions API, the LLM, e.g., GPT- For instructors and students to take advantage of LLM-
3.5-Turbo, will provide a predicted score for this answer. based autograding, we design and implement a web system
In the below, we discuss two important considerations for incorporating both the embedding-based and completion-based
the completion-based autograding approach, namely example autograding approaches. The system components are shown in
selection and the incorporation of RAG (Retrieval Augmented Figure 1.
Generation):
Selecting examples for prompt construction: The number A. Components
of examples in a prompt could be varied. Providing more The system is designed for both computing students and
examples would likely yield better scoring results as the instructors at the undergraduate level. The web interface pro-
LLM can learn more effectively using the relevant examples. vides functionalities for instructors to create/read/update/delete
and MongoDB as the database. Figure 2 shows the web inter-
face in which students can practice answering short questions.
When students answer a question, their answers and the marks
given by our autograding approaches will be automatically
added into the database. In Figure 3, the instructors can edit
any answers and marks given by the autograding approaches,
as well as providing additional feedback for each answer.
Instructors can also add more questions/answers for students
to practice.
TABLE I TABLE II
M OHLER DATASET: S AMPLE Q UESTION , A NSWERS AND S CORES SE LECTURE NOTES INCORPORATED AS GRADING CONTEXTS