0% found this document useful (0 votes)
54 views11 pages

Automatic Grading of Short Answers Using Large Language Models in

This document discusses the development of an automated grading system for short answers in software engineering courses using Large Language Models (LLMs) like GPT-3.5 and GPT-4. The proposed system aims to alleviate the challenges of manual grading by incorporating text embedding and completion approaches, allowing for a broader range of acceptable answers. The authors evaluate their method against existing techniques and present a new dataset tailored for software engineering assessments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views11 pages

Automatic Grading of Short Answers Using Large Language Models in

This document discusses the development of an automated grading system for short answers in software engineering courses using Large Language Models (LLMs) like GPT-3.5 and GPT-4. The proposed system aims to alleviate the challenges of manual grading by incorporating text embedding and completion approaches, allowing for a broader range of acceptable answers. The authors evaluate their method against existing techniques and present a new dataset tailored for software engineering assessments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Singapore Management University

Institutional Knowledge at Singapore Management University

Research Collection School Of Computing and School of Computing and Information Systems
Information Systems

5-2024

Automatic grading of short answers using Large Language


Models in software engineering courses
Nguyen Binh Duong TA

Yi Meng CHAI

Follow this and additional works at: https://ink.library.smu.edu.sg/sis_research

Part of the Educational Assessment, Evaluation, and Research Commons, Higher Education
Commons, and the Software Engineering Commons

This Conference Proceeding Article is brought to you for free and open access by the School of Computing and
Information Systems at Institutional Knowledge at Singapore Management University. It has been accepted for
inclusion in Research Collection School Of Computing and Information Systems by an authorized administrator of
Institutional Knowledge at Singapore Management University. For more information, please email
[email protected].
Automatic Grading of Short Answers Using Large
Language Models in Software Engineering Courses
Ta Nguyen Binh Duong, Chai Yi Meng
School of Computing and Information Systems
Singapore Management University
Email: [email protected]

Abstract—Short-answer based questions have been used widely techniques [3], [4], which require careful feature extractions
due to their effectiveness in assessing whether the desired learning before model training and score prediction. More recent ap-
outcomes have been attained by students. However, due to their proaches leverage deep learning techniques, which are able
open-ended nature, many different answers could be considered
entirely or partially correct for the same question. In the context to learn representative features from huge amounts of data
of computer science and software engineering courses where the instead manual feature engineering work. The deep learning
enrolment has been increasing recently, manual grading of short- based approaches may suffer from the lack of data on short
answer questions is a time-consuming and tedious process for answer based assessments.
instructors. Latest advances in pre-trained LLMs, e.g., OpenAI’s release
In software engineering courses, assessments concern not just
coding but many other aspects of software development such
of the GPT family of models, have enabled researchers to
as system analysis, architecture design, software processes and further investigate autograding for text-based responses from
operation methodologies such as Agile and DevOps. However, students, e.g., [5], [6]. However, not much work has been done
existing work in automatic grading/scoring of text-based answers for LLM-based autograding of short answers in the context of
in computing courses have been focusing more on coding-oriented computing education, especially software engineering courses
questions. In this work, we consider the problem of autograding a
broader range of short answers in software engineering courses.
[7]. Such courses cover a wide range of topics including
We propose an automated grading system incorporating both programming, system design, Agile processes, DevOps prac-
text embedding and completion approaches based on recently tices in system deployment, operation, and maintenance, etc.
introduced pre-trained large language models (LLMs) such as Assessment questions in these topics, e.g., “list one problem
GPT-3.5/4. We design and implement a web-based system so with agile processes such as Scrum?” could have a wide
that students and instructors can easily leverage autograding for
learning and teaching. Finally, we conduct an extensive evaluation
range of correct answers. We note that automated grading in
of our automated grading approaches. We use a popular public computing education has been focusing more on coding based
dataset in the computing education domain and a new software questions [8], [9], which could have a rather limited set of valid
engineering dataset of our own. The results demonstrate the responses and could be graded by running pre-determined unit
effectiveness of our approach, and provide useful insights for testcases.
further research in this area of AI-enabled education.
In this work, we consider the problem of autograding of
Index Terms—automatic grading, large language models, em-
bedding, software engineering courses, short answers short answers in the context of software engineering courses,
which are not limited to just programming/coding questions.
We makes the following contributions in this paper:
I. I NTRODUCTION
• We propose an automated grading method incorporating
Assessments in education can be done in many forms, for both text embedding and completion approaches based
instance multiple-choice questions, essays, short written re- on recently introduced pre-trained LLMs such as GPT-
sponses, coding tests, etc. We note that questions which require 3.5-Turbo and GPT-4. The completion-based autograding
short textual answers are popular in educational assessments approach also leverages Retrieval Augmented Generation
[1]. One of the main reasons is that they could be considered to [10] for better grading accuracy.
be more effective compared to multiple-choice questions due • We design and implement a web based system for our
to a greater level of information retrieval from memory when LLM-based autograding approaches. The system targets
trying to come up with answers [2]. However, short-answer both instructors and students. Instructors can use the
questions can accept different correct and partially correct web system to do manual adjustments of the autograded
answers. Grading many of such answers is undoubtedly a very scores and to provide additional feedback to answers from
tedious and time-consuming process, especially in computing students; while students can practice question answering
courses at the university level where the number of students with instant grading.
has been increasing significantly recently. • We compile a new dataset containing popular questions
Automatic grading/scoring of short textual answers is an and short answers in the context of our software engineer-
established problem in technology-enabled education. Various ing courses. These courses cover important software con-
existing approaches made use of traditional machine learning cepts in addition to programming, namely system design,
software testing, Agile processes, DevOps practices, etc. were described in [14] and [15], which leverage the Siamese
This dataset complements existing ones, e.g., the Mohler Bidirectional Long Short-Term Memory networks (BiLSTMs).
dataset [3] which is mainly about programming based Their results were also reported using the same dataset in
questions. [3]. More recent approaches to short answer grading including
• We conduct an extensive evaluation of our automated [16], which uses the Transformer architecture [17] and other
grading approach using the new dataset, together with optimization techniques to address the problem of insufficient
another public dataset in the domain of computer science. training data.
To this end, we compare our approach in short-answer
grading to some of the most popular existing deep learn- B. LLM-based approaches
ing based approaches including paragraph embeddings
Due to recent advances in pre-trained LLMs, there have
and Siamese long short-term memory (LSTM) neural
been a growing body of work making use of LLMs for
networks. The results demonstrate the effectiveness of our
automated grading in educational contexts. In particular, [5]
approach, and provide useful insights for further work in
investigated text augmentation techniques using GPT-3.5 to
this area.
improve the dataset for training machine learning models
This paper is organized as follows. Section II discusses which will be used to provide automated feedback to students.
related work in short answer autograding, especially recent [6] evaluated the accuracy of using GPT-3’s text-davinci-003
work in deep learning and LLMs. Section III describes our model for automatic grading of essays. Using 12,100 essays,
approaches to autograding of short answers. Section IV pro- it concluded that GPT-3 models, combined with linguistic
vides details on our web based system implementation. Section features, provided a high level of accuracy. Note that this is for
V presents our evaluation methodology, while Section VI essay scoring, not short answer grading in computer science
discusses the experimental results. Section VII concludes the related courses. [18] also used OpenAI’s GPT-3.5 text-davinci-
paper and highlights possible future work. 003 model for one-shot prompting and the text completion
II. R ELATED WORK API to do automatic grading. However, they made use of the
Prize Short Answer Scoring dataset, which includes questions
Below we summarize several key recent and existing work from science, biology, English, etc., but not computer science
in automatic short answer grading. We will compare the related courses. Similarly, [19] investigated automated scoring
reported performance for these approaches to ours in this paper for the subject of divergent thinking. The authors performed
where possible. fine-tuning of LLMs on human-judged responses. The authors
A. Deep learning based approaches of [20] evaluated GPT-4 for short answer grading using the
SciEntsBank and Beetle datasets. They found that for these
Traditional machine learning techniques have been applied
datasets, GPT-4’s performance is comparable to manually
to the problem of automated short answer grading for many
crafted machine learning models.
years. In these approaches, e.g., [3], [4], [11], manual feature
Regarding autograding of short answers in the context of
engineering is needed before training the models on a part
computer science related courses, very recent works including
of the dataset. For instance, [4] described feature extraction
[21] which made use of ChatGPT for grading exams in a data
methods including text similarity, question demoting, term
science course. They also evaluated ChatGPT for a German-
weighting, etc. Using these features, a simple ridge regression
based information system introductory course. They found that
model was trained. The authors reported autograding perfor-
such LLM deployment can be valuable, but it is not yet ready
mance, e.g., accuracy, in the form of the Pearson correlation
for fully automated grading. ChatGPT was also used in [22]
coefficient value of 0.592, and the root mean squared error
to provide corrections to open-ended answers from software
(RMSE) of 0.887. They used a dataset consisting of many
development professionals participating in technical training.
computer programing related questions and answers [3] made
The authors found that subject matter experts usually agreed
available by Mohler et al.
with the corrections given by ChatGPT. None of these work
Recently, deep learning based approaches have gained
made use of well-known datasets in computer science courses
much popularity. Deep learning based autograders automat-
such as the Mohler dataset [3]. The exception is [7], in which
ically learn representative features from large datasets. In
the authors compared pre-trained LLMs such as ELMo, BERT,
[12], the authors did a comprehensive survey of deep learn-
GPT-2, etc., directly on their autograding performance for the
ing approaches, including embedding, sequential models and
Mohler dataset. We note that this work was done a while ago
attention-based neural networks for short answer grading.
so the latest GPT models were not included.
The authors then showed that the features learned by deep
learning methods mainly work as complementary to manually
C. Summary
crafted features of the autograding model. [13] considered
automatic grading of short answers using two different types We note that existing deep learning based approaches to
of paragraph embedding models. They obtained a Pearson short answer grading could provide good accuracy, but they
correlation coefficient of 0.569 and RMSE of 0.797 on the need to be combined with hand-crafted features and require
Mohler dataset [3]. Other neural network based approaches extensive training with large datasets. On the other hand, more
recent approaches based on generative AI, in particular pre- The cosine similarity will range from 0 to 1, with 0 being
trained LLMs, have been focusing more on other educational the least similar and 1 being the most similar. After comparing
domains which are not computer science related. In addition, the cosine similarities between each student answer and all
many of the existing approaches made use of the computer the reference answers, the most similar reference answer to
science dataset from [3] which had been released a while the student answer will be selected (lines 12-14 of Listing
ago. This dataset is about basic data structures and computer 1). A mark will then be given to the student answer which
programming concepts. is proportional to the cosine similarity (line 16 of Listing 1).
In this work, we aim to develop new LLM-based approaches This is done by multiplying the cosine similarity score with
which do not require training, and to evaluate these approaches the reference answer’s score.
using an entirely new dataset obtained from software engineer- The embedding-based autoscoring of short answers can be
ing courses which include many more topics and concepts implemented and deployed to use quickly due to the general
beyond just programming. We plan to release our new dataset availability and affordability of of state-of-the-art text embed-
publicly to encourage further research in this area. ding models such as text-embedding-ada-002. For instance, its
III. LLM- BASED AUTO -G RADING A PPROACHES pricing as the time of writing is just $0.0001 per 1K tokens.
However, this approach might require a wide range of possible
In this section, we describe in details our proposed
reference answers to be provided for more accurate grading.
approaches to auto-grading short answers, namely the
For short-answer questions, this is potentially challenging as
embedding-based, and the completion-based approach. Both
there can be a large number of possibly correct answers to a
of the approaches are based on latest advances in pre-trained
single question. We can mitigate this issue by using correct
LLMs, in particular the text embedding and chat completion
answers from students as reference answers. Another issue is
models released publicly by OpenAI.
that, although models such as text-embedding-ada-002 is quite
A. Embedding-based affordable, computing embeddings for answers every time you
Text embeddings are numerical representations of text in need to do grading (lines 8 and 10 of Listing 1) will add to
which words or phrases are represented as a vector of numbers. the total cost. For this, we could use a vector database such
They are used to capture semantic meanings and relationships as Chroma (https://www.trychroma.com) to store and retrieve
between words or phrases, enabling more efficient processing the pre-computed embeddings when required.
and understanding of human languages [23].
B. Completion-based
1 Input: pair of question, answer (Q, A)
2 list R = [reference answers for Q] Completion is essentially the generation of output based
3
4 Output: numerical score S for A on the text prompts given to a pre-trained LLM such as
5 Steps: GPT-3.5-Turbo. Prompt construction, or prompt engineering
6 Ch = 0 for LLMs is an active research area [25]. In a prompt, we
7 Sq = 0
8 Compute the embedding Ea for A may provide relevant instructions, examples, etc., in natural
9 For each reference answer Ar in R languages. Such data would help direct the model to produce
10 Compute the embedding Er for Ar the desired output. One way to do prompting is called zero-
11 Compute a cosine similarity Cr = cos(Er, Ea)
12 If Cr > Ch: shot, in which a query is sent to the LLM without concrete
13 Ch = Cr examples of expected results. On the other hand, in few-
14 Sq = score of Ar shot prompting, we provide multiple examples of questions
15
16 S = Ch * Sq and their corresponding answers in a simulated multi-turn
17 Return S conversation with the LLM. At the end of the conversation,
Listing 1. Embedding-based autograding approach we can ask the LLM to score a student answer for a given
question.
The algorithm for our embedding-based autoscoring ap-
In this completion-based autograding approach, we make
proach is shown in Listing 1. In this approach, the algorithm
use of the OpenAI’s Chat Completions API1 . The API defines
computes the embeddings of all the reference answers and
prompts as sequences of messages. Each message has two
student answers for a particular question using an available text
components, namely role and content. The role can be “sys-
embedding model (lines 8-11 of Listing 1). In this work, we
tem”, “user”, or “assistant”. A message with ”system” role
use OpenAI’s text-embedding-ada-002 model as it is OpenAI’s
is usually used first to define the behavior of the model. A
best and most cost-effective embedding model as of 2023.
“user” message gives instructions, and an “assistant” message
The cosine similarity [24] between each reference answer
provides an example of the desired output. The prompt is
and student answer (to be auto-graded) is then calculated using
constructed with all the required messages and sent to the LLM
their corresponding embedding vectors, A and B respectively,
via an API call. Our completion-based autograding approach
as follows:
is shown in Listing 2.
Pn
A·B i=1 Ai Bi 1 https://platform.openai.com/docs/guides/text-generation/chat-completions-
cos(A, B) = = pPn pPn (1)
∥A∥∥B∥ i=1 (A i)
2
i=1 (Bi )
2 api
1 Message 1: {"role": "system", "content": "You are an We note that more examples used translates to more cost,
AI assistant for teaching software engineering as models such as OpenAI’s offerings charge based on the
concepts."}
2
number of tokens in the requests and responses. However, in
3 # Start providing examples in the prompt here this work we focus on ways to provide more relevant examples
4 Message 2: {"role": "user", "content": "Given the to improve grading accuracy rather than cost.
question ’What could be a problem with
monolithic software?’, provide a score for the In our completion-based grading approach, we split the an-
corresponding answer ’Scaling needs to be done swers in a dataset into three different categories, namely low-
for the whole application’."} quality (having low marks), medium-quality (having average
5 Message 3: {"role": "assistant", "content": "Score:
4/4"} to quite decent marks), and high-quality (having full marks).
6 During the automated grading process for a particular question,
7 Message 4: {"role": "user", "content": "Given the our algorithm will select a random answer from each answer
question ’What could be a problem with
monolithic software?’, provide a score for the category and construct the appropriate prompt to be sent to
corresponding answer ’It is easier to develop’." the LLM. The number of answers to be used as examples for
} each category can be configurable. For instance, in this work
8 Message 5: {"role": "assistant", "content": "Score:
1/4"} we have considered using 1, 2, and 3 answers per category as
9 examples. As a result, the completion-based grading approach
10 # Provide more examples using additional messages if can construct prompts having a total of 3, 6, or 9 examples
needed
11
(for 3 categories). We believe that this approach provides the
12 # This message is used for autograding LLM with a better understanding of the grading rubrics for
13 Last message: {"role": "user", "content": "Given the each given question.
question ’What could be a problem with
monolithic software?’, provide a score for the Incorporating Retrieval Augmented Generation (RAG):
corresponding answer ’It is hard to make changes Pre-trained LLMs have been shown to perform well in many
.’"} common NLP tasks. However, their knowledge base could
14
15 # The LLM will respond with an appropriate score in not be easily revised or expanded beyond simple fine-tuning,
the below message and they may hallucinate in their responses [26]. RAG [10],
16 Message: {"role": "assistant", "content": "Score: < [27] enables a LLM to access external knowledge databases
predicted_score>"}
to complete domain-specific tasks with better consistency,
Listing 2. Completion-based autograding approach reliability and reduced hallucinations. Given an input, e.g.,
When instructors need to do autograding, the completion- a question, RAG retrieves relevant texts from the specified
based approach constructs a sequence of messages as de- external knowledge databases, and adds those texts as context
scribed in Listing 2. Each “user” message provides the ques- to the prompt to be sent to the LLM. With more appropriate
tion and a corresponding answer, which could be a reference context, the LLM can generate output with higher quality.
answer, or a student answer. This “user” message is imme- In the completion-based autograding approach using RAG,
diately followed by an “assistant” message which has the we make use of the course content to provide additional
score given for the answer. Together, this pair of messages context. The aim is to improve grading accuracy and reliability.
provides a concrete example of how scoring should be done For our software engineering courses, we make available PDF
for a question and its corresponding answer. For example, in lecture notes for each topic covered, e.g., automation, software
Listing 2, messages 2 and 3 provide a score of 4/4 for the processes, software testing, etc. The lecture notes will then
following question/answer pair: “What could be a problem be parsed and partitioned into chunks of texts for which
with monolithic software?” / ”Scaling needs to be done for corresponding text embeddings will be computed. Given a
the whole application”. Similarly, messages 4 and 5 provide specific question to be graded, the most relevant chunks will be
another example for the same question with a different answer. retrieved based on comparing the embedding of the question
We can give more examples to the prompt by providing more versus the embedding of each chunk. The relevant chunks are
of such pair of messages. Finally, the last “user” message then fed into the LLM as the grading context.
in the prompt will provide the answer to be graded for the
same question which has been used in the previous examples. IV. I MPLEMENTATION
Following the Chat Completions API, the LLM, e.g., GPT- For instructors and students to take advantage of LLM-
3.5-Turbo, will provide a predicted score for this answer. based autograding, we design and implement a web system
In the below, we discuss two important considerations for incorporating both the embedding-based and completion-based
the completion-based autograding approach, namely example autograding approaches. The system components are shown in
selection and the incorporation of RAG (Retrieval Augmented Figure 1.
Generation):
Selecting examples for prompt construction: The number A. Components
of examples in a prompt could be varied. Providing more The system is designed for both computing students and
examples would likely yield better scoring results as the instructors at the undergraduate level. The web interface pro-
LLM can learn more effectively using the relevant examples. vides functionalities for instructors to create/read/update/delete
and MongoDB as the database. Figure 2 shows the web inter-
face in which students can practice answering short questions.
When students answer a question, their answers and the marks
given by our autograding approaches will be automatically
added into the database. In Figure 3, the instructors can edit
any answers and marks given by the autograding approaches,
as well as providing additional feedback for each answer.
Instructors can also add more questions/answers for students
to practice.

Fig. 1. Components in our web based autograding system

(CRUD) questions and answers, and to monitor student per-


formance. For students, they can use the system as a way to
practice for quizzes by answering questions according to topics
covered in the course. In this way, students continue to provide
more data so the system could get better in autograding over
time. The database ensures all the questions/answers/marks are
persisted. Fig. 2. Students can practice on short answer questions. Their answers will
We implement a data partitioning mechanism to automat- be graded automatically.
ically divide the student answers into categories, e.g., high
quality, medium quality, etc., as mentioned in Section III. The
mechanism should be rerun once more answers from students V. E VALUATION M ETHODOLOGY
have been added to the database. For context extraction, we use This section describes the datasets and the performance
OpenAI’s text-embedding-ada-002 and the Faiss library [28], measures used in our evaluation.
which is a popular package for similarity search developed at
Meta’s AI Research, to compute and extract chunks of lecture A. Datasets
notes relevant to a question which needs to be auto-graded.
Two complementary datasets are used to evaluate the per-
The context is then incorporated into the prompt together with
formance of our proposed embedding-based and completion-
the grading examples.
based autograding approaches.
B. LLM deployments Mohler dataset [3]: This dataset has been widely used in
OpenAI’s GPT LLMs are used in the implemen- evaluating automatic grading approaches for short answers.
tation of both the embedding-based and completion- Most questions in the dataset are about programming/coding
based automated grading approaches. These pre-trained concepts. We use it mainly for fair comparisons with existing
LLMs are deployed in the Azure cloud, and accessi- approaches in this area. The dataset is obtained through exam-
ble via web APIs. In particular, our system accesses s/assignments given to students in an introductory computer
the Chat Completions API, and the text embedding via science class at the University of North Texas. For every
endpoints at https://api.openai.com/v1/chat/completions, and student answers, it is marked by two graders and the average
https://api.openai.com/v1/embeddings, respectively. mark is calculated for each answer in the range of 0 mark to
In our implementation, we have incorporated three differ- 5 marks, with 5 marks being the maximum.
ent LLM deployments from OpenAI, namely GPT-3.5-Turbo, The dataset consists of a total of 87 questions and 1 refer-
GPT-4, and text-embedding-ada-002. As these are pay-per-use ence answer for each question, but 6 questions are excluded
models, the cost is a concern especially when there are more from the dataset as they are not short answer questions.
students using our system in the future. For this reason, GPT- There are 24 to 31 students’ answers per questions in the
3.5-Turbo is used as the default LLM most of the time instead dataset, summing up to 2273 answers with an average of 28
of GPT-4, as the former is quite capable and cost-effective, answers per questions. All results obtained through this dataset
i.e., 30x cheaper compared to the latter. The embedding model are based on the 81 questions and 2273 answers. A sample
provided by OpenAI is rather inexpensive, costing just $0.0001 question and its corresponding answers/scores extracted from
per 1K tokens. [3] are shown in Table I.
In our work, the answers in this dataset are split into
C. Web-based implementation 3 different categories: low-quality (less than or equal to 2
We have implemented a complete web based system using marks), medium-quality (less than or equal to 4 marks) and
Vue.js for the frontend interface, Flask for the backend logic, high-quality (5 marks). This partitioning is important for the
Fig. 3. Instructors can edit answers and marks given automatically for any question, as well as providing more feedback.

TABLE I TABLE II
M OHLER DATASET: S AMPLE Q UESTION , A NSWERS AND S CORES SE LECTURE NOTES INCORPORATED AS GRADING CONTEXTS

Question What is a pointer? Topic Summary Pages


Reference Answer A variable that contains the address in Automation Software deployment models, infrastruc- 25
memory of another variable. ture and CI/CD
Student Answer a pointer holds a memory location. Software design Dependency injection, REST API design 32
Score 1 5 Software processes Waterfall, iterative and agile processes 29
Score 2 4 Security Confidentiality, integrity, availability ap- 30
Average Score 4.5 proaches
Versioning Distributed version control, Git workflows 36
XP practices Code review, refactoring, and pair pro- 30
gramming
evaluation of the completion-based approach, where different Software support Events, incidents and problem manage- 50
numbers of examples are used for prompting LLMs. ment for software systems
Software engineering (SE) dataset: This is a dataset on Software testing Blackbox, whitebox, input space partition- 22
ing, unit, integration, regression testing
the broader topic of software development with subtopics
consisting of automation, software design, versioning, agile
processes, extreme programming (XP), security, solution sup- TABLE III
SE DATASET: S AMPLE Q UESTION , A NSWERS AND S CORE
port, and testing. The summary for each subtopic is listed in
Table II. It nicely complements the Mohler dataset, which is Subtopic Automation
mainly about programming. The dataset consists of a total Question What is one advantage of canary deploy-
ment?
of 32 short-answer questions, with the number of reference
Reference Answer Can minimize the impact of errors to a
answers per question ranging from 1 to 4. There is a total subset of users
of 421 graded answers with their corresponding marks, with Graded Answer it is cheaper to do
an average of 13 answers per question. The marks for each Graded Answer’s Score 1
question ranges from 0 to 4, with 4 marks being the maximum.
Along with this dataset, there are PDF lecture notes for each
of the subtopics with the number of pages ranging from 22 Pn
to 50. The PDFs are to be used as additional contexts for the (xi − x)(yi − y)
r = pPn i=1 Pn (2)
2 2
grading of questions related to their respective subtopics. i=1 (xi − x) i=1 (yi − y)
The answers in this dataset are also split into 3 different
where xi represents the actual mark given by human graders,
mark categories: low-quality (less than or equal to 1 marks),
and yi represent the mark given by the autograding approach
medium-quality (less than or equal to 3.5 marks) and high-
for the same answer. x and y represent the means of x and y,
quality (4 marks). An example is shown in Table III.
respectively.
B. Performance measures The mean absolute error (MAE) is calculated by averaging
Similar to existing work in this area [15], the results the absolute differences between the actual and predicted
are evaluated using the Pearson correlation coefficient, mean marks:
absolute error (MAE) and root mean square error (RMSE).
n
The Pearson correlation coefficient is one of the most 1X
M AE = |xi − yi | (3)
common way to measure linear correlations. The result will n i=1
range from value of -1 to 1 depending on the strength and
direction of the relationship. A larger absolute value will Finally, the root mean square error (RMSE) is also used
signify stronger correlation between the two variables tested. widely to measure the quality of predictions:
r Pn completion-based approach with 6 examples will potentially
2
i=1 (xi − yi ) be more useful due to its balance between efficiency and
RM SE = (4)
n accuracy.
In (3) and (4), n is the total number of answers being We can observe a large discrepancy between the results pro-
evaluated, xi is the actual mark for the i-th answer and yi duced by the embedding-based approach for the two different
is the predicted mark given by the autograding approach for datasets, where the RMSE and MAE for the SE dataset are
the same answer. Both MAE and RMSE are reliable metrics more than twice of those produced for the Mohler dataset.
for assessing the accuracy of predictions. In particular, for the Mohler dataset, the embedding-based
approach produced a RMSE and a MAE of 0.932 and 0.749,
C. Research questions respectively. On the other hand, for the SE dataset, the same
In the evaluation, we aim to answer the following research approach produced a RSME and a MAE of 2.017 and 1.727.
questions (RQs): For the Pearson correlation coefficient, it’s 0.557 and 0.507
• RQ1: Which is the better approach for autograding short for the Mohler and SE datasets, respectively.
answers: embedding-based or completion-based? From the experiments, we note that the embedding-based
• RQ2: How do embedding-based and completion-based approach is more biased towards giving higher scores due to
autograding compare to existing deep learning based the relatively high cosine similarities obtained between the
approaches? student answers and the reference answers in many cases;
• RQ3: Does adding context from relevant lecture notes unless the student’s answer is hardly related to the question
on the question’s topic using RAG produce more ac- text. This can be observed from the fact that 86% of the scores
curate grading result when using the completion-based predicted for the Mohler dataset [3] is more than 4 marks
approach? when using the embedding-based approach. At the same time,
• RQ4: How do different versions of the same LLM family, we also note that the Mohler dataset has about 63% of the
e.g., GPT-3.5-Turbo and GPT-4, compare to each other answers scoring above 4 out of 5 marks as given by the human
in the autograding of short answers? graders. On the other hand, the SE dataset only has 27% of the
answers scoring above 3 out of 4 marks. This explains why the
VI. R ESULTS embedding-based approach does better in the Mohler dataset,
A. RQ1: Embedding-based vs. completion-based but much worse in the SE dataset. In this case, we observe
We first compare our proposed embedding-based and that the completion-based approach is the better way to do
completion-based approaches using both the Mohler dataset autograding of short answers as it significantly outperforms
[3] and the SE dataset. The default LLM used is GPT-3.5- the embedding-based approach in the SE dataset.
Turbo. The results are shown in Table IV.
Summary-RQ1: The completion-based approach could
TABLE IV be considered the better autograding approach overall;
E MBEDDING VS . C OMPLETION as it is more consistent with the predicted marks given to
Model Pearson Correlation Coefficient RMSE MAE
answers in both datasets, regardless of the actual mark
Mohler Dataset distribution in any of them. In both cases, we will need
Embedding 0.557 0.932 0.749
Completion (3 examples) 0.450 1.185 0.960
to provide more relevant examples of answers and actual
Completion (6 examples) 0.406 0.975 0.780 scores in the completion-based prompt to improve the
Completion (9 examples) 0.525 0.922 0.706
SE Dataset
autograding performance.
Embedding 0.507 2.017 1.727
Completion (3 examples) 0.621 1.342 1.044 B. RQ2: Comparison to deep learning based methods
Completion (6 examples) 0.694 1.207 0.872
Completion (9 examples) 0.674 1.240 0.852 We now compare the embedding and completion based
approaches with other existing autograding methods which
For the Mohler dataset [3], the embedding-based approach made use of deep learning techniques. For a fair comparison,
produces the highest Pearson correlation coefficient of 0.557. all the approaches are considered using performance measures
On the other hand, the completion-based approach with 9 reported previously with the Mohler dataset. Table V summa-
examples produces a Pearson correlation of 0.525, as well as rizes the key results. In particular, we collected results reported
the best RMSE and MAE of 0.922 and 0.706 respectively. from [14], which implemented and evaluated several variations
For the SE dataset, the results produced by the completion- of the Long Short-Term Memory (LSTM) neural networks for
based approach with 6 examples are quite similar to those short answer grading. The authors of [13] considered different
produced by the same approach with 9 examples. The former types of paragraph embedding models. Finally, the usage
having a higher Pearson correlation coefficient of 0.694 and a Bidirectional LSTM (BiLSTM) has been shown to perform
lower RMSE of 1.207; while the latter having a lower MAE of well in automated grading [15]. More recently, pre-trained
0.852. However, given that more examples have to be passed models such as ELMo [7] was also used on the Mohler dataset.
into the prompt, it may take a longer processing time and As observed from Table V, the embedding and completion
higher cost when doing grading. Therefore, in this case the based approaches may not be the best approach in terms of
TABLE V TABLE VI
C OMPARING WITH EXISTING DEEP LEARNING BASED AUTOGRADING E VALUATING THE EFFECT OF ADDITIONAL CONTEXT IN AUTOGRADING -
APPROACHES - M OHLER DATASET SE DATASET

Approach Pearson RMSE MAE Pearson Cor- RMSE MAE


Correlation relation Coef-
Coefficient ficient
LSTM-EMD-SVOR [14] 0.550 0.830 0.490 Completion (3 ex- 0.621 1.342 1.044
LSTM-EMD-Logits [14] 0.649 1.135 0.657 amples)
Paragraph embedding 0.569 0.797 - Completion 0.631 1.338 1.018
(doc2vec) [13] with context
Siamese BiLSTM + feature 0.655 0.889 0.618 (3 example)
engineering [15] Completion (6 ex- 0.694 1.207 0.872
Stacked BiLSTM (ELMo) 0.485 0.978 - amples)
[7] Completion 0.642 1.149 0.795
Embedding-based 0.557 0.932 0.749 with context
Completion-based 0.525 0.922 0.706 (6 examples)
(9 examples) Completion (9 ex- 0.674 1.240 0.852
amples)
Completion 0.748 1.026 0.693
with context
result. However they offer a good balance among all the three (9 examples)
metrics, and can be seen to have an average performance
which is quite comparable to the existing deep learning based
approaches, e.g., [13], [14]. It is important to note that in our The results are shown in Table VI for the SE dataset, for
approaches, there was no extensive training or fine-tuning done which we have the corresponding course materials. With rele-
with a large labelled dataset. On the other hand, existing deep vant context given in the prompt, it is observed that there are
learning based approaches incur significant training cost with generally notable improvements in the quality of autograding
a large part of the same dataset prior to predictions [16]. In for short answers. For instance, the completion-based approach
some cases, e.g., [15], manual feature engineering tasks are with 9 examples (no context provided) produces a Pearson
needed to improve the prediction performance. correlation coefficient of 0.674, RMSE of 1.207 and MAE of
We note that popular pre-trained LLMs such as BERT and 0.872. With the context, the same approach produces better
ELMo [29] have been applied on the Mohler dataset. As shown predictions with a Pearson correlation coefficient of 0.748,
in Table V, the performance of the ELMo-based approach still RMSE of 1.026 and MAE of 0.693. The only exception is
has a gap when compared to the embedding and completion the completion-based (6 examples), in which the additional
based approach. There have been research on how to leverage context does not improve the already high Pearson correlation.
the GPT family of models on short answer grading, e.g., [20], However, Table VI shows that the RMSE and MAE have been
[21]. However, we could not find other recent works making improved due to more context in that case.
use of the latest pre-trained LLMs like GPT-3.5-Turbo or GPT-
4 on the Mohler dataset. Summary-RQ3: Relevant context extracted from course
materials and given to the LLM prompt could signifi-
Summary-RQ2: The embedding and completion based cantly improve the autograding accuracy in most cases.
approaches do not require extensive training or fine-
tuning to perform reasonably well. Therefore, they are D. RQ4: Comparison between GPT-3.5-Turbo and GPT-4
more generally applicable to a wide variety of grading We also compare the grading performance of the GPT-4
scenarios; not just in our specific SE courses but also in and GPT-3.5-Turbo LLMs. We note that the cost of GPT-4 is
other courses. significantly more than GPT-3.5-Turbo for the same amount
of tokens. Due to the limited budget, we have not had the
C. RQ3: Using course materials as additional context in chance to fully explore GPT-4’s capabilities. In our system,
completion-based autograding we use GPT-4 to implement the completion-based approach
We would like to find out if there will be an improvement in with 6 and 9 examples, and evaluate it on the SE dataset.
the autograding capability of the completion-based approach In Table VII, there is a significant improvement in the GPT-
when provided with more context extracted from relevant 4 based approach when compared to the one using GPT-3.5-
course materials such as lecture notes when they are available. Turbo. GPT-4 completion-based approach with 9 examples
This is referred to as RAG - retrieval augmented generation achieved the highest Pearson correlation coefficient of 0.844, a
[27]. As described in the implementation, we use OpenAI’s low RMSE and MAE of 0.828 and 0.566, respectively. While
text-embedding-ada-002 and the Faiss library [28] to store and the results are very promising, a more extensive evaluation
extract chunks of lecture notes (as detailed in Table II) relevant of GPT-4 based approaches is needed when the cost becomes
to a question which needs to be auto-graded. The context is less of an issue. It is worth noting that as the time of writing,
then incorporated into the prompt together with the grading GPT-4 generally costs about 30x more compared to GPT-3.5-
examples. Turbo.
TABLE VII ACKNOWLEDGEMENT
GPT-4 VS . GPT-3.5-T URBO FOR THE COMPLETION - BASED APPROACH -
SE DATASET This work is supported by the UResearch programme from
Pearson Cor- RMSE MAE
the School of Computing and Information Systems, Singapore
relation Coef- Management University.
ficient
GPT-3.5-Turbo 0.694 1.207 0.872
(6 examples) R EFERENCES
GPT-4 0.784 0.896 0.616
(6 example) [1] T. Puthiaparampil and M. M. Rahman, “Very short answer questions: a
GPT-3.5-Turbo 0.674 1.240 0.852 viable alternative to multiple choice questions,” BMC medical education,
(9 examples) vol. 20, no. 1, pp. 1–8, 2020.
GPT-4 0.844 0.828 0.566 [2] S. Greving and T. Richter, “Examining the testing effect in university
(9 examples) teaching: Retrievability and question format matter,” Frontiers in Psy-
chology, vol. 9, 2018.
[3] M. Mohler, R. Bunescu, and R. Mihalcea, “Learning to grade short
answer questions using semantic similarity measures and dependency
Summary-RQ4: The newer LLM version such as GPT-4 graph alignments,” in Proceedings of the 49th Annual Meeting of the
could significantly outperform previous models in short Association for Computational Linguistics: Human Language Technolo-
answer autograding. gies, 2011, pp. 752–762.
[4] M. A. Sultan, C. Salazar, and T. Sumner, “Fast and easy short answer
grading with high accuracy,” in Proceedings of the 2016 Conference
E. Limitations of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, 2016, pp. 1070–1075.
Here we discuss some limitations of this work. First, outputs [5] K. Cochran, C. Cohn, J. F. Rouet, and P. Hastings, “Improving auto-
from LLMs could vary from time to time, which might affect mated evaluation of student text responses using gpt-3.5 for text data
the autograding accuracy reported in Section VI. We have augmentation,” in International Conference on Artificial Intelligence in
Education. Springer, 2023, pp. 217–228.
attempted to mitigate this issue by reporting the accuracy [6] A. Mizumoto and M. Eguchi, “Exploring the potential of using an ai
using a large number of answers from two different datasets. language model for automated essay scoring,” Research Methods in
Second, due to funding issues we could not fully evaluate GPT- Applied Linguistics, vol. 2, no. 2, 2023.
[7] S. K. Gaddipati, D. Nair, and P. G. Plöger, “Comparative evaluation of
4’s autograding accuracy. It is possible that newer and more pretrained transfer learning models on automatic short answer grading,”
expensive LLM versions will provide improved performance arXiv preprint arXiv:2009.01303, 2020.
compared to what we have reported here. Finally, it would [8] J. Mitra, “Studying the impact of auto-graders giving immediate feed-
back in programming assignments,” in Proceedings of the 54th ACM
be better to build a larger dataset with more questions and Technical Symposium on Computer Science Education V. 1, 2023, pp.
answers on various software engineering topics. We plan to 388–394.
do so with the latest LLM versions, e.g., GPT-4 Turbo, when [9] D. S. Mishra and S. H. Edwards, “The programming exercise markup
the cost becomes more manageable. language: Towards reducing the effort needed to use automated grading
tools,” in Proceedings of the 54th ACM Technical Symposium on
VII. C ONCLUSION Computer Science Education V. 1, 2023, pp. 395–401.
[10] S. Siriwardhana, R. Weerasekera, E. Wen, T. Kaluarachchi, R. Rana,
This work on LLM-based automatic grading of short an- and S. Nanayakkara, “Improving the domain adaptation of retrieval aug-
swers has the potential to reduce the marking burden on mented generation (rag) models for open domain question answering,”
Transactions of the Association for Computational Linguistics, vol. 11,
instructors teaching a variety of courses, especially in the pp. 1–17, 2023.
domain of computer science and software engineering where [11] S. Basu, C. Jacobs, and L. Vanderwende, “Powergrading: a clustering ap-
the number of students has been increasing recently. We have proach to amplify human effort for short answer grading,” Transactions
of the Association for Computational Linguistics, vol. 1, pp. 391–402,
proposed two new approaches for autograding short answers 2013.
using embedding and completion models, which are based on [12] S. Haller, A. Aldea, C. Seifert, and N. Strisciuglio, “Survey on automated
the OpenAI’s GPT family of LLMs. short answer grading with deep learning: from word embeddings to
transformers,” arXiv preprint arXiv:2204.03503, 2022.
We have conducted extensive evaluations and comparison [13] S. Hassan, A. A. Fahmy, and M. El-Ramly, “Automatic short answer
to the existing methods in this area using a well-known scoring based on paragraph embeddings,” International Journal of
dataset and a new dataset of our own in software engineering Advanced Computer Science and Applications, vol. 9, no. 10, 2018.
[14] S. Kumar, S. Chakrabarti, and S. Roy, “Earth mover’s distance pooling
courses at the university level. The datasets capture different over siamese lstms for automatic short answer grading,” in Proceedings
kinds of mark distributions which could affect any auto- of the 26th International Joint Conference on Artificial Intelligence, ser.
grading methods. We found that our approaches, especially IJCAI’17. AAAI Press, 2017, p. 2046–2052.
[15] A. Prabhudesai and T. N. Duong, “Automatic short answer grading using
the completion-based approach, which do not require time- siamese bidirectional lstm based regression,” in 2019 IEEE international
consuming training of deep learning models, could work well conference on engineering, technology and education (TALE). IEEE,
for the given datasets. We also found that relevant context in 2019, pp. 1–6.
the form of lecture notes for the course would help improve [16] X. Zhu, H. Wu, and L. Zhang, “Automatic short-answer grading via
bert-based deep neural networks,” IEEE Transactions on Learning
grading performance. Lastly, newer models like GPT-4 look Technologies, vol. 15, no. 3, pp. 364–375, 2022.
very promising in autograding tasks. However, the cost of such [17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
models is still a concern especially for educational institutions. Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in
neural information processing systems, vol. 30, 2017.
We plan to investigate ways to do more accurate and fair [18] S.-Y. Yoon, “Short answer grading using one-shot prompting and text
autograding while minimizing LLM cost in our future work. similarity scoring model,” arXiv preprint arXiv:2305.18638, 2023.
[19] P. Organisciak, S. Acar, D. Dumas, and K. Berthiaume, “Beyond seman-
tic distance: automated scoring of divergent thinking greatly improves
with large language models,” Thinking Skills and Creativity, p. 101356,
2023.
[20] G. Kortemeyer, “Performance of the pre-trained large language
model gpt-4 on automated short answer grading,” arXiv preprint
arXiv:2309.09338, 2023.
[21] J. Schneider, B. Schenk, C. Niklaus, and M. Vlachos, “Towards
llm-based autograding for short textual answers,” arXiv preprint
arXiv:2309.11508, 2023.
[22] G. Pinto, I. Cardoso-Pereira, D. Monteiro, D. Lucena, A. Souza,
and K. Gama, “Large language models for education: Grading open-
ended questions using chatgpt,” in Proceedings of the XXXVII Brazilian
Symposium on Software Engineering, 2023, pp. 293–302.
[23] J. M. Gomez-Perez, R. Denaux, A. Garcia-Silva, J. M. Gomez-Perez,
R. Denaux, and A. Garcia-Silva, “Understanding word embeddings
and language models,” A Practical Guide to Hybrid Natural Language
Processing: Combining Neural Models and Knowledge Graphs for NLP,
pp. 17–31, 2020.
[24] P. Xia, L. Zhang, and F. Li, “Learning similarity with cosine similarity
ensemble,” Information sciences, vol. 307, pp. 39–52, 2015.
[25] P. Denny, V. Kumar, and N. Giacaman, “Conversing with copilot:
Exploring prompt engineering for solving cs1 problems using natural
language,” in Proceedings of the 54th ACM Technical Symposium on
Computer Science Education V. 1, 2023, pp. 1136–1142.
[26] A. Martino, M. Iannelli, and C. Truong, “Knowledge injection to counter
large language model (llm) hallucination,” in European Semantic Web
Conference. Springer, 2023, pp. 182–185.
[27] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal,
H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel et al., “Retrieval-
augmented generation for knowledge-intensive nlp tasks,” Advances in
Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020.
[28] J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search
with GPUs,” IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–
547, 2019.
[29] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz,
E. Agirre, I. Heintz, and D. Roth, “Recent advances in natural language
processing via large pre-trained language models: A survey,” ACM
Computing Surveys, vol. 56, no. 2, pp. 1–40, 2023.

You might also like