Automatic Grading of Short Answers Using Large Language Models in

This document discusses the development of an automated grading system for short answers in software engineering courses using Large Language Models (LLMs) like GPT-3.5 and GPT-4. The proposed system aims to alleviate the challenges of manual grading by incorporating text embedding and completion approaches, allowing for a broader range of acceptable answers. The authors evaluate their method against existing techniques and present a new dataset tailored for software engineering assessments.

Uploaded by

mahnoorarshad311002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views11 pages

Automatic Grading of Short Answers Using Large Language Models in

Uploaded by

mahnoorarshad311002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Singapore Management University

Institutional Knowledge at Singapore Management University

Research Collection School Of Computing and School of Computing and Information Systems
Information Systems

5-2024

Automatic grading of short answers using Large Language

Models in software engineering courses
Nguyen Binh Duong TA

Yi Meng CHAI

Follow this and additional works at: https://ink.library.smu.edu.sg/sis_research

Part of the Educational Assessment, Evaluation, and Research Commons, Higher Education
Commons, and the Software Engineering Commons

This Conference Proceeding Article is brought to you for free and open access by the School of Computing and
Information Systems at Institutional Knowledge at Singapore Management University. It has been accepted for
inclusion in Research Collection School Of Computing and Information Systems by an authorized administrator of
Institutional Knowledge at Singapore Management University. For more information, please email
[email protected].
Automatic Grading of Short Answers Using Large
Language Models in Software Engineering Courses
Ta Nguyen Binh Duong, Chai Yi Meng
School of Computing and Information Systems
Singapore Management University
Email: [email protected]

Abstract—Short-answer based questions have been used widely techniques [3], [4], which require careful feature extractions
due to their effectiveness in assessing whether the desired learning before model training and score prediction. More recent ap-
outcomes have been attained by students. However, due to their proaches leverage deep learning techniques, which are able
open-ended nature, many different answers could be considered
entirely or partially correct for the same question. In the context to learn representative features from huge amounts of data
of computer science and software engineering courses where the instead manual feature engineering work. The deep learning
enrolment has been increasing recently, manual grading of short- based approaches may suffer from the lack of data on short
answer questions is a time-consuming and tedious process for answer based assessments.
instructors. Latest advances in pre-trained LLMs, e.g., OpenAI’s release
In software engineering courses, assessments concern not just
coding but many other aspects of software development such
of the GPT family of models, have enabled researchers to
as system analysis, architecture design, software processes and further investigate autograding for text-based responses from
operation methodologies such as Agile and DevOps. However, students, e.g., [5], [6]. However, not much work has been done
existing work in automatic grading/scoring of text-based answers for LLM-based autograding of short answers in the context of
in computing courses have been focusing more on coding-oriented computing education, especially software engineering courses
questions. In this work, we consider the problem of autograding a
broader range of short answers in software engineering courses.
[7]. Such courses cover a wide range of topics including
We propose an automated grading system incorporating both programming, system design, Agile processes, DevOps prac-
text embedding and completion approaches based on recently tices in system deployment, operation, and maintenance, etc.
introduced pre-trained large language models (LLMs) such as Assessment questions in these topics, e.g., “list one problem
GPT-3.5/4. We design and implement a web-based system so with agile processes such as Scrum?” could have a wide
that students and instructors can easily leverage autograding for
learning and teaching. Finally, we conduct an extensive evaluation
range of correct answers. We note that automated grading in
of our automated grading approaches. We use a popular public computing education has been focusing more on coding based
dataset in the computing education domain and a new software questions [8], [9], which could have a rather limited set of valid
engineering dataset of our own. The results demonstrate the responses and could be graded by running pre-determined unit
effectiveness of our approach, and provide useful insights for testcases.
further research in this area of AI-enabled education.
In this work, we consider the problem of autograding of
Index Terms—automatic grading, large language models, em-
bedding, software engineering courses, short answers short answers in the context of software engineering courses,
which are not limited to just programming/coding questions.
We makes the following contributions in this paper:
I. I NTRODUCTION
• We propose an automated grading method incorporating
Assessments in education can be done in many forms, for both text embedding and completion approaches based
instance multiple-choice questions, essays, short written re- on recently introduced pre-trained LLMs such as GPT-
sponses, coding tests, etc. We note that questions which require 3.5-Turbo and GPT-4. The completion-based autograding
short textual answers are popular in educational assessments approach also leverages Retrieval Augmented Generation
[1]. One of the main reasons is that they could be considered to [10] for better grading accuracy.
be more effective compared to multiple-choice questions due • We design and implement a web based system for our
to a greater level of information retrieval from memory when LLM-based autograding approaches. The system targets
trying to come up with answers [2]. However, short-answer both instructors and students. Instructors can use the
questions can accept different correct and partially correct web system to do manual adjustments of the autograded
answers. Grading many of such answers is undoubtedly a very scores and to provide additional feedback to answers from
tedious and time-consuming process, especially in computing students; while students can practice question answering
courses at the university level where the number of students with instant grading.
has been increasing significantly recently. • We compile a new dataset containing popular questions
Automatic grading/scoring of short textual answers is an and short answers in the context of our software engineer-
established problem in technology-enabled education. Various ing courses. These courses cover important software con-
existing approaches made use of traditional machine learning cepts in addition to programming, namely system design,
software testing, Agile processes, DevOps practices, etc. were described in [14] and [15], which leverage the Siamese
This dataset complements existing ones, e.g., the Mohler Bidirectional Long Short-Term Memory networks (BiLSTMs).
dataset [3] which is mainly about programming based Their results were also reported using the same dataset in
questions. [3]. More recent approaches to short answer grading including
• We conduct an extensive evaluation of our automated [16], which uses the Transformer architecture [17] and other
grading approach using the new dataset, together with optimization techniques to address the problem of insufficient
another public dataset in the domain of computer science. training data.
To this end, we compare our approach in short-answer
grading to some of the most popular existing deep learn- B. LLM-based approaches
ing based approaches including paragraph embeddings
Due to recent advances in pre-trained LLMs, there have
and Siamese long short-term memory (LSTM) neural
been a growing body of work making use of LLMs for
networks. The results demonstrate the effectiveness of our
automated grading in educational contexts. In particular, [5]
approach, and provide useful insights for further work in
investigated text augmentation techniques using GPT-3.5 to
this area.
improve the dataset for training machine learning models
This paper is organized as follows. Section II discusses which will be used to provide automated feedback to students.
related work in short answer autograding, especially recent [6] evaluated the accuracy of using GPT-3’s text-davinci-003
work in deep learning and LLMs. Section III describes our model for automatic grading of essays. Using 12,100 essays,
approaches to autograding of short answers. Section IV pro- it concluded that GPT-3 models, combined with linguistic
vides details on our web based system implementation. Section features, provided a high level of accuracy. Note that this is for
V presents our evaluation methodology, while Section VI essay scoring, not short answer grading in computer science
discusses the experimental results. Section VII concludes the related courses. [18] also used OpenAI’s GPT-3.5 text-davinci-
paper and highlights possible future work. 003 model for one-shot prompting and the text completion
II. R ELATED WORK API to do automatic grading. However, they made use of the
Prize Short Answer Scoring dataset, which includes questions
Below we summarize several key recent and existing work from science, biology, English, etc., but not computer science
in automatic short answer grading. We will compare the related courses. Similarly, [19] investigated automated scoring
reported performance for these approaches to ours in this paper for the subject of divergent thinking. The authors performed
where possible. fine-tuning of LLMs on human-judged responses. The authors
A. Deep learning based approaches of [20] evaluated GPT-4 for short answer grading using the
SciEntsBank and Beetle datasets. They found that for these
Traditional machine learning techniques have been applied
datasets, GPT-4’s performance is comparable to manually
to the problem of automated short answer grading for many
crafted machine learning models.
years. In these approaches, e.g., [3], [4], [11], manual feature
Regarding autograding of short answers in the context of
engineering is needed before training the models on a part
computer science related courses, very recent works including
of the dataset. For instance, [4] described feature extraction
[21] which made use of ChatGPT for grading exams in a data
methods including text similarity, question demoting, term
science course. They also evaluated ChatGPT for a German-
weighting, etc. Using these features, a simple ridge regression
based information system introductory course. They found that
model was trained. The authors reported autograding perfor-
such LLM deployment can be valuable, but it is not yet ready
mance, e.g., accuracy, in the form of the Pearson correlation
for fully automated grading. ChatGPT was also used in [22]
coefficient value of 0.592, and the root mean squared error
to provide corrections to open-ended answers from software
(RMSE) of 0.887. They used a dataset consisting of many
development professionals participating in technical training.
computer programing related questions and answers [3] made
The authors found that subject matter experts usually agreed
available by Mohler et al.
with the corrections given by ChatGPT. None of these work
Recently, deep learning based approaches have gained
made use of well-known datasets in computer science courses
much popularity. Deep learning based autograders automat-
such as the Mohler dataset [3]. The exception is [7], in which
ically learn representative features from large datasets. In
the authors compared pre-trained LLMs such as ELMo, BERT,
[12], the authors did a comprehensive survey of deep learn-
GPT-2, etc., directly on their autograding performance for the
ing approaches, including embedding, sequential models and
Mohler dataset. We note that this work was done a while ago
attention-based neural networks for short answer grading.
so the latest GPT models were not included.
The authors then showed that the features learned by deep
learning methods mainly work as complementary to manually
C. Summary
crafted features of the autograding model. [13] considered
automatic grading of short answers using two different types We note that existing deep learning based approaches to
of paragraph embedding models. They obtained a Pearson short answer grading could provide good accuracy, but they
correlation coefficient of 0.569 and RMSE of 0.797 on the need to be combined with hand-crafted features and require
Mohler dataset [3]. Other neural network based approaches extensive training with large datasets. On the other hand, more
recent approaches based on generative AI, in particular pre- The cosine similarity will range from 0 to 1, with 0 being
trained LLMs, have been focusing more on other educational the least similar and 1 being the most similar. After comparing
domains which are not computer science related. In addition, the cosine similarities between each student answer and all
many of the existing approaches made use of the computer the reference answers, the most similar reference answer to
science dataset from [3] which had been released a while the student answer will be selected (lines 12-14 of Listing
ago. This dataset is about basic data structures and computer 1). A mark will then be given to the student answer which
programming concepts. is proportional to the cosine similarity (line 16 of Listing 1).
In this work, we aim to develop new LLM-based approaches This is done by multiplying the cosine similarity score with
which do not require training, and to evaluate these approaches the reference answer’s score.
using an entirely new dataset obtained from software engineer- The embedding-based autoscoring of short answers can be
ing courses which include many more topics and concepts implemented and deployed to use quickly due to the general
beyond just programming. We plan to release our new dataset availability and affordability of of state-of-the-art text embed-
publicly to encourage further research in this area. ding models such as text-embedding-ada-002. For instance, its
III. LLM- BASED AUTO -G RADING A PPROACHES pricing as the time of writing is just $0.0001 per 1K tokens.
However, this approach might require a wide range of possible
In this section, we describe in details our proposed
reference answers to be provided for more accurate grading.
approaches to auto-grading short answers, namely the
For short-answer questions, this is potentially challenging as
embedding-based, and the completion-based approach. Both
there can be a large number of possibly correct answers to a
of the approaches are based on latest advances in pre-trained
single question. We can mitigate this issue by using correct
LLMs, in particular the text embedding and chat completion
answers from students as reference answers. Another issue is
models released publicly by OpenAI.
that, although models such as text-embedding-ada-002 is quite
A. Embedding-based affordable, computing embeddings for answers every time you
Text embeddings are numerical representations of text in need to do grading (lines 8 and 10 of Listing 1) will add to
which words or phrases are represented as a vector of numbers. the total cost. For this, we could use a vector database such
They are used to capture semantic meanings and relationships as Chroma (https://www.trychroma.com) to store and retrieve
between words or phrases, enabling more efficient processing the pre-computed embeddings when required.
and understanding of human languages [23].
B. Completion-based
1 Input: pair of question, answer (Q, A)
2 list R = [reference answers for Q] Completion is essentially the generation of output based
3
4 Output: numerical score S for A on the text prompts given to a pre-trained LLM such as
5 Steps: GPT-3.5-Turbo. Prompt construction, or prompt engineering
6 Ch = 0 for LLMs is an active research area [25]. In a prompt, we
7 Sq = 0
8 Compute the embedding Ea for A may provide relevant instructions, examples, etc., in natural
9 For each reference answer Ar in R languages. Such data would help direct the model to produce
10 Compute the embedding Er for Ar the desired output. One way to do prompting is called zero-
11 Compute a cosine similarity Cr = cos(Er, Ea)
12 If Cr > Ch: shot, in which a query is sent to the LLM without concrete
13 Ch = Cr examples of expected results. On the other hand, in few-
14 Sq = score of Ar shot prompting, we provide multiple examples of questions
15
16 S = Ch * Sq and their corresponding answers in a simulated multi-turn
17 Return S conversation with the LLM. At the end of the conversation,
Listing 1. Embedding-based autograding approach we can ask the LLM to score a student answer for a given
question.
The algorithm for our embedding-based autoscoring ap-
In this completion-based autograding approach, we make
proach is shown in Listing 1. In this approach, the algorithm
use of the OpenAI’s Chat Completions API1 . The API defines
computes the embeddings of all the reference answers and
prompts as sequences of messages. Each message has two
student answers for a particular question using an available text
components, namely role and content. The role can be “sys-
embedding model (lines 8-11 of Listing 1). In this work, we
tem”, “user”, or “assistant”. A message with ”system” role
use OpenAI’s text-embedding-ada-002 model as it is OpenAI’s
is usually used first to define the behavior of the model. A
best and most cost-effective embedding model as of 2023.
“user” message gives instructions, and an “assistant” message
The cosine similarity [24] between each reference answer
provides an example of the desired output. The prompt is
and student answer (to be auto-graded) is then calculated using
constructed with all the required messages and sent to the LLM
their corresponding embedding vectors, A and B respectively,
via an API call. Our completion-based autograding approach
as follows:
is shown in Listing 2.
Pn
A·B i=1 Ai Bi 1 https://platform.openai.com/docs/guides/text-generation/chat-completions-
cos(A, B) = = pPn pPn (1)
∥A∥∥B∥ i=1 (A i)
2
i=1 (Bi )
2 api
1 Message 1: {"role": "system", "content": "You are an We note that more examples used translates to more cost,
AI assistant for teaching software engineering as models such as OpenAI’s offerings charge based on the
concepts."}
2
number of tokens in the requests and responses. However, in
3 # Start providing examples in the prompt here this work we focus on ways to provide more relevant examples
4 Message 2: {"role": "user", "content": "Given the to improve grading accuracy rather than cost.
question ’What could be a problem with
monolithic software?’, provide a score for the In our completion-based grading approach, we split the an-
corresponding answer ’Scaling needs to be done swers in a dataset into three different categories, namely low-
for the whole application’."} quality (having low marks), medium-quality (having average
5 Message 3: {"role": "assistant", "content": "Score:
4/4"} to quite decent marks), and high-quality (having full marks).
6 During the automated grading process for a particular question,
7 Message 4: {"role": "user", "content": "Given the our algorithm will select a random answer from each answer
question ’What could be a problem with
monolithic software?’, provide a score for the category and construct the appropriate prompt to be sent to
corresponding answer ’It is easier to develop’." the LLM. The number of answers to be used as examples for
} each category can be configurable. For instance, in this work
8 Message 5: {"role": "assistant", "content": "Score:
1/4"} we have considered using 1, 2, and 3 answers per category as
9 examples. As a result, the completion-based grading approach
10 # Provide more examples using additional messages if can construct prompts having a total of 3, 6, or 9 examples
needed
11
(for 3 categories). We believe that this approach provides the
12 # This message is used for autograding LLM with a better understanding of the grading rubrics for
13 Last message: {"role": "user", "content": "Given the each given question.
question ’What could be a problem with
monolithic software?’, provide a score for the Incorporating Retrieval Augmented Generation (RAG):
corresponding answer ’It is hard to make changes Pre-trained LLMs have been shown to perform well in many
.’"} common NLP tasks. However, their knowledge base could
14
15 # The LLM will respond with an appropriate score in not be easily revised or expanded beyond simple fine-tuning,
the below message and they may hallucinate in their responses [26]. RAG [10],
16 Message: {"role": "assistant", "content": "Score: < [27] enables a LLM to access external knowledge databases
predicted_score>"}
to complete domain-specific tasks with better consistency,
Listing 2. Completion-based autograding approach reliability and reduced hallucinations. Given an input, e.g.,
When instructors need to do autograding, the completion- a question, RAG retrieves relevant texts from the specified
based approach constructs a sequence of messages as de- external knowledge databases, and adds those texts as context
scribed in Listing 2. Each “user” message provides the ques- to the prompt to be sent to the LLM. With more appropriate
tion and a corresponding answer, which could be a reference context, the LLM can generate output with higher quality.
answer, or a student answer. This “user” message is imme- In the completion-based autograding approach using RAG,
diately followed by an “assistant” message which has the we make use of the course content to provide additional
score given for the answer. Together, this pair of messages context. The aim is to improve grading accuracy and reliability.
provides a concrete example of how scoring should be done For our software engineering courses, we make available PDF
for a question and its corresponding answer. For example, in lecture notes for each topic covered, e.g., automation, software
Listing 2, messages 2 and 3 provide a score of 4/4 for the processes, software testing, etc. The lecture notes will then
following question/answer pair: “What could be a problem be parsed and partitioned into chunks of texts for which
with monolithic software?” / ”Scaling needs to be done for corresponding text embeddings will be computed. Given a
the whole application”. Similarly, messages 4 and 5 provide specific question to be graded, the most relevant chunks will be
another example for the same question with a different answer. retrieved based on comparing the embedding of the question
We can give more examples to the prompt by providing more versus the embedding of each chunk. The relevant chunks are
of such pair of messages. Finally, the last “user” message then fed into the LLM as the grading context.
in the prompt will provide the answer to be graded for the
same question which has been used in the previous examples. IV. I MPLEMENTATION
Following the Chat Completions API, the LLM, e.g., GPT- For instructors and students to take advantage of LLM-
3.5-Turbo, will provide a predicted score for this answer. based autograding, we design and implement a web system
In the below, we discuss two important considerations for incorporating both the embedding-based and completion-based
the completion-based autograding approach, namely example autograding approaches. The system components are shown in
selection and the incorporation of RAG (Retrieval Augmented Figure 1.
Generation):
Selecting examples for prompt construction: The number A. Components
of examples in a prompt could be varied. Providing more The system is designed for both computing students and
examples would likely yield better scoring results as the instructors at the undergraduate level. The web interface pro-
LLM can learn more effectively using the relevant examples. vides functionalities for instructors to create/read/update/delete
and MongoDB as the database. Figure 2 shows the web inter-
face in which students can practice answering short questions.
When students answer a question, their answers and the marks
given by our autograding approaches will be automatically
added into the database. In Figure 3, the instructors can edit
any answers and marks given by the autograding approaches,
as well as providing additional feedback for each answer.
Instructors can also add more questions/answers for students
to practice.

Fig. 1. Components in our web based autograding system

(CRUD) questions and answers, and to monitor student per-

formance. For students, they can use the system as a way to
practice for quizzes by answering questions according to topics
covered in the course. In this way, students continue to provide
more data so the system could get better in autograding over
time. The database ensures all the questions/answers/marks are
persisted. Fig. 2. Students can practice on short answer questions. Their answers will
We implement a data partitioning mechanism to automat- be graded automatically.
ically divide the student answers into categories, e.g., high
quality, medium quality, etc., as mentioned in Section III. The
mechanism should be rerun once more answers from students V. E VALUATION M ETHODOLOGY
have been added to the database. For context extraction, we use This section describes the datasets and the performance
OpenAI’s text-embedding-ada-002 and the Faiss library [28], measures used in our evaluation.
which is a popular package for similarity search developed at
Meta’s AI Research, to compute and extract chunks of lecture A. Datasets
notes relevant to a question which needs to be auto-graded.
Two complementary datasets are used to evaluate the per-
The context is then incorporated into the prompt together with
formance of our proposed embedding-based and completion-
the grading examples.
based autograding approaches.
B. LLM deployments Mohler dataset [3]: This dataset has been widely used in
OpenAI’s GPT LLMs are used in the implemen- evaluating automatic grading approaches for short answers.
tation of both the embedding-based and completion- Most questions in the dataset are about programming/coding
based automated grading approaches. These pre-trained concepts. We use it mainly for fair comparisons with existing
LLMs are deployed in the Azure cloud, and accessi- approaches in this area. The dataset is obtained through exam-
ble via web APIs. In particular, our system accesses s/assignments given to students in an introductory computer
the Chat Completions API, and the text embedding via science class at the University of North Texas. For every
endpoints at https://api.openai.com/v1/chat/completions, and student answers, it is marked by two graders and the average
https://api.openai.com/v1/embeddings, respectively. mark is calculated for each answer in the range of 0 mark to
In our implementation, we have incorporated three differ- 5 marks, with 5 marks being the maximum.
ent LLM deployments from OpenAI, namely GPT-3.5-Turbo, The dataset consists of a total of 87 questions and 1 refer-
GPT-4, and text-embedding-ada-002. As these are pay-per-use ence answer for each question, but 6 questions are excluded
models, the cost is a concern especially when there are more from the dataset as they are not short answer questions.
students using our system in the future. For this reason, GPT- There are 24 to 31 students’ answers per questions in the
3.5-Turbo is used as the default LLM most of the time instead dataset, summing up to 2273 answers with an average of 28
of GPT-4, as the former is quite capable and cost-effective, answers per questions. All results obtained through this dataset
i.e., 30x cheaper compared to the latter. The embedding model are based on the 81 questions and 2273 answers. A sample
provided by OpenAI is rather inexpensive, costing just $0.0001 question and its corresponding answers/scores extracted from
per 1K tokens. [3] are shown in Table I.
In our work, the answers in this dataset are split into
C. Web-based implementation 3 different categories: low-quality (less than or equal to 2
We have implemented a complete web based system using marks), medium-quality (less than or equal to 4 marks) and
Vue.js for the frontend interface, Flask for the backend logic, high-quality (5 marks). This partitioning is important for the
Fig. 3. Instructors can edit answers and marks given automatically for any question, as well as providing more feedback.

TABLE I TABLE II
M OHLER DATASET: S AMPLE Q UESTION , A NSWERS AND S CORES SE LECTURE NOTES INCORPORATED AS GRADING CONTEXTS

Question What is a pointer? Topic Summary Pages

Reference Answer A variable that contains the address in Automation Software deployment models, infrastruc- 25
memory of another variable. ture and CI/CD
Student Answer a pointer holds a memory location. Software design Dependency injection, REST API design 32
Score 1 5 Software processes Waterfall, iterative and agile processes 29
Score 2 4 Security Confidentiality, integrity, availability ap- 30
Average Score 4.5 proaches
Versioning Distributed version control, Git workflows 36
XP practices Code review, refactoring, and pair pro- 30
gramming
evaluation of the completion-based approach, where different Software support Events, incidents and problem manage- 50
numbers of examples are used for prompting LLMs. ment for software systems
Software engineering (SE) dataset: This is a dataset on Software testing Blackbox, whitebox, input space partition- 22
ing, unit, integration, regression testing
the broader topic of software development with subtopics
consisting of automation, software design, versioning, agile
processes, extreme programming (XP), security, solution sup- TABLE III
SE DATASET: S AMPLE Q UESTION , A NSWERS AND S CORE
port, and testing. The summary for each subtopic is listed in
Table II. It nicely complements the Mohler dataset, which is Subtopic Automation
mainly about programming. The dataset consists of a total Question What is one advantage of canary deploy-
ment?
of 32 short-answer questions, with the number of reference
Reference Answer Can minimize the impact of errors to a
answers per question ranging from 1 to 4. There is a total subset of users
of 421 graded answers with their corresponding marks, with Graded Answer it is cheaper to do
an average of 13 answers per question. The marks for each Graded Answer’s Score 1
question ranges from 0 to 4, with 4 marks being the maximum.
Along with this dataset, there are PDF lecture notes for each
of the subtopics with the number of pages ranging from 22 Pn
to 50. The PDFs are to be used as additional contexts for the (xi − x)(yi − y)
r = pPn i=1 Pn (2)
2 2
grading of questions related to their respective subtopics. i=1 (xi − x) i=1 (yi − y)
The answers in this dataset are also split into 3 different
where xi represents the actual mark given by human graders,
mark categories: low-quality (less than or equal to 1 marks),
and yi represent the mark given by the autograding approach
medium-quality (less than or equal to 3.5 marks) and high-
for the same answer. x and y represent the means of x and y,
quality (4 marks). An example is shown in Table III.
respectively.
B. Performance measures The mean absolute error (MAE) is calculated by averaging
Similar to existing work in this area [15], the results the absolute differences between the actual and predicted
are evaluated using the Pearson correlation coefficient, mean marks:
absolute error (MAE) and root mean square error (RMSE).
n
The Pearson correlation coefficient is one of the most 1X
M AE = |xi − yi | (3)
common way to measure linear correlations. The result will n i=1
range from value of -1 to 1 depending on the strength and
direction of the relationship. A larger absolute value will Finally, the root mean square error (RMSE) is also used
signify stronger correlation between the two variables tested. widely to measure the quality of predictions:
r Pn completion-based approach with 6 examples will potentially
2
i=1 (xi − yi ) be more useful due to its balance between efficiency and
RM SE = (4)
n accuracy.
In (3) and (4), n is the total number of answers being We can observe a large discrepancy between the results pro-
evaluated, xi is the actual mark for the i-th answer and yi duced by the embedding-based approach for the two different
is the predicted mark given by the autograding approach for datasets, where the RMSE and MAE for the SE dataset are
the same answer. Both MAE and RMSE are reliable metrics more than twice of those produced for the Mohler dataset.
for assessing the accuracy of predictions. In particular, for the Mohler dataset, the embedding-based
approach produced a RMSE and a MAE of 0.932 and 0.749,
C. Research questions respectively. On the other hand, for the SE dataset, the same
In the evaluation, we aim to answer the following research approach produced a RSME and a MAE of 2.017 and 1.727.
questions (RQs): For the Pearson correlation coefficient, it’s 0.557 and 0.507
• RQ1: Which is the better approach for autograding short for the Mohler and SE datasets, respectively.
answers: embedding-based or completion-based? From the experiments, we note that the embedding-based
• RQ2: How do embedding-based and completion-based approach is more biased towards giving higher scores due to
autograding compare to existing deep learning based the relatively high cosine similarities obtained between the
approaches? student answers and the reference answers in many cases;
• RQ3: Does adding context from relevant lecture notes unless the student’s answer is hardly related to the question
on the question’s topic using RAG produce more ac- text. This can be observed from the fact that 86% of the scores
curate grading result when using the completion-based predicted for the Mohler dataset [3] is more than 4 marks
approach? when using the embedding-based approach. At the same time,
• RQ4: How do different versions of the same LLM family, we also note that the Mohler dataset has about 63% of the
e.g., GPT-3.5-Turbo and GPT-4, compare to each other answers scoring above 4 out of 5 marks as given by the human
in the autograding of short answers? graders. On the other hand, the SE dataset only has 27% of the
answers scoring above 3 out of 4 marks. This explains why the
VI. R ESULTS embedding-based approach does better in the Mohler dataset,
A. RQ1: Embedding-based vs. completion-based but much worse in the SE dataset. In this case, we observe
We first compare our proposed embedding-based and that the completion-based approach is the better way to do
completion-based approaches using both the Mohler dataset autograding of short answers as it significantly outperforms
[3] and the SE dataset. The default LLM used is GPT-3.5- the embedding-based approach in the SE dataset.
Turbo. The results are shown in Table IV.
Summary-RQ1: The completion-based approach could
TABLE IV be considered the better autograding approach overall;
E MBEDDING VS . C OMPLETION as it is more consistent with the predicted marks given to
Model Pearson Correlation Coefficient RMSE MAE
answers in both datasets, regardless of the actual mark
Mohler Dataset distribution in any of them. In both cases, we will need
Embedding 0.557 0.932 0.749
Completion (3 examples) 0.450 1.185 0.960
to provide more relevant examples of answers and actual
Completion (6 examples) 0.406 0.975 0.780 scores in the completion-based prompt to improve the
Completion (9 examples) 0.525 0.922 0.706
SE Dataset
autograding performance.
Embedding 0.507 2.017 1.727
Completion (3 examples) 0.621 1.342 1.044 B. RQ2: Comparison to deep learning based methods
Completion (6 examples) 0.694 1.207 0.872
Completion (9 examples) 0.674 1.240 0.852 We now compare the embedding and completion based
approaches with other existing autograding methods which
For the Mohler dataset [3], the embedding-based approach made use of deep learning techniques. For a fair comparison,
produces the highest Pearson correlation coefficient of 0.557. all the approaches are considered using performance measures
On the other hand, the completion-based approach with 9 reported previously with the Mohler dataset. Table V summa-
examples produces a Pearson correlation of 0.525, as well as rizes the key results. In particular, we collected results reported
the best RMSE and MAE of 0.922 and 0.706 respectively. from [14], which implemented and evaluated several variations
For the SE dataset, the results produced by the completion- of the Long Short-Term Memory (LSTM) neural networks for
based approach with 6 examples are quite similar to those short answer grading. The authors of [13] considered different
produced by the same approach with 9 examples. The former types of paragraph embedding models. Finally, the usage
having a higher Pearson correlation coefficient of 0.694 and a Bidirectional LSTM (BiLSTM) has been shown to perform
lower RMSE of 1.207; while the latter having a lower MAE of well in automated grading [15]. More recently, pre-trained
0.852. However, given that more examples have to be passed models such as ELMo [7] was also used on the Mohler dataset.
into the prompt, it may take a longer processing time and As observed from Table V, the embedding and completion
higher cost when doing grading. Therefore, in this case the based approaches may not be the best approach in terms of
TABLE V TABLE VI
C OMPARING WITH EXISTING DEEP LEARNING BASED AUTOGRADING E VALUATING THE EFFECT OF ADDITIONAL CONTEXT IN AUTOGRADING -
APPROACHES - M OHLER DATASET SE DATASET

Approach Pearson RMSE MAE Pearson Cor- RMSE MAE

Correlation relation Coef-
Coefficient ficient
LSTM-EMD-SVOR [14] 0.550 0.830 0.490 Completion (3 ex- 0.621 1.342 1.044
LSTM-EMD-Logits [14] 0.649 1.135 0.657 amples)
Paragraph embedding 0.569 0.797 - Completion 0.631 1.338 1.018
(doc2vec) [13] with context
Siamese BiLSTM + feature 0.655 0.889 0.618 (3 example)
engineering [15] Completion (6 ex- 0.694 1.207 0.872
Stacked BiLSTM (ELMo) 0.485 0.978 - amples)
[7] Completion 0.642 1.149 0.795
Embedding-based 0.557 0.932 0.749 with context
Completion-based 0.525 0.922 0.706 (6 examples)
(9 examples) Completion (9 ex- 0.674 1.240 0.852
amples)
Completion 0.748 1.026 0.693
with context
result. However they offer a good balance among all the three (9 examples)
metrics, and can be seen to have an average performance
which is quite comparable to the existing deep learning based
approaches, e.g., [13], [14]. It is important to note that in our The results are shown in Table VI for the SE dataset, for
approaches, there was no extensive training or fine-tuning done which we have the corresponding course materials. With rele-
with a large labelled dataset. On the other hand, existing deep vant context given in the prompt, it is observed that there are
learning based approaches incur significant training cost with generally notable improvements in the quality of autograding
a large part of the same dataset prior to predictions [16]. In for short answers. For instance, the completion-based approach
some cases, e.g., [15], manual feature engineering tasks are with 9 examples (no context provided) produces a Pearson
needed to improve the prediction performance. correlation coefficient of 0.674, RMSE of 1.207 and MAE of
We note that popular pre-trained LLMs such as BERT and 0.872. With the context, the same approach produces better
ELMo [29] have been applied on the Mohler dataset. As shown predictions with a Pearson correlation coefficient of 0.748,
in Table V, the performance of the ELMo-based approach still RMSE of 1.026 and MAE of 0.693. The only exception is
has a gap when compared to the embedding and completion the completion-based (6 examples), in which the additional
based approach. There have been research on how to leverage context does not improve the already high Pearson correlation.
the GPT family of models on short answer grading, e.g., [20], However, Table VI shows that the RMSE and MAE have been
[21]. However, we could not find other recent works making improved due to more context in that case.
use of the latest pre-trained LLMs like GPT-3.5-Turbo or GPT-
4 on the Mohler dataset. Summary-RQ3: Relevant context extracted from course
materials and given to the LLM prompt could signifi-
Summary-RQ2: The embedding and completion based cantly improve the autograding accuracy in most cases.
approaches do not require extensive training or fine-
tuning to perform reasonably well. Therefore, they are D. RQ4: Comparison between GPT-3.5-Turbo and GPT-4
more generally applicable to a wide variety of grading We also compare the grading performance of the GPT-4
scenarios; not just in our specific SE courses but also in and GPT-3.5-Turbo LLMs. We note that the cost of GPT-4 is
other courses. significantly more than GPT-3.5-Turbo for the same amount
of tokens. Due to the limited budget, we have not had the
C. RQ3: Using course materials as additional context in chance to fully explore GPT-4’s capabilities. In our system,
completion-based autograding we use GPT-4 to implement the completion-based approach
We would like to find out if there will be an improvement in with 6 and 9 examples, and evaluate it on the SE dataset.
the autograding capability of the completion-based approach In Table VII, there is a significant improvement in the GPT-
when provided with more context extracted from relevant 4 based approach when compared to the one using GPT-3.5-
course materials such as lecture notes when they are available. Turbo. GPT-4 completion-based approach with 9 examples
This is referred to as RAG - retrieval augmented generation achieved the highest Pearson correlation coefficient of 0.844, a
[27]. As described in the implementation, we use OpenAI’s low RMSE and MAE of 0.828 and 0.566, respectively. While
text-embedding-ada-002 and the Faiss library [28] to store and the results are very promising, a more extensive evaluation
extract chunks of lecture notes (as detailed in Table II) relevant of GPT-4 based approaches is needed when the cost becomes
to a question which needs to be auto-graded. The context is less of an issue. It is worth noting that as the time of writing,
then incorporated into the prompt together with the grading GPT-4 generally costs about 30x more compared to GPT-3.5-
examples. Turbo.
TABLE VII ACKNOWLEDGEMENT
GPT-4 VS . GPT-3.5-T URBO FOR THE COMPLETION - BASED APPROACH -
SE DATASET This work is supported by the UResearch programme from
Pearson Cor- RMSE MAE
the School of Computing and Information Systems, Singapore
relation Coef- Management University.
ficient
GPT-3.5-Turbo 0.694 1.207 0.872
(6 examples) R EFERENCES
GPT-4 0.784 0.896 0.616
(6 example) [1] T. Puthiaparampil and M. M. Rahman, “Very short answer questions: a
GPT-3.5-Turbo 0.674 1.240 0.852 viable alternative to multiple choice questions,” BMC medical education,
(9 examples) vol. 20, no. 1, pp. 1–8, 2020.
GPT-4 0.844 0.828 0.566 [2] S. Greving and T. Richter, “Examining the testing effect in university
(9 examples) teaching: Retrievability and question format matter,” Frontiers in Psy-
chology, vol. 9, 2018.
[3] M. Mohler, R. Bunescu, and R. Mihalcea, “Learning to grade short
answer questions using semantic similarity measures and dependency
Summary-RQ4: The newer LLM version such as GPT-4 graph alignments,” in Proceedings of the 49th Annual Meeting of the
could significantly outperform previous models in short Association for Computational Linguistics: Human Language Technolo-
answer autograding. gies, 2011, pp. 752–762.
[4] M. A. Sultan, C. Salazar, and T. Sumner, “Fast and easy short answer
grading with high accuracy,” in Proceedings of the 2016 Conference
E. Limitations of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, 2016, pp. 1070–1075.
Here we discuss some limitations of this work. First, outputs [5] K. Cochran, C. Cohn, J. F. Rouet, and P. Hastings, “Improving auto-
from LLMs could vary from time to time, which might affect mated evaluation of student text responses using gpt-3.5 for text data
the autograding accuracy reported in Section VI. We have augmentation,” in International Conference on Artificial Intelligence in
Education. Springer, 2023, pp. 217–228.
attempted to mitigate this issue by reporting the accuracy [6] A. Mizumoto and M. Eguchi, “Exploring the potential of using an ai
using a large number of answers from two different datasets. language model for automated essay scoring,” Research Methods in
Second, due to funding issues we could not fully evaluate GPT- Applied Linguistics, vol. 2, no. 2, 2023.
[7] S. K. Gaddipati, D. Nair, and P. G. Plöger, “Comparative evaluation of
4’s autograding accuracy. It is possible that newer and more pretrained transfer learning models on automatic short answer grading,”
expensive LLM versions will provide improved performance arXiv preprint arXiv:2009.01303, 2020.
compared to what we have reported here. Finally, it would [8] J. Mitra, “Studying the impact of auto-graders giving immediate feed-
back in programming assignments,” in Proceedings of the 54th ACM
be better to build a larger dataset with more questions and Technical Symposium on Computer Science Education V. 1, 2023, pp.
answers on various software engineering topics. We plan to 388–394.
do so with the latest LLM versions, e.g., GPT-4 Turbo, when [9] D. S. Mishra and S. H. Edwards, “The programming exercise markup
the cost becomes more manageable. language: Towards reducing the effort needed to use automated grading
tools,” in Proceedings of the 54th ACM Technical Symposium on
VII. C ONCLUSION Computer Science Education V. 1, 2023, pp. 395–401.
[10] S. Siriwardhana, R. Weerasekera, E. Wen, T. Kaluarachchi, R. Rana,
This work on LLM-based automatic grading of short an- and S. Nanayakkara, “Improving the domain adaptation of retrieval aug-
swers has the potential to reduce the marking burden on mented generation (rag) models for open domain question answering,”
Transactions of the Association for Computational Linguistics, vol. 11,
instructors teaching a variety of courses, especially in the pp. 1–17, 2023.
domain of computer science and software engineering where [11] S. Basu, C. Jacobs, and L. Vanderwende, “Powergrading: a clustering ap-
the number of students has been increasing recently. We have proach to amplify human effort for short answer grading,” Transactions
of the Association for Computational Linguistics, vol. 1, pp. 391–402,
proposed two new approaches for autograding short answers 2013.
using embedding and completion models, which are based on [12] S. Haller, A. Aldea, C. Seifert, and N. Strisciuglio, “Survey on automated
the OpenAI’s GPT family of LLMs. short answer grading with deep learning: from word embeddings to
transformers,” arXiv preprint arXiv:2204.03503, 2022.
We have conducted extensive evaluations and comparison [13] S. Hassan, A. A. Fahmy, and M. El-Ramly, “Automatic short answer
to the existing methods in this area using a well-known scoring based on paragraph embeddings,” International Journal of
dataset and a new dataset of our own in software engineering Advanced Computer Science and Applications, vol. 9, no. 10, 2018.
[14] S. Kumar, S. Chakrabarti, and S. Roy, “Earth mover’s distance pooling
courses at the university level. The datasets capture different over siamese lstms for automatic short answer grading,” in Proceedings
kinds of mark distributions which could affect any auto- of the 26th International Joint Conference on Artificial Intelligence, ser.
grading methods. We found that our approaches, especially IJCAI’17. AAAI Press, 2017, p. 2046–2052.
[15] A. Prabhudesai and T. N. Duong, “Automatic short answer grading using
the completion-based approach, which do not require time- siamese bidirectional lstm based regression,” in 2019 IEEE international
consuming training of deep learning models, could work well conference on engineering, technology and education (TALE). IEEE,
for the given datasets. We also found that relevant context in 2019, pp. 1–6.
the form of lecture notes for the course would help improve [16] X. Zhu, H. Wu, and L. Zhang, “Automatic short-answer grading via
bert-based deep neural networks,” IEEE Transactions on Learning
grading performance. Lastly, newer models like GPT-4 look Technologies, vol. 15, no. 3, pp. 364–375, 2022.
very promising in autograding tasks. However, the cost of such [17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
models is still a concern especially for educational institutions. Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in
neural information processing systems, vol. 30, 2017.
We plan to investigate ways to do more accurate and fair [18] S.-Y. Yoon, “Short answer grading using one-shot prompting and text
autograding while minimizing LLM cost in our future work. similarity scoring model,” arXiv preprint arXiv:2305.18638, 2023.
[19] P. Organisciak, S. Acar, D. Dumas, and K. Berthiaume, “Beyond seman-
tic distance: automated scoring of divergent thinking greatly improves
with large language models,” Thinking Skills and Creativity, p. 101356,
2023.
[20] G. Kortemeyer, “Performance of the pre-trained large language
model gpt-4 on automated short answer grading,” arXiv preprint
arXiv:2309.09338, 2023.
[21] J. Schneider, B. Schenk, C. Niklaus, and M. Vlachos, “Towards
llm-based autograding for short textual answers,” arXiv preprint
arXiv:2309.11508, 2023.
[22] G. Pinto, I. Cardoso-Pereira, D. Monteiro, D. Lucena, A. Souza,
and K. Gama, “Large language models for education: Grading open-
ended questions using chatgpt,” in Proceedings of the XXXVII Brazilian
Symposium on Software Engineering, 2023, pp. 293–302.
[23] J. M. Gomez-Perez, R. Denaux, A. Garcia-Silva, J. M. Gomez-Perez,
R. Denaux, and A. Garcia-Silva, “Understanding word embeddings
and language models,” A Practical Guide to Hybrid Natural Language
Processing: Combining Neural Models and Knowledge Graphs for NLP,
pp. 17–31, 2020.
[24] P. Xia, L. Zhang, and F. Li, “Learning similarity with cosine similarity
ensemble,” Information sciences, vol. 307, pp. 39–52, 2015.
[25] P. Denny, V. Kumar, and N. Giacaman, “Conversing with copilot:
Exploring prompt engineering for solving cs1 problems using natural
language,” in Proceedings of the 54th ACM Technical Symposium on
Computer Science Education V. 1, 2023, pp. 1136–1142.
[26] A. Martino, M. Iannelli, and C. Truong, “Knowledge injection to counter
large language model (llm) hallucination,” in European Semantic Web
Conference. Springer, 2023, pp. 182–185.
[27] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal,
H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel et al., “Retrieval-
augmented generation for knowledge-intensive nlp tasks,” Advances in
Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020.
[28] J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search
with GPUs,” IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–
547, 2019.
[29] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz,
E. Agirre, I. Heintz, and D. Roth, “Recent advances in natural language
processing via large pre-trained language models: A survey,” ACM
Computing Surveys, vol. 56, no. 2, pp. 1–40, 2023.

Tosem2hshzh024 5 PDF
No ratings yet
Tosem2hshzh024 5 PDF
79 pages
Evaluating Large Language Models On Non-Code Software Engineering Tasks
No ratings yet
Evaluating Large Language Models On Non-Code Software Engineering Tasks
16 pages
Leveraging Open Source LLMs For Software Engineering Education and Training
No ratings yet
Leveraging Open Source LLMs For Software Engineering Education and Training
10 pages
2024 Pietersen
No ratings yet
2024 Pietersen
23 pages
Spring 2025 - CS619 - 10911
No ratings yet
Spring 2025 - CS619 - 10911
3 pages
AI Literacy and Its Implications For Prompt Engineering Strategies
No ratings yet
AI Literacy and Its Implications For Prompt Engineering Strategies
16 pages
Large Language Models For Software Engineering
No ratings yet
Large Language Models For Software Engineering
79 pages
Seuh2024 4
No ratings yet
Seuh2024 4
12 pages
My Library - CSV
No ratings yet
My Library - CSV
10 pages
Large Language Models For Software Engineering - A Systematic Literature Review
No ratings yet
Large Language Models For Software Engineering - A Systematic Literature Review
79 pages
Newbcode - A Programming Learning App Using Text-Based Classification
No ratings yet
Newbcode - A Programming Learning App Using Text-Based Classification
48 pages
Case Study For Procurement
No ratings yet
Case Study For Procurement
62 pages
A New Similarity Based Method For Assess
No ratings yet
A New Similarity Based Method For Assess
19 pages
2023 BDCC - Short Ans. Grading
No ratings yet
2023 BDCC - Short Ans. Grading
14 pages
Case Study
No ratings yet
Case Study
12 pages
Leveraging Large Language Models To Generate Course-Specific Semantically Annotated Learning Objects
No ratings yet
Leveraging Large Language Models To Generate Course-Specific Semantically Annotated Learning Objects
20 pages
Irjet V11i505
No ratings yet
Irjet V11i505
6 pages
Automatic Configurable and Partial Assessment of Student SQL Queries With Subqueries
No ratings yet
Automatic Configurable and Partial Assessment of Student SQL Queries With Subqueries
6 pages
4 E-Question Paper Creator
No ratings yet
4 E-Question Paper Creator
7 pages
Escholarship UC Item 6kf0r28s
No ratings yet
Escholarship UC Item 6kf0r28s
45 pages
1225-Article Text-2315-1-10-20220818
No ratings yet
1225-Article Text-2315-1-10-20220818
6 pages
Adapting Large Language Models For Education: Foundational Capabilities, Potentials, and Challenges
No ratings yet
Adapting Large Language Models For Education: Foundational Capabilities, Potentials, and Challenges
31 pages
2 Corinthians
No ratings yet
2 Corinthians
120 pages
From GPT To BERT:: Benchmarking Large Language Models For Automated Iz Generation
No ratings yet
From GPT To BERT:: Benchmarking Large Language Models For Automated Iz Generation
2 pages
Muet Yearly Scheme of Work 2024
100% (5)
Muet Yearly Scheme of Work 2024
19 pages
Cambridge IGCSE ™: Islamiyat 0493/11
No ratings yet
Cambridge IGCSE ™: Islamiyat 0493/11
17 pages
Qsimavr: Graphical Simulation of An AVR Processor and Periphery
No ratings yet
Qsimavr: Graphical Simulation of An AVR Processor and Periphery
71 pages
A Private Cloud Based Smart Learning Environment Using Moodle For Universities
No ratings yet
A Private Cloud Based Smart Learning Environment Using Moodle For Universities
16 pages
Listening To Connected Speech PDF
No ratings yet
Listening To Connected Speech PDF
4 pages
2.WSS - Enquiry Routines
No ratings yet
2.WSS - Enquiry Routines
7 pages
Outsiders
No ratings yet
Outsiders
4 pages
Thrax
No ratings yet
Thrax
2 pages
Research 2 Chapter 8 - 082441
No ratings yet
Research 2 Chapter 8 - 082441
13 pages
Soal B Inggris Uas KLS X Olla
No ratings yet
Soal B Inggris Uas KLS X Olla
25 pages
Descriptive Writing
No ratings yet
Descriptive Writing
7 pages
Inv Preview Acc 5721434 88556 23
No ratings yet
Inv Preview Acc 5721434 88556 23
2 pages
CS403 2nd Assignment
No ratings yet
CS403 2nd Assignment
2 pages
Soulmates 1
No ratings yet
Soulmates 1
2 pages
DETAILED LESSON WPS Office
No ratings yet
DETAILED LESSON WPS Office
9 pages
Tutorial 3 Question
No ratings yet
Tutorial 3 Question
8 pages
Research On The Writing History of Arabic Rhetoric Studies and Its Importance
No ratings yet
Research On The Writing History of Arabic Rhetoric Studies and Its Importance
9 pages
ĐỀ THI THỬ CẢ NƯỚC 2024 01 english
No ratings yet
ĐỀ THI THỬ CẢ NƯỚC 2024 01 english
4 pages
Word Formation Practice Acts & Key FIRST TRAINER
No ratings yet
Word Formation Practice Acts & Key FIRST TRAINER
4 pages
Js3 t2 Crs History of Christianity
No ratings yet
Js3 t2 Crs History of Christianity
6 pages
P1 - Photoshop - Env Body Culture
No ratings yet
P1 - Photoshop - Env Body Culture
6 pages
Soal Bhs Inggris Olimpiade 2
No ratings yet
Soal Bhs Inggris Olimpiade 2
3 pages
Real World Task and Pedagogic Tasks
0% (1)
Real World Task and Pedagogic Tasks
1 page
Exploratory Data Analysis and Data Mining On Yelp Restaurant Review Using Ada Boosting and MLP Techniques
No ratings yet
Exploratory Data Analysis and Data Mining On Yelp Restaurant Review Using Ada Boosting and MLP Techniques
5 pages
BFSI
No ratings yet
BFSI
5 pages
Detailed Lesson Plan For Multigrade Classes in Grade 2 and 3
100% (1)
Detailed Lesson Plan For Multigrade Classes in Grade 2 and 3
4 pages
Discipleshift Participant Guide
100% (2)
Discipleshift Participant Guide
32 pages
Template Provided by Genigraphics - 800.790.4001 - Replace This Text With Your Title
No ratings yet
Template Provided by Genigraphics - 800.790.4001 - Replace This Text With Your Title
1 page
Unit 1 Short Test 1A: Grammar
No ratings yet
Unit 1 Short Test 1A: Grammar
2 pages
2.personal Pronouns
No ratings yet
2.personal Pronouns
1 page
Ultimate Enterprise Data Analysis and Forecasting using Python: Leverage Cloud platforms with Azure Time Series Insights and AWS Forecast Components for Time Series Analysis and Forecasting with Deep learning Modeling using Python
From Everand
Ultimate Enterprise Data Analysis and Forecasting using Python: Leverage Cloud platforms with Azure Time Series Insights and AWS Forecast Components for Time Series Analysis and Forecasting with Deep learning Modeling using Python
Shanthababu Pandian
No ratings yet
A Guide to Java Interviews
From Everand
A Guide to Java Interviews
Aishik Dutta
No ratings yet
Machine Learning Fundamentals: Concepts, Models, and Applications
From Everand
Machine Learning Fundamentals: Concepts, Models, and Applications
Amar Sahay
No ratings yet
Algorithms Made Simple: Understanding the Building Blocks of Software
From Everand
Algorithms Made Simple: Understanding the Building Blocks of Software
William E. Clark
No ratings yet
Agile Foundation Courseware – English
From Everand
Agile Foundation Courseware – English
Nader Rad
No ratings yet
CISSP Domain 1 Study Guide ( Updated 2024 ) With Practice Exam Questions, Quizzes, Flash Cards: CISSP Study Guide - Updated 2024, #1
From Everand
CISSP Domain 1 Study Guide ( Updated 2024 ) With Practice Exam Questions, Quizzes, Flash Cards: CISSP Study Guide - Updated 2024, #1
ADITYA .
No ratings yet
DevOps Master Courseware
From Everand
DevOps Master Courseware
Alejandro Pestchanker
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Discrete Structure and Automata Theory for Learners: Learn Discrete Structure Concepts and Automata Theory with JFLAP
From Everand
Discrete Structure and Automata Theory for Learners: Learn Discrete Structure Concepts and Automata Theory with JFLAP
Sukhpreet Kaur Gill
No ratings yet
Touchpad Information Technology Class 10: Skill Education Based on Windows & OpenOffice Code (402)
From Everand
Touchpad Information Technology Class 10: Skill Education Based on Windows & OpenOffice Code (402)
Dr. Sanjay Jain
No ratings yet
Eazy Pmp
From Everand
Eazy Pmp
O.A. Amao
No ratings yet
The Use Of Blogs in K-12
From Everand
The Use Of Blogs in K-12
Ahmad Saad
No ratings yet
Study Guide Implementing DevOps Solutions (DevNet Professional) 300-910 DEVOPS
From Everand
Study Guide Implementing DevOps Solutions (DevNet Professional) 300-910 DEVOPS
Anand Vemula
No ratings yet
Developing Technical Training: A Structured Approach for Developing Classroom and Computer-based Instructional Materials
From Everand
Developing Technical Training: A Structured Approach for Developing Classroom and Computer-based Instructional Materials
Ruth C. Clark
4.5/5 (5)
IGNOU BCA System Analysis and Design Previous Year Solved Papers MCS 014
From Everand
IGNOU BCA System Analysis and Design Previous Year Solved Papers MCS 014
Manish Soni
No ratings yet
Earned Schedule Application Handbook
From Everand
Earned Schedule Application Handbook
Walter H. Lipke
No ratings yet
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
From Everand
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
Elaine Tate
No ratings yet
Python Machine Learning Projects: Learn how to build Machine Learning projects from scratch (English Edition)
From Everand
Python Machine Learning Projects: Learn how to build Machine Learning projects from scratch (English Edition)
Dr. Deepali R Vora
No ratings yet
Trending E-learning Technologies in 2023: 2023/1, #1
From Everand
Trending E-learning Technologies in 2023: 2023/1, #1
Timothy Shava
No ratings yet
Self-Supervised Learning: Teaching AI with Unlabeled Data
From Everand
Self-Supervised Learning: Teaching AI with Unlabeled Data
Robert Johnson
No ratings yet
IT Interview Guide for Freshers: Crack your IT interview with confidence
From Everand
IT Interview Guide for Freshers: Crack your IT interview with confidence
Sameer S Paradkar
No ratings yet
Fundamentals of Software Engineering: Designed to provide an insight into the software engineering concepts
From Everand
Fundamentals of Software Engineering: Designed to provide an insight into the software engineering concepts
Hitesh Mohapatra
No ratings yet
Teaching and Learning in STEM With Computation, Modeling, and Simulation Practices: A Guide for Practitioners and Researchers
From Everand
Teaching and Learning in STEM With Computation, Modeling, and Simulation Practices: A Guide for Practitioners and Researchers
Alejandra J. Magana
No ratings yet
Teaching Primary Programming with Scratch Pupil Book Year 6
From Everand
Teaching Primary Programming with Scratch Pupil Book Year 6
Phil Bagge
No ratings yet
Group Project Software Management: A Guide for University Students and Instructors
From Everand
Group Project Software Management: A Guide for University Students and Instructors
Tommy Yuan
No ratings yet
IGNOU BCA Introduction to Software Engineering Previous Year Unsolved Papers BCS 051
From Everand
IGNOU BCA Introduction to Software Engineering Previous Year Unsolved Papers BCS 051
Manish Soni
No ratings yet
The Complete Project Management Exam Checklist: 500 Practical Questions & Answers for Exam Preparation and Professional Certification: 500 Practical Questions & Answers for Exam Preparation and Professional Certification
From Everand
The Complete Project Management Exam Checklist: 500 Practical Questions & Answers for Exam Preparation and Professional Certification: 500 Practical Questions & Answers for Exam Preparation and Professional Certification
Guan Leng Lim
No ratings yet
The Art of Controller Design
From Everand
The Art of Controller Design
Martin Braae
No ratings yet
VMWARE Certified Spring Professional Certification Concept Based Practice Questions - Latest Edition
From Everand
VMWARE Certified Spring Professional Certification Concept Based Practice Questions - Latest Edition
Exam OG
No ratings yet
PMP® Full Exam: 2: 200 Questions and Answers
From Everand
PMP® Full Exam: 2: 200 Questions and Answers
Leithy Mohamed Leithy
4.5/5 (4)
AP Computer Science Principles Premium, 2026: Prep Book with 6 Practice Tests + Comprehensive Review + Online Practice
From Everand
AP Computer Science Principles Premium, 2026: Prep Book with 6 Practice Tests + Comprehensive Review + Online Practice
Barron's Educational Series
No ratings yet
Few-Shot Machine Learning: Doing More with Less Data
From Everand
Few-Shot Machine Learning: Doing More with Less Data
Robert Johnson
No ratings yet
Learning Advanced Programming
From Everand
Learning Advanced Programming
IT Campus Academy
No ratings yet
PMP Exam Companion
From Everand
PMP Exam Companion
SUJAN
No ratings yet
Software Testing Interview Questions You'll Most Likely Be Asked
From Everand
Software Testing Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet

Automatic Grading of Short Answers Using Large Language Models in

Uploaded by

Automatic Grading of Short Answers Using Large Language Models in

Uploaded by

Singapore Management University

Institutional Knowledge at Singapore Management University

Automatic grading of short answers using Large Language

Follow this and additional works at: https://ink.library.smu.edu.sg/sis_research

Fig. 1. Components in our web based autograding system

(CRUD) questions and answers, and to monitor student per-

Question What is a pointer? Topic Summary Pages

Approach Pearson RMSE MAE Pearson Cor- RMSE MAE

You might also like