0% found this document useful (0 votes)

21 views

Grading Programming Assignments by Summarization - LLMs

This study presents a Large Language Model (LLM) approach for automating the grading of programming assignments, addressing the challenges of manual grading's complexity and subjectivity. By utilizing a coder-decoder architecture with CodeBERT and a Transformer model, the system generates code summaries and evaluates them against assignment descriptions for semantic similarity, achieving an accuracy of 0.92. The proposed method aims to provide timely feedback to students, enhancing their learning retention and allowing for iterative improvements in their coding skills.

Uploaded by

Isha Shah

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Grading Programming Assignments by Summarization - LLMs

Uploaded by

Isha Shah

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Grading Programming Assignments by Summarization

Dong Dong Yue Liang

College of Computer & Cyber Security, Hebei Normal College of Computer & Cyber Security, Hebei Normal
University, China University, China
[email protected] [email protected]

ABSTRACT training by repeating such as riding a horse, or archery. Instructors

Grading programming assignments manually is a big burden for should give students instant feedback for their programming assign-
instructors who teach programming languages for university stu- ments according to the Ebbinghaus forgetting curve in psychology.
dents due to complexity and subjectivity. The black test approach Unfortunately, coding is slow and evaluating assignments is slowly,
adopted by online judge systems can only outputs either an answer too. A programming assignment for a unit usually takes approx-
is correct or incorrect. This study proposes a Large Language Model imately two weeks, resulting in the risk of inconsistent grading
(LLM) approach to automatically grade answers from students for and delayed feedback. Students could receive their grades after
programming assignments. A LLM mode formed by coder-decoder they had forgotten the ideas of solutions for a released program-
architecture is utilized to generate summarization from source code, ming assignment. Moreover, unzipping, downloading, working
then the summarization is compared to the textual assignment de- through, running, and providing feedback place a great burden on
scription by semantic similarity. Finally, the output is converted instructors.
to five-score rating. CodeBERT and a Transformer model serve as Nowadays, automated grading tools like Online Judge (OJ) sys-
coder and decoder respectively. The semantic similarity is com- tem are widely adopted, benefiting both instructors and students
puted by MiniLM-L6. The validation test shows that the accuracy directly [4]. With the provision of immediate feedback, students
of the suggested approach reaches 0.92. are able to submit their code and promptly view the assessment
outcomes. Students may revise and resubmit their code, thereby
CCS CONCEPTS establishing a cycle of ongoing refinement. The rapid feedback
supports self-directed learning, removing the former requirement
• Applied computing; • Computer assisted instruction;; • So-
to endure a wait for several days to receive graded assignments
cial and professional topics; • Student assessment.;
and feedback. Furthermore, it mitigates students’ apprehensions
regarding the potential forgetting of the initial problem-solving
KEYWORDS
methodologies prior to re-submission. Such a feature is beneficial
automatic grading, CodeBERT, source code summarization, pro- for the reinforcement of students’ learning retention and the solidi-
gramming assignment assessment fication of their skills [5]. Nevertheless, the majority of OJ systems
ACM Reference Format: assess students’ solutions exclusively through black-box testing: if
Dong Dong and Yue Liang. 2024. Grading Programming Assignments by a program passes 3 or more test cases, then full marks, otherwise 0.
Summarization. In ACM Turing Award Celebration Conference 2024 (ACM- This is motivation example here: given the start and end times of
TURC ’24), July 05–07, 2024, Changsha, China. ACM, New York, NY, USA, an examination, such as 8:30 AM and 10:30 AM, calculate the total
6 pages. https://doi.org/10.1145/3674399.3674426 duration for the examination in Java. What follows is a solution
taken from an actual student’s submission.
1 INTRODUCTION Obviously, the solution cannot pass any OJ system. An instructor
The competency to design, implement and execute algorithms in a could review the source code and assigns a grade of ”B” using a
programming language is one of the core graduation requirements five-level scoring. Therefore, an alternative OJ system that can
for the bachelor of science (B.S.) program related to computer sci- grade students’ programming assignments based on source code as
ence [1] As a result, the number of grading programming assign- an instructor does is expected.
ments in programming courses continues to increase to ensure
that students are adequately trained and instructed for continu-
ous enhancement students’ programming ability [2]. However,
programming is a tough course to teach and learn, as it involves
not only engineering issues but also paradigms of thought [3] The
training procedure and methodology of hands-on practice provided
by programming assignments are totally different from the skills
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
ACM-TURC ’24, July 05–07, 2024, Changsha, China
© 2024 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-1011-7/24/07
https://doi.org/10.1145/3674399.3674426

53
ACM-TURC ’24, July 05–07, 2024, Changsha, China Dong Dong and Yue Liang

The remainder of this paper is structured as follows: section 2 In contrast to the ”Perfect Match” strategy, it does not overlook any
explores some of the work that has been done for the automated whitespace characters, including line breaks.
programming automated grading issues. Section 3 describes the 3. Wildcard Matching: This strategy permits the use of wildcard
proposed LLM approach. The experiments to train the model pro- characters, such as the asterisk (*) and question mark (?), to denote
posed and test the prototype system, as well as result analysis are the expected output in a case-insensitive manner. Furthermore,
shown in detail in section 4. Last, section 5 ends with conclusions wildcard fuzzy matching is applicable in contexts where the output
and suggestions for the further work. spans multiple lines. In such cases, the output lines from the user’s
program will be deemed correct provided they constitute a subset
of the expected output.
4. Regular expression matching: The expected output is ex-
2 REVIEW OF RELATED WORK pressed by regular expressions.
This paper focuses solely on the automatic grading algorithms 5. Set Matching: The expected output is equal to the set of user
and systems applied to programming assignments. This topic has program outputs (by line), regardless of line order.
been of interest to computer science educators since 1960s [6] It 6. Longest Common Subsequence Matching.
remains continually attractive to researchers up to the present [7]- 7. Edit Distance Matching.
[15]. Most systems perform fully automated assessment to offer These distinctive features imply that both instructors and stu-
approximately instantaneous feedback, which can increase student dents expect an objective grade which matches students’ effort as
satisfaction [10]. Most research focused on assessing correctness. exactly as an OJ can.
Dynamic and static approaches are both used to automate the as- Machine learning technology trains a model on data sets by vari-
sessment of programming assignments [10]. Dynamic approaches, ous algorithms firstly, then predicts outcomes by the model without
primarily black-box tests, need to run the programs submitted. human intervention. Supervised learning stands out as a widely
Static approaches evaluate a student’s submission by comparing implemented strategy, where algorithms are meticulously trained
it against one or more solutions provided by educators [11]. The on datasets meticulously annotated by human graders. In contrast,
static approaches apply various methods to evaluate the similarity unsupervised learning allows algorithms to autonomously delve
between a student’s submission and the suggested solution(s), such into unlabeled datasets, seeking to identify underlying patterns and
as ASTs(Abstract Syntax Trees), control flow graphs, or dependency commonalities among student submissions. In this context, ma-
graphs [14]. chine learning methods such as regression model, decision trees and
The Virtual Teaching Assistant [16] matches a student’ submis- support vector machines are leveraged to classify code segments
sion against location-free or location-specific patterns, relaxing [17]-[20].
the strict specification of program output and allowing for partial For the case of performance evaluation, most papers primarily
marks to be awarded for programs with partial functional correct- compare grades from the automatic assessment tools to ones pro-
ness. Location-free means the order of program output tokens or vided by human graders. However, it is more difficult to reproduce
characters or specific structure can be ignored when determining results and compare tools with a common dataset since lacks of
the grade. Conversely, if the order of the output or the specific benchmark. Furthermore, most of these systems cannot award par-
structure does impact the grade, the pattern is referred as location- tial grades for incomplete programs, resulting in students receiving
specific. The semantic similarity-based approach has also been a zero grade. In contrast, a human grader would typically gives par-
studied. tial marks for source code that implements a subset of the features
There are many systems employ dynamic approach in the ed- or has minor logical errors or syntax errors. Deep learning models,
ucational technology domain, such as LeetCode (leetcode.com), with their capacity to process vast quantities of data, distill intricate
Codeforces (codeforces.com), PKU JudgeOnline (poj.org). These features, and render nuanced judgments, have become increasingly
solutions are dedicated to instant feedback mechanisms for pro- popular within automated grading systems [21]-[25]. The field of
grammers. For example, CG Online Judge (course.educg.net) is one deep learning has seen significant evolution in recent times, par-
of the excellent teaching platforms for computer science. CG as- ticularly in the realm of automatic grading, where it demonstrates
sesses student submissions for programming assignments through markedly superior performance compared to conventional machine
a couple of black-box test cases as the other OJs do. In addition, CG learning models.
provides multiple strategies for aligning expected outputs with the
actual outputs generated by the student’s program, some of them
are detailed below:
1. Perfect Match: This is the default and most frequently utilized
matching strategy of the OJ system. It mandates that the student’s
program output strictly aligns with the predefined expected output.
For the purpose of comparison, white space characters—such as tabs,
spaces, and control characters that are not visually apparent—are 3 THE PROPOSED SYSTEM
automatically eliminated from the commencement and termination Our proposed system integrates the analytical prowess of a LLM
of lines, along with any blank lines. to mitigate the inherent limitations of conventional black-box test
2. Exact Matching: This approach demands that the student’s assessment methodologies in programming education. Suppose
program output mirror the expected output with complete precision. given a set of programming assignment P and a set of solutions

54
Grading Programming Assignments by Summarization ACM-TURC ’24, July 05–07, 2024, Changsha, China

submitted by students S. Initially, our approach involves the train-

ing of a deep learning model designed to produce concise sum-
maries from the source code. This is achieved through the appli-
cation of a state-of-the-art code-based transformer word embed-
ding model, known as CodeBERT(www.microsoft.com/en-us/re-
search/project/code-intelligence). CodeBERT, a pre-trained lan-
guage model developed by Microsoft, is specifically tailored for the
processing of source code, facilitating tasks such as code similarity
detection, summarization, and completion. The initial transforma-
tion performed by CodeBERT entails the conversion of the source
codes within 𝑆 into a vector representation, thereby laying the
groundwork for subsequent analysis and summarization processes.
Once the summarization model has been adequately trained, it
proceeds to generate summarizations for each student submission.
This is achieved through a Transformer model, meticulously de- Figure 1: The work flow of the system
signed with a multi-head self-attention mechanism, which enables
it to adeptly handle the complexities of natural language processing
tasks. intricate semantic interplays between programming languages and
For each programming assignment 𝑝 ∈ 𝑃, the set of generated natural language, thereby enhancing semantic comprehension. This
summarizations is represented as 𝐷 (𝑠, 𝑝). Within this set, each indi- model employs mask attention matrices alongside prefix adapters,
vidual summarization 𝑑 ∈ 𝐷 (𝑠, 𝑝) is meticulously compared against harnessing cross-modal content such as AST and code comments
the textual description of the assignment p using the MiniLM-L6 to refine the representation of code. It adeptly converts the AST
model [21]. This comparison serves as the basis for evaluating the representation of the source code into a sequence that maintains
students’ responses, which are then assigned a grade on 5-point all structural integrity. The encoder processes the textual natural
Likert scale ranging from A to E. language description of programming assignments P, along with
The grading criterion is as follows: if 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 (𝑝, 𝑑) falls within the solutions in the form of source code S. We use the commonly
the range of [0.9, 1], the system assigns the highest grade, ”A”. defined categorical cross-entropy to specify the model’s training
This approach ensures a standardized and objective method of loss function. The loss function is defined as follows:
assessment that aligns with established academic standards. 𝑛
Õ
Our approach is predicated on an innovative approach to assess- 𝐻 (𝑝, 𝑞) = 𝑝 (𝑥𝑖 ) log 𝑞 (𝑥𝑖 ) (1)
ing programming assignments, focusing on the measurement of se- 𝑖=1
mantic similarity rather than on traditional source code comparison. 𝑛 represents the number of training samples, and the task can be
The core idea is to compare the natural language (NL) representation viewed as the difficulty of representing the probability distribution
of the semantics encapsulated within a student’s assessed program 𝑝 (𝑥) through the probability distribution 𝑞(𝑥).
to the NL description provided for the programming assignment For the decoder layer, we employ a Transformer model founded
itself. This method diverges from the conventional practice of on a multi-head self-attention mechanism neural network. This
aligning student submissions with a set of instructor-provided so- architecture endues the model with superior capabilities in address-
lutions. Although the solutions for a given assignment can differ ing long dependency challenges inherent in the summarization of
widely, offering diverse pathways to the same output, the underly- source code. The output from the encoding layer, which encom-
ing semantics remain consistent and distinctive. Our methodology passes embedding vectors for the source code summary sequence
adeptly addresses this diversity by employing semantic-preserving and associated hidden states, is then channeled into the decoder
transformations. By the way of summarization, we endeavor to mit- layer. This setup enables the model to prognosticate the subsequent
igate syntactic discrepancies and distill the essence of the code into word in the sequence of the source code summary.
a concise, semantically equivalent NL text. Particularly, This ap- The CodeBERT model outputs the vector repre-
proach shows students the degree of correctness by 5-score method sentation of the word embedding in the source code
rather than ”pass” or ”failed”. Figure 1 provides a visual repre- 𝑋 = [𝑥𝐶𝐿𝑆 , 𝑥 1, 𝑥 2, · · · , 𝑥𝑛 , 𝑥𝑆𝐸𝑃 ]. This vector is fed into the
sentation of our grading methodology. Regardless of whether the trained model for feature extraction, resulting in the feature
submitted source code is error-prone or incomplete, it is subjected matrix output by the encoder: 𝑓 = [ℎ𝐶𝐿𝑆 , ℎ 1, ℎ 2, · · · , ℎ𝑛 , ℎ𝑆𝐸𝑃 ].
to a comprehensive evaluation process. In addition to MASK, there are another two delimiters: CLS and
The proposed system comprises two pivotal modules: the first is SEP. CLS is an abbreviation for ”Classification” used to distinguish
an LLM module designed to generate summaries, and the second is different parts of text. The CLS token is placed at the beginning of
dedicated to calculating semantic similarity. As depicted in Figure 2, the text sequence to guide the model in classifying the entire text.
the summarization module follows the coder-decoder architecture, SEP is an abbreviation for ”Separator” used to separate different
where CodeBERT services as its encoder. CodeBERT represents the phrases. The decoder generates the 𝑊 𝑥 + 1 based on the current
vanguard of bimodal pre-trained models, adept at handling both partial summarization 𝑌 = [𝑦𝑆𝑂𝑆 , 𝑦1, 𝑦2, · · · , 𝑦𝑥 ], 𝑦𝑆𝑂𝑆 is a special
programming and natural languages. Constructed atop a multi- token used to indicate the staring point of the output sequence
layer Transformer network, CodeBERT is capable of capturing the generation by the model. Figure 3 illustrates an example. Given a

55
ACM-TURC ’24, July 05–07, 2024, Changsha, China Dong Dong and Yue Liang

Figure 4: Comparition of D(s, p) and P

Figure 2: The Architecture of summarization module

model then converts these inputs into vector representations of
the texts. As shown in Figure 4, these vectors are compared using
question (programming assignment) which is ”Reading a text file” cosine similarity to assess the closeness of their semantic content.
and answer, the summarization by the model is ”Reads a file from a
file path”. 4 EXPERIMENTS
We integrated the MiniLM-L6 model into our system to calculate
semantic similarity. This model converts code summarizations Two experiments are conducted: one for training the summariza-
𝐷 (𝑠, 𝑝) and assignment descriptions P into semantic feature vectors. tion model and the other for validating the grading system.
Despite its compact size, MiniLM-L6 is built on a refined 6-layer
Transformer architecture, inspired by but more efficient than larger 4.1 SUMMARIZATION MODEL TRAIN
models like CodeBERT. We train the CodeBERT model on the CodeSearchNet dataset
MiniLM-L6 employs a miniaturization strategy that benefits from for source code summarization. This curated collection pairs
knowledge distillation, transferring knowledge from larger models source code with summaries, derived from high-quality open-
to this smaller one. This process allows MiniLM-L6 to learn complex source projects. Our process involves rigorous lexical and syntactic
language representations without direct access to large datasets. analysis to ensure the dataset’s integrity. As depicted in Figure 5,
The model also uses optimization techniques such as learning rate each dataset entry includes a source code snippet and its summary.
preheating and dynamic adjustment during fine-tuning, enabling We use the Java language dataset from CodeSearchNet as train-
quick adaptation to new tasks and preventing overfitting. The ing data for our proposed summarization model. The dataset is
lightweight MiniLM-L6 can be deployed on servers with limited divided into three partions: train, validate, and test. The distribution
resources, making it a practical choice for our system. of data across each part is shown in Table 1.
We take the source code summary, which is output by the LLM In the data preparation phase, we filter out general assignments
summary module, and the problem description of the programming and eliminate those requiring complex background knowledge for
assignment as inputs to the MiniLM-L6 model, respectively. The understanding. For instance, assignments involving complex rules,

Figure 3: An Example for source code summarization

56
Grading Programming Assignments by Summarization ACM-TURC ’24, July 05–07, 2024, Changsha, China

Figure 5: An Example for CodeBERT source code and summarization used for training

Table 1: Summary of CodeSearchNet Dataset

Language Train Validate Test

Java 164923 5183 10955

Table 2: Summary of CodeSearchNet Dataset Table 3: Summary of automatic grading

Language Train Score A B C

Java 164923 92% 7% 1%

such as those about knights on a chessboard, are disregarded. Con-

versely, assignments on common mathematical problems, like the
Fibonacci sequence, are retained.
The experiments are conducted on an Ubuntu 20.04 operating
system, equipped with 160 GB of RAM and an NVIDIA GeForce
RTX 3090 graphics card. The development environment consists of
Python 3.8 and PyTorch 1.10.0.
Selected hyperparameters are detailed in Table 2.

4.2 VALIDATION TEST

We deployed the model on a computer with a Windows 11 operating
system, 16 GB of RAM, and an NVIDIA GeForce RTX 3060 graphics
card.
To verify the effectiveness of our proposed approach, we con- Figure 6: An example of the same programming questions
ducted an experiment using N=100 downloaded questions and but different answers and final scores
their corresponding suggested answers from the Java exercises
(www.geeksforgeeks.org/java-exercises, edabit.com/challenges).
After removing excess comments and line breaks, the source codes where 𝑀 is the number of ’A’ s output by the system.
that served as answers and the questions were converted into the The results of the experiment are presented in Table 3.
text format required by the model. We then automatically graded The experimental results indicate that our method achieves an
them with our system to assess the closeness of the suggested accuracy of 92%. Notably, our scoring system can also accurately
answers to the corresponding question descriptions. grade the same problem with different solutions.
We defined a cosine similarity score of 90 and above as grade ’A’,
a score of 80 to 89 as grade ’B’, a score below 80 as grade ’C’ and so 5 CONCLUSIONS
on. Since the answers to the questions are completely correct, we A significant feature of our automatic grading system is its ability to
evaluated the accuracy which is measured by grade programming assignment answers in a way that recognizes
𝑀 partial correctness, rather than pass or failed. Our system not only
𝑁 evaluate assessments quickly and provides instant feedback, as

57
ACM-TURC ’24, July 05–07, 2024, Changsha, China Dong Dong and Yue Liang

the other systems do, but also it can accommodate the diversity [7] Gouri Ginde, Rahul Aedula, Snehanshu Saha. 2017. Big Data Acquisition, Prepa-
of solutions inherent in programming assignments and produce ration, and Analysis Using Apache Software Foundation Tools. In Big Data
Analytics: Tools and Technology for Effective Planning, Anirudh K. Somani and
grades by 5-score method. Our approach can encourage students Ganesh Chandra Deka (Eds.). 2017.
according their efforts since it produces grade as exactly as it can. [8] David Hovemeyer. 2022. A framework for declarative autograders. Proceedings
of the 54th ACM Technical Symposium on Computer Science Education V. 2 06
This kinds of acknowledge feedback excite students to concentrate March 2023. ACM Press, New York, USA, 1282. https://doi.org/10.1145/3545947.
on their individual learning paths, thereby enhancing their skills. 3576228
Our method represents a novel type of automatic grading system [9] Jack Hollingsworth. 1960. Automatic graders for programming classes. Commu-
nications of the ACM 3, 10 (1960), 528–529.
that has received less investigation: one based on Large Language [10] Marcus Messer, Neil C. C. Brown, Michael Kölling, and Miaojing Shi. 2023. Ma-
Models (LLMs). The core concept is to evaluate problem answers chine learning-based automated grading and feedback tools for programming: a
meta-analysis. In Proceedings of the 2023 Conference on Innovation and Tech-
by summarizing source code through LLMs. This methodology nology in Computer Science Education V. 1 (ITiCSE 2023). ACM Press, New York,
surpasses the limitations of traditional online judges. NY, USA, 491–497. https://doi.org/10.1145/3587102.3588822
Future research should address fairness. Human judgment and [11] Kirsti M Ala-Mutka. 2005. A survey of automated assessment approaches for
programming assignments. Computer Science Education 15, 2 (2005), 83–102.
expertise may still be necessary to ensure equitable assessment, [12] J.C. Caiza, J.M. Del Alamo. 2013. Programming assignments automatic grad-
given the inherent challenges in evaluating creativity, critical think- ing: review of tools and implementations. Proceedings of the 7th International
ing, and logical structures. The integration of LLMs in assessment Technology, Education and Development Conference. 4-5 March, 2013, Valencia,
Spain. IATED Publications, Valencia, Spain. 5691–5700.
raises concerns about the potential displacement of human graders. [13] Draylson M. Souza, Katia R. Felizardo, Ellen F. Barbosa. 2016. A systematic litera-
Educators should aim to use LLMs as a tool to augment their roles, ture review of assessment tools for programming assignments. IEEE 29th Inter-
national Conference on Software Engineering Education and Training (CSEET),
not to supplant them. Dallas, TX, USA, 2016, 147-156, doi: 10.1109/CSEET.2016.48.
Another direction is to explore the automatic grading of student [14] Adidah Lajis, Shahidatul Arfah Baharudin, Diyana Ab Kadir. 2018. A review of
achievements based on educational taxonomies such as Bloom’s techniques in automatic programming assessment for practical skill test. Journal
of Telecommunication, Electronic and Computer Engineering (JTEC) 10, 2-5
Cognitive Competency Model. (2018), 109–113.
As mentioned in the introduction section, the importance of [15] H Aldriye, A Alkhalaf, M Alkhalaf. 2019. Automated grading systems for program-
problem-solving idea often exceeds the answer itself. How to dis- ming assignments: A literature review. International Journal of Advanced Com-
puter Science and Applications 10, 3 (2019). DOI:10.14569/IJACSA.2019.0100328
covery ideas from source code is also a challenging research area. [16] Chih-Yueh Chou, Yan-Jhih Chen. 2021. Virtual teaching assistant for grading
programming assignments: non-dichotomous pattern based program output
matching and partial grading approach. In 2021 IEEE 4th International Conference
ACKNOWLEDGMENTS on Knowledge Innovation and Invention (ICKII). IEEE, 2021.
This work was partially supported by Education Development [17] Y. Brun, M.D. Ernst. 2004. Finding latent code errors via machine learning over
program executions. In Proceedings of the 26th International Conference on
Project of Hebei Province Education Department, China(Grant No. Software Engineering. IEEE, 2004.
WTZX202421) and Humanities and Social Sciences Foundation of [18] Shashank Srikant, Varun Aggarwal. 2013. Automatic grading of computer
Hebei Normal University(Grant No. S23JX003). programs: A machine learning approach. 12th International Conference
on Machine Learning and Applications, Miami, FL, USA, 2013, 85-92. DOI:
10.1109/ICMLA.2013.22.
REFERENCES [19] Verma, Arjun and Udhayanan, Prateksha and Shankar, Rahul Murali and KN,
[1] Rajendra K. Raj and Amruth N. Kumar. 2022. Toward computer science curricular Nikhila and Chakrabarti, Sujit Kumar. 2021. Source-code similarity measurement:
guidelines 2023 (CS2023). ACM Inroads 13, 4 (December 2022), 22–25. https: syntax tree fingerprinting for automated evaluation, In Proceedings of the First
//doi.org/10.1145/3571092 International Conference on AI-ML Systems, AIMLSystems ’21 Bangalore, India.
[2] Marcus Messer. 2022. Grading programming assignments with an automated ACM Press, New York, NY, USA. DOI: 10.1145/3486001.3486228
grading and feedback assistant. In Artificial Intelligence in Education. Posters [20] Orr Walker and Nathaniel Russell. 2021. Automatic assessment of the design
and Late Breaking Results, Workshops and Tutorials, Industry and Innovation quality of python programs with personalized feedback. In Proceedings of The
Tracks, Practitioners’ and Doctoral Consortium: 23rd International Conference, 14th International Conference on Educational Data Mining, 2 June 2021. 495–501.
AIED 2022, Durham, UK, July 27–31, 2022, Proceedings, Part II. Springer-Verlag, [21] Wang Wenhui, Wei Furu, Dong Li. 2020 Minilm: Deep self-attention distillation
Berlin, Heidelberg, 35–40. https://doi.org/10.1007/978-3-031-11647-6_6 for task-agnostic compression of pre-trained transformers. Advances in Neural
[3] Eckerdal, A., Berglund, A., & Thuné, M. (2023). Learning programming practice Information Processing Systems, 2020, 33: 5776-5788.
and programming theory in the computer laboratory. European Journal of Engi- [22] Mostafizer Rahman, Yutaka Watanobe,Keita Nakamura. 2020. Source code as-
neering Education, 49(2), 330–347. https://doi.org/10.1080/03043797.2023.2294953 sessment and classification based on estimated error probability using attentive
[4] Ivanochko, I., Kostiv, Y. 2023. Designing and implementation of online judgment LSTM language model and its application in programming education. Applied
system. In: Kryvinska, N., Greguš, M., Fedushko, S. (eds) Developments in Sciences 10, 8 (2020), 2973.
Information and Knowledge Management Systems for Business Applications. [23] Fabio Rezende de Souza, Francisco de Assis Zampirolli a and Guiou Kobayashi
Studies in Systems, Decision and Control, vol 462. Springer, Cham. https://doi. . 2019. Convolutional neural network applied to code assignment grading. In
org/10.1007/978-3-031-25695-0_10 11th International Conference on Computer Supported Education, Jan. 2019.
[5] Huiting Wu and Yanshen Liu and Lin Qiu and Yi Liu, 2016. Online judge system SCITEPRESS. DOI:10.5220/0007711000620069
and its applications in c language teaching, International Symposium on Educa- [24] M. L. Wickramasinghe, H.P Wijethunga, S. R. Yapa. 2020. Smart exam evaluator
tional Technology (ISET), Beijing, China, 19-21 July 2016. IEEE. pp. 57-60, doi: for object-oriented programming modules. In 2020 2nd International Conference
10.1109/ISET.2016.14. on Advancements in Computing (ICAC), Vol. 1. IEEE, 2020, 1-6.
[6] Mitchell J C.1996. Foundations for programming languages. Cambridge: MIT [25] Roshan Vasu Muddaluru; Sharvaani Ravikumar Thoguluva. 2023. Auto-grading C
press. programming assignments with CodeBERT and Random Forest Regressor. In 2023
14th International Conference on Computing Communication and Networking
Technologies (ICCCNT). IEEE, 2023, 1-6.

Case Study4 - The Sharp Printing AG
No ratings yet
Case Study4 - The Sharp Printing AG
10 pages
MEM 506 OR & Simulation Syllabus
No ratings yet
MEM 506 OR & Simulation Syllabus
6 pages
Cse570 Zola
No ratings yet
Cse570 Zola
11 pages
Online Examination and Evaluation System
No ratings yet
Online Examination and Evaluation System
5 pages
Sigcse 18
No ratings yet
Sigcse 18
6 pages
Wang 2007
No ratings yet
Wang 2007
9 pages
Spreadsheet Model For Student Evaluation PDF
No ratings yet
Spreadsheet Model For Student Evaluation PDF
8 pages
The Design and Implementation of A Computer Based Testing System Using Component-Based Software Engineering
No ratings yet
The Design and Implementation of A Computer Based Testing System Using Component-Based Software Engineering
8 pages
Automatic Assessment of Programming Assignment
No ratings yet
Automatic Assessment of Programming Assignment
9 pages
Applying Large Language Models to Enhance the Assessment of Parallel Functional Programming Assignments
No ratings yet
Applying Large Language Models to Enhance the Assessment of Parallel Functional Programming Assignments
9 pages
ABET Syllabus Template COMP1020 (Honors) Adams Fall 2022
No ratings yet
ABET Syllabus Template COMP1020 (Honors) Adams Fall 2022
7 pages
Course - Syllabus - COE 351 - Spring 2020 PDF
No ratings yet
Course - Syllabus - COE 351 - Spring 2020 PDF
6 pages
The Problem and Review of Related Literature and Studies: Bachelor of Science in Information Technology
No ratings yet
The Problem and Review of Related Literature and Studies: Bachelor of Science in Information Technology
35 pages
2016 Iceit
No ratings yet
2016 Iceit
4 pages
CS 223 Syllabus
No ratings yet
CS 223 Syllabus
9 pages
CIS2619 W15 Syllabus
No ratings yet
CIS2619 W15 Syllabus
5 pages
Automatic grading of short answers using Large Language Models in
No ratings yet
Automatic grading of short answers using Large Language Models in
11 pages
Course Outline Fall 2024 COEN424-6313
No ratings yet
Course Outline Fall 2024 COEN424-6313
4 pages
PPT#00
No ratings yet
PPT#00
26 pages
Jamil2017 PDF
No ratings yet
Jamil2017 PDF
3 pages
Modeling How Students Learn To Program: Chris Piech, Mehran Sahami, Daphne Koller, Stephen Cooper, Paulo Blikstein
No ratings yet
Modeling How Students Learn To Program: Chris Piech, Mehran Sahami, Daphne Koller, Stephen Cooper, Paulo Blikstein
6 pages
Sysc3303a w21 Outline
No ratings yet
Sysc3303a w21 Outline
6 pages
SE-3313-Course-Outline-2023-2024_Updated (1)
No ratings yet
SE-3313-Course-Outline-2023-2024_Updated (1)
7 pages
conference-template-a4 (1) (1)
No ratings yet
conference-template-a4 (1) (1)
6 pages
Course Syllabus
No ratings yet
Course Syllabus
5 pages
Automatic Configurable and Partial Assessment of Student SQL Queries With Subqueries
No ratings yet
Automatic Configurable and Partial Assessment of Student SQL Queries With Subqueries
6 pages
Repairing Bugs in Python Assignments Using Large Language Models
No ratings yet
Repairing Bugs in Python Assignments Using Large Language Models
12 pages
27
No ratings yet
27
10 pages
Analytics 03 00004 v2
No ratings yet
Analytics 03 00004 v2
17 pages
3699538.3699571
No ratings yet
3699538.3699571
2 pages
How Computers Work
No ratings yet
How Computers Work
3 pages
Translation From Problem To Code in Seven Steps
No ratings yet
Translation From Problem To Code in Seven Steps
7 pages
WebTech Prax splitPDF
No ratings yet
WebTech Prax splitPDF
2 pages
Automated Grading and Feedback Tools for Programming Education: A Systematic Review
No ratings yet
Automated Grading and Feedback Tools for Programming Education: A Systematic Review
43 pages
CSC520 Syllabus
No ratings yet
CSC520 Syllabus
3 pages
CMP4266 CWRK Assessment Brief 2022
No ratings yet
CMP4266 CWRK Assessment Brief 2022
10 pages
Code Contrast A Contractive Learning Approach_for_G
No ratings yet
Code Contrast A Contractive Learning Approach_for_G
28 pages
Civi 341
No ratings yet
Civi 341
3 pages
2016 Eled
No ratings yet
2016 Eled
6 pages
Predicting the academic progression in student s standpoint using machine learning
No ratings yet
Predicting the academic progression in student s standpoint using machine learning
14 pages
Identifying Student Profiles Within Online Judge Systems
No ratings yet
Identifying Student Profiles Within Online Judge Systems
18 pages
Generative AI in The Software Modeling Classroom An Experience Report With ChatGPT and UML
No ratings yet
Generative AI in The Software Modeling Classroom An Experience Report With ChatGPT and UML
10 pages
ACC214 SIM - SDL Week 1-3
No ratings yet
ACC214 SIM - SDL Week 1-3
29 pages
COMPSCI 130 - 2021 Semester Two - Course Outline
No ratings yet
COMPSCI 130 - 2021 Semester Two - Course Outline
6 pages
CS2313 Computer Programming
No ratings yet
CS2313 Computer Programming
4 pages
Institute of Aeronautical Engineering: Computer Science and Engineering
No ratings yet
Institute of Aeronautical Engineering: Computer Science and Engineering
10 pages
2025_AERO371_COURSE_OUTLINE_T 3384_--
No ratings yet
2025_AERO371_COURSE_OUTLINE_T 3384_--
4 pages
Comparative Evaluation of Parametric Design Systems For Teac - 2017 - Design Stu
No ratings yet
Comparative Evaluation of Parametric Design Systems For Teac - 2017 - Design Stu
29 pages
Ap Computer Science A Course Overview
No ratings yet
Ap Computer Science A Course Overview
2 pages
Algorithms and Data Sctructures 2
No ratings yet
Algorithms and Data Sctructures 2
3 pages
Art 43-51 Pankiewicz Bator Ementor 5 82 2019
No ratings yet
Art 43-51 Pankiewicz Bator Ementor 5 82 2019
11 pages
Computing (Syllabus 9569) : Singapore-Cambridge General Certificate of Education Advanced Level Higher 2 (2022)
No ratings yet
Computing (Syllabus 9569) : Singapore-Cambridge General Certificate of Education Advanced Level Higher 2 (2022)
13 pages
Prediction of Graduate Admission IEEE - 2020
No ratings yet
Prediction of Graduate Admission IEEE - 2020
6 pages
CSE 110 - Programming For Everyone - Introduction To Programming
No ratings yet
CSE 110 - Programming For Everyone - Introduction To Programming
2 pages
Course Syllabus COSC 1436 - Programming Fundamentals I
No ratings yet
Course Syllabus COSC 1436 - Programming Fundamentals I
3 pages
7COM1025 Coursework Briefing Sheet 2023 Main
No ratings yet
7COM1025 Coursework Briefing Sheet 2023 Main
3 pages
IS2104 CourseHandout
No ratings yet
IS2104 CourseHandout
7 pages
SUMMARY
No ratings yet
SUMMARY
13 pages
Quiz 1 FM Summer 2023
No ratings yet
Quiz 1 FM Summer 2023
4 pages
VMWARE Certified Spring Professional Certification Concept Based Practice Questions - Latest Edition
From Everand
VMWARE Certified Spring Professional Certification Concept Based Practice Questions - Latest Edition
Exam OG
No ratings yet
A Guide to Java Interviews
From Everand
A Guide to Java Interviews
Aishik Dutta
No ratings yet
assessment of complex programming using SIETTE
No ratings yet
assessment of complex programming using SIETTE
15 pages
automated_essay_scoring_textmining&NLTK
No ratings yet
automated_essay_scoring_textmining&NLTK
6 pages
TASLP.2021.3057230
No ratings yet
TASLP.2021.3057230
14 pages
peerj-cs-1468
No ratings yet
peerj-cs-1468
18 pages
Biometric_technology_in_banking_institutions_The_
No ratings yet
Biometric_technology_in_banking_institutions_The_
12 pages
Decode
No ratings yet
Decode
79 pages
Be Artificial Intelligence and Data Science Semester 3 2023 November Fundamentals of Data Structure Fods Pattern 2019
No ratings yet
Be Artificial Intelligence and Data Science Semester 3 2023 November Fundamentals of Data Structure Fods Pattern 2019
2 pages
Be Artificial Intelligence and Data Science Semester 3 2023 November Computer Graphics CG 2019 Pattern
No ratings yet
Be Artificial Intelligence and Data Science Semester 3 2023 November Computer Graphics CG 2019 Pattern
2 pages
Dokumen - Tips Ilmu Bedah FK Unsoed Libre
100% (1)
Dokumen - Tips Ilmu Bedah FK Unsoed Libre
240 pages
Cylinder Liner Lubrication Meo Class 2 Orals
No ratings yet
Cylinder Liner Lubrication Meo Class 2 Orals
10 pages
Quiz BRM 5
50% (2)
Quiz BRM 5
2 pages
Craniopharyngioma: Intracranial Pressure
No ratings yet
Craniopharyngioma: Intracranial Pressure
1 page
CIS Red Hat Enterprise Linux 6 Benchmark v2.0.2
No ratings yet
CIS Red Hat Enterprise Linux 6 Benchmark v2.0.2
345 pages
The Sullivan Street Bakery Cookbook. ISBN 0393247287, 978-0393247282
100% (23)
The Sullivan Street Bakery Cookbook. ISBN 0393247287, 978-0393247282
23 pages
2024-25 itr
No ratings yet
2024-25 itr
1 page
(PC) Consiglio v. Tilton Et Al - Document No. 8
No ratings yet
(PC) Consiglio v. Tilton Et Al - Document No. 8
4 pages
The Enemy Within - v13
No ratings yet
The Enemy Within - v13
8 pages
Mushroom Mavens-Body of The Plan
No ratings yet
Mushroom Mavens-Body of The Plan
17 pages
Chapter 1 - Background and Purpose - Group 2 REVISED
No ratings yet
Chapter 1 - Background and Purpose - Group 2 REVISED
4 pages
False Unicorn
No ratings yet
False Unicorn
3 pages
Dental Implant
No ratings yet
Dental Implant
12 pages
2950 0212 01 - XAS 66 - DD - ASL
No ratings yet
2950 0212 01 - XAS 66 - DD - ASL
108 pages
Humss 1a Research Final Revised
100% (1)
Humss 1a Research Final Revised
34 pages
Practical Research 1
No ratings yet
Practical Research 1
19 pages
Partsbook Pegasus W644, W664
No ratings yet
Partsbook Pegasus W644, W664
200 pages
NPL Casual Check Shirts-Benchmarking SS 24
No ratings yet
NPL Casual Check Shirts-Benchmarking SS 24
167 pages
Honeywell Inc.
No ratings yet
Honeywell Inc.
15 pages
The Music Teacher - Word
No ratings yet
The Music Teacher - Word
6 pages
Xamarin Tutorial
No ratings yet
Xamarin Tutorial
18 pages
Experiment 4
100% (1)
Experiment 4
3 pages
A Free Word Dependency Parser in Prolog
No ratings yet
A Free Word Dependency Parser in Prolog
8 pages
Thank You!: Squeezebox
No ratings yet
Thank You!: Squeezebox
2 pages
Engaged Listening Worksheet 4 - 10
0% (1)
Engaged Listening Worksheet 4 - 10
3 pages
Enjoy Connection Globally.: Tax Invoice/ Tax Credit Note
No ratings yet
Enjoy Connection Globally.: Tax Invoice/ Tax Credit Note
37 pages
Max 31790 Ev Kit
100% (1)
Max 31790 Ev Kit
19 pages
Ssj-100 Sji RRJ 95 Callouts Rev. 1.0.0
No ratings yet
Ssj-100 Sji RRJ 95 Callouts Rev. 1.0.0
9 pages

Grading Programming Assignments by Summarization - LLMs

Uploaded by

Grading Programming Assignments by Summarization - LLMs

Uploaded by

Grading Programming Assignments by Summarization

Dong Dong Yue Liang

ABSTRACT training by repeating such as riding a horse, or archery. Instructors

submitted by students S. Initially, our approach involves the train-

Figure 4: Comparition of D(s, p) and P

Figure 2: The Architecture of summarization module

Figure 3: An Example for source code summarization

Table 1: Summary of CodeSearchNet Dataset

Language Train Validate Test

Table 2: Summary of CodeSearchNet Dataset Table 3: Summary of automatic grading

Language Train Score A B C

such as those about knights on a chessboard, are disregarded. Con-

4.2 VALIDATION TEST

You might also like