Grading Programming Assignments by Summarization - LLMs
Grading Programming Assignments by Summarization - LLMs
53
ACM-TURC ’24, July 05–07, 2024, Changsha, China Dong Dong and Yue Liang
The remainder of this paper is structured as follows: section 2 In contrast to the ”Perfect Match” strategy, it does not overlook any
explores some of the work that has been done for the automated whitespace characters, including line breaks.
programming automated grading issues. Section 3 describes the 3. Wildcard Matching: This strategy permits the use of wildcard
proposed LLM approach. The experiments to train the model pro- characters, such as the asterisk (*) and question mark (?), to denote
posed and test the prototype system, as well as result analysis are the expected output in a case-insensitive manner. Furthermore,
shown in detail in section 4. Last, section 5 ends with conclusions wildcard fuzzy matching is applicable in contexts where the output
and suggestions for the further work. spans multiple lines. In such cases, the output lines from the user’s
program will be deemed correct provided they constitute a subset
of the expected output.
4. Regular expression matching: The expected output is ex-
2 REVIEW OF RELATED WORK pressed by regular expressions.
This paper focuses solely on the automatic grading algorithms 5. Set Matching: The expected output is equal to the set of user
and systems applied to programming assignments. This topic has program outputs (by line), regardless of line order.
been of interest to computer science educators since 1960s [6] It 6. Longest Common Subsequence Matching.
remains continually attractive to researchers up to the present [7]- 7. Edit Distance Matching.
[15]. Most systems perform fully automated assessment to offer These distinctive features imply that both instructors and stu-
approximately instantaneous feedback, which can increase student dents expect an objective grade which matches students’ effort as
satisfaction [10]. Most research focused on assessing correctness. exactly as an OJ can.
Dynamic and static approaches are both used to automate the as- Machine learning technology trains a model on data sets by vari-
sessment of programming assignments [10]. Dynamic approaches, ous algorithms firstly, then predicts outcomes by the model without
primarily black-box tests, need to run the programs submitted. human intervention. Supervised learning stands out as a widely
Static approaches evaluate a student’s submission by comparing implemented strategy, where algorithms are meticulously trained
it against one or more solutions provided by educators [11]. The on datasets meticulously annotated by human graders. In contrast,
static approaches apply various methods to evaluate the similarity unsupervised learning allows algorithms to autonomously delve
between a student’s submission and the suggested solution(s), such into unlabeled datasets, seeking to identify underlying patterns and
as ASTs(Abstract Syntax Trees), control flow graphs, or dependency commonalities among student submissions. In this context, ma-
graphs [14]. chine learning methods such as regression model, decision trees and
The Virtual Teaching Assistant [16] matches a student’ submis- support vector machines are leveraged to classify code segments
sion against location-free or location-specific patterns, relaxing [17]-[20].
the strict specification of program output and allowing for partial For the case of performance evaluation, most papers primarily
marks to be awarded for programs with partial functional correct- compare grades from the automatic assessment tools to ones pro-
ness. Location-free means the order of program output tokens or vided by human graders. However, it is more difficult to reproduce
characters or specific structure can be ignored when determining results and compare tools with a common dataset since lacks of
the grade. Conversely, if the order of the output or the specific benchmark. Furthermore, most of these systems cannot award par-
structure does impact the grade, the pattern is referred as location- tial grades for incomplete programs, resulting in students receiving
specific. The semantic similarity-based approach has also been a zero grade. In contrast, a human grader would typically gives par-
studied. tial marks for source code that implements a subset of the features
There are many systems employ dynamic approach in the ed- or has minor logical errors or syntax errors. Deep learning models,
ucational technology domain, such as LeetCode (leetcode.com), with their capacity to process vast quantities of data, distill intricate
Codeforces (codeforces.com), PKU JudgeOnline (poj.org). These features, and render nuanced judgments, have become increasingly
solutions are dedicated to instant feedback mechanisms for pro- popular within automated grading systems [21]-[25]. The field of
grammers. For example, CG Online Judge (course.educg.net) is one deep learning has seen significant evolution in recent times, par-
of the excellent teaching platforms for computer science. CG as- ticularly in the realm of automatic grading, where it demonstrates
sesses student submissions for programming assignments through markedly superior performance compared to conventional machine
a couple of black-box test cases as the other OJs do. In addition, CG learning models.
provides multiple strategies for aligning expected outputs with the
actual outputs generated by the student’s program, some of them
are detailed below:
1. Perfect Match: This is the default and most frequently utilized
matching strategy of the OJ system. It mandates that the student’s
program output strictly aligns with the predefined expected output.
For the purpose of comparison, white space characters—such as tabs,
spaces, and control characters that are not visually apparent—are 3 THE PROPOSED SYSTEM
automatically eliminated from the commencement and termination Our proposed system integrates the analytical prowess of a LLM
of lines, along with any blank lines. to mitigate the inherent limitations of conventional black-box test
2. Exact Matching: This approach demands that the student’s assessment methodologies in programming education. Suppose
program output mirror the expected output with complete precision. given a set of programming assignment P and a set of solutions
54
Grading Programming Assignments by Summarization ACM-TURC ’24, July 05–07, 2024, Changsha, China
55
ACM-TURC ’24, July 05–07, 2024, Changsha, China Dong Dong and Yue Liang
56
Grading Programming Assignments by Summarization ACM-TURC ’24, July 05–07, 2024, Changsha, China
Figure 5: An Example for CodeBERT source code and summarization used for training
57
ACM-TURC ’24, July 05–07, 2024, Changsha, China Dong Dong and Yue Liang
the other systems do, but also it can accommodate the diversity [7] Gouri Ginde, Rahul Aedula, Snehanshu Saha. 2017. Big Data Acquisition, Prepa-
of solutions inherent in programming assignments and produce ration, and Analysis Using Apache Software Foundation Tools. In Big Data
Analytics: Tools and Technology for Effective Planning, Anirudh K. Somani and
grades by 5-score method. Our approach can encourage students Ganesh Chandra Deka (Eds.). 2017.
according their efforts since it produces grade as exactly as it can. [8] David Hovemeyer. 2022. A framework for declarative autograders. Proceedings
of the 54th ACM Technical Symposium on Computer Science Education V. 2 06
This kinds of acknowledge feedback excite students to concentrate March 2023. ACM Press, New York, USA, 1282. https://doi.org/10.1145/3545947.
on their individual learning paths, thereby enhancing their skills. 3576228
Our method represents a novel type of automatic grading system [9] Jack Hollingsworth. 1960. Automatic graders for programming classes. Commu-
nications of the ACM 3, 10 (1960), 528–529.
that has received less investigation: one based on Large Language [10] Marcus Messer, Neil C. C. Brown, Michael Kölling, and Miaojing Shi. 2023. Ma-
Models (LLMs). The core concept is to evaluate problem answers chine learning-based automated grading and feedback tools for programming: a
meta-analysis. In Proceedings of the 2023 Conference on Innovation and Tech-
by summarizing source code through LLMs. This methodology nology in Computer Science Education V. 1 (ITiCSE 2023). ACM Press, New York,
surpasses the limitations of traditional online judges. NY, USA, 491–497. https://doi.org/10.1145/3587102.3588822
Future research should address fairness. Human judgment and [11] Kirsti M Ala-Mutka. 2005. A survey of automated assessment approaches for
programming assignments. Computer Science Education 15, 2 (2005), 83–102.
expertise may still be necessary to ensure equitable assessment, [12] J.C. Caiza, J.M. Del Alamo. 2013. Programming assignments automatic grad-
given the inherent challenges in evaluating creativity, critical think- ing: review of tools and implementations. Proceedings of the 7th International
ing, and logical structures. The integration of LLMs in assessment Technology, Education and Development Conference. 4-5 March, 2013, Valencia,
Spain. IATED Publications, Valencia, Spain. 5691–5700.
raises concerns about the potential displacement of human graders. [13] Draylson M. Souza, Katia R. Felizardo, Ellen F. Barbosa. 2016. A systematic litera-
Educators should aim to use LLMs as a tool to augment their roles, ture review of assessment tools for programming assignments. IEEE 29th Inter-
national Conference on Software Engineering Education and Training (CSEET),
not to supplant them. Dallas, TX, USA, 2016, 147-156, doi: 10.1109/CSEET.2016.48.
Another direction is to explore the automatic grading of student [14] Adidah Lajis, Shahidatul Arfah Baharudin, Diyana Ab Kadir. 2018. A review of
achievements based on educational taxonomies such as Bloom’s techniques in automatic programming assessment for practical skill test. Journal
of Telecommunication, Electronic and Computer Engineering (JTEC) 10, 2-5
Cognitive Competency Model. (2018), 109–113.
As mentioned in the introduction section, the importance of [15] H Aldriye, A Alkhalaf, M Alkhalaf. 2019. Automated grading systems for program-
problem-solving idea often exceeds the answer itself. How to dis- ming assignments: A literature review. International Journal of Advanced Com-
puter Science and Applications 10, 3 (2019). DOI:10.14569/IJACSA.2019.0100328
covery ideas from source code is also a challenging research area. [16] Chih-Yueh Chou, Yan-Jhih Chen. 2021. Virtual teaching assistant for grading
programming assignments: non-dichotomous pattern based program output
matching and partial grading approach. In 2021 IEEE 4th International Conference
ACKNOWLEDGMENTS on Knowledge Innovation and Invention (ICKII). IEEE, 2021.
This work was partially supported by Education Development [17] Y. Brun, M.D. Ernst. 2004. Finding latent code errors via machine learning over
program executions. In Proceedings of the 26th International Conference on
Project of Hebei Province Education Department, China(Grant No. Software Engineering. IEEE, 2004.
WTZX202421) and Humanities and Social Sciences Foundation of [18] Shashank Srikant, Varun Aggarwal. 2013. Automatic grading of computer
Hebei Normal University(Grant No. S23JX003). programs: A machine learning approach. 12th International Conference
on Machine Learning and Applications, Miami, FL, USA, 2013, 85-92. DOI:
10.1109/ICMLA.2013.22.
REFERENCES [19] Verma, Arjun and Udhayanan, Prateksha and Shankar, Rahul Murali and KN,
[1] Rajendra K. Raj and Amruth N. Kumar. 2022. Toward computer science curricular Nikhila and Chakrabarti, Sujit Kumar. 2021. Source-code similarity measurement:
guidelines 2023 (CS2023). ACM Inroads 13, 4 (December 2022), 22–25. https: syntax tree fingerprinting for automated evaluation, In Proceedings of the First
//doi.org/10.1145/3571092 International Conference on AI-ML Systems, AIMLSystems ’21 Bangalore, India.
[2] Marcus Messer. 2022. Grading programming assignments with an automated ACM Press, New York, NY, USA. DOI: 10.1145/3486001.3486228
grading and feedback assistant. In Artificial Intelligence in Education. Posters [20] Orr Walker and Nathaniel Russell. 2021. Automatic assessment of the design
and Late Breaking Results, Workshops and Tutorials, Industry and Innovation quality of python programs with personalized feedback. In Proceedings of The
Tracks, Practitioners’ and Doctoral Consortium: 23rd International Conference, 14th International Conference on Educational Data Mining, 2 June 2021. 495–501.
AIED 2022, Durham, UK, July 27–31, 2022, Proceedings, Part II. Springer-Verlag, [21] Wang Wenhui, Wei Furu, Dong Li. 2020 Minilm: Deep self-attention distillation
Berlin, Heidelberg, 35–40. https://doi.org/10.1007/978-3-031-11647-6_6 for task-agnostic compression of pre-trained transformers. Advances in Neural
[3] Eckerdal, A., Berglund, A., & Thuné, M. (2023). Learning programming practice Information Processing Systems, 2020, 33: 5776-5788.
and programming theory in the computer laboratory. European Journal of Engi- [22] Mostafizer Rahman, Yutaka Watanobe,Keita Nakamura. 2020. Source code as-
neering Education, 49(2), 330–347. https://doi.org/10.1080/03043797.2023.2294953 sessment and classification based on estimated error probability using attentive
[4] Ivanochko, I., Kostiv, Y. 2023. Designing and implementation of online judgment LSTM language model and its application in programming education. Applied
system. In: Kryvinska, N., Greguš, M., Fedushko, S. (eds) Developments in Sciences 10, 8 (2020), 2973.
Information and Knowledge Management Systems for Business Applications. [23] Fabio Rezende de Souza, Francisco de Assis Zampirolli a and Guiou Kobayashi
Studies in Systems, Decision and Control, vol 462. Springer, Cham. https://doi. . 2019. Convolutional neural network applied to code assignment grading. In
org/10.1007/978-3-031-25695-0_10 11th International Conference on Computer Supported Education, Jan. 2019.
[5] Huiting Wu and Yanshen Liu and Lin Qiu and Yi Liu, 2016. Online judge system SCITEPRESS. DOI:10.5220/0007711000620069
and its applications in c language teaching, International Symposium on Educa- [24] M. L. Wickramasinghe, H.P Wijethunga, S. R. Yapa. 2020. Smart exam evaluator
tional Technology (ISET), Beijing, China, 19-21 July 2016. IEEE. pp. 57-60, doi: for object-oriented programming modules. In 2020 2nd International Conference
10.1109/ISET.2016.14. on Advancements in Computing (ICAC), Vol. 1. IEEE, 2020, 1-6.
[6] Mitchell J C.1996. Foundations for programming languages. Cambridge: MIT [25] Roshan Vasu Muddaluru; Sharvaani Ravikumar Thoguluva. 2023. Auto-grading C
press. programming assignments with CodeBERT and Random Forest Regressor. In 2023
14th International Conference on Computing Communication and Networking
Technologies (ICCCNT). IEEE, 2023, 1-6.
58