Analytics 03 00004 v2
Analytics 03 00004 v2
1 Graduate School of Natural Science and Technology, Okayama University, Okayama 700-8530, Japan;
[email protected] (E.E.H.); [email protected] (K.H.W.);
[email protected] (S.T.A.); [email protected] (X.L.)
2 Department of Computer and Information Science, Tokyo University of Agriculture and Technology,
Tokyo 184-8588, Japan; [email protected]
3 Department of Electrical Engineering, National Taiwan Normal University, Taipei 106, Taiwan;
[email protected]
* Correspondence: [email protected]
† This paper is an extended version of our paper published in 2023 11th International Conference on
Information and Education Technology (ICIET), 18–20 March 2023, Fujisawa, Japan.
Abstract: A web-based Java programming learning assistant system (JPLAS) has been developed for
novice students to study Java programming by themselves while enhancing code reading and code
writing skills. One type of the implemented exercise problem is code writing problem (CWP), which
asks students to create a source code that can pass the given test code. The correctness of this answer
code is validated by running them on JUnit. In previous works, a Python-based answer code validation
program was implemented to assist teachers. It automatically verifies the source codes from all the
students for one test code, and reports the number of passed test cases by each code in the CSV file.
While this program plays a crucial role in checking the correctness of code behaviors, it cannot detect
Citation: Htet, E.E.; Wai, K.H.; Aung, code plagiarism that can often happen in programming courses. In this paper, we implement a code
S.T.; Funabiki, N.; Lu, X.; Kyaw, plagiarism checking function in the answer code validation program, and present its application results to a
H.H.S.; Kao, W.-C. Code Plagiarism
Java programming course at Okayama University, Japan. This function first removes the whitespace
Checking Function and Its
characters and the comments using the regular expressions. Next, it calculates the Levenshtein distance
Application for Code Writing Problem
and similarity score for each pair of source codes from different students in the class. If the score is
in Java Programming Learning
larger than a given threshold, they are regarded as plagiarism. Finally, it outputs the scores as a CSV
Assistant System. Analytics 2024, 3,
46–62. https://doi.org/10.3390/ file with the student IDs. For evaluations, we applied the proposed function to a total of 877 source
analytics3010004 codes for 45 CWP assignments submitted from 9 to 39 students and analyzed the results. It was
found that (1) CWP assignments asking for shorter source codes generate higher scores than those
Academic Editors: Ping-Feng Pai
for longer codes due to the use of test codes, (2) proper thresholds are different by assignments, and
and Qingshan Jiang
(3) some students often copied source codes from certain students.
Received: 5 November 2023
Revised: 16 December 2023 Keywords: Java programming learning; JPLAS; JUnit; code writing problem; plagiarism; Levenshtein
Accepted: 15 January 2024 distance; Python
Published: 17 January 2024
1. Introduction
Copyright: © 2024 by the authors.
Licensee MDPI, Basel, Switzerland. For decades, Java has been widely used in a variety of practical application systems,
This article is an open access article including large enterprise systems in large companies as well as compact systems such
distributed under the terms and as embedded ones. Therefore, there has been a strong need for engineers who have high
conditions of the Creative Commons Java programming skills in IT companies. To meet this demand, a lot of universities and
Attribution (CC BY) license (https:// professional schools have provided Java programming courses.
creativecommons.org/licenses/by/ To help Java programming education, we have developed a web-based Java programming
4.0/). learning assistant system (JPLAS) for novice students to study Java programming by themselves
while enhancing code reading and code writing skills. JPLAS provides various exercise
problems at different difficulty levels to support the learning of students at different
learning stages.
For JPLAS, the answer platform has been implemented to help self-studies of students
at home [1]. It is implemented on Node.js [2], and will be distributed to students using
Docker [3]. The correctness of an answer to any exercise problem is verified automatically
and instantly by running the automatic marking functions.
In programming education, novice students should start by solving easy problems to
develop code reading skills. These problems may have short and simple codes that can help
students learn the language rules and basic programming ideas. After gaining solid code
reading skills, they can move to code writing study. If students are not able to understand
source codes written by others, they will have difficulty in writing their own programs
correctly. The learning goals should be gradually progressed as the understanding levels of
students are improved.
To assist this progressive programming learning by novice students, JPLAS offers the
following types of exercise problems. The grammar-concept understanding problem (GUP)
requests to answer the important words, such as reserved words and common libraries in
the programming language, in the given source code by giving the questions that describe
their concepts. The value trace problem (VTP) requests the current values of the important
variables and the output messages in the given source code. The element fill-in-blank problem
(EFP) requests to fill in the blank elements in the given source code so that the original
source code is gained. The code completion problem (CCP) requests to correct and complete
the given source code that has several blank elements and incorrect ones. The code writing
problem (CWP) requests to write a source code that will pass the tests in the given test code,
where the code testing is applied by running both codes on JUnit. In every exercise problem,
the correctness of any answer from a student is verified automatically in the answer platform.
The string matching with the correct answer is adopted in GUP, VTP, EFP, and CCP. The
unit testing of the answer code is applied in CWP.
Among these problem types, CWP is designed to study writing source codes from
scratch that will satisfy the requested specifications in the assignment. The answer platform
automatically runs JUnit with the given test code and the submitted source code for code
testing when the answer submission button is clicked [1].
Previously, we implemented the answer code validation program in Python to help
a teacher assign and mark a lot of CWP assignments to many students in their Java
programming course in a university or professional school [4]. This program automatically
verifies the source codes from all the students for each CWP assignment and reports the
number of passed test cases by each code in the CSV file. By checking the summary of
the test results of all the students in the CSV file, the teacher can easily grasp the learning
progress of the students and grade them. However, although this program plays a crucial
role in evaluating the correctness of code behaviors, it cannot detect code plagiarism that can
often happen in programming courses.
In this paper, we implement a code plagiarism checking function in the answer code
validation program, and present its application results to a Java programming course at
Okayama University, Japan. First, this function removes the whitespace characters, such
as spaces or tabs, and the comment lines using the regular expressions. Next, it calculates
the Levenshtein distance and the similarity score for each pair of source codes from different
students in the class. If the score is larger than a given threshold, these two source codes are
regarded as plagiarism. Finally, it outputs the scores as a CSV file with the student IDs. The
function will enhance the functionality of the answer code validation program and contribute
to comprehensive and effective Java programming studies of novice students.
Currently, code plagiarism has become serious due to the great progress of generative
AI tools such as ChatGPT. For CWP, students can obtain an answer source code for each
assignment by submitting the test code. It is expected that code plagiarism checking function
will detect the source codes generated by AI tools by collecting the possible ones.
Analytics 2024, 3 48
For evaluations, we applied the proposed function to a total of 877 source codes for
45 CWP assignments submitted from 9 to 39 students and analyzed the results. It was
found that (1) CWP assignments asking for shorter source codes generate higher scores
than those for longer codes due to the use of test codes, (2) proper thresholds are different
by assignments, and (3) some students often copied source codes from certain students.
The rest of this paper is organized as follows: Section 2 discusses related works in
literature. Section 3 reviews our previous works of the code writing problem. Section 4
presents the implementation of the code plagiarism checking function. Section 5 discusses
application results. Finally, Section 6 concludes this paper with future works.
2. Literature Review
In this section, we discuss related works in literature to this study.
the authors present the integration of different detection tools as a practical solution to
enhance code similarity examinations in programming courses. Additionally, in [20], the
authors emphasize the need to exclude segments unlikely to indicate plagiarism, offering a
pragmatic approach to refining detection accuracy. Moreover, in [21], the authors introduce
Deimos, a tool with practical implications for instructors, providing efficient and language-
independent plagiarism detections, and enhancing programming education. These studies
collectively emphasize the significance of pragmatic strategies and innovative tools in
practical implementations within programming education.
By examining these papers, our proposal introduces a simple and unique method
for detecting code plagiarism by utilizing regular expressions to streamline source codes
and employing the Levenshtein distance for similarity scoring. We applied the proposal
to a Java programming course at Okayama University, demonstrating its practicality and
effectiveness in a real-world educational context. Although the Levenshtein distance is a
useful metric for detecting plagiarism, the current proposal may have some weaknesses,
such as not considering syntax or grammar.
This test code includes the three import statements for the JUnit packages at Lines
2, 3, and 4. It also declares the BubbleSortTest class at Line 5, which contains one test
method annotated with “@Test” at Line 6. This annotation indicates that the following lines
represent a test case that will be executed on JUnit as the following procedure:
1. Generate the bubbleSort object of the BubbleSort class in the source code.
2. Call the sort method of the bubbleSort object with the arguments for the input data.
3. Compare the output codeOutput of the sort method with the expected one expOutput
using the assertEquals method.
This platform follows the MVC model. For the model (M) part, JUnit is used where Java
is used to implement the programs. The file system is used to manage the data where all data
are provided by a file. For the view (V) part of the browser, Embedded JavaScript (EJS) is used
instead of the default template engine of Express.js, to avoid the complex syntax structure.
For the control (C) part, Node.js and Express.js are adopted together, where JavaScript is used
to implement the programs.
Figure 2 illustrates the answer interface to solve a CWP assignment on a web browser.
The right side of the interface shows the test code of the assignment. The left side shows
the input form for a student to write the answer source code. A student needs to write the
code to pass all the tests in the test code while looking at it. After completing the source
code, the student needs to submit it by clicking the “Submit” button. Then, the code testing
is immediately conducted by compiling the source code and running the test code with
it on JUnit. The test results will appear on the lower side of the interface. It is noted that
Figures 1 and 2 are adopted from a previous paper [4].
codevalidator
student_codes
student1
student2
addon
test
Java_CWP_algorithm
Java_CWP_basic
output
student1_Java_CWP_basic_output.txt
csv
student1_Java_CWP_basic.csv
where max(length of string1, length of string2) represents the larger length between two
strings string1 and string2.
By Student 1
01: package p1;
02: public class HelloWorld{
03: public static void main(String[] args) {
04: System.out.println("Hello World!");
05: }
06: }
By Student 2
01: package p1;
02: public class HelloWorld{
03: public static void main(String[] args){
04: System.out.print ("Hello World!");
05: }
06: }
Levenshtein distance computation is given by O(nm), where n and m represent the lengths
of the two source codes. Therefore, the complexity of each computation depends on the
length of the files being compared. However, the source codes to be checked were made by
the students for the same assignment. Thus, it is possible to assume that every code has n
characters. As a result, the complexity for each code pair checking would be O(n2 ).
The number of source code pairs is given by k(k − 1)/2 when k students submit source
codes. Therefore, the final computational complexity of the function is given by O(k2 n2 ).
In addition, in the revised paper, we measure the CPU time for applying the code
plagiarism checking function to all the source codes for each assignment in Section 5.1.
The PC environment consists of an Intel® Core™ i5-7500K CPU @ 3.40 GHz with a 64-bit
Windows 10 Pro operating system. The function was implemented by Python 3.9.6.
1 helloworld 33 6 1.13
2 messagedisplay 33 8 0.27
3 codecorrection1 32 11 0.23
4 codecorrection2 32 12 0.25
basic grammar 5 ifandswitch 32 27 0.25
6 escapeusage 32 6 0.23
7 returnandbreak 32 18 0.25
8 octalnumber 32 8 0.23
9 hexadecimal 32 9 1.38
10 maxitem 32 11 1.02
11 minitem 31 11 1.05
12 arraylistimport 19 35 0.20
13 linkedlistdemo 18 28 0.19
14 hashmapdemo 17 26 0.22
data structure
15 treesetdemo 17 32 0.11
16 que 16 17 0.06
17 stack 16 17 0.06
18 animal 16 18 0.06
19 animal1 16 20 0.08
20 animalinterfaceusage 16 29 0.41
21 author 16 34 0.13
22 book 16 43 0.55
23 book1 16 24 0.08
object-oriented
24 bookdata 16 40 0.11
programming
25 car 16 21 0.09
26 circle 16 22 0.09
27 gameplayer 16 13 0.27
28 methodoverloading 16 13 0.31
29 physicsteacher 16 25 0.08
30 student 16 17 0.27
Analytics 2024, 3 57
Table 1. Cont.
Number of
Group Topic ID Assignment Title LOC CPU Time (s)
Students
31 binarysearch 12 12 0.16
32 binsort 11 20 0.19
33 bubblesort 11 21 0.22
34 bubblesort1 11 16 0.17
35 divide 11 8 0.09
36 GCD 11 19 0.13
fundamental
37 LCM 11 18 0.16
algorithms
38 heapsort 10 38 0.14
39 insertionsort 10 23 0.16
40 shellsort 10 28 0.19
41 quicksort1 9 38 0.28
42 quicksort2 9 25 0.11
43 quicksort3 9 30 0.13
final 44 makearray 39 25 0.34
examination 45 primenumber 39 20 0.27
Table 2 shows the number of student pairs that had a 100% similarity score for each
number of assignments for basic grammar. It suggests that one pair submitted the identical
source codes for all of the 11 assignments, and another pair did the same for 10 assignments.
With the high probability, these pairs submitted copied source codes. Some students often
copied the source codes from certain students.
Analytics 2024, 3 58
Table 3 shows the number of student pairs that had a 100% similarity score for each
number of assignments for data structure. It suggests that one pair submitted the identical
source codes for five assignments, and another pair did the same for four assignments.
With the high probability, these pairs submitted copied source codes. Some students often
copied the source codes from certain students.
6. Conclusions
This paper presented the code plagiarism checking function in the code validation program.
It removes the whitespace characters and the comment lines using regular expressions, and
calculates the similarity score from the Levenshtein distance between every pair of two source
codes from students. If the score is larger than a given threshold, they are regarded as
plagiarism. The results are output in the CSV file. For evaluations, we applied the proposal
to a total of 877 source codes for 45 CWP assignments from 9 to 39 students and analyzed
the results. The results confirm the validity and effectiveness of the proposal.
We also applied this code plagiarism checking function to this year’s Java programming
class. Although we informed the students to avoid copy from each other, we still found
that 4 students submitted copied source codes for some assignments among 55 students. In
future works, we will assign new assignments to students in Java programming courses,
and apply the proposal to them. We will also study the coding rule checking function to
improve the readability and efficiency of the codes.
Author Contributions: Methodology, S.T.A.; Validation, X.L.; Writing—original draft, E.E.H. and
K.H.W.; Writing—review and editing, E.E.H., K.H.W., N.F. and H.H.S.K.; Supervision, N.F. and
W.-C.K. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Data Availability Statement: All data is contained within the manuscript.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Aung, S.T.; Funabiki, N.; Aung, L.H.; Htet, H.; Kyaw, H.H.S.; Sugawara, S. An implementation of Java programming learning
assistant system platform using Node.js. In Proceedings of the International Conference on Information and Education Technology,
Matsue, Japan, 9–11 April 2022; pp. 47–52.
2. Node.js. Available online: https://nodejs.org/en (accessed on 4 November 2023).
3. Docker. Available online: https://www.docker.com/ (accessed on 4 November 2023).
4. Wai, K.H.; Funabiki, N.; Aung, S.T.; Mon, K.T.; Kyaw, H.H.S.; Kao, W.-C. An implementation of answer code validation program
for code writing problem in java programming learning assistant system. In Proceedings of the International Conference on
Information and Education Technology, Fujisawa, Japan, 18–20 March 2023; pp. 193–198.
5. Ala-Mutka, K. Problems in Learning and Teaching Programming. A Literature Study for Developing Visualizations in the Codewitz-Minerva
Project; Tampere University of Technology: Tampere, Finland, 2004; pp. 1–13.
6. Konecki, M. Problems in programming education and means of their improvement. In DAAAM International Scientific Book;
DAAAM International: Vianna, Austria, 2014; pp. 459–470.
7. Queiros, R.A.; Peixoto, L.; Paulo, J. PETCHA—A programming exercises teaching assistant. In Proceedings of the ACM Annual
Conference on Innovation and Technology in Computer Science Education, Haifa, Israel, 3–5 July 2012; pp. 192–197.
8. Li, F.W.-B.; Watson, C. Game-based concept visualization for learning programming. In Proceedings of the ACM Workshop on
Multimedia Technologies for Distance Learning, Scottsdale, AZ, USA, 1 December 2011; pp. 37–42.
9. Ünal, E.; Çakir, H. Students’ views about the problem based collaborative learning environment supported by dynamic web
technologies. Malays. Online J. Edu. Tech. 2017, 5, 1–19.
Analytics 2024, 3 62
10. Zinovieva, I.S.; Artemchuk, V.O.; Iatsyshyn, A.V.; Popov, O.O.; Kovach, V.O.; Iatsyshyn, A.V.; Romanenko, Y.O.; Radchenko, O.V.
The use of online coding platforms as additional distance tools in programming education. J. Phys. Conf. Ser. 2021, 1840, 012029.
[CrossRef]
11. Denny, P.; Luxton-Reilly, A.; Tempero, E.; Hendrickx, J. CodeWrite: Supporting student-driven practice of Java. In Proceedings of
the ACM Technical Symposium on Computer Science Education, Dallas, TX, USA, 9–12 March 2011; pp. 471–476.
12. Shamsi, F.A.; Elnagar, A. An intelligent assessment tool for student’s Java submission in introductory programming courses. J.
Intelli. Learn. Syst. Appl. 2012, 4, 59–69.
13. Edwards, S.H.; Pérez-Quiñones, M.A. Experiences using test-driven development with an automated grader. J. Comput. Sci. Coll.
2007, 22, 44–50.
14. Tung, S.H.; Lin, T.T.; Lin, Y.H. An exercise management system for teaching programming. J. Softw. 2013, 8, 1718–1725. [CrossRef]
15. Rani, S.; Singh, J. Enhancing Levenshtein’s edit distance algorithm for evaluating document similarity. In Proceedings of the
International Conference on Computing, Analytics and Networks, Singapore, 27–28 October 2018; pp. 72–80.
16. Ihantola, P.; Ahoniemi, T.; Karavirta, V.; Seppälä, O. Review of recent systems for automatic assessment of programming
assignments. In Proceedings of the 10th Koli Calling International Conference on Computing Education Research, New York, NY,
USA, 28 October 2010 ; pp. 86–93.
17. Duric, Z.; Gasevic, D. A source code similarity system for plagiarism detection. Comput. J. 2013, 56, 70–86. [CrossRef]
18. Ahadi, A.; Mathieson, L. A comparison of three popular source code similarity detecting student plagiarism. In Proceedings of
the Twenty-First Australasian Computing Education Conference, Sydney, Australia, 29–31 January 2019; pp. 112–117.
19. Novak, M.; Joy, M.; Keremek, D. Source-code similarity detection and detection tools used in academia: A systematic review.
ACM Trans. Comp. Educ. 2019, 19, 1–37. [CrossRef]
20. Karnalim, S.O.; Sheard, J.; Dema, I.; Karkare, A.; Leinonen, J.; Liut, M.; McCauley, R. Choosing code segments to exclude from
code similarity detection. In Proceedings of the Working Group Reports on Innovation and Technology in Computer Science
Education, Trondheim, Norway, 17–18 June 2020; pp. 1–19.
21. Kustanto, C.; Liem, I. Automatic source code plagiarism detection. In Proceedings of the 10th ACIS International Conference on
Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing, Daegu, Republic of Korea, 27–29
May 2009; pp. 481–486.
22. JUnit. Available online: https://en.wikipedia.org/wiki/JUnit (accessed on 4 November 2023).
23. Bubble Sort. Available online: https://en.wikipedia.org/wiki/Bubble_sort (accessed on 4 November 2023).
24. Levenshtein Distance. Available online: https://en.wikipedia.org/wiki/Levenshtein_distance (accessed on 4 November 2023).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.