Plagiarism Detection Research
Plagiarism Detection Research
A Comparative Study
In partial Fulfillment
By:
Dela-Cerna, Arwin.
Pazo, Wawie.
Romero, Neil.
Trinidad, Joyce.
May 2023
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph
TABLES OF CONTENTS
Pages
PRELIMINARY
TITLE PAGE I
TABLE OF CONTENT II
LIST OF TABLES III
CHAPTER 1 – INTRODUCTION
INTRODUCTION 1
CHAPTER 2 – RELATED LITERATURES AND STUDIES
REVIEW OF RELATED LITERATURE 3
CHAPTER 3 – METHODOLOGY
METHODOLOGY 6
ALGORITHMS DESCRIPTION 6
ALGORITHM COMPARISON 8
CHAPTER 4 – RESULTS AND DISCUSSION
RESULTS AND DISCUSSION 13
CHAPTER 5 – CONCLUSION
CONCLUSION 17
REFERENCES
REFERENCES 19
ii
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph
List of Tables
Table 1: Test results with the Jaro winkler distance algorithm and MS Word Feature
Results
Table 2: Test results with Winnowing algorithm and MS Word Feature Results
iii
Chapter I
INTRODUCTION
Plagiarism is the act of duplicating someone's work and getting it off as your own.
This is a crime that is very popular in the academic world. With the development of
sophisticated, plagiarism is becoming easier to do, making it very easy for anyone to
access it and extremely useful for coursework, practicum reports, journals, final
namely plagiarism. Plagiarism can affect a student's integrity in several ways, including
making them incapable of bringing up new ideas and making them lazy. Things that are
classified as plagiarism are making as if the ideas, ideas, and works of others are the
result of their own work, taking several articles without making a reference from where
the writing was taken. Plagiarism can also include the use of someone else's ideas or
concepts without proper citation or acknowledgement. For this reason, it's crucial to
always provide fair credit to the author of a piece of work and to properly credit any
For this study, researchers used the Jaro Winkler Distance algorithm and the
algorithm uses a string metric approach to measure the similarity between two strings.
This research focuses on two algorithms, Jaro Winkler and Winnowing. Jaro Winkler
Distance Algorithm and Winnowing Algorithm have not yet been compared to determine
which algorithm gives the best results. For this reason, Jaro Winkler Distance Algorithm
and Winnowing Algorithm were used in this study. Data for this study were gathered
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph
from journal articles. The data is taken from the website (https://garuda.ristekbrin.go.id/).
The scope compared to the research in this test is the abstract part of the journal
document. The purpose of this research is to get the best algorithm in detecting the
similarity of words by comparing the Jaro Winkler Distance Algorithm with the
Winnowing Algorithm.
2
Chapter II
Distance Algorithm and Winnowing Algorithm, which we get from different sources like
internet. To make this study more efficient and understandable, we also extensively
essay or a point of view from another person, doesn't referring the original sources, and
then claims it as their own. In order to limit the existence of plagiarism, it is important to
build and construct a virtual machine to detect the works that will be uploaded in the
online journal media. Most of the equipment used, including Turnitin, Plagscout, Viper,
and Plagiarism Checker, is foreign-made. Students and professors must connect to their
network in order to check for plagiarism. Most of the fee-based plagiarism detectors
have the potential to deplete the nation's foreign exchange. The authenticity of
article titled "Jaro Winkler Algorithm for Measuring Similarity Online News" that online
news is a source of information for people; this affects journalists as news writers who
can find news information quickly and accurately every day. Journalists may plagiarize
other journalists or take news information from other news media sites and post it in the
required. The Jaro Winkler algorithm was proposed in this study, with the result obtained
from the calculation normalised so that 0 implies there is no likeness and 1 means it has
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph
the precise resemblance. Data from 20 online news media sources in Central
Kalimantan were used. To obtain news on the same topic, the scraping procedure
The findings of the news similarity computation with the Jaro Winkler algorithm
yielded an average value of 74.49%, with 43 news data having severe plagiarism levels
and 12 news data having moderate plagiarism levels. The Jaro Winkler algorithm has
flaws in determining the similarity value in the given data. Some undetected data should
have a high level of plagiarism but not severe plagiarism, and alternately.
A study was carried out by Melani and Clara (2022) Plagiarism is the act of
reproducing another person's work and presenting it as one's own, and it is unlawful.
definition, and one of its similarities is the substance of a work that uses phrases,
individuals. The first step in minimizing plagiarism is to find similarities between two
ideas in a work. With software that can find similar phrases in a document, calculations
inside the document is therefore important in order to solve the issue. Researchers went
through a number of steps, starting with the preprocessing stage, and then used the
Jaro Winkler distance method and winnowing technique to calculate similarities. With
the aid of the simulator, the computation procedure is carried out manually. As can be
observed from the accuracy values achieved by the Winnowing Algorithm, which was
4
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph
tested against the Jaro Winkler Distance Algorithm, of 94.4% and 88.8%, respectively,
5
Chapter III
METHODOLOGY
The methodology used in this study to obtain the best algorithm between the
Jaro Winkler Distance algorithm and the Winnowing algorithm consist of data collection,
text preprocessing, and word similarity detection using Jaro Winkler Distance Algorithm
ALGORITHMS DESCRIPTION
shifting) required to make the strings match. The algorithm assigns a value
works:
distance. The scale factor is determined based on the length of the common
hash value.
The core idea of the Winnowing algorithm is to select the minimum hash
value within a window as the representative for that window. By doing so, it
captures the most significant hash values while discarding less important
7
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph
collections of documents.
ALGORITHMS COMPARISON
A. Data Collection
In this study, researchers used three categories of data. There are two
documents per data point in each category. To determine the percentage of similarity,
the two documents in each set of data will be compared. The categories are as follows:
1. Category All Words are not Equal In this category all the words in the
document are not the same. in this category using 1 data. The naming of the
2. Category Partial Words Same In this category, some of the words in the
document are the same, the documents being compared are documents that
have the same object of research in the journal. To find out if the same word
exists, it is checked manually. In this category, there are variations in the data
to be tested. The data variations in this category are the displacement of the
8
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph
sentence position, the change in the sentence pattern from the active
effect of the same initial word and the influence of the match distance range.
3. Category All Words Same In this category, the contents of the document use
the same journal, so the same words and the number of words in document 1
and document 2 are the same. The variations used in this category are the
the effect of the beginning of the same word and the influence of the match
distance range. This category uses 6 data. The naming of the data used is
DU-
B. Text Preprocessing
Preprocessing the data will be the following step after data collecting is finished.
documents into lowercase letters (a-z), omitted characters other than letters,
text into chunks of words known as tokens. The stage to remove words that
have low information from the text. Stopwords are common words that usually
9
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph
process of finding the root word of each word by removing all affixes.
2. Winnowing Algorithm
capital letters in documents into lowercase letters (a-z), characters other than
for remove words that have low information from the text. Stopwords are
common words that usually appear in large numbers and are considered
meaningless., stemming is the process of finding the root word of each word
After preprocessing the data, the two documents in each data category in each
category will be compared using the Jaro Winkler Distance algorithm and the
The workings of the Jaro Winkler Distance algorithm are to calculate the jaro
similarity value in the two documents that have gone through the
preprocessing process, calculate the jaro winkler distance value and calculate
10
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph
the percentage. To compute the Jaro similarity value, various variable values
are employed, including the number of words in each of the two texts, the
values in both documents. The jaro similarity value, the prefix value, and the
distance value are the variables utilized to calculate the jaro winkler distance.
(The maximum value is 4, the constant value is p = 0.1, and the number of
words at the beginning of the document before the inequality was discovered,
which was the same). after getting the jaro winkler distance the results are
2. Winnowing Algorithm
The way the Winnowing algorithm works is to change the contents of the two
documents into a series of n-grams, then change the words in the n-gram
sequence to a hash value using the Rolling hash method, then the hash value
will be divided into several windows with a size of W. window is the process of
forming a substring from the hash value along w-grams and from the
formation) then the Jaccard Coefficient is carried out to calculate the word
11
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph
algorithm. When correctly identified cases are compared to the total number of
cases, accuracy is determined. Data with low, moderate, and high levels of
plagiarism are labeled and used to calculate accuracy. The same word pairs from
Formula 1
• Calculating the time complexity of the Jaro Winkler Distance algorithm and
12
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph
13
Chapter IV
accuracy and time complexity on the Jaro Winkler Distance algorithm and the
Winnowing Algorithm.
A. The results and discussion chapters will show the results of the accuracy of
word similarity detection from the Jaro Winkler Distance algorithm and the
TABLE 1. Test results with the Jaro winkler distance algorithm and MS Word Feature
Results
Data Jaro Winkler
MS Word Feature Label Distance Label
Results Algorithm
Results
DU-1 0% R 0% R
DU-2a 37.15% S 46.2% S
DU-2b 33.51% S 40.8% R
DU-2c 37.15% S 43.4% S
DU-2d 37.15% S 45.2% S
DU-2e 37.15% S 40.8%% R
DU-2f 37.15% S 51.6% S
DU-2g 1.04% R 34% S
DU-2h 1.04% R 0% R
DU-2i 44.24% S 50.8% S
DU-2j 54.86% S 54.2% S
DU-2k 54.86% S 54.1% S
DU-3a 100% B 100% B
DU-3b 100% B 70% S
DU-3c 100% B 94.8% B
DU-3d 100% B 100% B
DU-3e 100% B 98% B
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph
Information:
R = Wispy
S = Medium B = Heavy
TR =2
TS = 9
TB = 5
FRB = 0
FRS = 1
FBR = 0
FBS = 1
FSR = 0
FSB = 0
accuracy=16/18×100%=88.8%
2. Test results and calculation of the Winnowing Algorithm
TABLE 2. Test results with Winnowing algorithm and MS Word Feature Results
15
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph
Information:
R= Wispy
S= Medium B = Heavy
TR =3
TS = 8
TB = 6
16
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph
FRB = 0
FRS = 0
FSR = 1
FBR = 0
FBS = 0
FSB = 0
accuracy=17/18×100%=94.4%
From the explanation of the accuracy values for the Jaro Winkler Distance
Algorithm and Winnowing Algorithm, the accuracy results for the Jaro Winkler Distance
Algorithm have an average value of 88.8% and for the Winnowing Algorithm it has an
average of 94.4%.
B. Time Complexity
The time complexity obtained by the Jaro Winkler Distance algorithm is O(n2) and the
17
Chapter V
CONCLUSION
Based on the results of the analysis and discussion carried out, the conclusions of this
1. The accuracy results obtained from the Jaro Winkler Distance Algorithm
are 88.8% and the Winnowing Algorithm is 94.4%, so the best algorithm in
Algorithm.
2. Jaro Winkler Distance Algorithm and Winnowing Algorithm have the same
have their own strengths and weaknesses. The Jaro-Winkler Distance Algorithm is a
string-matching algorithm that is widely used in determining the similarity between two
strings. It is effective in detecting plagiarism in shorter texts but may not be as efficient
in longer texts. However, it is relatively faster than the Winnowing Algorithm. On the
other hand, the Winnowing Algorithm is a fingerprinting algorithm that is best suited for
plagiarism in larger texts but may not be as efficient in shorter texts. It is also slower
than the Jaro-Winkler Distance Algorithm. In conclusion, the choice between these two
algorithms would depend on the specific requirements of the plagiarism detection task
at hand. If the task involves shorter texts, the Jaro-Winkler Distance Algorithm may be
the better choice. However, if the task involves longer texts, the Winnowing Algorithm
REFERENCES
19
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph
(2020): 2130-2136.
LLC, 2022.
(2022): 975-982.
4. Hakim, L., 2019. Penggunaan N-Gram dan Jaro Winkler Distance pada aplikasi
Machine ( NBSVM ) Classifier,” 2019 Int. Conf. Comput. Sci. Inf. Technol. Electr.
20