0% found this document useful (0 votes)
84 views23 pages

Plagiarism Detection Research

This document presents a comparative study between the Jaro Winkler Distance algorithm and the Winnowing algorithm for detecting plagiarism. It provides a literature review of previous studies that have analyzed the Jaro Winkler and Winnowing algorithms for measuring similarity between texts. Specifically, one study found that the Jaro Winkler algorithm achieved an average similarity value of 74.49% when comparing online news articles. Another study directly compared the Jaro Winkler and Winnowing algorithms and found that Winnowing had higher accuracy rates of 94.4% versus 88.8% for Jaro Winkler. The current study aims to apply both algorithms to detect plagiarism and determine which provides the best results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views23 pages

Plagiarism Detection Research

This document presents a comparative study between the Jaro Winkler Distance algorithm and the Winnowing algorithm for detecting plagiarism. It provides a literature review of previous studies that have analyzed the Jaro Winkler and Winnowing algorithms for measuring similarity between texts. Specifically, one study found that the Jaro Winkler algorithm achieved an average similarity value of 74.49% when comparing online news articles. Another study directly compared the Jaro Winkler and Winnowing algorithms and found that Winnowing had higher accuracy rates of 94.4% versus 88.8% for Jaro Winkler. The current study aims to apply both algorithms to detect plagiarism and determine which provides the best results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Republic of the Philippines

North Eastern Mindanao State University


Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph

COMPARISONS OF ALGORITHMS USED IN PLAGIARISM DETECTION SOFTWARE

A Comparative Study

Presented to the Faculty of College of Information Technology Education

NORTH EASTERN MINDANAO STATE UNIVERSITY

Tandag City, Surigao del Sur

In partial Fulfillment

Of the Requirements for the Degree

Bachelor of Science in Computer Science

By:

Dela-Cerna, Arwin.

Pazo, Wawie.

Romero, Neil.

Trinidad, Joyce.

May 2023
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph

TABLES OF CONTENTS
Pages
PRELIMINARY
TITLE PAGE I
TABLE OF CONTENT II
LIST OF TABLES III
CHAPTER 1 – INTRODUCTION
INTRODUCTION 1
CHAPTER 2 – RELATED LITERATURES AND STUDIES
REVIEW OF RELATED LITERATURE 3
CHAPTER 3 – METHODOLOGY
METHODOLOGY 6
ALGORITHMS DESCRIPTION 6
ALGORITHM COMPARISON 8
CHAPTER 4 – RESULTS AND DISCUSSION
RESULTS AND DISCUSSION 13
CHAPTER 5 – CONCLUSION
CONCLUSION 17
REFERENCES
REFERENCES 19

ii
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph

List of Tables

Table 1: Test results with the Jaro winkler distance algorithm and MS Word Feature
Results

Table 2: Test results with Winnowing algorithm and MS Word Feature Results

iii
Chapter I

INTRODUCTION

Plagiarism is the act of duplicating someone's work and getting it off as your own.

This is a crime that is very popular in the academic world. With the development of

technology such as the internet, which is growing and becoming increasingly

sophisticated, plagiarism is becoming easier to do, making it very easy for anyone to

access it and extremely useful for coursework, practicum reports, journals, final

assignments, and so on. In addition, it is easy to have a negative impact on students,

namely plagiarism. Plagiarism can affect a student's integrity in several ways, including

making them incapable of bringing up new ideas and making them lazy. Things that are

classified as plagiarism are making as if the ideas, ideas, and works of others are the

result of their own work, taking several articles without making a reference from where

the writing was taken. Plagiarism can also include the use of someone else's ideas or

concepts without proper citation or acknowledgement. For this reason, it's crucial to

always provide fair credit to the author of a piece of work and to properly credit any

sources you use.

For this study, researchers used the Jaro Winkler Distance algorithm and the

Winnowing algorithm to compare words in two documents. A Jaro Winkler distance

algorithm uses a string metric approach to measure the similarity between two strings.

This research focuses on two algorithms, Jaro Winkler and Winnowing. Jaro Winkler

Distance Algorithm and Winnowing Algorithm have not yet been compared to determine

which algorithm gives the best results. For this reason, Jaro Winkler Distance Algorithm

and Winnowing Algorithm were used in this study. Data for this study were gathered
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph

from journal articles. The data is taken from the website (https://garuda.ristekbrin.go.id/).

The scope compared to the research in this test is the abstract part of the journal

document. The purpose of this research is to get the best algorithm in detecting the

similarity of words by comparing the Jaro Winkler Distance Algorithm with the

Winnowing Algorithm.

2
Chapter II

REVIEW OF RELATED LITERATURE


This chapter contains various studies of the Comparison Between Jaro-Winkler

Distance Algorithm and Winnowing Algorithm, which we get from different sources like

internet. To make this study more efficient and understandable, we also extensively

evaluated and analyzed it.

According to Moeliono (2020) that plagiarism happens when someone steals an

essay or a point of view from another person, doesn't referring the original sources, and

then claims it as their own. In order to limit the existence of plagiarism, it is important to

build and construct a virtual machine to detect the works that will be uploaded in the

online journal media. Most of the equipment used, including Turnitin, Plagscout, Viper,

and Plagiarism Checker, is foreign-made. Students and professors must connect to their

network in order to check for plagiarism. Most of the fee-based plagiarism detectors

have the potential to deplete the nation's foreign exchange. The authenticity of

documents can be determined using a variety of algorithms. One technique for

determining the authenticity of a document is the winnowing algorithm.

In an earlier study, Teguh Efriyanto and Mardhiya Hayaty (2022) wrote in an

article titled "Jaro Winkler Algorithm for Measuring Similarity Online News" that online

news is a source of information for people; this affects journalists as news writers who

can find news information quickly and accurately every day. Journalists may plagiarize

other journalists or take news information from other news media sites and post it in the

media without citation. To assess the similarity of internet news, an algorithm is

required. The Jaro Winkler algorithm was proposed in this study, with the result obtained

from the calculation normalised so that 0 implies there is no likeness and 1 means it has
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph

the precise resemblance. Data from 20 online news media sources in Central

Kalimantan were used. To obtain news on the same topic, the scraping procedure

employed the Custom Search JSON API and keywords.

The findings of the news similarity computation with the Jaro Winkler algorithm

yielded an average value of 74.49%, with 43 news data having severe plagiarism levels

and 12 news data having moderate plagiarism levels. The Jaro Winkler algorithm has

flaws in determining the similarity value in the given data. Some undetected data should

have a high level of plagiarism but not severe plagiarism, and alternately.

A study was carried out by Melani and Clara (2022) Plagiarism is the act of

reproducing another person's work and presenting it as one's own, and it is unlawful.

This is very common in Indonesian educational institutions. Plagiarism has a broad

definition, and one of its similarities is the substance of a work that uses phrases,

sentences, or paragraphs that were previously produced or published by other

individuals. The first step in minimizing plagiarism is to find similarities between two

ideas in a work. With software that can find similar phrases in a document, calculations

can be done to determine the similarities in a document. Determining word similarity

inside the document is therefore important in order to solve the issue. Researchers went

through a number of steps, starting with the preprocessing stage, and then used the

Jaro Winkler distance method and winnowing technique to calculate similarities. With

the aid of the simulator, the computation procedure is carried out manually. As can be

observed from the accuracy values achieved by the Winnowing Algorithm, which was

4
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph

tested against the Jaro Winkler Distance Algorithm, of 94.4% and 88.8%, respectively,

after comparison of the two methods, the Winnowing Algorithm is greater.

5
Chapter III

METHODOLOGY
The methodology used in this study to obtain the best algorithm between the

Jaro Winkler Distance algorithm and the Winnowing algorithm consist of data collection,

text preprocessing, and word similarity detection using Jaro Winkler Distance Algorithm

and Winnowing Algorithm and get the best Algorithm.

ALGORITHMS DESCRIPTION

1. The Jaro-Winkler Distance algorithm considers the number of matching

characters between two strings, as well as the transpositions (character

shifting) required to make the strings match. The algorithm assigns a value

between 0 and 1, with 0 denoting no similarity and 1 denoting an ideal match.

Here's a step-by-step explanation of how the Jaro-Winkler Distance algorithm

works:

 Calculate the Jaro Distance: The Jaro distance is calculated by

counting the number of matching characters between the two strings,

considering a specific matching window based on the length of the

strings. It additionally counts the number of transpositions required to

make the strings comparable.

 Calculate the Jaro-Winkler Distance: The Jaro-Winkler distance is an

enhancement to the Jaro distance that gives more weight to strings

that have a common prefix. It calculates the Jaro-Winkler distance by

adding a prefix scale factor (commonly denoted as p) to the Jaro

distance. The scale factor is determined based on the length of the common

prefix and a predefined constant scaling factor (commonly denoted as l).


Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph

 Interpret the Result: The resulting Jaro-Winkler distance ranges

between 0 and 1. A value of 1 indicates a perfect match, while values

closer to 0 represent less similarity between the strings.

The Jaro-Winkler Distance algorithm is commonly used in various

applications, such as record linkage (matching records from different

databases), fuzzy matching, and spell checking. It is particularly useful for

comparing names or strings that have a high probability of having

typographical errors or variations due to human input.

2. The Winnowing algorithm is a technique used in computer science and data

mining to identify near-duplicate documents or find similar subsequences

within a document. It is primarily employed for tasks such as plagiarism

detection, document fingerprinting, and content similarity analysis. The

algorithm operates based on a sliding window approach. It starts by dividing a

document into overlapping subsequences of a fixed size, often referred to as

k-mers or k-shingles. Each subsequence is then hashed to generate a unique

hash value.

The core idea of the Winnowing algorithm is to select the minimum hash

value within a window as the representative for that window. By doing so, it

captures the most significant hash values while discarding less important

ones. This "winnowing" process reduces the amount of data to be processed

and focuses on the most distinguishing features of a document. Once the

7
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph

representative hash values are chosen, they can be used to compare

documents or subsequences. Similar documents or subsequences will have

similar representative hash values, enabling efficient comparison and

identification of near-duplicates or similar content. The Winnowing algorithm

is commonly employed in applications like plagiarism detection systems,

document clustering, similarity search, and malware detection. It provides an

effective and scalable approach to identifying similar content within large

collections of documents.

ALGORITHMS COMPARISON

A. Data Collection

In this study, researchers used three categories of data. There are two

documents per data point in each category. To determine the percentage of similarity,

the two documents in each set of data will be compared. The categories are as follows:

1. Category All Words are not Equal In this category all the words in the

document are not the same. in this category using 1 data. The naming of the

data used is DU-

2. Category Partial Words Same In this category, some of the words in the

document are the same, the documents being compared are documents that

have the same object of research in the journal. To find out if the same word

exists, it is checked manually. In this category, there are variations in the data

to be tested. The data variations in this category are the displacement of the

8
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph

sentence position, the change in the sentence pattern from the active

sentence to the passive sentence, he displacement of the word position, the

effect of the same initial word and the influence of the match distance range.

This category uses 11 data.

3. Category All Words Same In this category, the contents of the document use

the same journal, so the same words and the number of words in document 1

and document 2 are the same. The variations used in this category are the

displacement of the sentence position, the displacement of the word position,

the effect of the beginning of the same word and the influence of the match

distance range. This category uses 6 data. The naming of the data used is

DU-

B. Text Preprocessing

Preprocessing the data will be the following step after data collecting is finished.

1. Jaro Winkler Distance Algorithm

Text preprocessing performed on the Jaro Winkler Distance algorithm, namely

the Case Folding process is a process that converts capital letters in

documents into lowercase letters (a-z), omitted characters other than letters,

such as numbers and punctuation marks. Tokenizing is the process of cutting

text into chunks of words known as tokens. The stage to remove words that

have low information from the text. Stopwords are common words that usually

9
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph

appear in large numbers and are considered meaningless. Stemming, is the

process of finding the root word of each word by removing all affixes.

2. Winnowing Algorithm

Text preprocessing carried out on the Winnowing algorithm is the

preprocessing stage used, namely, case folding is a process that converts

capital letters in documents into lowercase letters (a-z), characters other than

letters are removed, such as numbers and punctuation marks., stopwords

removal/filtering is the stage

for remove words that have low information from the text. Stopwords are

common words that usually appear in large numbers and are considered

meaningless., stemming is the process of finding the root word of each word

by removing all affixes, and, whitespace insensitivity is the process of

removing all spaces in the document.

C. Word Similarity Detection

After preprocessing the data, the two documents in each data category in each

category will be compared using the Jaro Winkler Distance algorithm and the

Winnowing algorithm to get the percentage similarity value.

1. Jaro Winkler Distance Algorithm

The workings of the Jaro Winkler Distance algorithm are to calculate the jaro

similarity value in the two documents that have gone through the

preprocessing process, calculate the jaro winkler distance value and calculate

10
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph

the percentage. To compute the Jaro similarity value, various variable values

are employed, including the number of words in each of the two texts, the

identical number of words in both papers, and the number of transposition

values in both documents. The jaro similarity value, the prefix value, and the

distance value are the variables utilized to calculate the jaro winkler distance.

(The maximum value is 4, the constant value is p = 0.1, and the number of

words at the beginning of the document before the inequality was discovered,

which was the same). after getting the jaro winkler distance the results are

multiplied by 100% to get the final percentage result.

2. Winnowing Algorithm

The way the Winnowing algorithm works is to change the contents of the two

documents into a series of n-grams, then change the words in the n-gram

sequence to a hash value using the Rolling hash method, then the hash value

will be divided into several windows with a size of W. window is the process of

forming a substring from the hash value along w-grams and from the

winnowing process will produce a fingerprint (minimum value of window

formation) then the Jaccard Coefficient is carried out to calculate the word

equation after the fingerprinting process.

11
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph

D. Get the Best Algorithm


After getting the percentage value of word similarity in each data using the Jaro
Winkler algorithm and the Winnowing algorithm, the next step is to find the best
algorithm between the Jaro Winkler Distance algorithm and the Winnowing algorithm.
The steps taken to get the best algorithm are calculating accuracy and calculating time
complexity.

• Using a confusion matrix, an accuracy calculation is done to determine the best

algorithm. When correctly identified cases are compared to the total number of

cases, accuracy is determined. Data with low, moderate, and high levels of

plagiarism are labeled and used to calculate accuracy. The same word pairs from

both documents are manually counted as benchmarking for accuracy

estimations. To manually determine the percentage of the same word, engage

with the following formula:

Formula 1

Formula 1. Count the same word pairs manually on two documents

• Calculating the time complexity of the Jaro Winkler Distance algorithm and

the Winnowing algorithm

12
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph

13
Chapter IV

RESULT AND DISCUSSION


In the results and discussion chapter, it contains the results and calculations of

accuracy and time complexity on the Jaro Winkler Distance algorithm and the

Winnowing Algorithm.

A. The results and discussion chapters will show the results of the accuracy of

word similarity detection from the Jaro Winkler Distance algorithm and the

Winnowing algorithm. MS Word Feature percentage results on each data

obtained from the calculation results of the Formula

1. Test results and calculation of the Jaro Winkler Distance Algorithm

TABLE 1. Test results with the Jaro winkler distance algorithm and MS Word Feature
Results
Data Jaro Winkler
MS Word Feature Label Distance Label
Results Algorithm
Results
DU-1 0% R 0% R
DU-2a 37.15% S 46.2% S
DU-2b 33.51% S 40.8% R
DU-2c 37.15% S 43.4% S
DU-2d 37.15% S 45.2% S
DU-2e 37.15% S 40.8%% R
DU-2f 37.15% S 51.6% S
DU-2g 1.04% R 34% S
DU-2h 1.04% R 0% R
DU-2i 44.24% S 50.8% S
DU-2j 54.86% S 54.2% S
DU-2k 54.86% S 54.1% S
DU-3a 100% B 100% B
DU-3b 100% B 70% S
DU-3c 100% B 94.8% B
DU-3d 100% B 100% B
DU-3e 100% B 98% B
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph

DU-3f 100% B 96% B

Information:

R = Wispy

S = Medium B = Heavy

So that the results:

TR =2

TS = 9

TB = 5

FRB = 0

FRS = 1

FBR = 0

FBS = 1

FSR = 0

FSB = 0

Then the accuracy result,

accuracy=16/18×100%=88.8%
2. Test results and calculation of the Winnowing Algorithm

TABLE 2. Test results with Winnowing algorithm and MS Word Feature Results

15
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph

Data MS Word Feature Label Winnowing Algorithm Label


Results Results
DU-1 0% R 0% R
DU- 37.15% S 31.16% S
2a
DU-2b 33.51% S 31.62% S
DU-2c 37.15% S 28.57% R
DU-2d 37.15% S 30.48% S
DU- 37.15% S 30.82% S
2e
DU-2f 37.15% S 30.72% S
DU-2g 1.04% R 20.71% R
DU-2h 1.04% R 21.15% R
DU-2i 44.24% S 31.21% S
DU-2j 54.86% S 48.47% S
DU-2k 54.86% S 47.97% S
DU- 100% B 100% B
3a
DU-3b 100% B 97.79% B
DU-3c 100% B 92.35% B
DU-3d 100% B 100% B
DU- 100% B 98.9% B
3e
DU-3f 100% B 96.69% B

Information:

R= Wispy

S= Medium B = Heavy

So that the results:

TR =3

TS = 8

TB = 6

16
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph

FRB = 0

FRS = 0

FSR = 1

FBR = 0

FBS = 0

FSB = 0

Then the accuracy result,

accuracy=17/18×100%=94.4%

From the explanation of the accuracy values for the Jaro Winkler Distance

Algorithm and Winnowing Algorithm, the accuracy results for the Jaro Winkler Distance

Algorithm have an average value of 88.8% and for the Winnowing Algorithm it has an

average of 94.4%.

B. Time Complexity

The time complexity obtained by the Jaro Winkler Distance algorithm is O(n2) and the

time complexity obtained from the Winnowing algorithm is O(n2).

17
Chapter V

CONCLUSION
Based on the results of the analysis and discussion carried out, the conclusions of this

research are as follows.

1. The accuracy results obtained from the Jaro Winkler Distance Algorithm

are 88.8% and the Winnowing Algorithm is 94.4%, so the best algorithm in

detecting the similarity of words and plagiarism detection is the Winnowing

Algorithm.

2. Jaro Winkler Distance Algorithm and Winnowing Algorithm have the same

time complexity, namely O(n2).

After analyzing and comparing the Jaro-Winkler Distance Algorithm and

Winnowing Algorithm in detecting plagiarism, we can conclude that both algorithms

have their own strengths and weaknesses. The Jaro-Winkler Distance Algorithm is a

string-matching algorithm that is widely used in determining the similarity between two

strings. It is effective in detecting plagiarism in shorter texts but may not be as efficient

in longer texts. However, it is relatively faster than the Winnowing Algorithm. On the

other hand, the Winnowing Algorithm is a fingerprinting algorithm that is best suited for

identifying similar blocks of text in longer documents. It is highly effective in detecting

plagiarism in larger texts but may not be as efficient in shorter texts. It is also slower

than the Jaro-Winkler Distance Algorithm. In conclusion, the choice between these two

algorithms would depend on the specific requirements of the plagiarism detection task

at hand. If the task involves shorter texts, the Jaro-Winkler Distance Algorithm may be

the better choice. However, if the task involves longer texts, the Winnowing Algorithm

may be more appropriate.


Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph

REFERENCES

19
Republic of the Philippines
North Eastern Mindanao State University
Rosario, Tandag City, Surigao del Sur
Telefax No. 086-214-4221
Website: www.nemsu.edu.ph

1. Faisal, Muhammad, et al. "Plagiarism detection using manber and winnowing

algorithm." International Journal of Advanced Science and Technology 29.6s

(2020): 2130-2136.

2. Bu'ulolo, Inte Christinawati, Melani Isabella Siregar, and Clara Fellysa

Simanjuntak.. AIP Conference Proceedings. Vol. 2658. No. 1. AIP Publishing

LLC, 2022.

3. Efriyanto, Teguh, and Mardhiya Hayaty. "JARO WINKLER ALGORITHM FOR

MEASURING SIMILARITY ONLINE NEWS." Jurnal Teknik Informatika (Jutif) 3.4

(2022): 975-982.

4. Hakim, L., 2019. Penggunaan N-Gram dan Jaro Winkler Distance pada aplikasi

kelas daring untuk deteksi plagiat. e-ISSN.

5. A. N. Muhammad, S. Bukhori, and P. Pandunata, “Sentiment Analysis of Positive

and Negative of YouTube Comments Using Naïve Bayes – Support Vector

Machine ( NBSVM ) Classifier,” 2019 Int. Conf. Comput. Sci. Inf. Technol. Electr.

Eng., vol. 1, pp. 199–205, 2019. Available:

https://doi.org/10.1109/ICOMITEE.2019.8920923 [Accessed: 28-Jan-2021].

6. Billhaqqi, T. T. I., Wicaksono, G. W., & Aditya, C. S. K. (2022, July). Comparison

analysis of Rabin-Karp and Winnowing algorithms in automated essay answer

assessment system. In AIP Conference Proceedings (Vol. 2453, No. 1, p.

030018). AIP Publishing LLC.

20

You might also like