0% found this document useful (0 votes)
20 views8 pages

A System For Detection of Plagiarism of Ideas Based On Deep Learning Algorithm

This paper presents a deep learning-based system for detecting plagiarism of ideas, which is particularly challenging due to its complex nature involving paraphrasing and semantic changes. The proposed system consists of a learning phase using deep learning algorithms to represent documents as vectors, followed by a detection phase that identifies similarities between documents. The authors highlight the limitations of traditional plagiarism detection methods and emphasize the advantages of their approach in improving accuracy and efficiency.

Uploaded by

ali99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views8 pages

A System For Detection of Plagiarism of Ideas Based On Deep Learning Algorithm

This paper presents a deep learning-based system for detecting plagiarism of ideas, which is particularly challenging due to its complex nature involving paraphrasing and semantic changes. The proposed system consists of a learning phase using deep learning algorithms to represent documents as vectors, followed by a detection phase that identifies similarities between documents. The authors highlight the limitations of traditional plagiarism detection methods and emphasize the advantages of their approach in improving accuracy and efficiency.

Uploaded by

ali99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

A System for Detection of Plagiarism of Ideas Based on Deep Learning

Algorithm
El Mostafa Hambi *, Faouzia Benabbou *, Nadia Bouhriz *
Information Technology and Modeling Laboratory
Science Faculty Ben M’sik, Casablanca, Morocco

[email protected], [email protected],
[email protected]

Abstract. The evolution of the availability of online documents and ease of retrieve these documents has caused to a
big plagiarism problem. They are many technics of plagiarism detection but the plagiarism of ideas steels the most
difficult to detect. In this regard, some methods have been proposed to perform the task of plagiarism detection of ideas,
but there are still many challenges to overcome. In this paper, we propose a system of plagiarism detection of ideas
based on Deep Learning Algorithms. The proposed approach deals with some problems encountered in detecting the
plagiarism of ideas such as: the problem of loss of meaning or the difficulty of detection of similarity between
documents.
Our system consists of two parts: a part of learning (deep learning) and a plagiarism detection part.

Keywords: Plagiarism, Deep Learning, Preprocessing, Doc2vev, neural network,

1 Introduction
The development of information technology (IT) and especially the Internet has considerably increased
the availability of information and leads consequently to the rising of plagiarism. The plagiarism of
idea is more fundamental and usually includes paraphrase as well as semantic and vocabulary changes.
This kind is the most difficult type of plagiarism to detect. The not avowed use of the original works is
considered as a plagiarism case and one of the greatest problems of scientific publications throughout
history; that's why the automatic identification methods for the detection of plagiarism have been
developed for acting as a possible countermeasure [1, 2, 3].
The plagiarism detection is considered as a branch of Natural Language Processing. The aim is to find
the most representative words or concepts of a text or a document. Traditional Natural Language
Processing (NLP) approaches often use a list of the words for detecting the similarity. In such methods,
the similarity between the synonym words is not taken into account, then if we transform these words
into concepts to have a semantic representation using WordNet, there will be another problem of the
ambiguity that will be triggered, so we can lose the sense of the sentences treated. Nowadays, the deep
learning technics constitute the best solution to overcome these problems. Deep Learning (DL) is an
important component of computational intelligence which has the core domain machine learning
research in it. It provides more efficient algorithms to deal with large-scale data in neuroscience,
computer vision, speech recognition, language processing, biomedical informatics, recommender
systems, learning theory, robotics, games, and so on [6, 7].
The essential goal of deep learning is to improve the processing, and pre-processing methods of NLP in
an automatic, efficient, and fast way. In text mining applications, deep learning methods represent words
as a vector of numerical values. This new representation contains a major part of syntactic as well as
semantic rules of the text data. In applications such as similarity detection and text classification, much
larger units such as phrases, sentences and documents should be described as a vector. Vectorized
representation of text data makes it easy to compare words and sentences as well as minimizing the need
to use lexicons [2].
The remainder of this paper is organized as follows. The first section presents definition of plagiarism.
The second section defines related work. The Third and fourth Sections are devoted to illustrate deep
learning and the approach of using it in NLP applications. The fifth section defines an overall illustration
of our approach; the last section introduces our future work to be carried out in the conclusion.

2 Definition of Plagiarism
The Plagiarism is an attempt to use the other's idea and present it as a personal work, which is considered
both illegal and immoral. Plagiarism is being done in various ways. Among them we mention the types
below [1, 4]:

 Copy-paste, textually (word by word) in which the content of the text is copied from one or
more sources. Copied content could be slightly modified.
 Paraphrasing, to change the grammar, to use the synonyms of words, to reorganize the
sentences of the original work and finally to delete some parts of the text.
 The use of false references, the addition of references which are false or that do not even exist.
 Plagiarism with translation, the contents are translated and used without reference to the
original work.
 Plagiarism of ideas, it is the most difficult plagiarism to detect because it is more evolved than
the previous types, also it is not simple manipulations made on the text, but a more advanced
form. This type of plagiarism consists in using the concepts and ideas of others with a
reformulation of sentences.

3 Related Works
Many plagiarism detection methods are available. Some of the plagiarism detection methods incorporate
natural language processing techniques for plagiarism detection. These NLP techniques are applied to
process the set of documents and also analysis the structure of the document. The similarity between
two documents which address different kind of plagiarism [5]:

Lexical methods: These methods consider text as a sequence of characters or terms [14]. The processing
technique that this approach relies upon includes tokenization, lowercasing, punctuation removal and
stemming [15]. In this method, the assumption is that the more terms both documents have in common,
the more similar they are. Methods that use features such as longest common subsequence, n-grams and
fingerprint are considered as this kind of methods. The comparison units adopted for detecting
plagiarism differ from one technique to another, the different such units include words, sentences,
passages, human defined sliding window or an n-gram. The summary of the work that have been done
using these techniques include [16], [17], [18], [19], [20], [21], [22].

These methods usually end up with a great outcome when the words are not changed by their synonyms.

Syntactical methods: Some methods use text’s syntactical units for comparing the similarity between
documents. This is a realization of the intuition that similar documents would have similar syntactical
structure. This method makes use of characteristics such as POS tag to compare the similarity between
different documents [25] used a low-level syntactic structure to show linguistic similarities along with
the similarity measure

The disadvantage with this approach is that it can’t give us the perfect result because two documents
share the same syntax they can be non-plagiarized.

Semantic methods: These methods use semantic similarity for comparing documents. Methods that use
synonyms, antonyms, hypernyms, and hyponyms are placed in this category [4].

In this approach different semantic features which include (Synonyms, hyponyms, hypernyms, semantic
dependencies) [5] are extracted from the source documents and then these features are used to trace out
the plagiarism case from the corpus and the fact database build-up of already existing documents.

The semantic approach is aimed to attain the high performance in terms of detection and should address
the issues of polysemy and synonymy (different words referring to the same things like car and
automobile) that are not handled by the lexical (straight forward term matching) approach. Lin et al.
[23] has explored semantic similarity using lexical databases such as Stanford Wordnet5 to acquire
synonyms. Other algorithms that can be used to extract the semantic features of sentences are Latent
Dirichelet Allocation [24].

This approach has another problem of the ambiguity that will be triggered, so we can lose the sense of
the sentences treated.

The above methods cannot resolve the problem of plagiarism of ideas, indeed if the content of the
document has been modified by the addition of synonyms and also the application of paraphrasing.

4 Proposed approach
The figure below represents the overall workflow of the proposed system. The approach is implemented
in two main phases, phase of deep learning and another phase that is based on this learning to detect
similarity.

The first part consists of preparing our corpus of documents which contains the source documents and
the plagiarized documents corresponding to the source document. We can give it the following name:
Learning Corpus.

Indeed, for the preparation stage, it will be used for the construction of our learning system, each
document will be transformed to a list of vectors using the doc2vec principle.

Then this learning system is a supervised neural network which contains the input data corresponding
to the vectors of the source documents and also the output data correspond to the vectors of the
plagiarized documents of our Learning Corpus.

According to this learning phase we go directly to the plagiarism detection phase. Firstly, we must have
the document to be analyzed and a corpus of documents in which we carry out our research. So, we can
also give at this corpus a specific name like corpus of source documents.

The learning system is used at this level to detect whether this document is plagiarized or not.

Finally, if a type of plagiarism is detected, we will add this plagiarized document to the corpus of
plagiarized documents and the source document to the corpus of source documents.

Document 1 Pre-Processsing &


Vector Deep
Representation learning
Detect if this
system
Document 2 document is
plagiarized
Detect important
sentences

Corpus of source
documents
Figure 1 : Global architecture
The figure above gives an overall view of our approach, our system takes as two documents entered, the
first is the document to be processed and the second is a document from the corpus of source document.
And then these two documents should go through a preprocessing phase and later these two documents
will be represented in the form of a list of vectors that will subsequently be input objects to our deep
learning system, this system will detect later if these two documents are similar or not. We will more
detail this step in the next section.

5 Details of our approach


In the following sub-sections, we present some details of the main methods used for detecting
plagiarism.

5-1 Document representation phase

The Figure bellow resume all of steps using in the pre-processing phase Applied for each document
processed: the document to be analyzed, the learning corpus documents or the documents of the corpus
of source documents.

Learning Corpus Suspicious Document Source Document

Vector Representation
Sentence Segmentation and Tokenisation
Pre-Processsing &
Document representation phase

Lemmatization

Construct sentence Vector using doc2vec


Detect important
sentences

Term Frequency-Inverse Sentence Frequency


Representation

SentanceVec1 SentanceVec1’

SentanceVec2 SentanceVec2’

SentanceVecN SentanceVecN’
Figure 2 : Pre-Processsing phase

A- Pre-Processing & Vector Representation

The initial module is a pretreatment of the dataset of the source and suspicious documents. This includes
sentence segmentation, tokenization, lemmatization and vector construction.
Sentence Segmentation and Tokenization: Each document is represented as a set of sentences. The
tokenization of each source and suspicious sentence is then made.
Lemmatization: convert words into their basic dictionary forms for easy comparisons.
Construct sentence Vector using doc2vec: After the Word2Vec model has proved effective and useful,
so we can easily group and find similar words in a huge corpus, people then began to think further: is it
possible to have a higher level of representation? Sentences, paragraphs or even documents. To do this,
we chose to work with the dm (distributed memory) model.
We treat the paragraph as an additional word. Then, it is concatenated / averaged with local context
word vectors during the prediction.

Figure 3: doc2vec principle

In other words, we treat each document as an additional word; Document ID / Paragraph ID is


represented as a single vector; documents are also embedded in the continuous vector space.

B- Detect important sentences

The frequency of inverse frequency document TFIDF, is a numerical is a numerical statistic intended to
reflect the importance of a sentence for a document in a collection or a corpus.
Finally, the weight is obtained by multiplying the two measures:

𝒕𝒇𝒊𝒅𝒇𝒊,𝒋 = 𝒕𝒇𝒊,𝒋 . 𝒊𝒅𝒇𝒊 (1)


𝒏𝟏,𝟏
𝒕𝒇𝟏,𝟏 = ∑ (2)
𝒌 𝒏𝒌,𝟏

𝒏𝟏,𝟏 is the number of vectors of the closest sentences in a document.


𝒏𝒌,𝟏 is the total number of sentences in a document.

|𝑫|
𝒊𝒅𝒇𝟏 =
|{𝒅𝒋 : 𝒕𝟏 ∈𝒅𝒋 }|
(3)
𝑫 is the number of documents.
𝒅𝒋 is the number of documents that contains vectors closest to the processed vector.
For each document we will calculate the weight of each of its sentences and then we will choose the
most important sentences for each document.
Once the documents are pretreated and represented as its most important vector phrases. We go directly
to the phase of the construction of our learning system by then using the representative vectors of each
document of the source corpus and the corpus which contains the plagiarized documents corresponding
to the first corpus source.

5-2 Deep Learning phase

Deep Learning is a method of Machine Learning that teaches computers what humans are naturally
capable. It teaches a computer model how to perform classification tasks directly from images, text or
audio. Deep Learning models can achieve an exceptional level of accuracy, sometimes superior to
human performance. Models are trained through a large set of labeled data and neural network
architectures that contain many layers.
So we will be inspired by the principle used in word2vec for the vector representation of a word that is
based on the neighbors of a word to generate its vector.
More precisely, we will make our system learn the types of plagiarism existing at the level of detection,
that is to say, we will build a supervised neural network with an input that contains the source documents
and an output that contains the plagiarized documents.
To do this in a first place, a matrix W will be initialized to small random values and then this W will be
adjusted automatically during the iterations performed at the learning phases of a neural network, so this
W will serve us to detect the similarity between the documents.
Here is a figure that illustrates our learning system:

Figure 3 : our apprenticeship system

5-3 Detection of plagiarism

Concerning the plagiarism detection phase, first of all, the documents must be involved in the pre-
processing and construction phases of the sentence vectors described in the previous chapters, and then
we will use the matrix built on the level of our learning system above. In the case where we could detect
a similarity, so we should refresh our learning system by adding these two documents by building a new
learning matrix W.

6 Conclusion
In this paper, we used deep representation of words for plagiarism detection task. Sentence-by-sentence
comparison is used to find text similarities by a construction of a deep learning system which serve us
to save each kind of plagiarism detected. Advantages of this method among others are its simplicity and
its fast sentence comparison. Concerning the future work consists of putting into practice this method
and comparing it with the other methods used at the level of the phase related word.

References:
1. Tuomo Kakkonen, Maxim Mozgovoy. Hermetic and Web Plagiarism Detection Systems for Student Essaysan
Evaluation Of The State-Of-The-Art. Journal of Educational Computing Research, v42 n2 p135-159 2010.
University of Joensuu, Finland, University of Aizu, Japan [en ligne] 2010.
2. Bela Gipp State-of-the-art in detecting academic plagiarism. International Journal for Educational Integrity.
University of California, Berkeley and University of Magdeburg, Department of Computer Science.
3. Maurer, H. and Zaka, B., 2007. Plagiarism–a problem and how to fight it. Proceeding of Ed-Media 2007, 4451-
4458.
4. Ahmed Jabr Ahmed Muftah. Document Plagiarism Detection Algorithm Using Semantic Networks. A project
report submitted in partial fulfillment of the requirements for the award of the degree of Master of Science
(Computer Science). Faculty of Computer Science and Information Systems University Technology Malaysia
(2009).
5. Erfaneh Gharavi, Kayvan Bijari et Kiarash ZahirniaA Deep Learning Approach to Persian Plagiarism Detection.
Journal of Machine Learning Research (2011).
6. Collobert, R. and Weston, J. A unified architecture for natural language processing: Deep neural networks with
multitask learning. In Proceedings of the 25th international conference on Machine learning ACM, 160-167 (2008).
7. Chong, M.Y.M.,. A study on plagiarism detection and plagiarism direction identification using natural language
processing techniques (2013).
8. Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, Kilian Q. Weinberger. From Word Embeddings To Document
Distances? Washington University in St. Louis, 1 Brookings Dr., St. Louis, MO 63130, 2016.
9. Aristomenis Thanopoulos, Nikos Fakotakis and George kokkinakis.Tokenization for Knowledge-free Automatic
Extraction of Lexical Similarities. TALN 2003, Batz-sur-Mer, 11-14 juin 2003. Electrical and Computer
Engineering Department, University of Patras 26500, Rion, Greece(2003).
10. Hendrik Heuer. Text comparison using word vector representations and dimensionality reduction. Proceeding of
8th Python in Science Conference - Austin, Texas (August 18 - 23, 2009).
11. Richard Socher, Cliff Chiung-Yu Lin, Andrew Y. Ng, Christopher D. Manning. Parsing Natural Scenes and Natural
Language with Recursive Neural Networks. Computer Science Department, Stanford University, Stanford, CA
94305, USA, 2015.
12. Mikolov, T., Chen, K., Corrado, G., and Dean, J., 2013. Efficient estimation of word representations in vector
space. arXiv preprint arXiv:1301.3781.
13. Torres, S. and Gelbukh, A., Comparing similarity measures for original WSD lesk algorithm. Research in
Computing Science 43, 155-166 (2009).
14. S. M. Alzahrani, N. Salim, and A. Abraham, “Understanding plagiarism linguistic patterns, textual features, and
detection methods,” Trans. Sys.Man Cyber Part C, vol. 42, no. 2, pp. 133–149, Mar. 2012. [Online]. Available:
http://dx.doi.org/10.1109/TSMCC.2011.2134847
15. M. Chong and L. Specia, “Lexical generalisation for word-level matching in plagiarism detection,” in RANLP,
2011, pp. 704–709.
16. S. Brin, J. Davis, and H. Garcia-Molina, “Copy detection mechanisms for digital documents,” in SIGMOD
Conference, 1995, pp. 398–409.
17. D. R. White and M. Joy, “Sentence-based natural language plagiarism detection,” ACM Journal of Educational
Resources in Computing, vol. 4, no. 4, pp. 1–20, 2004.
18. S. Niezgoda and T. P. Way, “Snitch: a software tool for detecting cut and paste plagiarism,” in SIGCSE, 2006, pp.
51–55.
19. A. Barr´ on-Cedeno and P. Rosso, “On automatic plagiarism detection based on n-grams comparison,” in ECIR,
2009, pp. 696–700.
20. M. S. Pera and Y.-K. Ng, “A naıve bayes classifier for web document summaries created by using word similarity
and significant factors,” International Journal on Artificial Intelligence Tools, vol. 19, no. 4, pp. 465–486, 2010.
21. E. Stamatatos, “Plagiarism detection using stopword n-grams,” JASIST, vol. 62, no. 12, pp. 2512–2527, 2011.
22. J. Grman and R. Ravas, “Improved implementation for finding text similarities in large sets of data - notebook for
pan at clef 2011,” in CLEF (Notebook Papers/Labs/Workshop), 2011.
23. H.-H. Chen, M.-S. Lin, and Y.-C. Wei, “Novel association measures using web search with double checking,” in
ACL, 2006. [15] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of Machine Learning
Research, vol. 3, pp. 993–1022, 2003.
24. G. Tsatsaronis, I. Varlamis, and M. Vazirgiannis, “Text relatedness based on a word thesaurus,” J. Artif. Intell.
Res. (JAIR), vol. 37, pp. 1–39, 2010.
25. Uzuner, O., and Katz, B., and Nahnsen, T.: Using Syntactic Information to Identify Plagiarism. In: 2nd Workshop
on Building Educational Applications using NLP (2005)

You might also like