Source Code Plagiarism
Source Code Plagiarism
Learning
Utrecht University
Daniël Heres
August 2017
Contents
1 Introduction 1
1.1 Formal Description . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Literature Overview 5
2.1 Tool Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Tools and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Moss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 JPlag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 Sherlock . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.4 Plaggie . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.5 SIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.6 Marble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.7 GPlag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.8 Evolving Similarity Functions . . . . . . . . . . . . . . . . 7
2.2.9 Plague Doctor, Feature-based Neural Network . . . . . . 8
2.2.10 Callgraph Matching . . . . . . . . . . . . . . . . . . . . . 8
2.2.11 Holmes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.12 DECKARD Code Clone Detection . . . . . . . . . . . . . 9
3 Research Questions 10
3.1 Plagiarism Detection Tool goals . . . . . . . . . . . . . . . . . . . 10
3.2 Research Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.2 Modeling and Training . . . . . . . . . . . . . . . . . . . . 12
3.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1
5 A Source Code Similarity Model 18
5.1 Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . 18
5.1.1 Word and Character Embeddings . . . . . . . . . . . . . . 18
5.1.2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . 19
5.1.3 Neural Network Layers . . . . . . . . . . . . . . . . . . . . 19
5.1.4 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . 20
5.1.5 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.1.6 Batch Normalization . . . . . . . . . . . . . . . . . . . . . 20
5.2 Github Training Dataset . . . . . . . . . . . . . . . . . . . . . . . 20
5.2.1 Random Code Obfuscation . . . . . . . . . . . . . . . . . 21
5.3 Neural Network for Source Code Similarity . . . . . . . . . . . . 23
5.3.1 Other Experiments . . . . . . . . . . . . . . . . . . . . . . 25
7 Evaluation of Results 30
8 Discussion 32
8.1 Limitations and Future Research . . . . . . . . . . . . . . . . . . 32
9 Conclusion 34
Bibliography 35
2
Abstract
In this thesis we study how we can apply machine learning techniques to improve
source code plagiarism detection. We present a system, InfiniteMonkey, that
can identify suspicious similarities between source code documents using two
methods. For fast retrieval of source code similarities, we use a system based on
n-gram features, tf-idf weighting and cosine similarity. The second part focuses
on applying more complex neural network models trained on a large synthetic
source code plagiarism dataset to classify source code plagiarism. This dataset
is created using an automatic refactoring system we developed for learning this
task. The methods are evaluated and compared to other tools on a number of
different datasets. We show that the traditional approach compares well against
other approaches, while the deep model on synthetic data does not generalize
well to the evaluation tasks. In this thesis we also show a simple technique for
visualization of source code similarities.
Chapter 1
Introduction
Learning programming skills often requires a lot of practice, time and effort.
For some programmers it becomes tempting to cheat and submit the work of
another person. This may occur when they feel they can not solve a task before
the deadline, or don’t want to invest the time required to complete the task.
Students can also work together ”too much”, and partially re-use the work of
each other when this is forbidden. Plagiarism is often considered unwanted, and
universities and other organizations have rules in place to deal with plagiarism.
Detecting cases of plagiarism when working with large groups quickly becomes
infeasible. The number of unique pairs that could contain plagiarism grows
quadratically with the number of submissions n using this formula:
n × (n − 1)
number-of-pairs(n) =
2
. The number of pairs to compare grows very quickly, as is visible in Figure 1.1.
Without support from plagiarism detection tools, identifying plagiarism takes
too much work, especially in big classes, in MOOC environments and in courses
that span multiple years.
Source code plagiarism detection tools work as follows: they take in a col-
lection of source code documents, and sort all unique pairs according to their
measured similarity. Tools can also give an explanation of this ordering, by
visualizing the similarities in the two programs. An abstract view of this task
can be seen in Figure 1.2.
In an earlier project we performed a comparison of a number of existing
source code plagiarism detection tools. While the tools often work pretty well,
it occurred to me that a lot of tools fail to incorporate domain knowledge and/or
remove useful information from the source code which results in a bad ranking.
In a certain programming task and programming language, some similarities
should be expected (e.g. in import statements and language keywords), while
other similarities should be reason to report a high level of similarity. The
frequency of those patterns could be used to improve the results of a plagiarism
detection system.
1
Figure 1.1: Relation between number of submissions and number of pairs to
check for plagiarism
5,000
4,000
Number of pairs
3,000
2,000
1,000
0 20 40 60 80 100
Number of submissions
2
Figure 1.3: Moss document comparison view
Besides differences in detection performance, some tools also were more help-
ful than other programs: Moss for example shows for each tool exactly where
between the two documents are the highest similarities. This can make visual
identification of plagiarism a lot easier, as can be seen in a files comparison by
Moss in Figure 1 . The same information that could be used to improve the
quality of similarity detection can also be used to improve the visualization of
source code similarity. We could not only show that certain parts are similar,
but what features contribute the most.
GD = { (d1 , d2 , s) | d1 ∈ D, d2 ∈ D, s ∈ {0, 1} }
The ground truth is a (human) annotated dataset of pairs which are consid-
ered either positive or negative. The ground truth contains the set of cases of
plagiarism pairs
3
documents. We measure the performance of this function on this set using
a performance metric P that computes a score using the sequence of ordered
similarities and the ground truth GD . In this thesis we use the average precision
as metric for comparing the quality of the results. This computes the average
for the precisions at each correctly predicted example.
4
Chapter 2
Literature Overview
In this chapter I give a literature overview of research and tools for plagiarism
and source code similarity detection.
5
2.2 Tools and Methods
2.2.1 Moss
Moss [5] is a tool developed to detect source code similarity for the purpose of
detection of software plagiarism. It is available as a web-service: documents can
be uploaded using a script and the results are visible through a web interface af-
ter processing. Moss uses character level n-grams (a contiguous subsequence of
length n) as features to compare documents. Instead of comparing all n-grams,
only some of the features are compared for reasons of efficiency. A commonly
used technique to select textual features is to calculate a hash value for each
feature but selecting only a subset of those features using 0 mod p for a fixed
p. The authors observe that this technique often leaves gaps in the documents,
making the probability of missing matches between documents higher. To pre-
vent this from happening they use an algorithm they call winnowing: instead
of randomly selecting n-grams from the document, they select for each window
at least one feature. Furthermore they use a large value for n to avoid noisy
results and remove white space characters to avoid matching on white space.
2.2.2 JPlag
JPlag [6] is a tool to order programs by similarity given a set of programs.
The authors argue that comparing programs based on a feature vector alone
throws too much away of the structural similarity. Instead they try to match
on what they call structural features. Instead of using the text directly they
convert the Java source code first to a list of tokens, such as BEGINCLASS,
ENDCLASS and BEGINMETHOD and ENDMETHOD. Then, they use an
algorithm to find matches between documents using the list of tokens, from the
largest to the smallest matches. There is some parameter for the minimum size
of matches, otherwise small matches would occur too often. They apply a few
runtime optimizations to the basic comparison algorithm with worst case O(n3 )
time complexity (where n is the size of the documents) using the Karp-Rabin
[7] algorithm. They compare different cut-off criteria using different methods
to create a threshold value given the similarity values on a dataset. They also
compare the influence of the minimum match length and the set of tokens to use
by performing some measurements on datasets. They also show some possible
attacks against JPlag.
2.2.3 Sherlock
Sherlock [8] is a simple C-program to sort text document pairs like source code
according to their similarity. The program first generates signatures. While
generating the signature, it drops whitespace characters and drops a fraction of
the other characters from the text file in a somewhat random fashion. Finally
all the signatures are compared against each other.
6
2.2.4 Plaggie
Plaggie [9] is another tool that supports checking for similarities between Java
source code documents. It works similar to JPlag. At the time of publication
the main differences were that Plaggie was open source and could be run locally.
Currently, JPlag is both open source and can also be run locally.
2.2.5 SIM
SIM [10] [11] is a software and text plagiarism detection tool written in C.
It works by tokenizing the files first and searching for the longest common
subsequence in the file pairs.
2.2.6 Marble
Marble [12] is a tool that is developed with simplicity in mind. The tool consists
of three phases: normalization, sorting and detection. The normalization phase
converts source code from raw text to a more abstract program. Keywords
like class, extends, String are maintained, names are converted to X and
numeric literals to N. Operators and symbols are also left in place. Import
declarations are discarded in this transformation. After normalization, classes
and class members are sorted lexicographically, this makes the tool insensitive
to reordering these program constructions. The simplified program finally is
compared by the Unix line-based diffing tool diff using the number of changed
lines normalized by total length of the two files.
2.2.7 GPlag
GPlag [13] uses a program dependence graph (PDG) analysis to measure sim-
ilarity between two programs. The idea of this is that the dependencies between
program parts often remain the same, even after refactoring of code. After the
programs are converted to a PDG, all subgraphs larger than some ”trivial” size
are tested on graph isomorphism relaxed with a relaxation parameter γ ∈ (0, 1].
Graph G is said to be γ-isomorphic to G′ if the subgraph S ⊆ G is subgraph
isomorphic to G′ and |S| ≥ γ|G′ |. To avoid testing every pair of sub-graphs,
they reject pairs of graphs that don’t have similar histograms of their vertices.
7
creates new programs based on a pool of the best current functions. The possible
functions are restricted by a simple grammar supporting for example addition
and multiplication and some terminals like within-document term frequency,
within-query term frequency and document length. They add a penalty for the
size of the similarity functions. They optimize three different fitness functions.
The resulting functions perform worse compared to the other functions, which
the authors suggest may be due to the functions returning negative similarity
scores as well as overfitting to the training set.
2.2.11 Holmes
Holmes [17] is a plagiarism detection tool for Haskell programs. A number of
different techniques are used by the tool for comparison of two programs: a fin-
gerprinting technique similar to Moss, a tokenstream and three different ways
of comparing degree signatures: based on Levenshtein edit distance of the two
vertice and Levenshtein distance of two vertices combined with position. Local
functions are not considered for the call graph. Holmes performs a reachability
8
analysis and removes code that is not reachable from the entry point to some
extent are removed as well. Comments are entirely removed in the transforma-
tion and most identifiers. Template code can be added by a teacher to prevent
matching on code that students are permitted to use.
9
Chapter 3
Research Questions
• Can we achieve better results than existing tools for plagiarism detection
tasks and source code similarity using machine learning techniques?
• How can we visualize the similarities between two programs and how does
this compare to e.g. the Unix diff tool and Moss? [5]
10
Dataset origin annotated as similar annotated as dissimilar total nr. files
a1 SOCO 54 n/a 3241
a2 SOCO 47 n/a 3093
b1 SOCO 73 n/a 3268
b2 SOCO 35 n/a 2266
c2 SOCO 14 n/a 88
mandelbrot UU 111 85 1434
prettyprint UU 9 135 290
reversi UU 117 83 1921
quicksort UU 81 94 1353
3.2.1 Dataset
For building, validating and testing our models we need a few datasets with an
annotated ground truth to optimize and evaluate our models. During a previous
experiment we used data from the SOCO [19] challenge and manually annotated
a few datasets from UU courses. During this thesis we extended this set with
one extra dataset. The characteristics (name, origin of data, nr. similar, nr.
dissimilar, total number of files) of these datasets are listed in table 3.1. In both
the SOCO and UU sets, documents can occur in multiple pairs because multiple
people may re-use code from another document. In the SOCO datasets, only
the number of documents annotated as similar is given. The total number of
documents annotated as similar or dissimilar may be higher than listed in the
table, as in all sets only a subset of the document pairs are evaluated.
11
3.2.2 Modeling and Training
In this thesis, we experiment with and evaluate different kinds of feature ex-
traction methods and machine learning models.
We consider using these approaches:
We try these approaches as they seem the most promising approach for our
task. Other approaches could be unsupervised learning (which we experimented
with) and semi-supervised learning. For the models we use the machine learning
tool Keras [20], in combination with the computation graph backend TensorFlow
[21] or Theano [22] in combination with scikit-learn [23] for creating models,
text processing, numerical computation, model evaluation and hyperparameter
optimization.
3.2.3 Evaluation
We can evaluate our model by reporting metrics on a collection of human an-
notated validation sets, i.e. the SOCO [19] dataset and data from Computer
Science courses at Utrecht University.
These metrics and results are compared against those of other source code
similarity tools that are both publicly available and support comparing Java
source code.
12
Chapter 4
Infinitemonkey Text
Retrieval Baseline
In this chapter we show the methods we used to build a baseline for Infinitemon-
key. We show how we compute a representation of documents using n-grams,
tf-idf weighting, and cosine-similarity. Further we show how we find a reasonable
setting for our hyperparameters using grid search.
4.1.1 N -grams
A n-gram is a contiguous sequence with length n in sequential data, often in
text. For example, the Haskell program main = putStrLn "hello word" >>
putStrLn "exit" split on and including non alphanumeric characters except
spaces contains the 1-grams visible in Table 4.1. 1-grams are also called un-
igrams or ”bag-of-words”. The downside of the bag-of-words model is that it
throws away all information about word order: all possible permutations of a
sequence result in the same vector. To overcome the issue of local ordering, n-
grams split the text into all sequences with length n: bigrams (n = 2), trigrams
(n = 3), etc. For an example of bigrams (n = 2) see Table 4.2. All possible
tokens not found in this document, but that occur in other documents have a
value of 0.
Instead of splitting the text into words, n-grams can also be computed at the
character level. For example, the character level 3-grams in the text range(2)
13
Count
main 1
= 1
putStrLn 2
" 4
hello 1
world 1
>> 1
exit 1
Count
main = 1
= putStrLn 1
putStrLn " 2
" hello 1
hello world 1
world " 1
" >> 1
>> putStrLn 1
" exit 1
exit " 1
is represented by the set {ran, ang, nge, ge(, e(2, (2)}. The representation of
this can be seen in Table 4.3.
To determine the full mapping from n-gram to vector index, a dictionary of
all the n-grams in the text must be computed. A common way to avoid the need
of building a dictionary of the mapping from n-gram is applying the so-called
”Hashing Trick”: instead of building up a dictionary, a hash function is used to
convert n-grams to an index of the feature vector. The size of the vector can
be tuned to keep the number of expected collisions low, while also not being
unnecessarily big. The values for n and which choice or combination of features
representations to use can be optimized using hyperparameter optimization.
14
Count
ran 1
ang 1
nge 1
ge( 1
e(2 1
(2) 1
The term frequency tf can simply be the raw frequency ft,d , but can also be
scaled logarithmically, using tf(t, d) = 1 + logft,d and binary (1 if the term is
present in the document and 0 otherwise).
4.1.3 Cosine-similarity
Cosine similarity is used to compute a scalar value in the interval [0, 1]:
d1 · d2
cosine-similarity(d1 , d2 ) = (4.2)
∥d1 ∥∥d2 ∥
Because we normalize the vectors before computing the distance matrix, we only
need to calculate the dot product between the two vectors. This operation is
very cheap to compute: for example, the pairwise distance matrix of more than
3000 sparse feature vectors (which results in a distance matrix of size 3000 ×
3000) can be computed in about 10 seconds on a modern CPU, depending on
the amount of features.
15
Hyperparameter values
ngram_range {(3, 3), (3, 4), (3, 5), (3, 6), (4, 4)...(6, 6)}
binary {0, 1}
smooth_idf {0, 1}
sublinear_tf {0, 1}
lowercase {0, 1}
max_df {0.5, 0.75, 1.0}
min_df {1, 2, 3}
analyzer {”char”, ”char_wb”, ”word”}
min_threshold {400.0, 600.0}
16
Hyperparameter Value
ngram_range (5, 5)
binary 0
smooth_idf 0
sublinear_tf 1
lowercase 0
max_df 1.0
min_df 2
analyzer ”char”
min_threshold 400.0
Table 4.5: Best performing hyperparameters found using grid search on the
mandelbrot dataset
Hyperparameter Value
ngram_range (6, 6)
binary 0
smooth_idf 1
sublinear_tf 1
lowercase 1
max_df 1.0
min_df 1
analyzer ”char_wb”
min_threshold 400.0
Table 4.6: Best performing hyperparameters found using Grid Search on the b2
dataset
on this task, while the highest with the ”char_wb” optimizer has 0.9353 as its
average precision.
Using our hyperparameter set the grid search is performed on a total of 8064
combinations. Performing a grid search on a single dataset consumes more than
24 hours for both the mandelbrot and b2 dataset.
17
Chapter 5
In this chapter we show how we learn a similarity model using a neural network
model and a dataset retrieved from open source code retrieved from Github.
18
however very inefficient compared to using a lookup table, as the vector grows
with the
y = b + W ⊺x (5.1)
where ̲ is a bias vector with size n, W a matrix of weights with dimension
m×n that are multiplied with the input vector x with dimension n. This results
in a output vector m-dimensional output vector y.
This can be used for example for a logistic regression model by combining
it with the logistic sigmoid function σ:
1
σ(z) = (5.2)
1 + e−z
When the output of the densely connected layer is a scalar value, it can be
combined with the sigmoid function to build a logistic regression model:
19
5.1.4 Early Stopping
Overfitting is a problem in learning models where there is a (big) difference in
error on a training set and the error on a validation set. This can be avoided
using a bigger training set and regularization methods.
Early stopping is a simple technique for preventing overfit by stopping the
optimization procedure when the validation loss does no longer improve. We
use a variant of early stopping that only writes a model to disk when this has a
lower error on a hold out set than the model that is already saved.
n [
∑ ]
L(W ) = − yi log pi (Wi xi ) + (1 − yi ) log(1 − pi (Wi xi ) (5.4)
i=1
Where yi ∈ {0, 1} is the binary class label, in our case similar and non-
similar and n is the number of samples in our training set. We optimize the
difference between the true labels and the predicted conditional probabilities
given by our model p(yi = 1|xi ). Minimizing this loss means we better capture
the underlying probability distribution.
20
are tagged as Java by Github. The files ending on the .java extension are copied
to another directory. In addition to this automatically retrieved source code set,
we use source code from the RosettaCode project. The original data is available
on the following URL: https://github.com/acmeism/RosettaCodeData.
The dataset is split into a set of similar source code document pairs and
a set of unique documents. After processing, the dataset contains 58548 file
pairs that are considered as similar and 57170 file pairs that are considered as
dissimilar.
21
Listing 5.2: Obfuscated Java Code
(formatted)
public class O {
Listing 5.1: Original Java Code public void B9a310 (int
u34n9qK) {
public class SumStack {
if (Br7Bo8c5 <
private int[] items;
vM.length) {
private int top = -1;
Br7Bo8c5 = 1 +
private int sum = 0;
Br7Bo8c5;
public SumStack (int size) {
vM[Br7Bo8c5] =
items = new int[size];
u34n9qK;
}
pVn_ = pVn_ + u34n9qK;
public void push (int d) {
}
if (top < items.length) {
}
top = top + 1;
private int pVn_ = 0;
items[top] = d;
public int pVn_ () {
sum = sum + d;
return pVn_;
}
}
}
public O (int VTX) {
public int pop () {
vM = new int[VTX];
int d = -1;
}
if (top >= 0) {
private int[] vM;
d = items[top];
public int Ga () {
top = top - 1;
int CfluOXd3 = -1;
sum = sum - d;
if (Br7Bo8c5 >= 0) {
}
CfluOXd3 =
return d;
vM[Br7Bo8c5];
}
Br7Bo8c5 = Br7Bo8c5 -
1;
public int sum () {
pVn_ = pVn_ -
return sum;
CfluOXd3;
}
}
}
return CfluOXd3;
}
private int Br7Bo8c5 = -1;
}
22
The Java parser we use removes both comments and formatting. Because of
this, after printing (and formatting) of the code we remove all spacing and new-
line characters as they no longer give any additional information. Furthermore,
the implementation of the tool does not support the entire Java-language, some
parts in the code may be unchanged after the obfuscation.
All code samples that are larger than 256 characters and smaller than 3000
characters after removing whitespace are converted to similar pairs and ”unique”
pairs, 114338 Java files in total.
The obfuscation tool makes sure that there is at least some similarity by
keeping at least one class method.
The network consists of three main parts: an encoder, which transforms the
character embeddings into a representation of fixed size using an LSTM-layer.
The documents are then concatenated in two ways and processed by another
”comparison” module. The left and right parts of the network share the same
parameters. By concatenating the output of the encoder in both ways, sharing
parameters and summing the outputs of the ”comparison” modules, the learned
similarity output of this network will be symmetric: sim(Documenta , Documentb ) =
sim(Documentb , Documenta ). The model constains a total of 136,929 param-
eters.
23
Figure 5.1: Architecture of Similarity Detection Network
For training the network we randomly split the data in a training and valida-
tion set using respectively 80% (91470 samples) and 20% (22868 samples) of the
data. The model is trained using the SGD method Adam with default learning
rates. The model is saved automatically at each iteration when it improves the
accuracy on the validation set. The progression of the training of the neural
network can be seen in Figure 5.2. The blue line shows the cross-entropy loss of
the network on the validation set (lower is better) after each full iteration on the
training data, the red line shows the accuracy on the validation set. The high
performance of the model on the validation set shows that the model separates
the similar and dissimilar classes well, on this synthetic dataset.
For using this large model in practice for plagiarism tasks we reorder the
highest similar results of the cosine-similarity model using the predictions of
the neural network model. The output of both the first model are combined by
taking the average of the two.
24
Figure 5.2: Training progression of Similarity Detection Network
100
0.4
0.2 90
0.1 85
0 80
0 2 4 6 8 10 12 14
Epoch number
25
final layer is used as vector representing each document. All the vectors are
compared using cosine similarity, under the assumption that similar documents
have vectors in a similar direction. Both models did not perform very well on the
mandelbrot set. They seem to predict high similarity values when documents
are almost the same, but fail on samples that are harder to predict.
26
Chapter 6
Visualization of Source
Code Similarity
6.1 Visualization
We use the following method to visualize the similarities between two source
code files:
• Calculate the relative importance of all the features for each pair
• Locate these features in the document pairs
• Display the fragments with the highest cosine similarity value.
27
Figure 6.1: InfiniteMonkey upload screen
We can also show the top 5 results within two file pairs, which can be seen
in figure 6.4. In our detail view, we show the top 5 combinations between two
file pairs with the highest cosine similarity.
Compared to the line based tool diff and plagiarism tool Moss, we show
the importance of individual features rather than showing respectively identical
lines or similar text blocks.
28
Figure 6.3: Top 5 results on the b2 dataset
Figure 6.4: Detail view: top 5 similarities for a single file pair
29
Chapter 7
Evaluation of Results
30
Dataset cpd infinitemonkey diff difflib jplag marble moss plaggie sherlock sim infinitemonkey-rerank
a1 0.611 0.928 0.929 0.94 0.842 0.766 0.775 0.857 0.877 0.778 0.754
a2 0.574 0.873 0.919 0.939 0.829 0.727 0.83 0.877 0.855 0.755 0.683
b1 0.676 0.965 0.979 0.981 0.872 0.71 0.829 0.864 0.77 0.783 0.819
b2 0.522 0.92 0.912 0.951 0.83 0.624 0.888 0.911 0.553 0.72 0.903
c2 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
prettyprint 0.0 0.767 0.1 1.0 0.764 0.604 0.579 0.633 0.0 1.0 0.071
reversi 1.0 1.0 0.99 0.945 0.964 0.962 1.0 0.962 0.744 0.71 1.0
quicksort 0.998 1.0 0.966 0.965 0.92 0.968 0.995 0.93 0.849 0.833 0.976
mandelbrot 0.975 1.0 0.995 0.998 0.996 1.0 1.0 0.963 0.978 0.59 1.0
nr. best 2 3 1 6 1 1 2 1 1 1 3
mean 0.706 0.939 0.866 0.969 0.898 0.82 0.872 0.89 0.792 0.792 0.763
Dataset cpd infinitemonkey diff difflib jplag marble moss plaggie sherlock sim
synthetic 0.919 0.983 0.921 0.815 0.985 0.987 0.91 0.983 0.806 0.96
31
Chapter 8
Discussion
We tried two main different approaches towards source code plagiarism detection
using machine learning. The first approach is an unsupervised approach using
cosine similarity, n-gram features and tf-idf ranking. The main contribution is
that the baseline tf-idf approach does well on a large set of plagiarism tasks,
and achieves better scores than other popular tools. Systems that also use
textual features like Moss, can likely add tf-idf scaling of features to improve
the ranking quality.
Our second contribution is a model based on a deep recurrent neural network.
This model shows the possibility to learn similarities in source code using our
model and a large synthetic dataset. However, this model does not generalize
well to our evaluation data.
Finally we compare all the tools on a number of different datasets using the
average precision metric and evaluate the results.
32
simpler supervised model using multiple similarity metrics or outputs from dif-
ferent tools. This model could learn the importance of each similarity metric.
Another interesting direction would be applying semi-supervised learning: first
learn a good representation of source code on a large dataset using an unsuper-
vised learning algorithm and then using these representations as features for a
supervised model.
33
Chapter 9
Conclusion
In this thesis we compared a number of tools and developed our own tool,
InfiniteMonkey. We experimented with a more traditional information retrieval
approach and machine learning models. We also performed a comparison based
on datasets based on two different sources: 9 in total. The n-gram model with
tf-idf weighting and cosine similarity works well for this problem and scores well
across different datasets. The tf-idf weighted features can also be used as part
of a visualization method, as more unique parts in the word pairs will show
up as more important than other similarities. We developed a web based GUI
based on this simple method. We also tried a deep neural network model on
synthetic data generated from open source repositories. This model trained on
the synthetic open source dataset did not generalize well to the source code
plagiarism tasks we evaluated. This approach needs more work to overcome
this problem.
34
Bibliography
[1] J. Hage, P. Rademaker, and N. van Vugt. Plagiarism detection for Java:
a tool comparison. In G. C. van der Veer, P. B. Sloep, and M. C. J. D.
van Eekelen, editors, Computer Science Education Research Conference,
CSERC 2011, Heerlen, The Netherlands, April 7-8, 2011, pages 33–46.
ACM, 2011.
[2] Enrique Flores, Paolo Rosso, Lidia Moreno, and Esaú Villatoro-Tello. On
the detection of source code re-use. In Prasenjit Majumder, Mandar Mitra,
Madhulika Agrawal, and Parth Mehta, editors, Proceedings of the Forum
for Information Retrieval Evaluation, FIRE 2014, Bangalore, India, De-
cember 5-7, 2014, pages 21–30. ACM, 2014.
[3] Enrique Flores, Paolo Rosso, Esaú Villatoro-Tello, Lidia Moreno, Rosa Al-
cover, and Vicente Chirivella. Pan@fire: Overview of CL-SOCO track on
the detection of cross-language source code re-use. In Prasenjit Majumder,
Mandar Mitra, Madhulika Agrawal, and Parth Mehta, editors, Post Pro-
ceedings of the Workshops at the 7th Forum for Information Retrieval Eval-
uation, Gandhinagar, India, December 4-6, 2015., volume 1587 of CEUR
Workshop Proceedings, pages 1–5. CEUR-WS.org, 2015.
[4] Phatludi Modiba, Vreda Pieterse, and Bertram Haskins. Evaluating pla-
giarism detection software for introductory programming assignments. In
Proceedings of the Computer Science Education Research Conference 2016,
CSERC ’16, pages 37–46, New York, NY, USA, 2016. ACM.
35
[8] R. Pike. Sherlock. http://www.cs.usyd.edu.au/~scilect/sherlock/.
[9] S. Surakka A. Ahtiainen and M. Rahikainen. Plaggie: GNU-licensed source
code plagiarism detection engine for Java exercises. In A. Berglund and
M. Wiggberg, editors, 6th Baltic Sea Conference on Computing Education
Research, Koli Calling, Baltic Sea ’06, Koli, Joensuu, Finland, November
9-12, 2006, pages 141–142. ACM, 2006.
[10] D. Grune and M. Huntjens. Het detecteren van kopieën bij informatica-
practica. Informatie, 31(11):864–867, 1989.
[11] D. Grune and M. Huntjens. Sim. http://dickgrune.com/Programs/
similarity_tester/.
36
[19] E. Flores, P. Rosso, L. Moreno, and E. Villatoro-Tello. On the detection of
source code re-use. In Proceedings of the Forum for Information Retrieval
Evaluation, pages 21–30. ACM, 2014.
[20] F. Chollet. Keras. https://github.com/fchollet/keras, 2015.
37