0% found this document useful (0 votes)
2 views9 pages

Unsupervised Detection of Solving Strate

This paper introduces a custom data analysis pipeline for classifying source code solutions in competitive programming using unsupervised learning techniques. It presents a new dataset, AlgoSol-10, consisting of ten programming problems with manually clustered solutions, and demonstrates the effectiveness of transformer models in identifying distinct algorithmic solutions. The proposed approach leverages various embedding methods and clustering algorithms to automate the classification of source code submissions, aiming to enhance the analysis of large code bases.

Uploaded by

pa294424
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views9 pages

Unsupervised Detection of Solving Strate

This paper introduces a custom data analysis pipeline for classifying source code solutions in competitive programming using unsupervised learning techniques. It presents a new dataset, AlgoSol-10, consisting of ten programming problems with manually clustered solutions, and demonstrates the effectiveness of transformer models in identifying distinct algorithmic solutions. The proposed approach leverages various embedding methods and clustering algorithms to automate the classification of source code submissions, aiming to enhance the analysis of large code bases.

Uploaded by

pa294424
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Unsupervised Detection of Solving

Strategies for Competitive Programming

Alexandru Ştefan Stoica1 , Daniel Băbiceanu1 , Marian Cristian Mihăescu1(B) ,


and Traian Rebedea2
1
University of Craiova, Craiova, Romania
[email protected]
2
University Politehnica of Bucharest, Bucharest, Romania
[email protected]

Abstract. Transformers are becoming more and more used for solving
various Natural Language Processing tasks. Recently, they have also been
employed to process source code to analyze very large code-bases auto-
matically. This paper presents a custom-designed data analysis pipeline
that can classify source code from competitive programming solutions.
Our experiments show that the proposed models accurately determine
the number of distinct solutions for a programming challenge task, even
in an unsupervised setting. Together with our model, we also introduce a
new dataset called AlgoSol-10 for this task that consists of ten program-
ming problems together with all the source code submissions manually
clustered by experts based on the algorithmic solution used to solve each
problem. Taking into account the success of the approach on small source
codes, we discuss the potential of further using transformers for the anal-
ysis of large code bases.

Keywords: Transformers · Competitive programming · Source code


analysis · Unsupervised learning

1 Introduction
This paper introduces a new dataset called AlgoSol-10 and an approach to deter-
mine distinct solutions in the context of competitive programming in terms
of algorithmic approach and implementation. For consistency in description of
works we will use the term algorithmic solution as the approach for solving a
competitive programming problem. The term source code solution (or just solu-
tion) represents a particular implementation in C++ programming language of
the algorithmic solution. The solution of a problem is represented by a source
code file on the hard disk that has been coded by a competitor during a contest.
In competitive programming, it is expected that we do not know the number
of distinct algorithmic solutions for a problem. We consider that a problem has
different algorithmic solutions if they use different computer science method-
ology and concepts to correctly solve the same problem e.g. The problem of

c Springer Nature Switzerland AG 2021


H. Yin et al. (Eds.): IDEAL 2021, LNCS 13113, pp. 157–165, 2021.
https://doi.org/10.1007/978-3-030-91608-4_16
158 A. Ş. Stoica et al.

finding if a number is in a vector can be solved by different algorithmic solutions


: 1) sorting the vector and binary search, 2) using a hash table, 3) using a bal-
anced tree, 4)using a linear search, etc. The trivial approach is to go through the
source code solutions of every participant and finally determine how many cor-
rect distinct algorithmic solutions were submitted. Considering that a problem
may have hundreds of source code solutions, we want to provide a method that
automatically determines the correct label for each solution for a fixed number
of distinct algorithmic solutions.
We did not find any available dataset with labelled problems in terms of
distinct algorithmic solutions to validate our method, so we created our own
dataset. It has 10 problems from Infoarena 1 for which we manually labeled all
source code solutions. To the best of our knowledge, we are the first to propose
an approach to this problem in the context of competitive programming. We
will provide a supervised method as well as a new unsupervised method that
performs well in comparison with classical unsupervised methods.
One of the most challenging aspects of the tackled task is building the labelled
dataset because it requires an expert in competitive programming who can cor-
rectly label various source code solutions that implement the same algorithmic
solution. For example, regarding the problem of finding a value in a vector,
the algorithmic solution that uses sort and binary search may have source code
implementations in terms of sorting (i.e., bubble sort, quick sort, STL sort, etc.)
and binary search (i.e., with while loop, with STL, etc.). The difficulty comes
also from the fact that the code written in competitive programming does not
respect the industry’s coding standards, which implies that one cannot make
any assumptions about specific rules. For example, Dijkstra’s algorithm could
be implemented in a function named f, variables could be named as letters etc.
Finally, not every piece of code is compilable, even though at a particular time
it was compiled, and that is because compilers usually change in time, and the
code may have been written many years ago without knowing the version of the
compiler used at that moment in time for each solution.
The proposed solution is based on unsupervised learning methodology, while
there are used ten manually labelled problems for evaluation purposes. The
embedding methods that are used within the workflow are classical Term
Frequency-Inverse Document Frequency (Tf-IDF) [10], and Word2Vec (W2V)
[12], as well as the more recent deep learning (DL) based Self-Attentive Func-
tion Embeddings (SAFE) [11].

2 Related Work
Natural Language Processing (NLP) generally deals with processing text infor-
mation using various Machine Learning and rule-based techniques, with the
former receiving wider interest lately. Recent developments target using state-
of-the-art NLP techniques for processing source code as well, as it can also
be regarded as text. Thus, Chen and Monperrus [6] present a literature study
1
Infoarena, www.infoarena.ro, last accessed on 30th June 2021.
Unsupervised Detection of Solving Strategies for Competitive Programming 159

on diverse techniques for source code embeddings as a novel research direction


within NLP. Further, Rabin et al. [13] aim to demystify the dimensions of the
source embeddings as a new attempt to shed light on this new research domain.
The usage of DL for source code analysis has become more popular due to the
attention mechanism from NLP that has been used in Code2Vec [1] and SAFE
[11]. Code2Vec uses the attention mechanism on various paths in the abstract
syntax tree (AST) of a function for generating corresponding embeddings. On
the other hand, SAFE only uses the attention mechanism on assembly code
obtained after compiling the source code and generates an embedding for each
function. A classification problem (i.e., prediction of labels) that uses W2V and
DL for the analysis of source code has been reported by Iacob et al. [8]. A usage of
Code2Vec for recommending function names has been reported by Jiang et al. [9],
although the proposed method has poor results on real-world code repositories
as the domain is still in its infancy.
Other approaches build the profile of individual students based on their pro-
gramming design [3], determine source code authorship [5], predict the bug prob-
ability of software [15], build source code documentation by summarization [2],
classify source code [4] or detect the design pattern [7].

3 Proposed Approach
A particular aspect of the task is that in competitive programming, a problem
usually has a small number of distinct algorithmic solutions which allows the
possibility of manual labelling. This context has two implications: 1) we face a
classification problem or an unsupervised learning situation with a known and
small number of clusters, and 2) we may build a labelled dataset for validation
purposes. Therefore, we can consider that for a particular problem, we know
K (i.e., the number of distinct algorithmic solutions), and the task becomes to
determine the label of each source code solution. Given that we do not have any
labels for our instances (i.e., source code solutions to a given problem), we are
clearly in the area of unsupervised learning.
As a proposed approach for this problem, we build a custom data analysis
pipeline from the following components.
Preprocessing. The first step in the pipeline is preprocessing of source code
solutions. This step is compulsory for all source code solutions that are given to
W2V and Tf-idf for building the embeddings. The following steps are performed:
1) delete the #include directives; 2) delete comments; 3) delete functions that
are not called; 4) replace all macro directives; 5) delete apostrophe characters;
6) delete all characters that are not ASCII (for example strings which may
contain unicode characters); 7) tokenize the source code. We mention that SAFE
embeddings do not require preprocessing because it takes as input the object
code, not the source code. Thus, the only requirement for the source code to be
provided as input to SAFE is that it needs to be compiled. As it can be seen here,
some embeddings methods use source code as input (i.e., W2V, Tf-Idf) while
SAFE uses object code. The main idea of this step is to do any preprocessing
160 A. Ş. Stoica et al.

necessary in order to obtain more quality code embeddings, since they contain
the main information.
Embedding Computation. For generating the source code embeddings with
W2V and SAFE we use AlgoLabel [8]. The embeddings computation with Tf-idf
is implemented separately and follows the classical bag-of-words approach.
W2V uses the tokens for each source code solution after preprocessing. Based
on the obtained tokens, we build a neural network whose goal is to predict the
current token given the nearby tokens in a C-BOW architecture [12]. The algo-
rithm uses a window of five tokens, and the embedding is a vector of dimension
128. The resulted embedding of a source code solution is the average of embed-
dings for the tokens that make up the source code solution itself.
Tf-idf algorithm uses the tokens obtained for each source code solution after
preprocessing. Those tokens are filtered by removing the stop words

"(", ")", ".", "#", ";", ",",


">>", "<<", "{", "}", "[", "]", ".", ""...""/

and by selecting the words which follow the following regular expression

\w+|\+|-|=|!=|\\*|\/|%|!|

The output of Tf-idf algorithm is a matrix whose number of columns is the


cardinal of the vocabulary (i.e., the number of distinct filtered tokens).
SAFE uses only binary code that has been obtained after compilation of
the source code solution. The pretrained SAFE model is used to compute one
embedding for each function in the source code solution. The dimension of the
obtained vector is 100, and the embedding of a source code solution is the average
of the embeddings of all functions from the source code solution itself.
Building Clusters for Known Value of K. For determining the clusters (or
patterns) within a set of source code solutions, we have used three flavours of
unsupervised learning.
Clustering with a Single View. This approach is the simplest in the way that for
each embedding technique, we run a specific clustering algorithm that partitions
the available source code solutions in K groups.
Clustering with Multi-View. We consider the number of available embeddings
methods used, V , as distinct views of the same dataset. This approach enables
the usage of Multi View Spectral Clustering (i.e., MVSC) [14] algorithm, which
also determines K groups.
Clustering by Voting. We assume that we have V views and for each view we
have an associated embedding. For each view we use a clustering algorithm which
will partition the embeddings for a particular view in K groups. Thus, for each
source code solution we associate a vector of dimension V , where at each position
i the value of xi is in the range from 0 to K − 1 and represents the cluster to
which it belongs in the i -th view. More exactly, if we define S(p) as the set of
Unsupervised Detection of Solving Strategies for Competitive Programming 161

all source code solutions associated to a problem p where Ik = {0, 1, ..., K − 1},
K ≥ 2 then S(p) = {si |si ∈ p} ⇒ v p (si ) : S(p) → Ikv , v p (si ) = (x0 , x1 , ..., xV −1 ).
We observe that the same vector may represent several source code solutions.
Thus, we define the frequency of a vector as the number of source code solutions
represented by the same vector.
sp (v) = {si |si ∈ S(p), v = v(si )}, f p (v) = |sp (v)| Having this function
defined, we find vectors whose sum of frequencies is maximal and have no
common coordinate. Furthermore, if we cannot find many vectors equal to
K (i.e., known number of clusters), we must choose fewer clusters. We define
(i) (i) (i)
A = {vi |vi ∈ Ikv , vi = {x0 , x1 , ..., xV −1 )}, i ∈ {1, .., K}
i j
where xd = xd , i ∈ {1,
 ..,p K}, j ∈ {1, .., K}, d ∈ {0, .., V − 1} The task is to
determine A such that f (vi ), vi ∈ A has maximum value. The algorithmic
solution itself is represented by the sp (vi ) vectors.
The vectors represent the K distinct algorithmic solutions that we extract
from each view. Thus, each algorithmic solution contains similar source code
solutions.
For solving this problem, we use a dictionary in which we compute the fre-
quency of each vector. We extract the keys and choose all subsets of dimension
K that meet the condition with no identical coordinate. For each subset, we
compute the sum of key values and obtain the maximum sum subset. The key of
the dictionary is a string that appends the coordinates separated by a separator.
The complexity is exponential, but the algorithm is tractable because in our
dataset V ≤ 3 and K ≤ 3.

Algorithm 1. Semi-supervised data analysis pipeline


Require: Solutions-Dataset = solutions for a problem
1: # Setup voters with their parameters: embedding, clustering algorithm and classi-
fication algorithm  (p)
2: # Build Ground-Truth-Dataset which maximizes f (vi )
3: # Sols-Train = Ground-Truth-Dataset
4: # Sols-Unlabeled = Solutions not in Sols-Train
5: while ( # of valid solutions greater than threshold and # Sols-Unlabeled greater
than 0) do
6: V oter − Xi = Train classifier on Sols-Train based on the i-th view
7: for all (Sols-Unlabeled) do
8: #Predict the label of solution by all voters
9: if (solution has same label in all voters ) then
10: #Append solution to Sols-Train
11: #Remove solution from Sols-Unlabeled
12: end if
13: end for
14: end while
15: Sols-Test = Sols-Train
16: #Validate Sols-Test
162 A. Ş. Stoica et al.

The algorithm presented above can be interpreted in the following way: each
view represents a voter that partitions the items (i.e., the source code solu-
tions) in k clusters. Each voter has its parameters in terms of used embedding,
employed clustering algorithm and employed classification algorithm. The ideal
situation occurs when voters perfectly agree on the distribution of items into
clusters. The task is to correctly associate the clusters predicted by voter A with
the cluster predicted by voter B. The problem becomes more complicated if we
have more voters such that the matching becomes more difficult to be deter-
mined. The task is to determine the matching with the maximum number of
identical items in coupled clusters. As initialization, we build a ground-truth
matching that is given on the 1 -st coordinate (i.e., the cluster id to which the
item belongs). After that, in the main loop, we predict the label (i.e., the cluster
id ) of the remaining items and consider that a label is correct if all classifiers
predict it. These items are appended to the training dataset, and we re-train the
voters only if the number of appended items is above a threshold and we still
have unlabeled items. Finally, the items that could not be labelled are discarded.
Evaluation Methodology. We evaluate each method taking into account the
optimal number of clusters for each problem. We denote the set of embed-
dings by E = {W ord2V ec, T f -idf, SAF E}, the set of clustering algorithms
by C = {Kmeans, Spectral Clustering, Agglomerativ Clustering}, and the
set of classification algorithms by Clf = {Random F orest, Svm, Xgboost}.
n n
We define [X] as the set of all subsets of dimension n with |X| and
X ∈ {E, C, Clf }. We mention that baseline results are obtained without hyper-
parameter tuning in either clustering and classification algorithms. As a general
approach, the evaluation of a clustering algorithm will take into consideration
all label permutations, and the one with the greatest F1 score will be selected
as the winner. Since problems may have algorithmic solutions with imbalanced
number of source code solutions, the chosen quality metric is F1-macro because
we want to treat the classes in the same way.
The evaluation of the method of building clusters with a single view takes
1 1
into consideration all the combinations obtained by Cartesian product [E] ×[C]
and evaluates each combination by the approach previously presented.
Similarly, the method that uses multi view spectral clustering will use as
2 3
embeddings [E] and [E] .
The method that uses clustering by voting is validated by determining the
best results after co-training by employing classifiers and obtaining the best
results after classification. These approaches will have as setup the Cartesian
2 2 2 3 3 3
products [E] × [C] × [Clf ] and [E] × [C] × [Clf ] , respectively.
After that, the supervised validation method is used to evaluate obtained
models’ capability to predict distinct algorithmic solutions. We have used a train-
test split frequently used for smaller datasets: 80% of the data is used for training,
1 1
and 20% is used for testing on the Cartesian product [E] × [Clf ] . Finally, we
determine for each problem the best method of clustering solutions.
Unsupervised Detection of Solving Strategies for Competitive Programming 163

4 Experimental Results
For evaluating the performance of the proposed algorithms with the employed
code embeddings, it is compulsory to have the ground truth. Considering that
there is no labelled dataset for distinct solutions for competitive programming,
we decided to label several problems manually. The AlgoSol-10 dataset consists
of ten problems from Infoarena and the criteria for selection are: 1) the problem
must have at least two distinct algorithmic solutions; 2) the number of source
code solutions for each class (i.e., algorithmic solution) should be large enough;
3) the classes should be as balanced as possible; 4) the algorithmic solutions
to be as distinct as possible such that we may better evaluate how embeddings
describe the algorithmic solutions for various implementations.

Table 1. Method with the highest F1-Macro score per problem.(UV - Unsupervised
Voting, MVSC - Multi-View Spectral Clustering)

Method Pb. Embed. Clustering Estimator F1 Size


UV mst w2v KMeans Random forest Micro:0.9783 323
safe SpectralClustering XGBClassifier Macro:0.9731
tf-idf KMeans XGBClassifier Weight:0.9785
UV cppp w2v KMeans XGBClassifier Micro:0.5476 409
tf-idf SpectralClustering SVC Macro:0.5176
Weight:0.4956
UV gcd w2v KMeans SVC SVC Micro:0.7412 429
tf-idf SpectralClustering Macro:0.7163
Weight: 0.7350
UV eval w2v SpectralClustering XGBClassifier Micro:0.9871 388
safe SpectralClustering Random forest Macro:.9796
Weight:0.9870
UV inv w2v KMeans KMeans XGBClassifier Micro:0.9725 400
safe KMeans XGBClassifier Macro:0.8447
tf-idf Random forest Weight:0.8447
UV invmod w2v SpectralClustering SVC Random Micro:0.9852 271
safe KMeans forest Macro:0.9849
Weight:0.9852
UV schi safe SpectralClustering Random forest Micro:0.9953 429
tf-idf SpectralClustering Random forest Macro:0.9953
Weight:0.9953
UV ancestors w2v KMeans Random forest Micro:0.9316 161
safe SpectralClustering Random forest Macro:0.9107
tf-idf KMeans XGBClassifier Weight:0.9287
UV strmatch w2v SpectralClustering XGBClassifier Micro:0.9792 434
tf-idf KMeans Random forest Macro:0.9792
Weight:0.9791
UV strmatch w2v SpectralClustering XGBClassifier Micro:0.9792 434
tf-idf KMeans Random forest Macro:0.9792
Weight:0.9791
MVSC scc safe None None Micro:0.7012 472
tf-idf Macro:0.6651
Weight:0.6651

In terms of the size of the dataset, each problem has about 500 solutions
written in C/C++ programming languages. The folders containing the source
code solutions are hierarchically organized, such as the root folder has the name
164 A. Ş. Stoica et al.

of the problem. Each problem’s folder has one sub-folder for each distinct algo-
rithmic solution, thus containing all the source code solutions belonging to a
particular algorithmic approach. The name of the source code solution is the id
from infoarena towards that solution.
Building the dataset has been performed by running a custom developed
scrapper 2 which downloaded only correct source code solutions. The selected
problems are: strmatch (ro., potrivirea sirurilor), ancestors (ro., stramosi),
schi, scc (ro., componente tari conexe, en. strongly connected components), gcd
(ro., cel mai mare divizor comun, en., greatest common divisor), cppp (ro., cele
mai apropriate două puncte in plan, en., closest pair of points in plane), eval(ro.,
evaluare, en., evaluation), invmod(ro., invers modular), mst(ro., arbore partial
de cost minim, en., minimum cost spanning tree), and inv. Detailed problem
statements and source code solutions are open access on infoarena site.
From Table 1, it can be seen that the UV (i.e., unsupervised voting) has the
highest score for all the problems, except for one problem. From here, we can
conclude that the initialization of a dataset with the maximum frequency method
improves, but it depends on the quality of the clustering methods and embed-
dings methods. In other words, if all the above methods yield a bad result, then
the unsupervised voting method will not give perfect results. The implementa-
tion of proposed solution may be found in AlgoDistinctSolutions 3 git repository
and contains the code for downloading source code solutions, the labeled dataset
and the implementation itself.

5 Conclusions

The paper presents a custom-designed semi-supervised data analysis pipeline


that correctly assigns source code solutions of problems from competitive pro-
gramming to their correct algorithmic approach. For validation purposes, we
have created a manually labelled dataset and have used several embedding meth-
ods in the context of different clustering algorithms integrated into an unsu-
pervised learning pipeline. The fully unsupervised learning has also produced
excellent results, such that we conclude the proposed method solves the tackled
problem correctly. In the future, we plan to extend the number of problems in the
dataset and make all the source code solutions compile. Further improvements
may include designing a metric that provides information about the probability
of the number of clusters from the dataset. Finally, we believe that this work
opens the way towards other practical applications of Deep Learning based NLP
on source code, such as generating alternative implementations in large code-
bases or plagiarism detection by comparing source code embeddings.

Acknowledgements. This work was partially supported by the grant 135C /2021
“Development of software applications that integrate machine learning algorithms”,
financed by the University of Craiova.
2
InfoarenaScrappingTool, https://github.com/Arkin1/InfoarenaScrappingTool.
3
AlgoDistinctSolutions, https://github.com/Arkin1/AlgoDistinctSolutions.
Unsupervised Detection of Solving Strategies for Competitive Programming 165

References
1. Alon, U., Zilberstein, M., Levy, O., Yahav, E.: code2vec: Learning distributed
representations of code. In: Proceedings of the ACM on Programming Languages,
vol. 3(POPL), pp. 1–29 (2019)
2. Arthur, M.P.: Automatic source code documentation using code summarization
technique of NLP. Procedia Comput. Sci. 171, 2522–2531 (2020)
3. Azcona, D., Arora, P., Hsiao, I.H., Smeaton, A.: user2code2vec: embeddings for
profiling students based on distributional representations of source code. In: Pro-
ceedings of the 9th International Conference on Learning Analytics & Knowledge,
pp. 86–95 (2019)
4. Barchi, F., Parisi, E., Urgese, G., Ficarra, E., Acquaviva, A.: Exploration of con-
volutional neural network models for source code classification. Eng. Appl. Artif.
Intell. 97, 104075 (2021)
5. Burrows, S., Uitdenbogerd, A.L., Turpin, A.: Application of information retrieval
techniques for source code authorship attribution. In: Zhou, X., Yokota, H., Deng,
K., Liu, Q. (eds.) DASFAA 2009. LNCS, vol. 5463, pp. 699–713. Springer, Heidel-
berg (2009). https://doi.org/10.1007/978-3-642-00887-0 61
6. Chen, Z., Monperrus, M.: A literature study of embeddings on source code. arXiv
preprint arXiv:1904.03061 (2019)
7. Chihada, A., Jalili, S., Hasheminejad, S.M.H., Zangooei, M.H.: Source code and
design conformance, design pattern detection from source code by classification
approach. Appl. Soft Comput. 26, 357–367 (2015)
8. Iacob, R.C.A., et al.: A large dataset for multi-label classification of algorithmic
challenges. Mathematics 8(11), 1995 (2020)
9. Jiang, L., Liu, H., Jiang, H.: Machine learning based recommendation of method
names: how far are we. In: 2019 34th IEEE/ACM International Conference on
Automated Software Engineering (ASE), pp. 602–614. IEEE (2019)
10. Jones, K.S.: A statistical interpretation of term specificity and its application in
retrieval. J. Documentation (1972)
11. Massarelli, L., Di Luna, G.A., Petroni, F., Baldoni, R., Querzoni, L.: SAFE: self-
attentive function embeddings for binary similarity. In: Perdisci, R., Maurice, C.,
Giacinto, G., Almgren, M. (eds.) DIMVA 2019. LNCS, vol. 11543, pp. 309–329.
Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22038-9 15
12. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
sentations in vector space. arXiv preprint arXiv:1301.3781 (2013)
13. Rabin, M.R.I., Mukherjee, A., Gnawali, O., Alipour, M.A.: Towards demystifying
dimensions of source code embeddings. In: Proceedings of the 1st ACM SIGSOFT
International Workshop on Representation Learning for Software Engineering and
Program Languages, pp. 29–38 (2020)
14. Kanaan-Izquierdo, S., Andrey Ziyatdinov, A.P.L.: Multiview and multifeature
spectral clustering using common eigenvectors. Pattern Recogn. Lett. 102, 30–36
(2018)
15. Shi, K., Lu, Y., Chang, J., Wei, Z.: Pathpair2vec: an ast path pair-based code
representation method for defect prediction. J. Comput. Lang. 59, 100979 (2020)

You might also like