Unsupervised Detection of Solving Strate
Unsupervised Detection of Solving Strate
Abstract. Transformers are becoming more and more used for solving
various Natural Language Processing tasks. Recently, they have also been
employed to process source code to analyze very large code-bases auto-
matically. This paper presents a custom-designed data analysis pipeline
that can classify source code from competitive programming solutions.
Our experiments show that the proposed models accurately determine
the number of distinct solutions for a programming challenge task, even
in an unsupervised setting. Together with our model, we also introduce a
new dataset called AlgoSol-10 for this task that consists of ten program-
ming problems together with all the source code submissions manually
clustered by experts based on the algorithmic solution used to solve each
problem. Taking into account the success of the approach on small source
codes, we discuss the potential of further using transformers for the anal-
ysis of large code bases.
1 Introduction
This paper introduces a new dataset called AlgoSol-10 and an approach to deter-
mine distinct solutions in the context of competitive programming in terms
of algorithmic approach and implementation. For consistency in description of
works we will use the term algorithmic solution as the approach for solving a
competitive programming problem. The term source code solution (or just solu-
tion) represents a particular implementation in C++ programming language of
the algorithmic solution. The solution of a problem is represented by a source
code file on the hard disk that has been coded by a competitor during a contest.
In competitive programming, it is expected that we do not know the number
of distinct algorithmic solutions for a problem. We consider that a problem has
different algorithmic solutions if they use different computer science method-
ology and concepts to correctly solve the same problem e.g. The problem of
2 Related Work
Natural Language Processing (NLP) generally deals with processing text infor-
mation using various Machine Learning and rule-based techniques, with the
former receiving wider interest lately. Recent developments target using state-
of-the-art NLP techniques for processing source code as well, as it can also
be regarded as text. Thus, Chen and Monperrus [6] present a literature study
1
Infoarena, www.infoarena.ro, last accessed on 30th June 2021.
Unsupervised Detection of Solving Strategies for Competitive Programming 159
3 Proposed Approach
A particular aspect of the task is that in competitive programming, a problem
usually has a small number of distinct algorithmic solutions which allows the
possibility of manual labelling. This context has two implications: 1) we face a
classification problem or an unsupervised learning situation with a known and
small number of clusters, and 2) we may build a labelled dataset for validation
purposes. Therefore, we can consider that for a particular problem, we know
K (i.e., the number of distinct algorithmic solutions), and the task becomes to
determine the label of each source code solution. Given that we do not have any
labels for our instances (i.e., source code solutions to a given problem), we are
clearly in the area of unsupervised learning.
As a proposed approach for this problem, we build a custom data analysis
pipeline from the following components.
Preprocessing. The first step in the pipeline is preprocessing of source code
solutions. This step is compulsory for all source code solutions that are given to
W2V and Tf-idf for building the embeddings. The following steps are performed:
1) delete the #include directives; 2) delete comments; 3) delete functions that
are not called; 4) replace all macro directives; 5) delete apostrophe characters;
6) delete all characters that are not ASCII (for example strings which may
contain unicode characters); 7) tokenize the source code. We mention that SAFE
embeddings do not require preprocessing because it takes as input the object
code, not the source code. Thus, the only requirement for the source code to be
provided as input to SAFE is that it needs to be compiled. As it can be seen here,
some embeddings methods use source code as input (i.e., W2V, Tf-Idf) while
SAFE uses object code. The main idea of this step is to do any preprocessing
160 A. Ş. Stoica et al.
necessary in order to obtain more quality code embeddings, since they contain
the main information.
Embedding Computation. For generating the source code embeddings with
W2V and SAFE we use AlgoLabel [8]. The embeddings computation with Tf-idf
is implemented separately and follows the classical bag-of-words approach.
W2V uses the tokens for each source code solution after preprocessing. Based
on the obtained tokens, we build a neural network whose goal is to predict the
current token given the nearby tokens in a C-BOW architecture [12]. The algo-
rithm uses a window of five tokens, and the embedding is a vector of dimension
128. The resulted embedding of a source code solution is the average of embed-
dings for the tokens that make up the source code solution itself.
Tf-idf algorithm uses the tokens obtained for each source code solution after
preprocessing. Those tokens are filtered by removing the stop words
and by selecting the words which follow the following regular expression
\w+|\+|-|=|!=|\\*|\/|%|!|
all source code solutions associated to a problem p where Ik = {0, 1, ..., K − 1},
K ≥ 2 then S(p) = {si |si ∈ p} ⇒ v p (si ) : S(p) → Ikv , v p (si ) = (x0 , x1 , ..., xV −1 ).
We observe that the same vector may represent several source code solutions.
Thus, we define the frequency of a vector as the number of source code solutions
represented by the same vector.
sp (v) = {si |si ∈ S(p), v = v(si )}, f p (v) = |sp (v)| Having this function
defined, we find vectors whose sum of frequencies is maximal and have no
common coordinate. Furthermore, if we cannot find many vectors equal to
K (i.e., known number of clusters), we must choose fewer clusters. We define
(i) (i) (i)
A = {vi |vi ∈ Ikv , vi = {x0 , x1 , ..., xV −1 )}, i ∈ {1, .., K}
i j
where xd = xd , i ∈ {1,
..,p K}, j ∈ {1, .., K}, d ∈ {0, .., V − 1} The task is to
determine A such that f (vi ), vi ∈ A has maximum value. The algorithmic
solution itself is represented by the sp (vi ) vectors.
The vectors represent the K distinct algorithmic solutions that we extract
from each view. Thus, each algorithmic solution contains similar source code
solutions.
For solving this problem, we use a dictionary in which we compute the fre-
quency of each vector. We extract the keys and choose all subsets of dimension
K that meet the condition with no identical coordinate. For each subset, we
compute the sum of key values and obtain the maximum sum subset. The key of
the dictionary is a string that appends the coordinates separated by a separator.
The complexity is exponential, but the algorithm is tractable because in our
dataset V ≤ 3 and K ≤ 3.
The algorithm presented above can be interpreted in the following way: each
view represents a voter that partitions the items (i.e., the source code solu-
tions) in k clusters. Each voter has its parameters in terms of used embedding,
employed clustering algorithm and employed classification algorithm. The ideal
situation occurs when voters perfectly agree on the distribution of items into
clusters. The task is to correctly associate the clusters predicted by voter A with
the cluster predicted by voter B. The problem becomes more complicated if we
have more voters such that the matching becomes more difficult to be deter-
mined. The task is to determine the matching with the maximum number of
identical items in coupled clusters. As initialization, we build a ground-truth
matching that is given on the 1 -st coordinate (i.e., the cluster id to which the
item belongs). After that, in the main loop, we predict the label (i.e., the cluster
id ) of the remaining items and consider that a label is correct if all classifiers
predict it. These items are appended to the training dataset, and we re-train the
voters only if the number of appended items is above a threshold and we still
have unlabeled items. Finally, the items that could not be labelled are discarded.
Evaluation Methodology. We evaluate each method taking into account the
optimal number of clusters for each problem. We denote the set of embed-
dings by E = {W ord2V ec, T f -idf, SAF E}, the set of clustering algorithms
by C = {Kmeans, Spectral Clustering, Agglomerativ Clustering}, and the
set of classification algorithms by Clf = {Random F orest, Svm, Xgboost}.
n n
We define [X] as the set of all subsets of dimension n with |X| and
X ∈ {E, C, Clf }. We mention that baseline results are obtained without hyper-
parameter tuning in either clustering and classification algorithms. As a general
approach, the evaluation of a clustering algorithm will take into consideration
all label permutations, and the one with the greatest F1 score will be selected
as the winner. Since problems may have algorithmic solutions with imbalanced
number of source code solutions, the chosen quality metric is F1-macro because
we want to treat the classes in the same way.
The evaluation of the method of building clusters with a single view takes
1 1
into consideration all the combinations obtained by Cartesian product [E] ×[C]
and evaluates each combination by the approach previously presented.
Similarly, the method that uses multi view spectral clustering will use as
2 3
embeddings [E] and [E] .
The method that uses clustering by voting is validated by determining the
best results after co-training by employing classifiers and obtaining the best
results after classification. These approaches will have as setup the Cartesian
2 2 2 3 3 3
products [E] × [C] × [Clf ] and [E] × [C] × [Clf ] , respectively.
After that, the supervised validation method is used to evaluate obtained
models’ capability to predict distinct algorithmic solutions. We have used a train-
test split frequently used for smaller datasets: 80% of the data is used for training,
1 1
and 20% is used for testing on the Cartesian product [E] × [Clf ] . Finally, we
determine for each problem the best method of clustering solutions.
Unsupervised Detection of Solving Strategies for Competitive Programming 163
4 Experimental Results
For evaluating the performance of the proposed algorithms with the employed
code embeddings, it is compulsory to have the ground truth. Considering that
there is no labelled dataset for distinct solutions for competitive programming,
we decided to label several problems manually. The AlgoSol-10 dataset consists
of ten problems from Infoarena and the criteria for selection are: 1) the problem
must have at least two distinct algorithmic solutions; 2) the number of source
code solutions for each class (i.e., algorithmic solution) should be large enough;
3) the classes should be as balanced as possible; 4) the algorithmic solutions
to be as distinct as possible such that we may better evaluate how embeddings
describe the algorithmic solutions for various implementations.
Table 1. Method with the highest F1-Macro score per problem.(UV - Unsupervised
Voting, MVSC - Multi-View Spectral Clustering)
In terms of the size of the dataset, each problem has about 500 solutions
written in C/C++ programming languages. The folders containing the source
code solutions are hierarchically organized, such as the root folder has the name
164 A. Ş. Stoica et al.
of the problem. Each problem’s folder has one sub-folder for each distinct algo-
rithmic solution, thus containing all the source code solutions belonging to a
particular algorithmic approach. The name of the source code solution is the id
from infoarena towards that solution.
Building the dataset has been performed by running a custom developed
scrapper 2 which downloaded only correct source code solutions. The selected
problems are: strmatch (ro., potrivirea sirurilor), ancestors (ro., stramosi),
schi, scc (ro., componente tari conexe, en. strongly connected components), gcd
(ro., cel mai mare divizor comun, en., greatest common divisor), cppp (ro., cele
mai apropriate două puncte in plan, en., closest pair of points in plane), eval(ro.,
evaluare, en., evaluation), invmod(ro., invers modular), mst(ro., arbore partial
de cost minim, en., minimum cost spanning tree), and inv. Detailed problem
statements and source code solutions are open access on infoarena site.
From Table 1, it can be seen that the UV (i.e., unsupervised voting) has the
highest score for all the problems, except for one problem. From here, we can
conclude that the initialization of a dataset with the maximum frequency method
improves, but it depends on the quality of the clustering methods and embed-
dings methods. In other words, if all the above methods yield a bad result, then
the unsupervised voting method will not give perfect results. The implementa-
tion of proposed solution may be found in AlgoDistinctSolutions 3 git repository
and contains the code for downloading source code solutions, the labeled dataset
and the implementation itself.
5 Conclusions
Acknowledgements. This work was partially supported by the grant 135C /2021
“Development of software applications that integrate machine learning algorithms”,
financed by the University of Craiova.
2
InfoarenaScrappingTool, https://github.com/Arkin1/InfoarenaScrappingTool.
3
AlgoDistinctSolutions, https://github.com/Arkin1/AlgoDistinctSolutions.
Unsupervised Detection of Solving Strategies for Competitive Programming 165
References
1. Alon, U., Zilberstein, M., Levy, O., Yahav, E.: code2vec: Learning distributed
representations of code. In: Proceedings of the ACM on Programming Languages,
vol. 3(POPL), pp. 1–29 (2019)
2. Arthur, M.P.: Automatic source code documentation using code summarization
technique of NLP. Procedia Comput. Sci. 171, 2522–2531 (2020)
3. Azcona, D., Arora, P., Hsiao, I.H., Smeaton, A.: user2code2vec: embeddings for
profiling students based on distributional representations of source code. In: Pro-
ceedings of the 9th International Conference on Learning Analytics & Knowledge,
pp. 86–95 (2019)
4. Barchi, F., Parisi, E., Urgese, G., Ficarra, E., Acquaviva, A.: Exploration of con-
volutional neural network models for source code classification. Eng. Appl. Artif.
Intell. 97, 104075 (2021)
5. Burrows, S., Uitdenbogerd, A.L., Turpin, A.: Application of information retrieval
techniques for source code authorship attribution. In: Zhou, X., Yokota, H., Deng,
K., Liu, Q. (eds.) DASFAA 2009. LNCS, vol. 5463, pp. 699–713. Springer, Heidel-
berg (2009). https://doi.org/10.1007/978-3-642-00887-0 61
6. Chen, Z., Monperrus, M.: A literature study of embeddings on source code. arXiv
preprint arXiv:1904.03061 (2019)
7. Chihada, A., Jalili, S., Hasheminejad, S.M.H., Zangooei, M.H.: Source code and
design conformance, design pattern detection from source code by classification
approach. Appl. Soft Comput. 26, 357–367 (2015)
8. Iacob, R.C.A., et al.: A large dataset for multi-label classification of algorithmic
challenges. Mathematics 8(11), 1995 (2020)
9. Jiang, L., Liu, H., Jiang, H.: Machine learning based recommendation of method
names: how far are we. In: 2019 34th IEEE/ACM International Conference on
Automated Software Engineering (ASE), pp. 602–614. IEEE (2019)
10. Jones, K.S.: A statistical interpretation of term specificity and its application in
retrieval. J. Documentation (1972)
11. Massarelli, L., Di Luna, G.A., Petroni, F., Baldoni, R., Querzoni, L.: SAFE: self-
attentive function embeddings for binary similarity. In: Perdisci, R., Maurice, C.,
Giacinto, G., Almgren, M. (eds.) DIMVA 2019. LNCS, vol. 11543, pp. 309–329.
Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22038-9 15
12. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
sentations in vector space. arXiv preprint arXiv:1301.3781 (2013)
13. Rabin, M.R.I., Mukherjee, A., Gnawali, O., Alipour, M.A.: Towards demystifying
dimensions of source code embeddings. In: Proceedings of the 1st ACM SIGSOFT
International Workshop on Representation Learning for Software Engineering and
Program Languages, pp. 29–38 (2020)
14. Kanaan-Izquierdo, S., Andrey Ziyatdinov, A.P.L.: Multiview and multifeature
spectral clustering using common eigenvectors. Pattern Recogn. Lett. 102, 30–36
(2018)
15. Shi, K., Lu, Y., Chang, J., Wei, Z.: Pathpair2vec: an ast path pair-based code
representation method for defect prediction. J. Comput. Lang. 59, 100979 (2020)