Coding Contest RP
Coding Contest RP
m
X
Processed tokens AST traversal max w(i) log ŷy(i)
"int", "!!VAR", ";", "TranslationUnit", θ
i=1
"int", "main", "(",")", "VarDecl", "endblock",
"{", "scanf", "(", "FunctionDecl", For rank prediction, since the goal is to predict within
"!!STR", ",", "&", "CompoundStmt",
"!!VAR", ")", ";", ... "CallExpr", ...
one rank of the actual rank, we trained a separate logistic
regression model for each rank. Each training example of rank
r is considered to be a positive example in the models for ranks
Concatenated sequence
within 1 rank of r.
"int", "!!VAR", ";", ..., "TranslationUnit",
"VarDecl", "endblock", ... (i) 1
ŷk =
1 + exp −θkT x(i)
Bigram extraction (min. 1% freq)
m #X
X ranks
max w(i) 1{|j − y (i) | ≤ 1} log ŷj
Normalization and scaling θ
i=1 j=1
Approx. 2k features +1{|j − y (i) | > 1} log(1 − ŷj )
Pm (i)
i=1 1{y = k}x(i)
µk = P m (i)
i=1 1{y = k}
Pm (i)
T
i=1 w x − µy(i) x(i) − µy(i)
(i)
Fig. 2. Neural network architecture.
Σ= Pm (i)
i=1 w
VI. E XPERIMENTS TABLE IV
ACCURACY FOR EACH MODEL (10- FOLD CROSS VALIDATION )
All experiments were conducted using 10-fold cross valida-
tion. For each type of model, we trained 10 models where each Model Accuracy Accuracy
model is trained on 9 contests (~54k examples) and tested on 1 (Rank±1) (Country)
Train Test Train Test
contest (~6k examples). The values reported here are averages Random/constant 30.0% 30.0% 10.0% 10.0%
over the 10 models. With this methodology, the models are Linear regresion 69.6% 60.1% N/A N/A
tested on problems never seen in training. This ensures that the GDA 75.7% 67.2% 75.0% 65.0%
Logistic regression 86.1% 71.6% 92.2% 68.4%
models are not learning specific features about the problems
Neural network 94.4% 77.2% 97.0% 72.5%
in the training set.
Due to the class imbalance described before, accuracy is
defined as the weighted accuracy where the weight w(i) of
each example is the inverse of the class size in the test set. when predicting a user’s rating in the test set. Given that ranks
For rank, we allow the predicted rank to be within one rank have a rating range of ~200, this is a fairly large error.
of the actual rank. If y (i) is the actual label and ŷ (i) is the GDA worked surprisingly well, achieving accuracies that
predicted label for example i: are almost as high as logistic regression. While GDA assumes
Pm that p(x|y) is multivariate Gaussian, logistic regression does
w(i) 1{y (i) = ŷ (i) }
Accuracy (Country) = i=1 Pm (i)
not make that assumption and is capable of modeling a large
i=1 w variety of other distributions. Since the accuracies are similar,
Pm
w(i) 1{|y (i) − ŷ (i) | ≤ 1} this indicates that p(x|y) is Gaussian to some degree.
Accuracy (Rank±1) = i=1 Pm (i)
i=1 w Out of all the algorithms, the neural network had the highest
The weighted accuracy shows how well the model can accuracies. The neural network was probably able to learn
predict all classes and not just the majority. A model that more complex relationships between the features compared
strongly favors larger classes would achieve a high unweighted to the other algorithms. Perhaps some combination of several
accuracy but low weighted accuracy. bigrams is highly indicative of rank or country. Interpretation
For the linear regression model, we also report the weighted of the neural network is out of scope of this project, however.
root mean-squared error (RMSE) for the predicted rating: The high training accuracies, compared to test accura-
sP cies, may indicate overfitting. In the neural network, dropout
m (i) y (i) − ŷ (i) 2
i=1 w
helped reduce overfitting (as described before), but no other
RMSE = Pm (i) regularization techniques were used. We briefly tried using
i=1 w
principal component analysis (PCA) to reduce the number of
Scikit-learn [11] is used to train the linear regression and GDA features, and L2 regularization on the parameters, but these
models, while TensorFlow [12] is used to train the logistic re- techniques decreased the test accuracy. More data helped
gression and neural network models. Models were trained with reduce overfitting, as the accuracy values are about 5% higher
the entire training set as a single batch. For logistic regression, than initial tests performed with 5 contests instead of 10.
we used gradient descent with a 0.1 learning rate, while for For each actual rank and country, the neural network test
the neural network, we used the Adam algorithm [13] with a accuracies are shown in Fig. 3 and 4. The model seems to be
0.0001 learning rate. These learning rates were experimentally able to predict all ranks with similar accuracy. For country,
found to converge. 50% dropout is used for the hidden layer, the model is able to predict the more common countries with
meaning that on every iteration, 50% of the hidden nodes are higher accuracy despite the weighted loss function used. This
inactive. This helps prevent the network from overfitting and may be because there is significantly more training data for
was found to increase the test accuracy. the more common countries.
VII. R ESULTS AND D ISCUSSION
The accuracies obtained for each model are shown in Table Legendary Grandmaster
IV. For reference, the accuracy of a model that outputs a International Grandmaster
random or constant output is shown in the first row. A model Grandmaster
International Master
that outputs a constant or random rank, except for the highest
Master
and lowest rank, would achieve 30% accuracy because there Candidate Master
are 3 ranks within 1 rank of the chosen rank. For country, Expert
however, we require that the model classifies the exact country, Specialist
and there are 10 countries in the data set. Pupil
Newbie
Classification was found to work better than regression
0 0.25 0.5 0.75 1
when predicting the rank. This may be because classification
optimizes what we actually care about, which is predicting
the correct rank, rather than the rating. The linear regression Fig. 3. Neural network test accuracy for rank (±1) by actual rank
model had a weighted RMSE (as previously defined) of 545
of the AST and appears exactly once per per program, but
India
China
since the count is normalized by the L2 norm of the count
Russian vector, its value will be higher in shorter programs. Thus, it
Bangladesh appears that GDA has learned to associate smaller programs
Vietnam with lower skill levels, despite having the L2 normalization to
Ukraine try to prevent this. It makes sense that a long program would
Poland
likely indicate a hard problem and a high skilled competitor.
Egypt
United States Tables VII and VIII show the features with the highest class
Iran means for Chinese and American competitors respectively.
0 0.25 0.5 0.75 1 It seems that Chinese competitors often use getchar to
read single characters from standard input, and import C
Fig. 4. Neural network test accuracy for country by actual country input libraries like cstdio. American competitors seem to
often spell out std in their code (like std::cout <<
std::endl) instead of importing the entire namespace with
VIII. I NTERPRETATION OF THE GDA MODEL using namespace std, and use ld which is a commonly
While the GDA model did not achieve the highest accuracy, used alias for long double.
its simplicity makes it possible to interpret the learned model
more easily. For this analysis, we randomly chose one of the TABLE VII
models from the 10-fold cross validation. To determine the S TRONGEST INDICATORS OF A C HINESE COMPETITOR
unigrams and bigrams that were the strongest indicators of
= getchar getchar getchar ( char !!VAR
high and low skill level, we compared the class means µk ; char cstdio cstdio > < cstdio
for the International Grandmaster and Pupil ranks and found cstring cstring > < cstring > !!CHR
the features where the class means had the largest (positive) < !!CHR { scanf
and smallest (negative) absolute difference. These features UnexposedExpr CharacterLiteral >= !!CHR
are shown in Tables V and VI. The features are ordered in
decreasing strength from left to right and top to bottom.
ifdef # ifdef assert endif ( std << std ld > > struct
# endif assert ( ( ... | ( :: < ld std ::
FunctionTemplate TemplateTypeParameter | , std struct ; template
__VA_ARGS__ FunctionTemplate ClassTemplate
ifdef LOCAL LOCAL endblock ClassTemplate os <<
TABLE VI
S TRONGEST INDICATORS OF LOW SKILL LEVEL IX. C ONCLUSION AND F UTURE W ORK
cin >> cin >> !!VAR >> In this paper, we studied the application of machine learning
cout << cout
TranslationUnit InclusionDirective techniques in predicting the rank and country of a Codeforces
TranslationUnit std ; competitor based on a single source code submission. The
IfStmt BinaryOperator main main ( neural network model achieved the highest accuracy of 77.2%
accuracy in predicting rank (within one rank) and 72.5% in
From this analysis, we can see that both tokens, like predicting country. Despite not achieving the highest accuracy,
cin >>, and AST nodes, like FunctionTemplate, are the GDA model was easier to interpret and we were able to
important to the model. As well, both unigrams and bigrams find unigrams and bigrams that were the strongest indicators
are important, although they are often related. of certain skill levels and countries.
High skilled competitors appear to use #ifdef signifi- Future work may include testing RNN or LSTM based mod-
cantly, perhaps to change the code’s behavior at compile time els, as discussed in Related Work. Acquiring more data may
by defining macros in the compiler flags. Also, they appear to help reduce overfitting. Token processing could be improved,
use assertions and C++ function templates. for example by replacing class and macro names with special
Low skilled programmers appear to use cin and cout for tokens in addition to variable and function names. N-grams
input. This makes sense since scanf and printf are faster with N > 2 could be tested as only unigrams and bigrams
input methods and often preferred by experienced competitors. were considered here. More hidden units or layers could be
It is interesting to see TranslationUnit as a strong added to the neural network. Interpretation of the logistic
indicator of low skill level. TranslationUnit is the root regression or neural network model could be attempted.
R EFERENCES
[1] M. Mirzayanov. Codeforces. [Online]. Available: http://codeforces.com/
[2] M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton, “A survey
of machine learning for big code and naturalness,” ACM Comput.
Surv., vol. 51, no. 4, pp. 81:1–81:37, Jul. 2018. [Online]. Available:
http://doi.acm.org/10.1145/3212695
[3] S. Burrows and S. M. Tahaghoghi, “Source code authorship attribution
using n-grams,” in Proceedings of the Twelth Australasian Document
Computing Symposium, Melbourne, Australia, RMIT University. Cite-
seer, 2007, pp. 32–39.
[4] S. Ugurel, R. Krovetz, and C. L. Giles, “What’s the code?: automatic
classification of source code archives,” in Proceedings of the eighth ACM
SIGKDD international conference on Knowledge discovery and data
mining. ACM, 2002, pp. 632–638.
[5] B. Alsulami, E. Dauber, R. Harang, S. Mancoridis, and R. Greenstadt,
“Source code authorship attribution using long short-term memory based
networks,” in European Symposium on Research in Computer Security.
Springer, 2017, pp. 65–82.
[6] Q. Le and T. Mikolov, “Distributed representations of sentences and
documents,” in International Conference on Machine Learning, 2014,
pp. 1188–1196.
[7] C. Piech, J. Huang, A. Nguyen, M. Phulsuksombati, M. Sahami, and
L. Guibas, “Learning program embeddings to propagate feedback on
student code,” in Proceedings of the 32Nd International Conference
on International Conference on Machine Learning - Volume 37,
ser. ICML’15. JMLR.org, 2015, pp. 1093–1102. [Online]. Available:
http://dl.acm.org/citation.cfm?id=3045118.3045235
[8] V. J. Hellendoorn and P. Devanbu, “Are deep neural networks the best
choice for modeling source code?” in Proceedings of the 2017 11th
Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE
2017. New York, NY, USA: ACM, 2017, pp. 763–773. [Online].
Available: http://doi.acm.org/10.1145/3106237.3106290
[9] S. Behnel, M. Faassen, and I. Bicking, “lxml: Xml and html with
python,” 2005.
[10] G. Salton and M. J. McGill, “Introduction to modern information
retrieval,” 1986.
[11] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay, “Scikit-learn: Machine learning in Python,” Journal of Machine
Learning Research, vol. 12, pp. 2825–2830, 2011.
[12] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,
A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,
M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray,
C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,
P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals,
P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng,
“TensorFlow: Large-scale machine learning on heterogeneous systems,”
2015, software available from tensorflow.org. [Online]. Available:
https://www.tensorflow.org/
[13] D. P. Kingma and J. Ba, “Adam: A method for stochastic
optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available:
http://arxiv.org/abs/1412.6980