DOBF - A Deobfuscation Pre-Training Objective For Programming Languages
DOBF - A Deobfuscation Pre-Training Objective For Programming Languages
Abstract and phrases (Sun et al., 2019), sampling masked words ac-
cording to their frequencies (Lample & Conneau, 2019),
Recent advances in self-supervised learning have replacing words with plausible alternatives (Clark et al.,
dramatically improved the state of the art on a
arXiv:2102.07492v2 [cs.CL] 16 Feb 2021
and variables have been replaced by uninformative names, initial sentence given the corrupted one. Lample & Con-
back to their original forms. Suggesting proper variable neau (2019) noticed that the masked words are often easy
and function names is a difficult task that requires to un- to predict, and proposed to sample the 15% masked words
derstand what the program does. In the context of source according to their frequencies instead of uniformly. This
code, it is a more sensible, but also a more difficult task way, rare words are sampled more often, making the pre-
than MLM. Indeed, we observe (c.f. Figure 1) that pre- training task more difficult for the model, which results in a
dicting the content of randomly masked tokens is usually better learning signal and faster training. Sun et al. (2019)
quite simple, as it often boils down to making syntax related also noticed that recovering the tokens masked by MLM is
predictions (e.g. predicting that was has been masked out too simple in some contexts (e.g. predicting the two tokens
is a parenthesis, a semi-column, etc.). These simple pre- “Harry Potter” is much harder than predicting only “Harry”
dictions actually provide little training signal to the model. if you know the next word is “Potter”). To address this issue,
In practice, MLM also masks out variable names, but if a they proposed to mask phrases and named entities instead of
given variable appears multiple times in a function, it will individual tokens. Joshi et al. (2020) and Song et al. (2019)
be easy for the model to simply copy its name from one of made a similar observation and proposed to mask random
the other occurrences. Our model does not have this issue, spans of text. They showed that this simple modification
as all occurrences of masked variables are replaced by the improves the performance on many downstream NLP tasks.
same VAR_i special tokens.
In this paper, we make the following contributions: Alternative objectives. Other pre-training objectives
have been proposed in addition to MLM. For instance, De-
vlin et al. (2018) also uses the next sentence prediction
• We present DOBF, a new pre-training objective based
(NSP) objective, a binary classification task that consists in
on deobfuscation, and show its effectiveness on multi-
predicting whether two input sentences follow each other
ple programming languages.
in the original corpus. The NSP objective was originally
• We show that DOBF significantly outperform MLM designed to improve the performance on downstream NLP
(e.g. BERT) on multiple tasks such as code search, tasks, but recent studies (Lample & Conneau, 2019; Liu
code summarization or unsupervised code translation. et al., 2019) showed that training MLM on stream of sen-
tences to leverage longer context, and removing the NSP
• We show that, by design, models pre-trained with objective improves the quality of pre-training. To improve
DOBF have interesting applications and can be used the sample-efficiency of MLM (where only 15% of tokens
to understand functions with uninformative identifier are predicted), Electra (Clark et al., 2020) proposed to re-
names. Besides, the model is able to successfully de- place (and not mask) some tokens with plausible alternatives,
obfuscate fully obfuscated source files. and to train a network to detect the tokens that have been
replaced. They showed that this new Replaced Token Detec-
In the next section, we discuss the related work. Then, tion (RTD) objective matches the performance of RoBERTa
we present our objective, and the downstream tasks we while using four times less computational resources. Dong
consider for fine-tuning. Finally, we present our results and et al. (2019) proposed a model that combines multiple pre-
the potential applications of our model. training tasks, including bidirectional, but also left-to-right
and right-to-left language modeling objectives. Lewis et al.
(2019) also proposed different pre-training objectives, to de-
2. Related work tect whether input sentences have been permuted, whether
Masked Language Modeling pre-training. Large pre- tokens have been deleted or inserted, etc.
trained transformers such as BERT (Devlin et al., 2018) or
RoBERTa (Liu et al., 2019) led to significant improvements Code Generation Pre-training. Recent studies showed
in the majority of natural language processing tasks. The that pre-training methods developed for natural language
quality of pre-training mainly comes from the MLM ob- processing are also effective for programming languages.
jective (i.e. the cloze task), that allows the model to make For instance, Feng et al. (2020) proposed CodeBERT, a
predictions by leveraging left and right contexts, unlike RoBERTa-based model trained on source code using the
causal language modeling (CLM) where the model predic- MLM and RTD objectives. They showed that their model
tions are only conditioned on previous words. In MLM, the performs well on downstream code generation tasks and
model takes as input a sentence and uniformly selects 15% outperforms previous pre-training approaches. Kanade et al.
of its tokens. Of the selected tokens, 80% are replaced by (2020) applied MLM and the next sentence prediction ob-
a special symbol [MASK], 10% are left unchanged, and jectives to pre-train models on Python code. More recently,
the remaining 10% are replaced by random tokens from the Roziere et al. (2020) applied the unsupervised machine trans-
vocabulary. The MLM objective consists in recovering the lation principles of Lample et al. (2018a;b) to monolingual
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages
Figure 1. Illustration of the MLM and DOBF objectives. Given an input function, the masked language modeling (MLM) task
randomly samples tokens to mask out. With source code, a large fraction of these tokens are related to the language syntax (e.g. commas,
parentheses, etc.) that are trivial for the model to predict, and provide a poor training signal. Instead, we propose to obfuscate the code by
masking the name of functions and variables, and to train the model to recover the original function by deobfuscating the code (DOBF).
When a variable is masked out, we mask all occurrences of this variable with the same mask symbol (e.g. all occurrences of “visited” are
replaced by “V0”) to prevent the model from copying names. The DOBF objective is more difficult and provides a better learning signal.
source code from GitHub. They showed that the resulting used a transformer together with augmented representations
model, TransCoder, was able to translate source code be- obtained from static analysis to infer procedure names in
tween Python, Java, and C++, in a fully unsupervised way. stripped binary files. These models are already used to
However, the two above studies build upon pre-training understand obfuscated and compiled source code. Unugli-
strategies developed in the context of natural language pro- fyJS1 , which is based on JSNICE (Raychev et al., 2015)
cessing. In this paper, we propose to use a code-specific and available online, is especially famous in the Javascript
objective to better pre-train models designed to be fine-tuned community. However, none of these studies investigated the
on code generation tasks: code deobfuscation. use of deobfuscation for model pre-training.
tokens are composed of Python keywords or symbols re- simply copy identifier names. In Figure 1, with MLM mask-
lated to syntax: , [ while = if ) return. These ing, the model can simply notice that a variable named
symbols are easy to recover, and a model will quickly learn queue is called on the fourth line. Since the variable is not
to predict them with perfect accuracy. This effect is ac- defined, the model can easily guess that queue has to be
centuated by the verbosity of the language. For instance, defined on the third line, and infer the value of the corre-
we would see significantly more of these tokens in Java. sponding [MASK] token. With the deobfuscation objective,
Retrieving the obfuscated graph token is also relatively the model needs to analyze code patterns and understand
simple: the model only needs to retrieve the most relevant the semantics of the variable to infer that, since its elements
variable in the scope. More generally, retrieving an identifier are popped with .pop(0), the variable V3 implements
name is often easy when given its full context, including its a queue. If its elements were popped with .pop(), our
definition and usages. Overall, we suspect that the MLM model would name it stack instead of queue (c.f. Fig-
objective is too simple in programming languages and we ure 8 in the appendix).
introduce a new objective, DOBF, which encourages the
model to learn a deeper understanding of code semantics. 3.3. Implementation
Overall, the deobfuscation objective operates like a super-
3.2. Deobfuscation Objective
vised machine translation objective, where a seq2seq model
Instead of MLM, we propose a new pre-training objective, is trained to map an obfuscated code into a dictionary rep-
DOBF, that leverages the particular structure of program- resented as a sequence of tokens. At inference time, the
ming languages. We obfuscate code snippets by replacing model is able to suggest meaningful class, function and
class, function and variable names with special tokens, and variable names for a piece of code with an arbitrary num-
train a model to recover the original names. When an identi- ber of obfuscated identifiers. Obfuscated classes, func-
fier is selected, all of its instances in the code are replaced by tions, and variables, are replaced with associated special
the same special token. This differs from MLM where the tokens: CLASS_0 . . . CLASS_N, FUNC_0 . . . FUNC_N
name of a variable can appear multiple times while being and VAR_0 . . . VAR_N. We serialize the output dictionary
masked a single time. For instance, in Figure 1, DOBF will as a sequence of tokens where the entries are separated by a
replace the two occurrences of node by the same symbol delimiter symbol |. 2
V5, while MLM will only mask one of these occurrences.
As a result, the fraction of meaningful tokens masked by 4. Experiments
the objective is language independent: for more verbose
languages (e.g. Java), the less informative syntax-related We train DOBF with the deobfuscation objective. First, we
tokens will not be masked out by the DOBF objective. evaluate our model on two straightforward deobfuscation
applications. Then, we show its performance on multiple
Each identifier is replaced with probability pobf ∈ [0, 1].
downstream tasks.
We ensure that the original input is modified: if no identifier
is replaced, we draw a random one to obfuscate. When
pobf = 0, we always obfuscate exactly one random iden- 4.1. Deobfuscation
tifier in the input. When pobf = 1, we obfuscate all the We evaluate our model on two applications of the deobfusca-
identifiers defined in the file. We ensure that the obfuscated tion task: when pobf = 0 (the model has to retrieve a single
code has the same behavior as the original. The second identifier name), and pobf = 1 (the model has to retrieve all
row in Figure 1 shows an example of obfuscated code with the identifier names).
pobf = 1, where we obfuscate a function bfs which im-
plements a breadth-first search. The function append is
not obfuscated as it is a standard Python function not de-
fined in the file. The model is given the obfuscated code as Deobfuscating a single identifier When pobf = 0, only
input and has to restore the original name of each special one identifier is obfuscated. In that case, the model has to
token CLASS_i, FUNC_i and VAR_i. In other words, the propose a relevant name for a single identifier using the rest
model needs to output a dictionary mapping special tokens of the non-obfuscated file as context. It can be applied as
to their initial values. a tool that suggests relevant variable names. Integrated de-
Finding informative names for obfuscated identifiers re- velopment environments (e.g. PyCharm or IntelliJ) already
quires the model to learn a deep understanding of code perform this task, often using simple handcrafted rules.
semantics, which is desirable for a pre-training task. MLM 2
In the obfuscated example given in Figure 1, the model is
will mask only some of the occurrences of the identifiers trained to generate the sequence: FUNC_0 bfs | VAR_0
and leave the other ones unchanged so that the model can graph | VAR_1 root | VAR_2 visited | VAR_3
queue | VAR_4 neighbor | VAR_5 node.
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages
Deobfuscating all identifiers Obfuscators are commonly in natural language the model has to retrieve the most se-
used to make code smaller and more efficient or to protect mantically related code within a collection of code snippets.
it by making it more difficult to understand and reuse. They This is a ranking problem evaluated using the Mean Recip-
typically apply several transformations, one of them being to rocal Rank (MRR) metric. The model is composed of two
replace every identifier name with short and uninformative encoders. The natural language query and the code are en-
names (e.g. a, b, c). In our work, such a transformation coded separately, and we compute the dot product between
corresponds to obfuscating a file with pobf = 1. To measure the first hidden states of the encoders’ last layers. This task
our model’s ability to revert the obfuscation operation, we is available in Python.
evaluate its accuracy when obfuscating all identifier names.
TransCoder TransCoder (Roziere et al., 2020) is an unsu-
Another application would be to help understand source
pervised machine translation model which translates func-
code written with uninformative variable names.
tions and methods between C++, Java, and Python. A
single seq2seq model is trained for all languages. In the
Evaluation metric We evaluate the ability of our model original work, TransCoder is pre-trained with MLM, and
to retrieve identifier names from the original non-obfuscated trained with denoising auto-encoding and back-translation.
code. We report the accuracy, which is the percentage of TransCoder is evaluated using the Computational Accuracy
recovered tokens that exactly match the ground truth. Fol- metric, which computes the percentage of correct solutions
lowing previous works (Allamanis et al., 2015; 2016; Alon according to series of unit tests. We only consider a single
et al., 2018; 2019), we also report the subtoken score, a model output (CA@1), with beam sizes of 1 and 10.
more flexible metric which computes the precision, recall,
and F1 scores for retrieving the original case-insensitive 4.3. Experimental details
subtokens. Each token is broken into subtokens using upper-
case letters for camlCase and underscores for snake_case. Model Architecture For DOBF, we consider a seq2seq
For instance, decoderAttention would be consid- model with attention, composed of an encoder and a decoder
ered to be a perfect match for decoder_attention or using a transformer architecture (Vaswani et al., 2017). We
attentionDecoder. attention would have a per- train two models with different sizes in order to provide fair
fect precision but a recall of 0.5, so a F1 score of 66.7. comparisons to our baselines (CodeBERT and TransCoder).
crossAttentionDecoder would have a perfect recall We train one model with 12 layers, 12 attention heads, and a
but a precision of 23 , corresponding to a F1 score of 80.0. hidden dimensionality of 768 and one model with 6 layers,
We compute the overall subtoken precision, recall and F1 8 attention heads, and a hidden dimensionality of 1024.
scores averaged over each recovered token.
Training dataset As in Roziere et al. (2020), we use the
GitHub public dataset available on Google BigQuery and
4.2. Fine-tuning on downstream tasks select all Python and Java files within the available projects.
In order to evaluate DOBF as a pre-training model, we Following Lopes et al. (2017) and Allamanis (2019), we
fine-tune DOBF on TransCoder and on three tasks from remove duplicate files. We also ensure that each fork be-
CodeXGLUE (Cod, 2020), a benchmark for programming longs to the same split as its source repository. We obfuscate
languages. We only consider the Java and Python tasks with each file and create the corresponding dictionary of masked
an encoder in the model architecture for which the training, identifier names, resulting in a parallel (obfuscated file -
validation, and test sets are publicly available. dictionary) dataset of 19 GB for Python and 26 GB for Java.
We show some statistics about this dataset in Table 1. We
CodeXGLUE Clone Detection This task is a binary clas- use the same tokenizers as Roziere et al. (2020). For com-
sification problem where the model has to predict whether parison purposes, we apply either the BPE codes used by
two code snippets are semantically equivalent. It is evalu- Roziere et al. (2020) or by Feng et al. (2020). In practice, we
ated using the F1 score. The model is composed of a single train only on files containing less than 2000 tokens, which
encoder and a classification layer. An input consists in two corresponds to more than 90% and 80% of the Java and
snippets of code, which are concatenated before being fed Python files respectively.
to the model. This task is available in Java.
Training details We train DOBF to translate obfuscated
CodeXGLUE Code Summarization Given a code snip- files into lists of identifier names. During DOBF training,
pet, the model is trained to generate the corresponding we alternate between batches of Java and Python composed
documentation in natural language. The architecture is a of 3000 tokens per GPU. We optimize DOBF with the Adam
sequence-to-sequence transformer model evaluated using optimizer (Kingma & Ba, 2014) and an inverse square-root
BLEU score (Papineni et al., 2002). The dataset includes learning rate scheduler (Vaswani et al., 2017). We imple-
both Java and Python source code. ment our models in PyTorch (Paszke et al., 2019) and train
CodeXGLUE NL Code Search Given a code search query them on 32 V100 GPUs. We use float16 operations to speed
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages
def FUNC_0(VAR_0, VAR_1):
VAR_2 = [VAR_1]
Table 1. Dataset statistics. VAR_3 = [VAR_1]
while VAR_3:
VAR_4 = VAR_3.pop(0)
Java Python for VAR_5 in VAR_0[VAR_4]:
if (VAR_5 not in VAR_2):
All - Size 26 GB 19 GB VAR_2.add(VAR_5)
All - Nb files 7.9M 3.6M VAR_3.append(VAR_5)
return VAR_2
Av. nb of tokens / file 718 1245
Av. nb of identifiers / file 25.9 41.8 def bfs(graph, start):
visited = [start]
queue = [start]
while queue:
node = queue.pop(0)
up training and to reduce the memory usage of our mod- for neighbor in graph[node]:
if (neighbor not in visited):
els. We try different initialization schemes: training from visited.add(neighbor)
scratch and with a Python-Java MLM following Roziere queue.append(neighbor)
return visited
et al. (2020). We train DOBF with three different obfusca-
tion probability parameters: pobf ∈ {0, 0.5, 1}. For each
pobf value, we train models with multiple initial learning Figure 2. Full deobfuscation of a breadth-first-search function
rates ranging from 10−4 to 3.10−4 and select the best one by DOBF. The code on top has been fully obfuscated. The code on
using the average subtoken F1 score computed on the vali- the bottom was recovered using DOBF by replacing the function
dation dataset. name and every variable name using the generated dictionary.
DOBF is able to suggest relevant function and variable names. It
Fine-tuning details Depending on the fine-tuning tasks, we makes the code much more readable and easier to understand.
consider different model architectures: seq2seq models with
encoder and decoder, architectures with two encoders or a
single encoder. In all cases, we initialize the encoders of configuration, pre-training DOBF with MLM improves the
these models with the encoder of DOBF and fine-tune all performance.
parameters. For fair comparison, we rerun all baselines, and
train models with the same architectures, number of GPUs, Figure 2 shows an example of a fully obfuscated function
batch sizes and optimizers as in the original papers. For recovered by our model. DOBF successfully manages to
CodeXGLUE, we noticed that the tasks are quite sensitive understand the purpose of the function and to predict appro-
to the learning rate parameter used during fine-tuning. We priate variable names. Figure 3 shows examples of function
perform a grid search on five learning rate parameters rang- name proposal by DOBF for functions implementing matrix
ing from 5.10−6 to 10−4 and we select the best parameter operations in Python. We observe that DOBF manages to
on the validation dataset. For TransCoder, we use a learning identify the key tokens and to properly infer the purpose of
rate of 10−4 as in Roziere et al. (2020). similar but very different functions. Figures 5, 6, and 7 in
the appendix show additional examples of function name
proposals by DOBF in Java and Python. Figure 8 shows ad-
5. Results
ditional examples where we show that DOBF also leverages
5.1. Deobfuscation non-obfuscated identifier names to understand the meaning
of input functions. Figures 9 and 10 in the appendix show
In Table 2, we evaluate the ability of our model to recover examples of deobfuscation of fully obfuscated Python code
identifier names, either when only one identifier is obfus- snippets using DOBF. It is able to understand the seman-
cated (pobf = 0) or when all identifiers are obfuscated tics and purposes of a variety of obfuscated classes and
(pobf = 1), for models trained with pobf ∈ {0, 0.5, 1}. functions, including a LSTM cell.
Even when evaluating with pobf = 0, training with pobf = 0
is less efficient than pobf = 0.5 since the model is only
5.2. Downstream tasks
trained to generate a single variable for each input sequence.
Training with pobf = 0.5 is a more difficult task that re- For fine-tuning, we considered models pre-trained with
quires the model to learn and understand more about code pobf = 0.5 and pobf = 1. Since they gave very simi-
semantics. Forcing the model to understand the structure of lar results on downstream tasks, we only use models pre-
the code may be useful even when testing with pobf = 0, trained with pobf = 0.5 in the rest of the paper. As base-
as some identifier names cannot be guessed only from the lines, we consider a randomly initialized model and a model
names of other identifiers. When DOBF has to recover a pre-trained with MLM only. For CodeXGLUE tasks, we
fully obfuscated function, it obtains the best accuracy when also consider CodeBERT as a baseline. We compare re-
trained with pobf = 1. It manages to recover 45.6% of sults for DOBF trained from scratch and DOBF initialized
the initial identifier names. We also observe that, for every with MLM (MLM+DOBF), and report results in Table 3.
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages
Figure 3. Additional examples of function name proposals for matrix operations in Python. DOBF is able to find the right name for
each matrix operation, showing that it learned to attend to the most important parts of the code. Even when the function only differs by
one token (e.g. a subtraction instead of an addition operator), DOBF successfully and confidently (c.f. scores) understands the semantics
of the function and its purpose.
Table 3. Results on downstream tasks for different pre-training configurations. Models pre-trained with MLM and DOBF signif-
icantly outperform both CodeBERT and models trained with MLM only. MLM+DOBF outperforms CodeBERT by 7% on natural
language code search (NLCS), and MLM by 6% in Java → Python computational accuracy. It also beats CodeBERT on every task except
Clone Detection, on which CodeBERT scores much higher than our MLM. The tasks where MLM provides large improvements over the
transformer baseline (first row, no pre-training) are also the tasks where DOBF provides the largest gains (e.g. clone detection, natural
language code search, and unsupervised translation).
Clone Det Code Sum Java Code Sum Python NLCS Python→Java Java→Python
(F1 score) (BLEU) (BLEU) (MRR) (CA@1) (CA@1)
k=1 k=10 k=1 k=10
Transformer 88.14 16.58 16.43 0.025 37.6 38.9 31.8 42.1
CodeBERT 96.50 18.25 18.22 0.315 - - - -
MLM 91.89 18.59 17.95 0.308 40.3 42.2 44.7 46.6
DOBF 96.52 18.19 17.51 0.272 38.9 45.7 44.7 46.4
MLM+DOBF 95.87 19.05 18.24 0.383 43.5 44.9 49.2 52.5
6. Conclusion
In this paper, we introduce a new deobfuscation objective
and show that it can be used for three purposes: recover fully
obfuscated code, suggest relevant identifier names, and pre-
train transformer models for programming language related
tasks. Although it does not require any parallel corpora of
source code aligned to natural language, DOBF outperforms
Figure 4. TransCoder results for different pre-training
CodeBERT and MLM pre-training on multiple downstream
schemes. Pre-training our model with MLM+DOBF instead of tasks, including clone detection, code summarization, natu-
MLM only, allows to quickly reach higher levels of computational ral language code search, and unsupervised code translation.
accuracy when fine-tuning for Java → Python translation. The gap These results show that DOBF leverages the particular struc-
between MLM+DOBF and MLM persists until convergence. ture of source code to add noise to the input sequence in
a particularly effective way. Other noise functions or sur-
rogate objectives adapted to source code may improve the
performance further. For instance, by training model to find
tasks, outperforming CodeBERT and MLM. The deobfusca- the type of given variables, the signature of a method, or to
tion objective is already effective as a pre-training task: it repair a piece of code which has been corrupted.
leads to results comparable to MLM on most tasks and is
much more effective on clone detection. The MLM+DOBF Since models pretrained on source code benefit from struc-
model outperforms MLM on all downstream tasks, the ma- tured noise, it would be interesting to see whether these
jor improvement being for NL code search, which is also findings can be applied to natural languages as well. Al-
the task that benefited the most from MLM pretraining. For though ambiguous, natural languages also have an underly-
TransCoder, MLM+DOBF increases the computational ac- ing structure. Leveraging the constituency or dependency
curacy of the MLM model by 2.7% when translating from parse trees of sentences (as opposed to abstract syntax trees
Python to Java, and by 5.9% when translating from Java to in programming languages) may help designing better pre-
Python with beam size 10. In Figure 4, we can see that com- training objectives for natural languages.
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages
References Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:
Pre-training of deep bidirectional transformers for lan-
Codexglue: An open challenge for code intelligence. arXiv,
guage understanding. CoRR, abs/1810.04805, 2018.
2020.
Allamanis, M. The adverse effects of code duplication in Dong, L., Yang, N., Wang, W., Wei, F., Liu, X., Wang, Y.,
machine learning models of code. In Proceedings of the Gao, J., Zhou, M., and Hon, H.-W. Unified language
2019 ACM SIGPLAN International Symposium on New model pre-training for natural language understanding
Ideas, New Paradigms, and Reflections on Programming and generation. arXiv preprint arXiv:1905.03197, 2019.
and Software, pp. 143–153, 2019.
Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong,
Allamanis, M., Barr, E. T., Bird, C., and Sutton, C. M., Shou, L., Qin, B., Liu, T., Jiang, D., et al. Code-
Learning natural coding conventions. In Proceedings bert: A pre-trained model for programming and natural
of the 22nd ACM SIGSOFT International Symposium languages. arXiv preprint arXiv:2002.08155, 2020.
on Foundations of Software Engineering, pp. 281–293,
2014. Fu, C., Chen, H., Liu, H., Chen, X., Tian, Y., Koushanfar,
F., and Zhao, J. Coda: An end-to-end neural program de-
Allamanis, M., Barr, E. T., Bird, C., and Sutton, C. Suggest-
compiler. In Advances in Neural Information Processing
ing accurate method and class names. In Proceedings of
Systems, pp. 3703–3714, 2019.
the 2015 10th Joint Meeting on Foundations of Software
Engineering, pp. 38–49, 2015. Gellenbeck, E. M. and Cook, C. R. An investigation of
Allamanis, M., Peng, H., and Sutton, C. A convolutional procedure and variable names as beacons during program
attention network for extreme summarization of source comprehension. In Empirical studies of programmers:
code. In International conference on machine learning, Fourth workshop, pp. 65–81. Ablex Publishing, Nor-
pp. 2091–2100, 2016. wood, NJ, 1991.
Alon, U., Zilberstein, M., Levy, O., and Yahav, E. A general Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer,
path-based representation for predicting program proper- L., and Levy, O. Spanbert: Improving pre-training
ties. ACM SIGPLAN Notices, 53(4):404–419, 2018. by representing and predicting spans. Transactions of
the Association for Computational Linguistics, 8:64–77,
Alon, U., Zilberstein, M., Levy, O., and Yahav, E. code2vec:
2020.
Learning distributed representations of code. Proceedings
of the ACM on Programming Languages, 3(POPL):1–29, Kanade, A., Maniatis, P., Balakrishnan, G., and Shi, K.
2019. Learning and evaluating contextual embedding of source
Bavishi, R., Pradel, M., and Sen, K. Context2name: A deep code. In International Conference on Machine Learning,
learning-based approach to infer natural variable names pp. 5110–5121. PMLR, 2020.
from usage contexts. arXiv preprint arXiv:1809.05193,
2018. Kingma, D. and Ba, J. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.
Bichsel, B., Raychev, V., Tsankov, P., and Vechev, M.
Statistical deobfuscation of android applications. In Lacomis, J., Yin, P., Schwartz, E., Allamanis, M.,
Proceedings of the 2016 ACM SIGSAC Conference on Le Goues, C., Neubig, G., and Vasilescu, B. Dire:
Computer and Communications Security, pp. 343–355, A neural approach to decompiled identifier naming.
2016. In 2019 34th IEEE/ACM International Conference on
Automated Software Engineering (ASE), pp. 628–639.
Butler, S., Wermelinger, M., Yu, Y., and Sharp, H. Relating
IEEE, 2019.
identifier naming flaws and code quality: An empirical
study. In 2009 16th Working Conference on Reverse Lample, G. and Conneau, A. Cross-lingual language model
Engineering, pp. 31–35. IEEE, 2009. pretraining. arXiv preprint arXiv:1901.07291, 2019.
Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D. Elec-
tra: Pre-training text encoders as discriminators rather Lample, G., Conneau, A., Denoyer, L., and Ranzato, M.
than generators. arXiv preprint arXiv:2003.10555, 2020. Unsupervised machine translation using monolingual cor-
pora only. ICLR, 2018a.
David, Y., Alon, U., and Yahav, E. Neural reverse en-
gineering of stripped binaries using augmented control Lample, G., Ott, M., Conneau, A., Denoyer, L., and Ran-
flow graphs. Proceedings of the ACM on Programming zato, M. Phrase-based & neural unsupervised machine
Languages, 4(OOPSLA):1–28, 2020. translation. In EMNLP, 2018b.
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages
Lawrie, D., Morrell, C., Feild, H., and Binkley, D. What’s in comprehensibility: an experimental investigation. J. Prog.
a name? a study of identifiers. In 14th IEEE International Lang., 4(3):143–167, 1996.
Conference on Program Comprehension (ICPC’06), pp.
3–12. IEEE, 2006. Vasilescu, B., Casalnuovo, C., and Devanbu, P. Re-
covering clear, natural identifiers from obfuscated js
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mo- names. In Proceedings of the 2017 11th Joint Meeting
hamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. on Foundations of Software Engineering, pp. 683–693,
Bart: Denoising sequence-to-sequence pre-training for 2017.
natural language generation, translation, and comprehen-
sion. arXiv preprint arXiv:1910.13461, 2019. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-
Liblit, B., Begel, A., and Sweetser, E. Cognitive perspec- tion is all you need. In Advances in neural information
tives on the role of naming in computer programs. In processing systems, pp. 5998–6008, 2017.
PPIG, pp. 11, 2006.
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov,
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., R. R., and Le, Q. V. Xlnet: Generalized autoregressive
Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. pretraining for language understanding. In Advances in
Roberta: A robustly optimized bert pretraining approach. neural information processing systems, pp. 5753–5763,
arXiv preprint arXiv:1907.11692, 2019. 2019.
Lopes, C. V., Maj, P., Martins, P., Saini, V., Yang, D.,
Zitny, J., Sajnani, H., and Vitek, J. Déjàvu: a map of
code duplicates on github. Proceedings of the ACM on
Programming Languages, 1(OOPSLA):1–28, 2017.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: a
method for automatic evaluation of machine translation.
In Proceedings of the 40th annual meeting on association
for computational linguistics, pp. 311–318. Association
for Computational Linguistics, 2002.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
L., et al. Pytorch: An imperative style, high-performance
deep learning library. In Advances in neural information
processing systems, pp. 8026–8037, 2019.
Song, K., Tan, X., Qin, T., Lu, J., and Liu, T.-Y. Mass:
Masked sequence to sequence pre-training for language
generation. In International Conference on Machine
Learning, pp. 5926–5936, 2019.
Sun, Y., Wang, S., Li, Y., Feng, S., Chen, X., Zhang, H.,
Tian, X., Zhu, D., Tian, H., and Wu, H. Ernie: Enhanced
representation through knowledge integration. arXiv
preprint arXiv:1904.09223, 2019.
Figure 5. Examples of name proposal in Java. DOBF is able to suggest relevant function names for a variety of Java methods and
demonstrates its ability to understand the semantics of the code. In the first two examples, the first element in the beam shows that it is
able to select relevant names in the context to find a function name: it uses Files.delete and Files.createDirectories to
suggest the tokens deleteFile and createDir. DOBF finds relevant names for Java methods without copying any part of the other
tokens. For example for the third method combining two lists as in the python zip function, for the fourth method which computes the
n-th element of the Fibonacci series and for the last method which computes the dot product between two vectors.
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages
unique 24.8%
remove_duplicates 23.8%
def FUNC_0 (l): removeDuplicates 18.8%
return list(set(l)) uniquify 18.7%
unique_items 13.8%
read_gzip_file 22.9%
def FUNC_0 (path): read_gzip 22.1%
with gzip.open(path, 'rb') as f: ungzip 20.8%
content = f.read() gzip_content 18.2%
return content gzip_read 16.0%
def f(n):
VAR_0 = [True for i in range(n + 1)]
p = 2
while (p * p <= n):
if ( VAR_0 [p] == True): prime 30.6%
for i in range(p * 2, n + 1, p): l 20.5%
isPrime 18.0%
VAR_0 [i] = False 16.4%
a
p += 1 primes 14.6%
VAR_0 [0]= False
VAR_0 [1]= False
return [p for p in range(n+1) if VAR_0 [p]]
Figure 6. Examples of name proposal in Python. Our model trained with DOBF goes well beyond copying tokens from the context.
For instance, in the first example, it understands that this function is used to get environment variables. In the second example, it proposes
names related to what this function actually does (removing duplicates in a list) instead of the individual operations it uses (converting to
set and then to list). The last two rows show proposals for two different identifiers in a function computing the list of prime numbers below
n using the sieve of Eratosthenes. The proposals for the function name are all relevant, and the third one names exactly the algorithm
which is used. The variable v is a list of booleans. At the end of the algorithm, v[i] is true if and only if i is prime. The proposed
names prime and isPrime are very relevant as they describe what the list contains. Although l and a are not very informative, they
indicate that the variable is a list or an array.
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages
multiply_lists 28.7%
multiply_list 23.5%
def FUNC_0 (v1, v2): multiply 18.1%
assert len(v1) == len(v2) multiply_vectors 14.9%
return [a * b for a, b in zip(v1, v2)] mul 14.8%
dotproduct 34.8%
dot_product 19.2%
def FUNC_0 (v1, v2): dotProduct 18.1%
assert len(v1) == len(v2) dot 15.6%
return sum([a * b for a, b in zip(v1, v2)]) multiply_by_addition 12.3%
xor 62.9%
XOR 12.8%
def FUNC_0 (v1, v2): vector_xor 10.8%
assert len(v1) == len(v2) xors 7.4%
return [a ^ b for a, b in zip(v1, v2)] xor_lists 6.1%
power 29.8%
list_power 20.9%
def FUNC_0 (v1, v2): lcm 19.9%
assert len(v1) == len(v2) power_list 15.1%
return [a ** b for a, b in zip(v1, v2)] powersum 14.3%
add_lists 27.0%
add 22.9%
def FUNC_0 (v1, v2): sum_lists 17.9%
assert len(v1) == len(v2) list_concat 17.7%
return [a + b for a, b in zip(v1, v2)] list_add 14.5%
minus 30.4%
subtract 29.8%
def FUNC_0 (v1, v2): difference 14.1%
assert len(v1) == len(v2) subtract_lists 13.3%
return [a - b for a, b in zip(v1, v2)] substract 12.4%
Figure 7. Examples of function name proposal in Python using DOBF. DOBF is able to identify the key tokens in each function, to
properly infer its purpose, and to suggest appropriate names along with a confidence score. In particular, even though the first two code
snippets are very similar in terms of edit distance, they implement very different functions and DOBF is able to name them appropriately.
FUNC_0 bfs | VAR_0 queue FUNC_0 dfs | VAR_0 stack FUNC_0 bfs
Figure 8. Deobfuscation on graph traversal functions. These three functions perform graph traversals. The only difference between the
first and the second function is that the first uses a queue to select the next element (.pop(0)) while the second uses a stack (.pop()).
The first function implements a breadth-first search (bfs) in the graph and the second implements a depth-first search (dfs). DOBF is able
to find the right function and variable names in each case. In the last function, we replaced the anonymized VAR_0 variable with queue
in the implementation of depth-first search. This erroneous information leads DOBF to believe that this function performs breadth-first
search. It shows that, just like human programmers, DOBF uses the names of the other variables to understand programs and choose
relevant identifier names. When working on code with misleading identifier names, it is often preferable to obfuscate several identifiers.
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages
def __init__(VAR_0, VAR_1, VAR_2, VAR_3): def __init__(self, input_size, hidden_size, bias):
super(CLASS_0, VAR_0).__init__() super(LSTM, self).__init__()
VAR_0.VAR_1 = VAR_1 self.input_size = input_size
VAR_0.VAR_2 = VAR_2 self.hidden_size = hidden_size
VAR_0.VAR_4 = nn.Linear(VAR_1, (4 * VAR_2), bias=VAR_3) self.h1 = nn.Linear(input_size, (4 * hidden_size), bias=bias)
VAR_0.VAR_5 = nn.Linear(VAR_2, (4 * VAR_2), bias=VAR_3) self.h2 = nn.Linear(hidden_size, (4 * hidden_size), bias=bias)
VAR_0.FUNC_0() self.init_weights()
Figure 9. Deobfuscation of an LSTM cell. DOBF is able to recover several of the original tokens, including the class name (LSTM)
and the full signature of the __init__ method. Even though DOBF does not always recover the original token, it generally proposes
very relevant tokens which improves code readability. In particular, for some tokens the accuracy and subtoken scores would be zero but
the recovered tokens are still very relevant. For instance, reset_parameters (FUNC_0) was renamed to init_weights, std
(VAR_7) was renamed to stdv, and hidden (VAR_13) was renamed to prev_state. In those instances, the original and recovered
tokens share no subtoken despite having very similar semantics.
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages
FUNC_0 dotProduct
def FUNC_0(VAR_0, VAR_1):
VAR_0 list1
return sum(map(operator.mul, VAR_0, VAR_1))
VAR_1 list2
FUNC_0 tail
def FUNC_0(VAR_0, VAR_1):
VAR_0 s
return list(collections.deque(VAR_0, maxlen=VAR_1))
VAR_1 n
FUNC_0 even_sum
def FUNC_0(VAR_0):
VAR_0 nums
return sum((VAR_1 for VAR_1 in VAR_0 if ((VAR_1 % 2) == 0)))
VAR_1 n
Figure 10. Examples of full deobfuscations of Python functions. Even when every identifier is obfuscated, DOBF is able to propose
relevant names. The proposed function name is informative and relevant in all examples since the first function computes a dot product,
the second downloads a HTML page and returns its content, the third evaluates whether the input contains only unique elements, the
fourth computes the tail of an iterable, and the fifth computes the sum of the even elements of an iterable.