0% found this document useful (0 votes)

82 views15 pages

DOBF - A Deobfuscation Pre-Training Objective For Programming Languages

This paper introduces a new pre-training objective called DOBF (Deobfuscation Pre-Training) for programming languages. DOBF trains a model to recover the original version of source code that has been obfuscated by replacing variable, method, and class names with uninformative names. The paper argues that existing pre-training objectives like masked language modeling (MLM) do not fully leverage the structure of source code. DOBF is a more difficult task than MLM as it requires understanding the program semantics to suggest proper names rather than just predicting masked tokens. Experimental results show models pre-trained with DOBF outperform existing approaches on downstream tasks like code translation and code search.

Uploaded by

Zedni Re

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views15 pages

DOBF - A Deobfuscation Pre-Training Objective For Programming Languages

Uploaded by

Zedni Re

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Baptiste Roziere * 1 2 Marie-Anne Lachaux * 1 Marc Szafraniec 1 Guillaume Lample 1

Abstract and phrases (Sun et al., 2019), sampling masked words ac-
cording to their frequencies (Lample & Conneau, 2019),
Recent advances in self-supervised learning have replacing words with plausible alternatives (Clark et al.,
dramatically improved the state of the art on a
arXiv:2102.07492v2 [cs.CL] 16 Feb 2021

2020), etc. Overall, most of these pre-training objectives

wide variety of tasks. However, research in lan- boil down to denoising auto-encoding tasks with different
guage model pre-training has mostly focused on methods to add noise to the input, using arbitrary noise func-
natural languages, and it is unclear whether mod- tions. In our case, we are interested in pre-training deep
els like BERT and its variants provide the best pre- learning models for programming languages. As in natural
training when applied to other modalities, such language, pre-training was shown to be effective for source
as source code. In this paper, we introduce a new code (Feng et al., 2020; Roziere et al., 2020). However,
pre-training objective, DOBF, that leverages the these studies both rely on the original MLM objective pro-
structural aspect of programming languages and posed by Devlin et al. (2018), which was initially designed
pre-trains a model to recover the original version for natural languages and does not leverage the particular
of obfuscated source code. We show that mod- structure of source code. We argue that this objective is ac-
els pre-trained with DOBF significantly outper- tually suboptimal in the context of programming languages,
form existing approaches on multiple downstream and propose a new objective based on code obfuscation.
tasks, providing relative improvements of up to
13% in unsupervised code translation, and 24% Code obfuscation consists in modifying source code in order
in natural language code search. Incidentally, we to make it harder for humans to understand, or smaller while
found that our pre-trained model is able to de- keeping its behaviour unchanged. In some ancient inter-
obfuscate fully obfuscated source files, and to preted languages, name minimization could also reduce the
suggest descriptive variable names. memory usage of the program. Today, it is used to protect in-
tellectual property by preventing people from understanding
and modifying the code, to prevent malware detection, and
1. Introduction to compress programs (e.g. Javascript code) to reduce net-
work payload sizes. Moreover, C compilers discard variable
Model pre-training with self-supervised methods such as names, and current rule-based and neural-based decompil-
BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019), ers generate obfuscated C code with uninformative variable
XLM (Lample & Conneau, 2019) or XLNet (Yang et al., names (Fu et al., 2019). Obfuscators typically apply sev-
2019), has become ubiquitous in Natural Language Process- eral transformations to the code. While some operations
ing (NLP), and led to significant improvements in many can be reversed (e.g. dead code injection), the obfuscation
tasks. These approaches are based on the Masked Language of identifier names—renaming every variable, method and
Modeling (MLM) objective, which consists in randomly class with uninformative names—is irreversible and has a
masking words from an input text, and training a model substantial impact on code comprehension (Gellenbeck &
to recover the original input. In the original approach pro- Cook, 1991; Takang et al., 1996; Lawrie et al., 2006).
posed by Devlin et al. (2018), a fraction of selected masked
words is replaced by masked tokens, another is replaced By analyzing the overall structure of an obfuscated file,
by random words, and another remains unchanged. Since an experienced programmer can always, with time, under-
then, a myriad of studies have proposed to modify the MLM stand the meaning of the obfuscated code. For instance,
objective, either by masking contiguous spans of text (Song in the obfuscated example in Figure 1, one can recognize
et al., 2019; Joshi et al., 2020), masking named entities the function and guess that it implements a breadth-first
search algorithm. We also expect neural networks, that ex-
*
Equal contribution 1 Facebook AI Research 2 Paris Dauphine cel in pattern recognition, to perform well on this task. We
University. Correspondence to: Baptiste Roziere <[email protected]>, propose to pre-train a model to revert the obfuscation func-
Marie-Anne Lachaux <[email protected]>. tion, by training a sequence-to-sequence (seq2seq) model
to convert obfuscated functions, where names of functions
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

and variables have been replaced by uninformative names, initial sentence given the corrupted one. Lample & Con-
back to their original forms. Suggesting proper variable neau (2019) noticed that the masked words are often easy
and function names is a difficult task that requires to un- to predict, and proposed to sample the 15% masked words
derstand what the program does. In the context of source according to their frequencies instead of uniformly. This
code, it is a more sensible, but also a more difficult task way, rare words are sampled more often, making the pre-
than MLM. Indeed, we observe (c.f. Figure 1) that pre- training task more difficult for the model, which results in a
dicting the content of randomly masked tokens is usually better learning signal and faster training. Sun et al. (2019)
quite simple, as it often boils down to making syntax related also noticed that recovering the tokens masked by MLM is
predictions (e.g. predicting that was has been masked out too simple in some contexts (e.g. predicting the two tokens
is a parenthesis, a semi-column, etc.). These simple pre- “Harry Potter” is much harder than predicting only “Harry”
dictions actually provide little training signal to the model. if you know the next word is “Potter”). To address this issue,
In practice, MLM also masks out variable names, but if a they proposed to mask phrases and named entities instead of
given variable appears multiple times in a function, it will individual tokens. Joshi et al. (2020) and Song et al. (2019)
be easy for the model to simply copy its name from one of made a similar observation and proposed to mask random
the other occurrences. Our model does not have this issue, spans of text. They showed that this simple modification
as all occurrences of masked variables are replaced by the improves the performance on many downstream NLP tasks.
same VAR_i special tokens.
In this paper, we make the following contributions: Alternative objectives. Other pre-training objectives
have been proposed in addition to MLM. For instance, De-
vlin et al. (2018) also uses the next sentence prediction
• We present DOBF, a new pre-training objective based
(NSP) objective, a binary classification task that consists in
on deobfuscation, and show its effectiveness on multi-
predicting whether two input sentences follow each other
ple programming languages.
in the original corpus. The NSP objective was originally
• We show that DOBF significantly outperform MLM designed to improve the performance on downstream NLP
(e.g. BERT) on multiple tasks such as code search, tasks, but recent studies (Lample & Conneau, 2019; Liu
code summarization or unsupervised code translation. et al., 2019) showed that training MLM on stream of sen-
tences to leverage longer context, and removing the NSP
• We show that, by design, models pre-trained with objective improves the quality of pre-training. To improve
DOBF have interesting applications and can be used the sample-efficiency of MLM (where only 15% of tokens
to understand functions with uninformative identifier are predicted), Electra (Clark et al., 2020) proposed to re-
names. Besides, the model is able to successfully de- place (and not mask) some tokens with plausible alternatives,
obfuscate fully obfuscated source files. and to train a network to detect the tokens that have been
replaced. They showed that this new Replaced Token Detec-
In the next section, we discuss the related work. Then, tion (RTD) objective matches the performance of RoBERTa
we present our objective, and the downstream tasks we while using four times less computational resources. Dong
consider for fine-tuning. Finally, we present our results and et al. (2019) proposed a model that combines multiple pre-
the potential applications of our model. training tasks, including bidirectional, but also left-to-right
and right-to-left language modeling objectives. Lewis et al.
(2019) also proposed different pre-training objectives, to de-
2. Related work tect whether input sentences have been permuted, whether
Masked Language Modeling pre-training. Large pre- tokens have been deleted or inserted, etc.
trained transformers such as BERT (Devlin et al., 2018) or
RoBERTa (Liu et al., 2019) led to significant improvements Code Generation Pre-training. Recent studies showed
in the majority of natural language processing tasks. The that pre-training methods developed for natural language
quality of pre-training mainly comes from the MLM ob- processing are also effective for programming languages.
jective (i.e. the cloze task), that allows the model to make For instance, Feng et al. (2020) proposed CodeBERT, a
predictions by leveraging left and right contexts, unlike RoBERTa-based model trained on source code using the
causal language modeling (CLM) where the model predic- MLM and RTD objectives. They showed that their model
tions are only conditioned on previous words. In MLM, the performs well on downstream code generation tasks and
model takes as input a sentence and uniformly selects 15% outperforms previous pre-training approaches. Kanade et al.
of its tokens. Of the selected tokens, 80% are replaced by (2020) applied MLM and the next sentence prediction ob-
a special symbol [MASK], 10% are left unchanged, and jectives to pre-train models on Python code. More recently,
the remaining 10% are replaced by random tokens from the Roziere et al. (2020) applied the unsupervised machine trans-
vocabulary. The MLM objective consists in recovering the lation principles of Lample et al. (2018a;b) to monolingual
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Figure 1. Illustration of the MLM and DOBF objectives. Given an input function, the masked language modeling (MLM) task
randomly samples tokens to mask out. With source code, a large fraction of these tokens are related to the language syntax (e.g. commas,
parentheses, etc.) that are trivial for the model to predict, and provide a poor training signal. Instead, we propose to obfuscate the code by
masking the name of functions and variables, and to train the model to recover the original function by deobfuscating the code (DOBF).
When a variable is masked out, we mask all occurrences of this variable with the same mask symbol (e.g. all occurrences of “visited” are
replaced by “V0”) to prevent the model from copying names. The DOBF objective is more difficult and provides a better learning signal.

source code from GitHub. They showed that the resulting used a transformer together with augmented representations
model, TransCoder, was able to translate source code be- obtained from static analysis to infer procedure names in
tween Python, Java, and C++, in a fully unsupervised way. stripped binary files. These models are already used to
However, the two above studies build upon pre-training understand obfuscated and compiled source code. Unugli-
strategies developed in the context of natural language pro- fyJS1 , which is based on JSNICE (Raychev et al., 2015)
cessing. In this paper, we propose to use a code-specific and available online, is especially famous in the Javascript
objective to better pre-train models designed to be fine-tuned community. However, none of these studies investigated the
on code generation tasks: code deobfuscation. use of deobfuscation for model pre-training.

Code deobfuscation. Empirical studies show that nam- 3. Model

ing conventions and the use of informative identifier names
make code more understandable, easier to maintain and lead 3.1. MLM for Programming Languages
to fewer bugs (Takang et al., 1996; Liblit et al., 2006; Butler A countless number of pre-training objectives have been
et al., 2009). It motivated other works studying deobfusca- introduced in the literature (Devlin et al., 2018; Clark et al.,
tion of identifier names and identifier name proposal using 2020; Lewis et al., 2019; Liu et al., 2019; Dong et al., 2019).
n-grams (Allamanis et al., 2014; 2015), probabilistic mod- Most of them rely on hyper-parameters and seemingly arbi-
els (Raychev et al., 2015; Bichsel et al., 2016; Vasilescu trary decisions (Should we mask individual tokens or spans?
et al., 2017; Alon et al., 2018), and recurrent neural net- Which fraction of them? What do we do with masked out
works (Bavishi et al., 2018; Lacomis et al., 2019). Alon tokens? etc.). These choices are typically based on intuition
et al. (2018) extract features from Abstract Syntax Tree and validated empirically on natural language processing
(AST) paths and train a Conditional Random Field to pre- tasks. However, source code is much more structured than
dict variable and method names, and infer types for several natural language, which makes predicting masked tokens
languages. DIRE (Lacomis et al., 2019) uses a commercial much easier for programming languages.
decompiler to obtain C code with uninformative identifier
names from binaries. They also use AST features, which The first row in Figure 1 shows an example of input / output
go through a Graph Neural Network trained jointly with a for the MLM objective. We can see that the majority of
LSTM model on the sequence of C tokens to retrieve rel- 1
http://www.nice2predict.org/
evant identifier names. More recently, David et al. (2020)
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

tokens are composed of Python keywords or symbols re- simply copy identifier names. In Figure 1, with MLM mask-
lated to syntax: , [ while = if ) return. These ing, the model can simply notice that a variable named
symbols are easy to recover, and a model will quickly learn queue is called on the fourth line. Since the variable is not
to predict them with perfect accuracy. This effect is ac- defined, the model can easily guess that queue has to be
centuated by the verbosity of the language. For instance, defined on the third line, and infer the value of the corre-
we would see significantly more of these tokens in Java. sponding [MASK] token. With the deobfuscation objective,
Retrieving the obfuscated graph token is also relatively the model needs to analyze code patterns and understand
simple: the model only needs to retrieve the most relevant the semantics of the variable to infer that, since its elements
variable in the scope. More generally, retrieving an identifier are popped with .pop(0), the variable V3 implements
name is often easy when given its full context, including its a queue. If its elements were popped with .pop(), our
definition and usages. Overall, we suspect that the MLM model would name it stack instead of queue (c.f. Fig-
objective is too simple in programming languages and we ure 8 in the appendix).
introduce a new objective, DOBF, which encourages the
model to learn a deeper understanding of code semantics. 3.3. Implementation
Overall, the deobfuscation objective operates like a super-
3.2. Deobfuscation Objective
vised machine translation objective, where a seq2seq model
Instead of MLM, we propose a new pre-training objective, is trained to map an obfuscated code into a dictionary rep-
DOBF, that leverages the particular structure of program- resented as a sequence of tokens. At inference time, the
ming languages. We obfuscate code snippets by replacing model is able to suggest meaningful class, function and
class, function and variable names with special tokens, and variable names for a piece of code with an arbitrary num-
train a model to recover the original names. When an identi- ber of obfuscated identifiers. Obfuscated classes, func-
fier is selected, all of its instances in the code are replaced by tions, and variables, are replaced with associated special
the same special token. This differs from MLM where the tokens: CLASS_0 . . . CLASS_N, FUNC_0 . . . FUNC_N
name of a variable can appear multiple times while being and VAR_0 . . . VAR_N. We serialize the output dictionary
masked a single time. For instance, in Figure 1, DOBF will as a sequence of tokens where the entries are separated by a
replace the two occurrences of node by the same symbol delimiter symbol |. 2
V5, while MLM will only mask one of these occurrences.
As a result, the fraction of meaningful tokens masked by 4. Experiments
the objective is language independent: for more verbose
languages (e.g. Java), the less informative syntax-related We train DOBF with the deobfuscation objective. First, we
tokens will not be masked out by the DOBF objective. evaluate our model on two straightforward deobfuscation
applications. Then, we show its performance on multiple
Each identifier is replaced with probability pobf ∈ [0, 1].
downstream tasks.
We ensure that the original input is modified: if no identifier
is replaced, we draw a random one to obfuscate. When
pobf = 0, we always obfuscate exactly one random iden- 4.1. Deobfuscation
tifier in the input. When pobf = 1, we obfuscate all the We evaluate our model on two applications of the deobfusca-
identifiers defined in the file. We ensure that the obfuscated tion task: when pobf = 0 (the model has to retrieve a single
code has the same behavior as the original. The second identifier name), and pobf = 1 (the model has to retrieve all
row in Figure 1 shows an example of obfuscated code with the identifier names).
pobf = 1, where we obfuscate a function bfs which im-
plements a breadth-first search. The function append is
not obfuscated as it is a standard Python function not de-
fined in the file. The model is given the obfuscated code as Deobfuscating a single identifier When pobf = 0, only
input and has to restore the original name of each special one identifier is obfuscated. In that case, the model has to
token CLASS_i, FUNC_i and VAR_i. In other words, the propose a relevant name for a single identifier using the rest
model needs to output a dictionary mapping special tokens of the non-obfuscated file as context. It can be applied as
to their initial values. a tool that suggests relevant variable names. Integrated de-
Finding informative names for obfuscated identifiers re- velopment environments (e.g. PyCharm or IntelliJ) already
quires the model to learn a deep understanding of code perform this task, often using simple handcrafted rules.
semantics, which is desirable for a pre-training task. MLM 2
In the obfuscated example given in Figure 1, the model is
will mask only some of the occurrences of the identifiers trained to generate the sequence: FUNC_0 bfs | VAR_0
and leave the other ones unchanged so that the model can graph | VAR_1 root | VAR_2 visited | VAR_3
queue | VAR_4 neighbor | VAR_5 node.
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Deobfuscating all identifiers Obfuscators are commonly in natural language the model has to retrieve the most se-
used to make code smaller and more efficient or to protect mantically related code within a collection of code snippets.
it by making it more difficult to understand and reuse. They This is a ranking problem evaluated using the Mean Recip-
typically apply several transformations, one of them being to rocal Rank (MRR) metric. The model is composed of two
replace every identifier name with short and uninformative encoders. The natural language query and the code are en-
names (e.g. a, b, c). In our work, such a transformation coded separately, and we compute the dot product between
corresponds to obfuscating a file with pobf = 1. To measure the first hidden states of the encoders’ last layers. This task
our model’s ability to revert the obfuscation operation, we is available in Python.
evaluate its accuracy when obfuscating all identifier names.
TransCoder TransCoder (Roziere et al., 2020) is an unsu-
Another application would be to help understand source
pervised machine translation model which translates func-
code written with uninformative variable names.
tions and methods between C++, Java, and Python. A
single seq2seq model is trained for all languages. In the
Evaluation metric We evaluate the ability of our model original work, TransCoder is pre-trained with MLM, and
to retrieve identifier names from the original non-obfuscated trained with denoising auto-encoding and back-translation.
code. We report the accuracy, which is the percentage of TransCoder is evaluated using the Computational Accuracy
recovered tokens that exactly match the ground truth. Fol- metric, which computes the percentage of correct solutions
lowing previous works (Allamanis et al., 2015; 2016; Alon according to series of unit tests. We only consider a single
et al., 2018; 2019), we also report the subtoken score, a model output (CA@1), with beam sizes of 1 and 10.
more flexible metric which computes the precision, recall,
and F1 scores for retrieving the original case-insensitive 4.3. Experimental details
subtokens. Each token is broken into subtokens using upper-
case letters for camlCase and underscores for snake_case. Model Architecture For DOBF, we consider a seq2seq
For instance, decoderAttention would be consid- model with attention, composed of an encoder and a decoder
ered to be a perfect match for decoder_attention or using a transformer architecture (Vaswani et al., 2017). We
attentionDecoder. attention would have a per- train two models with different sizes in order to provide fair
fect precision but a recall of 0.5, so a F1 score of 66.7. comparisons to our baselines (CodeBERT and TransCoder).
crossAttentionDecoder would have a perfect recall We train one model with 12 layers, 12 attention heads, and a
but a precision of 23 , corresponding to a F1 score of 80.0. hidden dimensionality of 768 and one model with 6 layers,
We compute the overall subtoken precision, recall and F1 8 attention heads, and a hidden dimensionality of 1024.
scores averaged over each recovered token.
Training dataset As in Roziere et al. (2020), we use the
GitHub public dataset available on Google BigQuery and
4.2. Fine-tuning on downstream tasks select all Python and Java files within the available projects.
In order to evaluate DOBF as a pre-training model, we Following Lopes et al. (2017) and Allamanis (2019), we
fine-tune DOBF on TransCoder and on three tasks from remove duplicate files. We also ensure that each fork be-
CodeXGLUE (Cod, 2020), a benchmark for programming longs to the same split as its source repository. We obfuscate
languages. We only consider the Java and Python tasks with each file and create the corresponding dictionary of masked
an encoder in the model architecture for which the training, identifier names, resulting in a parallel (obfuscated file -
validation, and test sets are publicly available. dictionary) dataset of 19 GB for Python and 26 GB for Java.
We show some statistics about this dataset in Table 1. We
CodeXGLUE Clone Detection This task is a binary clas- use the same tokenizers as Roziere et al. (2020). For com-
sification problem where the model has to predict whether parison purposes, we apply either the BPE codes used by
two code snippets are semantically equivalent. It is evalu- Roziere et al. (2020) or by Feng et al. (2020). In practice, we
ated using the F1 score. The model is composed of a single train only on files containing less than 2000 tokens, which
encoder and a classification layer. An input consists in two corresponds to more than 90% and 80% of the Java and
snippets of code, which are concatenated before being fed Python files respectively.
to the model. This task is available in Java.
Training details We train DOBF to translate obfuscated
CodeXGLUE Code Summarization Given a code snip- files into lists of identifier names. During DOBF training,
pet, the model is trained to generate the corresponding we alternate between batches of Java and Python composed
documentation in natural language. The architecture is a of 3000 tokens per GPU. We optimize DOBF with the Adam
sequence-to-sequence transformer model evaluated using optimizer (Kingma & Ba, 2014) and an inverse square-root
BLEU score (Papineni et al., 2002). The dataset includes learning rate scheduler (Vaswani et al., 2017). We imple-
both Java and Python source code. ment our models in PyTorch (Paszke et al., 2019) and train
CodeXGLUE NL Code Search Given a code search query them on 32 V100 GPUs. We use float16 operations to speed
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages
def FUNC_0(VAR_0, VAR_1):
VAR_2 = [VAR_1]
Table 1. Dataset statistics. VAR_3 = [VAR_1]
while VAR_3:
VAR_4 = VAR_3.pop(0)
Java Python for VAR_5 in VAR_0[VAR_4]:
if (VAR_5 not in VAR_2):
All - Size 26 GB 19 GB VAR_2.add(VAR_5)
All - Nb files 7.9M 3.6M VAR_3.append(VAR_5)
return VAR_2
Av. nb of tokens / file 718 1245
Av. nb of identifiers / file 25.9 41.8 def bfs(graph, start):
visited = [start]
queue = [start]
while queue:
node = queue.pop(0)
up training and to reduce the memory usage of our mod- for neighbor in graph[node]:
if (neighbor not in visited):
els. We try different initialization schemes: training from visited.add(neighbor)
scratch and with a Python-Java MLM following Roziere queue.append(neighbor)
return visited
et al. (2020). We train DOBF with three different obfusca-
tion probability parameters: pobf ∈ {0, 0.5, 1}. For each
pobf value, we train models with multiple initial learning Figure 2. Full deobfuscation of a breadth-first-search function
rates ranging from 10−4 to 3.10−4 and select the best one by DOBF. The code on top has been fully obfuscated. The code on
using the average subtoken F1 score computed on the vali- the bottom was recovered using DOBF by replacing the function
dation dataset. name and every variable name using the generated dictionary.
DOBF is able to suggest relevant function and variable names. It
Fine-tuning details Depending on the fine-tuning tasks, we makes the code much more readable and easier to understand.
consider different model architectures: seq2seq models with
encoder and decoder, architectures with two encoders or a
single encoder. In all cases, we initialize the encoders of configuration, pre-training DOBF with MLM improves the
these models with the encoder of DOBF and fine-tune all performance.
parameters. For fair comparison, we rerun all baselines, and
train models with the same architectures, number of GPUs, Figure 2 shows an example of a fully obfuscated function
batch sizes and optimizers as in the original papers. For recovered by our model. DOBF successfully manages to
CodeXGLUE, we noticed that the tasks are quite sensitive understand the purpose of the function and to predict appro-
to the learning rate parameter used during fine-tuning. We priate variable names. Figure 3 shows examples of function
perform a grid search on five learning rate parameters rang- name proposal by DOBF for functions implementing matrix
ing from 5.10−6 to 10−4 and we select the best parameter operations in Python. We observe that DOBF manages to
on the validation dataset. For TransCoder, we use a learning identify the key tokens and to properly infer the purpose of
rate of 10−4 as in Roziere et al. (2020). similar but very different functions. Figures 5, 6, and 7 in
the appendix show additional examples of function name
proposals by DOBF in Java and Python. Figure 8 shows ad-
5. Results
ditional examples where we show that DOBF also leverages
5.1. Deobfuscation non-obfuscated identifier names to understand the meaning
of input functions. Figures 9 and 10 in the appendix show
In Table 2, we evaluate the ability of our model to recover examples of deobfuscation of fully obfuscated Python code
identifier names, either when only one identifier is obfus- snippets using DOBF. It is able to understand the seman-
cated (pobf = 0) or when all identifiers are obfuscated tics and purposes of a variety of obfuscated classes and
(pobf = 1), for models trained with pobf ∈ {0, 0.5, 1}. functions, including a LSTM cell.
Even when evaluating with pobf = 0, training with pobf = 0
is less efficient than pobf = 0.5 since the model is only
5.2. Downstream tasks
trained to generate a single variable for each input sequence.
Training with pobf = 0.5 is a more difficult task that re- For fine-tuning, we considered models pre-trained with
quires the model to learn and understand more about code pobf = 0.5 and pobf = 1. Since they gave very simi-
semantics. Forcing the model to understand the structure of lar results on downstream tasks, we only use models pre-
the code may be useful even when testing with pobf = 0, trained with pobf = 0.5 in the rest of the paper. As base-
as some identifier names cannot be guessed only from the lines, we consider a randomly initialized model and a model
names of other identifiers. When DOBF has to recover a pre-trained with MLM only. For CodeXGLUE tasks, we
fully obfuscated function, it obtains the best accuracy when also consider CodeBERT as a baseline. We compare re-
trained with pobf = 1. It manages to recover 45.6% of sults for DOBF trained from scratch and DOBF initialized
the initial identifier names. We also observe that, for every with MLM (MLM+DOBF), and report results in Table 3.
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Input Code Function Name Proposals

def FUNC_0 (m1, m2):
assert m1.shape == m2.shape matrix_add 25.9%
n, m = m1.shape matrixAdd 22.5%
res = [[0 for _ in range(m)] for _ in range(n)] matrixadd 18.8%
for i in range(n): matrix_sum 16.7%
for j in range(m): matrix_addition 16.1%
res[i][j] = m1[i][j] + m2[i][j]
return res

def FUNC_0 (m1, m2):

assert m1.shape == m2.shape matrix_sub 26.1%
n, m = m1.shape matrix_subtract 21.5%
res = [[0 for _ in range(m)] for _ in range(n)] matrix_subtraction 19.7%
for i in range(n): sub 17.6%
for j in range(m): sub_matrix 15.0%
res[i][j] = m1[i][j] - m2[i][j]
return res

def FUNC_0 (matrix): transpose 36.7%

n, _ = matrix.shape rotate 29.5%
for i in range(n): rotate_matrix 17.1%
for j in range(i,n): symmetric 8.9%
matrix[i][j], matrix[j][i] = \ rotate_matrix_by_row 7.7%
matrix[j][i], matrix[i][j]

def FUNC_0 (m1, m2):

n1, m1 = m1.shape
n2, m2 = m2.shape matrix_product 28.8%
assert n2 == m1 mat_mult 23.8%
res = [[0 for _ in range(m2)] for _ in range(n1)] matmul_mat 17.0%
for i in range(n1): matprod 16.0%
for j in range(m2): matrixProduct 14.4%
res[i][j] = sum([m1[i][k] * m2[k][j]
for k in range(n2)])
return res

Figure 3. Additional examples of function name proposals for matrix operations in Python. DOBF is able to find the right name for
each matrix operation, showing that it learned to attend to the most important parts of the code. Even when the function only differs by
one token (e.g. a subtraction instead of an addition operator), DOBF successfully and confidently (c.f. scores) understands the semantics
of the function and its purpose.

The randomly initialized model is useful to measure the

Table 2. Results on partial and full deobfuscation. Token accu- importance of pre-training on a given task. Pre-training
racy and subtoken F1 score of DOBF evaluated with pobf = 0 (i.e. is particularly important for the NLCS task: without pre-
name proposal, where a single token is obfuscated) and pobf = 1 training, the model achieves a performance of 0.025 MMR
(i.e. full deobfuscation, where all tokens are obfuscated). We while it goes up to 0.308 with MLM pre-training. The main
consider models trained with different obfuscation probabilities differences between our MLM baseline and CodeBERT,
pobf . DOBF0.5 performs well for both tasks, and it even performs are that 1) CodeBERT was trained on a different dataset
better than DOBF0 for Identifier Name Proposal. DOBF0 and which contains functions with their documentation, 2) it
DOBF1 perform poorly when evaluated on other pobf parameters. uses an additional RTD objective, and 3) is initialized from
Pre-training DOBF with MLM further improves the performance.
a RoBERTa model. Although code summarization and NL
code search involve natural language and may benefit from
Eval pobf = 0 Eval pobf = 1 CodeBERT’s dataset that contains code documentation, we
Acc F1 Acc F1 obtained very similar results on this task using a simpler
dataset. However, our MLM baseline did not match their
DOBF0 56.3 68.0 0.4 0.9 performance on clone detection. We also tried to initialize
DOBF0.5 61.1 71.2 41.8 54.8 our MLM model with RoBERTa, but did not observe any
DOBF1 18.1 27.0 45.6 58.1 substantial impact on the performance on downstream tasks.
MLM+DOBF0.5 67.6 76.3 45.7 58.0
DOBF trained from scratch and DOBF pre-trained with
MLM+DOBF1 20.0 28.3 49.7 61.1
MLM obtained state-of-the-art results on all downstream
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Table 3. Results on downstream tasks for different pre-training configurations. Models pre-trained with MLM and DOBF signif-
icantly outperform both CodeBERT and models trained with MLM only. MLM+DOBF outperforms CodeBERT by 7% on natural
language code search (NLCS), and MLM by 6% in Java → Python computational accuracy. It also beats CodeBERT on every task except
Clone Detection, on which CodeBERT scores much higher than our MLM. The tasks where MLM provides large improvements over the
transformer baseline (first row, no pre-training) are also the tasks where DOBF provides the largest gains (e.g. clone detection, natural
language code search, and unsupervised translation).

Clone Det Code Sum Java Code Sum Python NLCS Python→Java Java→Python
(F1 score) (BLEU) (BLEU) (MRR) (CA@1) (CA@1)
k=1 k=10 k=1 k=10
Transformer 88.14 16.58 16.43 0.025 37.6 38.9 31.8 42.1
CodeBERT 96.50 18.25 18.22 0.315 - - - -
MLM 91.89 18.59 17.95 0.308 40.3 42.2 44.7 46.6
DOBF 96.52 18.19 17.51 0.272 38.9 45.7 44.7 46.4
MLM+DOBF 95.87 19.05 18.24 0.383 43.5 44.9 49.2 52.5

putational accuracy of the test set is higher for MLM+DOBF

during the entire training. Also, MLM+DOBF beats Code-
BERT by a wide margin on NL code search and code
summarization, showing that programming language data
aligned with natural language is not necessary to train an
effective model on those tasks. MLM+DOBF yields higher
scores than both DOBF and MLM on most tasks, showing
that MLM and DOBF are complementary.

6. Conclusion
In this paper, we introduce a new deobfuscation objective
and show that it can be used for three purposes: recover fully
obfuscated code, suggest relevant identifier names, and pre-
train transformer models for programming language related
tasks. Although it does not require any parallel corpora of
source code aligned to natural language, DOBF outperforms
Figure 4. TransCoder results for different pre-training
CodeBERT and MLM pre-training on multiple downstream
schemes. Pre-training our model with MLM+DOBF instead of tasks, including clone detection, code summarization, natu-
MLM only, allows to quickly reach higher levels of computational ral language code search, and unsupervised code translation.
accuracy when fine-tuning for Java → Python translation. The gap These results show that DOBF leverages the particular struc-
between MLM+DOBF and MLM persists until convergence. ture of source code to add noise to the input sequence in
a particularly effective way. Other noise functions or sur-
rogate objectives adapted to source code may improve the
performance further. For instance, by training model to find
tasks, outperforming CodeBERT and MLM. The deobfusca- the type of given variables, the signature of a method, or to
tion objective is already effective as a pre-training task: it repair a piece of code which has been corrupted.
leads to results comparable to MLM on most tasks and is
much more effective on clone detection. The MLM+DOBF Since models pretrained on source code benefit from struc-
model outperforms MLM on all downstream tasks, the ma- tured noise, it would be interesting to see whether these
jor improvement being for NL code search, which is also findings can be applied to natural languages as well. Al-
the task that benefited the most from MLM pretraining. For though ambiguous, natural languages also have an underly-
TransCoder, MLM+DOBF increases the computational ac- ing structure. Leveraging the constituency or dependency
curacy of the MLM model by 2.7% when translating from parse trees of sentences (as opposed to abstract syntax trees
Python to Java, and by 5.9% when translating from Java to in programming languages) may help designing better pre-
Python with beam size 10. In Figure 4, we can see that com- training objectives for natural languages.
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

References Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:
Pre-training of deep bidirectional transformers for lan-
Codexglue: An open challenge for code intelligence. arXiv,
guage understanding. CoRR, abs/1810.04805, 2018.
2020.
Allamanis, M. The adverse effects of code duplication in Dong, L., Yang, N., Wang, W., Wei, F., Liu, X., Wang, Y.,
machine learning models of code. In Proceedings of the Gao, J., Zhou, M., and Hon, H.-W. Unified language
2019 ACM SIGPLAN International Symposium on New model pre-training for natural language understanding
Ideas, New Paradigms, and Reflections on Programming and generation. arXiv preprint arXiv:1905.03197, 2019.
and Software, pp. 143–153, 2019.
Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong,
Allamanis, M., Barr, E. T., Bird, C., and Sutton, C. M., Shou, L., Qin, B., Liu, T., Jiang, D., et al. Code-
Learning natural coding conventions. In Proceedings bert: A pre-trained model for programming and natural
of the 22nd ACM SIGSOFT International Symposium languages. arXiv preprint arXiv:2002.08155, 2020.
on Foundations of Software Engineering, pp. 281–293,
2014. Fu, C., Chen, H., Liu, H., Chen, X., Tian, Y., Koushanfar,
F., and Zhao, J. Coda: An end-to-end neural program de-
Allamanis, M., Barr, E. T., Bird, C., and Sutton, C. Suggest-
compiler. In Advances in Neural Information Processing
ing accurate method and class names. In Proceedings of
Systems, pp. 3703–3714, 2019.
the 2015 10th Joint Meeting on Foundations of Software
Engineering, pp. 38–49, 2015. Gellenbeck, E. M. and Cook, C. R. An investigation of
Allamanis, M., Peng, H., and Sutton, C. A convolutional procedure and variable names as beacons during program
attention network for extreme summarization of source comprehension. In Empirical studies of programmers:
code. In International conference on machine learning, Fourth workshop, pp. 65–81. Ablex Publishing, Nor-
pp. 2091–2100, 2016. wood, NJ, 1991.

Alon, U., Zilberstein, M., Levy, O., and Yahav, E. A general Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer,
path-based representation for predicting program proper- L., and Levy, O. Spanbert: Improving pre-training
ties. ACM SIGPLAN Notices, 53(4):404–419, 2018. by representing and predicting spans. Transactions of
the Association for Computational Linguistics, 8:64–77,
Alon, U., Zilberstein, M., Levy, O., and Yahav, E. code2vec:
2020.
Learning distributed representations of code. Proceedings
of the ACM on Programming Languages, 3(POPL):1–29, Kanade, A., Maniatis, P., Balakrishnan, G., and Shi, K.
2019. Learning and evaluating contextual embedding of source
Bavishi, R., Pradel, M., and Sen, K. Context2name: A deep code. In International Conference on Machine Learning,
learning-based approach to infer natural variable names pp. 5110–5121. PMLR, 2020.
from usage contexts. arXiv preprint arXiv:1809.05193,
2018. Kingma, D. and Ba, J. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.
Bichsel, B., Raychev, V., Tsankov, P., and Vechev, M.
Statistical deobfuscation of android applications. In Lacomis, J., Yin, P., Schwartz, E., Allamanis, M.,
Proceedings of the 2016 ACM SIGSAC Conference on Le Goues, C., Neubig, G., and Vasilescu, B. Dire:
Computer and Communications Security, pp. 343–355, A neural approach to decompiled identifier naming.
2016. In 2019 34th IEEE/ACM International Conference on
Automated Software Engineering (ASE), pp. 628–639.
Butler, S., Wermelinger, M., Yu, Y., and Sharp, H. Relating
IEEE, 2019.
identifier naming flaws and code quality: An empirical
study. In 2009 16th Working Conference on Reverse Lample, G. and Conneau, A. Cross-lingual language model
Engineering, pp. 31–35. IEEE, 2009. pretraining. arXiv preprint arXiv:1901.07291, 2019.
Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D. Elec-
tra: Pre-training text encoders as discriminators rather Lample, G., Conneau, A., Denoyer, L., and Ranzato, M.
than generators. arXiv preprint arXiv:2003.10555, 2020. Unsupervised machine translation using monolingual cor-
pora only. ICLR, 2018a.
David, Y., Alon, U., and Yahav, E. Neural reverse en-
gineering of stripped binaries using augmented control Lample, G., Ott, M., Conneau, A., Denoyer, L., and Ran-
flow graphs. Proceedings of the ACM on Programming zato, M. Phrase-based & neural unsupervised machine
Languages, 4(OOPSLA):1–28, 2020. translation. In EMNLP, 2018b.
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Lawrie, D., Morrell, C., Feild, H., and Binkley, D. What’s in comprehensibility: an experimental investigation. J. Prog.
a name? a study of identifiers. In 14th IEEE International Lang., 4(3):143–167, 1996.
Conference on Program Comprehension (ICPC’06), pp.
3–12. IEEE, 2006. Vasilescu, B., Casalnuovo, C., and Devanbu, P. Re-
covering clear, natural identifiers from obfuscated js
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mo- names. In Proceedings of the 2017 11th Joint Meeting
hamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. on Foundations of Software Engineering, pp. 683–693,
Bart: Denoising sequence-to-sequence pre-training for 2017.
natural language generation, translation, and comprehen-
sion. arXiv preprint arXiv:1910.13461, 2019. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-
Liblit, B., Begel, A., and Sweetser, E. Cognitive perspec- tion is all you need. In Advances in neural information
tives on the role of naming in computer programs. In processing systems, pp. 5998–6008, 2017.
PPIG, pp. 11, 2006.
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov,
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., R. R., and Le, Q. V. Xlnet: Generalized autoregressive
Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. pretraining for language understanding. In Advances in
Roberta: A robustly optimized bert pretraining approach. neural information processing systems, pp. 5753–5763,
arXiv preprint arXiv:1907.11692, 2019. 2019.

Lopes, C. V., Maj, P., Martins, P., Saini, V., Yang, D.,
Zitny, J., Sajnani, H., and Vitek, J. Déjàvu: a map of
code duplicates on github. Proceedings of the ACM on
Programming Languages, 1(OOPSLA):1–28, 2017.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: a
method for automatic evaluation of machine translation.
In Proceedings of the 40th annual meeting on association
for computational linguistics, pp. 311–318. Association
for Computational Linguistics, 2002.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
L., et al. Pytorch: An imperative style, high-performance
deep learning library. In Advances in neural information
processing systems, pp. 8026–8037, 2019.

Raychev, V., Vechev, M., and Krause, A. Predicting program

properties from" big code". ACM SIGPLAN Notices, 50
(1):111–124, 2015.

Roziere, B., Lachaux, M.-A., Chanussot, L., and Lample,

G. Unsupervised translation of programming languages.
Advances in Neural Information Processing Systems, 33,
2020.

Song, K., Tan, X., Qin, T., Lu, J., and Liu, T.-Y. Mass:
Masked sequence to sequence pre-training for language
generation. In International Conference on Machine
Learning, pp. 5926–5936, 2019.

Sun, Y., Wang, S., Li, Y., Feng, S., Chen, X., Zhang, H.,
Tian, X., Zhu, D., Tian, H., and Wu, H. Ernie: Enhanced
representation through knowledge integration. arXiv
preprint arXiv:1904.09223, 2019.

Takang, A. A., Grubb, P. A., and Macredie, R. D. The

effects of comments and identifier names on program
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Input Code Proposed Function Name

public static void FUNC_0 (String path){

try { deleteFile 48.3%
Files.delete(path); remove 16.9%
} DeleteFile 13.2%
catch (Exception e) { removeFile 13.1%
System.err.println("Error deleting file " + path); deleteFileQuietly 8.4%
}
}
public static void FUNC_0 (String path){ createDir 23.5%
createDirectory 20.9%
if (!Files.exists(path)) { 20.8%
Files.createDirectories(path); createDirIfNotExists
ensureDirectoryExists 18.5%
} 16.3%
} createDirectoryIfNotExists
public static List<Pair<String, Double>> FUNC_0 (List<String> list1,
zip 28.6%
List<Double> list2) 20.0%
{ intersect
combine 17.9%
return IntStream.range(0, Math.min(list1.size(), list2.size())) 17.5%
.mapToObj(i -> new Pair<>(list1.get(i), list2.get(i))) merge
intersection 16.0%
.collect(Collectors.toList());
}
public static int FUNC_0 (int n){
int a = 0, b = 1;
int tmp; fib 41.5%
for (int i = 0; i < n; i ++){ fibonacci 36.6%
tmp = a + b; fibon 9.1%
a = b; fibo 8.8%
b = tmp; fibonacci_series 4.0%
}
return a;
}
public static float FUNC_0 (List<Float> vec1,
List<Float> vec2) {
float size = vec1.size(); dotProduct 40.9%
assert size == vec2.size(); dot 23.9%
float result = 0.0f; dot_product 16.5%
for (int i = 0; i < size; i++) { dotproduct 10.5%
result += vec1.get(i) * vec2.get(i); inner 8.3%
}
return result;
}

Figure 5. Examples of name proposal in Java. DOBF is able to suggest relevant function names for a variety of Java methods and
demonstrates its ability to understand the semantics of the code. In the first two examples, the first element in the beam shows that it is
able to select relevant names in the context to find a function name: it uses Files.delete and Files.createDirectories to
suggest the tokens deleteFile and createDir. DOBF finds relevant names for Java methods without copying any part of the other
tokens. For example for the third method combining two lists as in the python zip function, for the fourth method which computes the
n-th element of the Fibonacci series and for the last method which computes the dot product between two vectors.
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Input Code Proposals for Highlighted Identifiers

get_env 25.3%
get_envvar 19.3%
def FUNC_0 (name): env 19.2%
return os.environ[name] getenv 18.5%
get_env_variable 17.7%

unique 24.8%
remove_duplicates 23.8%
def FUNC_0 (l): removeDuplicates 18.8%
return list(set(l)) uniquify 18.7%
unique_items 13.8%

read_gzip_file 22.9%
def FUNC_0 (path): read_gzip 22.1%
with gzip.open(path, 'rb') as f: ungzip 20.8%
content = f.read() gzip_content 18.2%
return content gzip_read 16.0%

def FUNC_0 (n):

v = [True for i in range(n + 1)]
p = 2
while (p * p <= n): sieve 36.1%
if (v[p] == True): prime_sieve 18.5%
for i in range(p * 2, n + 1, p): sieve_of_eratosthenes 15.5%
v[i] = False primes 15.3%
p += 1 eratosthenes 14.5%
v[0]= False
v[1]= False
return [p for p in range(n+1) if v[p]]

def f(n):
VAR_0 = [True for i in range(n + 1)]
p = 2
while (p * p <= n):
if ( VAR_0 [p] == True): prime 30.6%
for i in range(p * 2, n + 1, p): l 20.5%
isPrime 18.0%
VAR_0 [i] = False 16.4%
a
p += 1 primes 14.6%
VAR_0 [0]= False
VAR_0 [1]= False
return [p for p in range(n+1) if VAR_0 [p]]

Figure 6. Examples of name proposal in Python. Our model trained with DOBF goes well beyond copying tokens from the context.
For instance, in the first example, it understands that this function is used to get environment variables. In the second example, it proposes
names related to what this function actually does (removing duplicates in a list) instead of the individual operations it uses (converting to
set and then to list). The last two rows show proposals for two different identifiers in a function computing the list of prime numbers below
n using the sieve of Eratosthenes. The proposals for the function name are all relevant, and the third one names exactly the algorithm
which is used. The variable v is a list of booleans. At the end of the algorithm, v[i] is true if and only if i is prime. The proposed
names prime and isPrime are very relevant as they describe what the list contains. Although l and a are not very informative, they
indicate that the variable is a list or an array.
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Input Code Proposed Function Name

multiply_lists 28.7%
multiply_list 23.5%
def FUNC_0 (v1, v2): multiply 18.1%
assert len(v1) == len(v2) multiply_vectors 14.9%
return [a * b for a, b in zip(v1, v2)] mul 14.8%

dotproduct 34.8%
dot_product 19.2%
def FUNC_0 (v1, v2): dotProduct 18.1%
assert len(v1) == len(v2) dot 15.6%
return sum([a * b for a, b in zip(v1, v2)]) multiply_by_addition 12.3%

xor 62.9%
XOR 12.8%
def FUNC_0 (v1, v2): vector_xor 10.8%
assert len(v1) == len(v2) xors 7.4%
return [a ^ b for a, b in zip(v1, v2)] xor_lists 6.1%

power 29.8%
list_power 20.9%
def FUNC_0 (v1, v2): lcm 19.9%
assert len(v1) == len(v2) power_list 15.1%
return [a ** b for a, b in zip(v1, v2)] powersum 14.3%

add_lists 27.0%
add 22.9%
def FUNC_0 (v1, v2): sum_lists 17.9%
assert len(v1) == len(v2) list_concat 17.7%
return [a + b for a, b in zip(v1, v2)] list_add 14.5%

minus 30.4%
subtract 29.8%
def FUNC_0 (v1, v2): difference 14.1%
assert len(v1) == len(v2) subtract_lists 13.3%
return [a - b for a, b in zip(v1, v2)] substract 12.4%

Figure 7. Examples of function name proposal in Python using DOBF. DOBF is able to identify the key tokens in each function, to
properly infer its purpose, and to suggest appropriate names along with a confidence score. In particular, even though the first two code
snippets are very similar in terms of edit distance, they implement very different functions and DOBF is able to name them appropriately.

BFS Implementation DFS Implementation DFS with Erroneous

Variable Name
def FUNC_0 (graph, node): def FUNC_0 (graph, node):
visited = [node] visited = [node] def FUNC_0 (graph, node):
VAR_0 = [node] VAR_0 = [node] visited = [node]
queue = [node]
while VAR_0 : while VAR_0 : while queue:
s = VAR_0 .pop(0) s = VAR_0 .pop() s = queue.pop()
for neighbour in graph[s]:
for neighbour in graph[s]: for neighbour in graph[s]:
if neighbour not in visited:
if neighbour not in visited: if neighbour not in visited:
visited.append(neighbour)
visited.add(neighbour) visited.add(neighbour)
queue.append(neighbour)
VAR_0 .append(neighbour) VAR_0 .append(neighbour) return visited
return visited return visited

FUNC_0 bfs | VAR_0 queue FUNC_0 dfs | VAR_0 stack FUNC_0 bfs

Figure 8. Deobfuscation on graph traversal functions. These three functions perform graph traversals. The only difference between the
first and the second function is that the first uses a queue to select the next element (.pop(0)) while the second uses a stack (.pop()).
The first function implements a breadth-first search (bfs) in the graph and the second implements a depth-first search (dfs). DOBF is able
to find the right function and variable names in each case. In the last function, we replaced the anonymized VAR_0 variable with queue
in the implementation of depth-first search. This erroneous information leads DOBF to believe that this function performs breadth-first
search. It shows that, just like human programmers, DOBF uses the names of the other variables to understand programs and choose
relevant identifier names. When working on code with misleading identifier names, it is often preferable to obfuscate several identifiers.
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Obfuscated Code Code Deobfuscated using DOBF

class CLASS_0(nn.Module): class LSTM(nn.Module):

def __init__(VAR_0, VAR_1, VAR_2, VAR_3): def __init__(self, input_size, hidden_size, bias):
super(CLASS_0, VAR_0).__init__() super(LSTM, self).__init__()
VAR_0.VAR_1 = VAR_1 self.input_size = input_size
VAR_0.VAR_2 = VAR_2 self.hidden_size = hidden_size
VAR_0.VAR_4 = nn.Linear(VAR_1, (4 * VAR_2), bias=VAR_3) self.h1 = nn.Linear(input_size, (4 * hidden_size), bias=bias)
VAR_0.VAR_5 = nn.Linear(VAR_2, (4 * VAR_2), bias=VAR_3) self.h2 = nn.Linear(hidden_size, (4 * hidden_size), bias=bias)
VAR_0.FUNC_0() self.init_weights()

def FUNC_0(VAR_6): def init_weights(self):

VAR_7 = (1.0 / math.sqrt(VAR_6.VAR_8)) stdv = (1.0 / math.sqrt(self.hidden_size))
for VAR_9 in VAR_6.VAR_10(): for m in self.modules():
VAR_9.data.uniform_((- VAR_7), VAR_7) m.data.uniform_((- stdv), stdv)

def FUNC_1(VAR_11, VAR_12, VAR_13): def forward(self, x, prev_state):

(VAR_14, VAR_15) = VAR_13 (prev_h, prev_c) = prev_state
VAR_14 = VAR_14.view(VAR_14.size(1), (- 1)) prev_h = prev_h.view(prev_h.size(1), (- 1))
VAR_15 = VAR_15.view(VAR_15.size(1), (- 1)) prev_c = prev_c.view(prev_c.size(1), (- 1))
VAR_12 = VAR_12.view(VAR_12.size(1), (- 1)) x = x.view(x.size(1), (- 1))
VAR_16 = (VAR_11.VAR_4(VAR_12) + VAR_11.VAR_5(VAR_14)) h = (self.h1(x) + self.h2(prev_h))
VAR_17 = VAR_16[:, :(3 * VAR_11.VAR_8)].sigmoid() s = h[:, :(3 * self.hidden_size)].sigmoid()
VAR_18 = VAR_16[:, (3 * VAR_11.VAR_8):].tanh() c = h[:, (3 * self.hidden_size):].tanh()
VAR_19 = VAR_17[:, :VAR_11.VAR_8] r = s[:, :self.hidden_size]
VAR_20 = VAR_17[:, VAR_11.VAR_8:(2 * VAR_11.VAR_8)] g = s[:, self.hidden_size:(2 * self.hidden_size)]
VAR_21 = VAR_17[:, (- VAR_11.VAR_8):] o = s[:, (- self.hidden_size):]
VAR_22 = (th.mul(VAR_15, VAR_20) + th.mul(VAR_19, VAR_18)) c = (th.mul(prev_c, g) + th.mul(r, c))
VAR_23 = th.mul(VAR_21, VAR_22.tanh()) h = th.mul(o, c.tanh())
VAR_23 = VAR_23.view(1, VAR_23.size(0), (- 1)) h = h.view(1, h.size(0), (- 1))
VAR_22 = VAR_22.view(1, VAR_22.size(0), (- 1)) c = c.view(1, c.size(0), (- 1))
return (VAR_23, (VAR_23, VAR_22)) return (h, (h, c))

ID Ground Truth DOBF

CLASS_0 LSTM LSTM
FUNC_0 reset_parameters init_weights
FUNC_1 forward forward
VAR_0 self self
VAR_1 input_size input_size
VAR_2 hidden_size hidden_size
VAR_3 bias bias
VAR_4 i2h h1
VAR_5 h2h h2
VAR_6 self self
VAR_7 std stdv
VAR_8 hidden_size hidden_size
VAR_9 w m
VAR_10 parameters modules
VAR_11 self self
VAR_12 x x
VAR_13 hidden prev_state
VAR_14 h prev_h
VAR_15 c prev_c
VAR_16 preact h
VAR_17 gates s
VAR_18 g_t c
VAR_19 i_t r
VAR_20 f_t g
VAR_21 o_t o
VAR_22 c_t c
VAR_23 h_t h

Figure 9. Deobfuscation of an LSTM cell. DOBF is able to recover several of the original tokens, including the class name (LSTM)
and the full signature of the __init__ method. Even though DOBF does not always recover the original token, it generally proposes
very relevant tokens which improves code readability. In particular, for some tokens the accuracy and subtoken scores would be zero but
the recovered tokens are still very relevant. For instance, reset_parameters (FUNC_0) was renamed to init_weights, std
(VAR_7) was renamed to stdv, and hidden (VAR_13) was renamed to prev_state. In those instances, the original and recovered
tokens share no subtoken despite having very similar semantics.
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Input Code Deobfuscated Identifiers

FUNC_0 dotProduct
def FUNC_0(VAR_0, VAR_1):
VAR_0 list1
return sum(map(operator.mul, VAR_0, VAR_1))
VAR_1 list2

def FUNC_0(VAR_0): FUNC_0 get_html

VAR_1 = urllib2.urlopen(VAR_0) VAR_0 url
VAR_2 = VAR_1.read() VAR_1 response
return VAR_2 VAR_2 html

def FUNC_0(VAR_0): FUNC_0 all_unique

VAR_1 = set(VAR_0) VAR_0 iterable
return (len(VAR_1) == len(VAR_0)) VAR_1 s

FUNC_0 tail
def FUNC_0(VAR_0, VAR_1):
VAR_0 s
return list(collections.deque(VAR_0, maxlen=VAR_1))
VAR_1 n

FUNC_0 even_sum
def FUNC_0(VAR_0):
VAR_0 nums
return sum((VAR_1 for VAR_1 in VAR_0 if ((VAR_1 % 2) == 0)))
VAR_1 n

Figure 10. Examples of full deobfuscations of Python functions. Even when every identifier is obfuscated, DOBF is able to propose
relevant names. The proposed function name is informative and relevant in all examples since the first function computes a dot product,
the second downloads a HTML page and returns its content, the third evaluates whether the input contains only unique elements, the
fourth computes the tail of an iterable, and the fifth computes the sum of the even elements of an iterable.

Code Obfuscation
No ratings yet
Code Obfuscation
201 pages
Aif-C01 1
No ratings yet
Aif-C01 1
16 pages
Auto-Detection of Programming Code Vulnerabilities With Natural L
No ratings yet
Auto-Detection of Programming Code Vulnerabilities With Natural L
37 pages
Editing Models With Task Arithmetic
No ratings yet
Editing Models With Task Arithmetic
31 pages
Training The Application of LLM
No ratings yet
Training The Application of LLM
68 pages
cs224n 2023 Lecture9 Pretraining
No ratings yet
cs224n 2023 Lecture9 Pretraining
54 pages
Entropy 26 01046
No ratings yet
Entropy 26 01046
33 pages
Romanma Automated Program Repair FINAL
No ratings yet
Romanma Automated Program Repair FINAL
110 pages
7469 Magicoder Empowering Code
No ratings yet
7469 Magicoder Empowering Code
26 pages
Program Synthesis With Large Language Models: Jacob Austin Augustus Odena
No ratings yet
Program Synthesis With Large Language Models: Jacob Austin Augustus Odena
34 pages
A Comprehensive Survey On Pretrained Foundation Models: A History From BERT To ChatGPT
No ratings yet
A Comprehensive Survey On Pretrained Foundation Models: A History From BERT To ChatGPT
99 pages
OpenCoder 1731317971
No ratings yet
OpenCoder 1731317971
35 pages
O C: T O C T - T C L L M: PEN Oder HE PEN Ookbook For OP IER ODE Arge Anguage Odels
No ratings yet
O C: T O C T - T C L L M: PEN Oder HE PEN Ookbook For OP IER ODE Arge Anguage Odels
35 pages
Introduction To Machine Learning Report 1
No ratings yet
Introduction To Machine Learning Report 1
17 pages
Gen Ad Code
No ratings yet
Gen Ad Code
16 pages
Practical Evasion Attack Against Neural Network-Based Macro-Malware Detection Method
No ratings yet
Practical Evasion Attack Against Neural Network-Based Macro-Malware Detection Method
16 pages
Group Final
No ratings yet
Group Final
27 pages
(2022) Bridging Pre-Trained Models and Downstream Tasks For Source Code Understanding
No ratings yet
(2022) Bridging Pre-Trained Models and Downstream Tasks For Source Code Understanding
12 pages
Malware Makeover: Breaking ML-based Static Analysis by Modifying Executable Bytes
No ratings yet
Malware Makeover: Breaking ML-based Static Analysis by Modifying Executable Bytes
15 pages
AIEngineering
No ratings yet
AIEngineering
25 pages
S Y W: S I - C L M: HOW OUR ORK Cratchpads For Ntermedi ATE Omputation With Anguage Odels
No ratings yet
S Y W: S I - C L M: HOW OUR ORK Cratchpads For Ntermedi ATE Omputation With Anguage Odels
16 pages
Codet5: Identifier-Aware Unified Pre-Trained Encoder-Decoder Models For Code Understanding and Generation
No ratings yet
Codet5: Identifier-Aware Unified Pre-Trained Encoder-Decoder Models For Code Understanding and Generation
13 pages
Deepproblog: Neural Probabilistic Logic Programming: Joint Last Authors
No ratings yet
Deepproblog: Neural Probabilistic Logic Programming: Joint Last Authors
14 pages
Security Vulnerability Detection With Multitask Se
No ratings yet
Security Vulnerability Detection With Multitask Se
11 pages
Electra Pre Training Text Encoders As Discriminators Rather Than Generators
No ratings yet
Electra Pre Training Text Encoders As Discriminators Rather Than Generators
18 pages
Syntia: Synthesizing The Semantics of Obfuscated Code
No ratings yet
Syntia: Synthesizing The Semantics of Obfuscated Code
19 pages
Your Instructions Are Not Always Helpfu
No ratings yet
Your Instructions Are Not Always Helpfu
10 pages
2023 07 28 Evolution of Language Models
No ratings yet
2023 07 28 Evolution of Language Models
73 pages
Fully Autonomous Programming With Large Language Models
No ratings yet
Fully Autonomous Programming With Large Language Models
10 pages
Mal BERTv 2
No ratings yet
Mal BERTv 2
33 pages
A Comprehensive Survey On Pretrained Foundation Models
No ratings yet
A Comprehensive Survey On Pretrained Foundation Models
97 pages
ML Overview
No ratings yet
ML Overview
26 pages
R Deep Neural Network Step by Step
No ratings yet
R Deep Neural Network Step by Step
27 pages
Pre Trained Models For NLP
No ratings yet
Pre Trained Models For NLP
15 pages
Lec-All Deep Learning Coursework
100% (2)
Lec-All Deep Learning Coursework
639 pages
Imaler: An Adversarial Attack Framework To Obfuscate Malware Structure Against Dgcnn-Based Classifier Via Reinforcement Learning
No ratings yet
Imaler: An Adversarial Attack Framework To Obfuscate Malware Structure Against Dgcnn-Based Classifier Via Reinforcement Learning
7 pages
Syllabus Instructors
No ratings yet
Syllabus Instructors
4 pages
AWS AI & ML Scholarship Connect Weekly Schedule - AI Programming With Python
No ratings yet
AWS AI & ML Scholarship Connect Weekly Schedule - AI Programming With Python
3 pages
Lab 1X Appendix Training Models With Pytorch
No ratings yet
Lab 1X Appendix Training Models With Pytorch
16 pages
Miniprojectposter
No ratings yet
Miniprojectposter
1 page
Instructors: Moses Charikar, Tengyu Ma, and Chris Re: Hope Everyone Stays Safe and Healthy in These Difficult Times!
No ratings yet
Instructors: Moses Charikar, Tengyu Ma, and Chris Re: Hope Everyone Stays Safe and Healthy in These Difficult Times!
40 pages
Magicoder - Source Code Is All You Need
No ratings yet
Magicoder - Source Code Is All You Need
16 pages
Systematically Finding Security Vulnerabilities in Black-Box
No ratings yet
Systematically Finding Security Vulnerabilities in Black-Box
14 pages
Software Protection Using Obfuscation
No ratings yet
Software Protection Using Obfuscation
6 pages
Quiz 3
No ratings yet
Quiz 3
5 pages
Generative Pretraining From Pixels V2
No ratings yet
Generative Pretraining From Pixels V2
12 pages
NN Lab2
No ratings yet
NN Lab2
5 pages
心理学deep learning Introduction - to - deep - neural - networks - - Syllabus - v0.9
No ratings yet
心理学deep learning Introduction - to - deep - neural - networks - - Syllabus - v0.9
3 pages
BreastCancer Classification - 2025
No ratings yet
BreastCancer Classification - 2025
24 pages
ChatGPT Implementation in The Metaverse
No ratings yet
ChatGPT Implementation in The Metaverse
607 pages
Mock Exam Paper
No ratings yet
Mock Exam Paper
4 pages
100 Days of ML
No ratings yet
100 Days of ML
383 pages
Module 2
No ratings yet
Module 2
151 pages
Malware Analysis Using Machine Learning and Deep Learning Techniques
No ratings yet
Malware Analysis Using Machine Learning and Deep Learning Techniques
7 pages
Fajardan Ines Nardo Thesis Animal-Intrusion-monitoring-system
No ratings yet
Fajardan Ines Nardo Thesis Animal-Intrusion-monitoring-system
96 pages
Phase 1 Road Lane Line Detection
No ratings yet
Phase 1 Road Lane Line Detection
9 pages
Integrating Big Data Analytics and Machine Learning For Predictive Maintenance Within Industrial IoT Frameworks
No ratings yet
Integrating Big Data Analytics and Machine Learning For Predictive Maintenance Within Industrial IoT Frameworks
5 pages
Lecture 2 3
No ratings yet
Lecture 2 3
102 pages
Dakka - Research Proposal Powerpoint
No ratings yet
Dakka - Research Proposal Powerpoint
21 pages
Smart Chess Assistant: Using AI To See The Board and Suggest The Best Moves
No ratings yet
Smart Chess Assistant: Using AI To See The Board and Suggest The Best Moves
13 pages
A Comparative Analysis of Convolutional Neural Network Architectures For Coffee Leaf Rust Detection
No ratings yet
A Comparative Analysis of Convolutional Neural Network Architectures For Coffee Leaf Rust Detection
6 pages
Study On Decision Tree and KNN Algorithm For Intrusion Detection System IJERTV9IS050303
No ratings yet
Study On Decision Tree and KNN Algorithm For Intrusion Detection System IJERTV9IS050303
6 pages
Bengali Text Classification Distinguishing Saintly and Common Forms Using Machine Learning Model
No ratings yet
Bengali Text Classification Distinguishing Saintly and Common Forms Using Machine Learning Model
7 pages
Like It or Not: A Survey of Twitter Sentiment Analysis Methods
No ratings yet
Like It or Not: A Survey of Twitter Sentiment Analysis Methods
42 pages
Customer Churn Prediction For A Retail
No ratings yet
Customer Churn Prediction For A Retail
8 pages
Class 10 Ai Sample Paper - 5 - Ms
No ratings yet
Class 10 Ai Sample Paper - 5 - Ms
8 pages
ML Process in Azure Cloud
No ratings yet
ML Process in Azure Cloud
17 pages
ML Viva Questions
No ratings yet
ML Viva Questions
4 pages
JOCC - Volume 2 - Issue 1 - Pages 50-65
No ratings yet
JOCC - Volume 2 - Issue 1 - Pages 50-65
16 pages
Black-Box Prediction of Flaky Test Fix Categories Using Language Models
No ratings yet
Black-Box Prediction of Flaky Test Fix Categories Using Language Models
12 pages
Underwater Object Detection Using Image Enhancement and Deep Learning Models
No ratings yet
Underwater Object Detection Using Image Enhancement and Deep Learning Models
6 pages
Beans Leaf Diseases Classification Using Mobilenet Models
No ratings yet
Beans Leaf Diseases Classification Using Mobilenet Models
12 pages
Anomaly Detection Using The Numenta Anomaly Benchmark
No ratings yet
Anomaly Detection Using The Numenta Anomaly Benchmark
8 pages
Deep-Learning-Based Stair Detection Using 3D Point Cloud Data For Preventing Walking Accidents of The Visually Impaired
No ratings yet
Deep-Learning-Based Stair Detection Using 3D Point Cloud Data For Preventing Walking Accidents of The Visually Impaired
7 pages
Paper 69-Fake Reviews Detection Using Supervised Machine
No ratings yet
Paper 69-Fake Reviews Detection Using Supervised Machine
6 pages
Author Identification Forensics Eisic2015 v3
No ratings yet
Author Identification Forensics Eisic2015 v3
9 pages
Write Up - Anita Borg Scholarship
No ratings yet
Write Up - Anita Borg Scholarship
3 pages
Introduction to Google's Go Programming Language: GoLang
From Everand
Introduction to Google's Go Programming Language: GoLang
Orhan Gazi
No ratings yet
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
ChatGPT Simplified: A Comprehensive Guide to Understanding and Utilizing AI Language Models, ChatGPT-4, ChatGPT Prompts, Fiction Writing, Blogging, Content Writing, Make Money Online
From Everand
ChatGPT Simplified: A Comprehensive Guide to Understanding and Utilizing AI Language Models, ChatGPT-4, ChatGPT Prompts, Fiction Writing, Blogging, Content Writing, Make Money Online
Silas Quantum
5/5 (1)
Thinking About Star
From Everand
Thinking About Star
Francis McCabe
No ratings yet
Go Functional Programming Simplified: A Practical Guide with Examples
From Everand
Go Functional Programming Simplified: A Practical Guide with Examples
William E. Clark
No ratings yet
Go File Handling for New Coders: A Practical Guide with Examples
From Everand
Go File Handling for New Coders: A Practical Guide with Examples
William E. Clark
No ratings yet
Go Debugging from Scratch: A Practical Guide with Examples
From Everand
Go Debugging from Scratch: A Practical Guide with Examples
William E. Clark
No ratings yet
Introduction to Programming Languages
From Everand
Introduction to Programming Languages
IntroBooks Team
4/5 (1)
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Explanation Based Learning: Fundamentals and Applications
From Everand
Explanation Based Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet

DOBF - A Deobfuscation Pre-Training Objective For Programming Languages

Uploaded by

DOBF - A Deobfuscation Pre-Training Objective For Programming Languages

Uploaded by

DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Baptiste Roziere * 1 2 Marie-Anne Lachaux * 1 Marc Szafraniec 1 Guillaume Lample 1

2020), etc. Overall, most of these pre-training objectives

Code deobfuscation. Empirical studies show that nam- 3. Model

Input Code Function Name Proposals

def FUNC_0 (m1, m2):

def FUNC_0 (matrix): transpose 36.7%

def FUNC_0 (m1, m2):

The randomly initialized model is useful to measure the

putational accuracy of the test set is higher for MLM+DOBF

Raychev, V., Vechev, M., and Krause, A. Predicting program

Roziere, B., Lachaux, M.-A., Chanussot, L., and Lample,

Takang, A. A., Grubb, P. A., and Macredie, R. D. The

Input Code Proposed Function Name

public static void FUNC_0 (String path){

Input Code Proposals for Highlighted Identifiers

def FUNC_0 (n):

Input Code Proposed Function Name

BFS Implementation DFS Implementation DFS with Erroneous

Obfuscated Code Code Deobfuscated using DOBF

class CLASS_0(nn.Module): class LSTM(nn.Module):

def FUNC_0(VAR_6): def init_weights(self):

def FUNC_1(VAR_11, VAR_12, VAR_13): def forward(self, x, prev_state):

ID Ground Truth DOBF

Input Code Deobfuscated Identifiers

def FUNC_0(VAR_0): FUNC_0 get_html

def FUNC_0(VAR_0): FUNC_0 all_unique

You might also like