0% found this document useful (0 votes)
41 views18 pages

Towards Neural Decompilation

This document proposes a novel approach to automatic decompilation using neural machine translation. It aims to train a decompiler to translate low-level code representations like LLVM IR and x86 assembly code back to their original high-level source code languages like C. Existing decompilers rely on hand-crafted rules which are difficult to adapt to new languages or features. The proposed approach trains a decompiler from examples generated by a compiler, allowing it to generalize to new languages and code patterns. It was able to successfully decompile over 97% of LLVM IR benchmarks and 88% of x86 assembly benchmarks back to equivalent C code.

Uploaded by

Jon Doe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views18 pages

Towards Neural Decompilation

This document proposes a novel approach to automatic decompilation using neural machine translation. It aims to train a decompiler to translate low-level code representations like LLVM IR and x86 assembly code back to their original high-level source code languages like C. Existing decompilers rely on hand-crafted rules which are difficult to adapt to new languages or features. The proposed approach trains a decompiler from examples generated by a compiler, allowing it to generalize to new languages and code patterns. It was able to successfully decompile over 97% of LLVM IR benchmarks and 88% of x86 assembly benchmarks back to equivalent C code.

Uploaded by

Jon Doe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Towards Neural Decompilation

Omer Katz Yuval Olshaker


Technion Technion
Israel Israel
[email protected] [email protected]

Yoav Goldberg Eran Yahav


Bar Ilan University Technion
Israel Israel
[email protected] [email protected]
arXiv:1905.08325v1 [cs.PL] 20 May 2019

Abstract
We address the problem of automatic decompilation, con-
verting a program in low-level representation back to
a higher-level human-readable programming language.
The problem of decompilation is extremely important
for security researchers. Finding vulnerabilities and un-
derstanding how malware operates is much easier when
(a) (b)
done over source code.
The importance of decompilation has motivated the
Figure 1. Example input (a) and output (b) of decompilation.
construction of hand-crafted rule-based decompilers. Such
decompilers have been designed by experts to detect spe-
cific control-flow structures and idioms in low-level code understanding the low-level code comprising the program.
and lift them to source level. The cost of supporting Currently this is done manually by reverse engineering
additional languages or new language features in these the program. Reverse engineering is a slow and tedious
models is very high. process by which a specialist tries to understand what
We present a novel approach to decompilation based a program does and how it does it. Decompilation can
on neural machine translation. The main idea is to au- greatly improve this process by translating the binary
tomatically learn a decompiler from a given compiler. code to a more readable higher-level code.
Given a compiler from a source language S to a target Decompilation has many applications beyond security.
language T , our approach automatically trains a decom- For example, porting a program to a new hardware archi-
piler that can translate (decompile) T back to S. We used tecture or operating system is easier when source code
our framework to decompile both LLVM IR and x86 is available and can be compiled to the new environ-
assembly to C code with high success rates. Using our ment. Decompilation also opens the door to application
LLVM and x86 instantiations, we were able to success- of source-level analysis and optimization tools.
fully decompile over 97% and 88% of our benchmarks
respectively. Existing Decompilers Existing decompilers, such as
Hex-Rays [2] and Phoenix [34], rely on pattern match-
ing to identify the high-level control-flow structure in a
1 Introduction program. These decompilers try to match segments of a
Given a low-level program in binary form or in some program’s control-flow graph (CFG) to some patterns
intermediate representation, decompilation is the task of known to originate from certain control-flow structures
lifting that program to human-readable high-level source (e.g. if-then-else or loops). This approach often fails
code. when faced with non-trivial code, and uses goto state-
Fig. 1 provides a high-level example of decompilation. ments to emulate the control-flow of the binary code. The
The input to the decompilation task is a low-level code resulting code is often low-level, and is really assembly
snippet, such as the one in Fig. 1(a). The goal of De- transliterated into C (e.g. assigning variables to tempo-
compilation is to generate a corresponding equivalent rary values/registers, using gotos, and using low-level
high-level code. The C code snippet of Fig. 1(b) is the operations rather than high-level constructs provided by
desired output for Fig. 1(a). the language). While it is usually semantically equivalent
There are many uses for decompilation. The most to the original binary code, it is hard to read, and in
common is for security purposes. Searching for software some cases less efficient, prohibiting recompilation of the
vulnerabilities and analyzing malware both start with decompiled code.
1
There are goto-free decompilers, such as DREAM++ we can make translations simpler and overcome many
[36, 37], that can decompile code without resorting to shortcomings of the NMT model. This insight is imple-
using gotos in the generated code. However, all existing mented in our approach as our canonicalization step
decompilers, even goto-free ones, are based on hand- (Section 4.3, for simplifying translations) and template
crafted rules designed by experts, making decompiler filling (Section 5, for overcoming NMT shortcomings).
development slow and costly. Our technique is still modest in its abilities, but presents
Even if a decompiler from a low-level language Llow to a significant step forward towards trainable decompil-
a high-level language Lhiдh exists, given a new language ers and in the application of NMT to the problem of
Lhiдh
′ , it is nontrivial to create a decompiler from Llow decompilation. The first phase of our approach borrows
to Lhiдh
′ based on the existing decompiler. There is no techniques from natural language processing (NLP) and
guarantee that any of the existing rules can be reused applies them to programming languages. We use an
for the new decompiler. existing NMT system to translate a program in a lower-
level language to a templated program in a higher-level
Neural Machine Translation Recent years have seen
language.
tremendous progress in Neural Machine Translation
Since we are working on programming languages rather
(NMT) [16, 22, 35]. NMT systems use neural networks
than natural languages, we can overcome some major
to translate a text from one language to another, and
pitfalls for traditional NMT systems, such as training
are widely used on natural languages. Intuitively, one
data generation (Section 4.2) and verification of trans-
can think of NMT as encoding an input text on one side
lation correctness (Section 4.4). We incorporate these
and decoding it to the output language on the other side
insights to create a decompilation technique capable of
(see Section 3 for more details). Recent work suggests
self-improvement by identifying decompilation failures as
that neural networks are also effective in summarizing
they occur, and triggering further training as needed to
source code [9, 11–13, 20, 21, 28, 29].
overcome such failures.
Recently, Katz et al. [23] suggested using neural net-
By using NMT techniques as the core of our decom-
works, specifically RNNs, for decompilation. Their ap-
piler’s first phase, we avoid the manual work required
proach trains a model for translating binary code directly
in traditional decompilers. The core of our technique is
to C source code. However, they did not compensate
language-agnostic requiring only minimal manual inter-
for the differences between natural languages and pro-
vention (i.e., implementing a compiler interface).
gramming languages, thus leading to poor results. For
One of the reasons that NMT works well in our setting
example, the code they generate often cannot be com-
is the fact that, compared to natural language, code has
piled or is not equivalent to the original source code.
a more repetitive structure and a significantly smaller
Their work, however, did highlight the viability of using
vocabulary. This enables training with significantly fewer
Neural Machine Translation for decompilation, thus sup-
examples than what is typically required for NLP [26]
porting the direction we are pursuing. Section 8 provides
(See Section 6).
additional discussion of [23].
Mission Statement Our goal is to decompile short
Our Approach We present a novel automatic neural snippets of low-level code to equivalent high-level snip-
decompilation technique, using a two-phased approach. pets. We aim to handle multiple languages (e.g. x86
In the first phase, we generate a templated code snippet assembly and LLVM IR). We focus on code compiled
which is structurally equivalent to the input. The code using existing off-the-shelf compilers (e.g. gcc [1] and
template determines the computation structure without clang [4]), with compiler optimizations enabled, for the
assignment of variables and numerical constants. Then, purpose of finding bugs and vulnerabilities in benign
in the second phase, we fill the template with values to software. More specifically, we do not attempt to handle
get the final decompiled program. The second phase is hand-crafted assembly as is often found in malware.
described in Section 5. Many previous works aimed to use decompilation as
Our approach can facilitate the creation of a decom- a mean of understanding the low-level code, and thus
piler from Llow to Lhiдh from every pair of languages for focused mostly on code readability. In addition to read-
which a compiler from Lhiдh to Llow exists. ability, we place a great emphasis on generating code
The technique suggested by [23] attempted to apply that is correct (i.e., can be compiled without further
NMT to binary code as-is, i.e. without any additional modifications) and equivalent to the given input.
steps and techniques to support the translation. We rec- We wish to further emphasize that the goal of our
ognize that for a trainable decompiler, and specifically work is not to outperform existing decomopilers (e.g.,
an NMT-based decompiler, to be useful in practice, we Hex-Rays [2]). Many years of development have been
need to augment it with programming-languages knowl- invested in such decompilers, resulting in mature and
edge (i.e. domain-knowledge). Using domain-knowledge
2
well-tested (though not yet perfect) tools. Rather, we The output of our decompiler’s NMT model is a canon-
wish to shed light on trainable decompilation, and NMT- icalized version of C, as seen in block (2). In this example,
based decompilation in particular, as a promising alter- output canonicalization consists of splitting numbers to
native approach to traditional decompilation. This new digits, same as was applied to the input, and printing
approach holds the advantage over existing decompilers the code in post-order (Section 4.3.2), i.e. each operator
not in its current results, but in its potential to handle appears after its operands. We apply un-canonicalization
new languages, features, compilers, and architectures to the output, which converts it from post-order to in-
with minimal manual intervention. We believe this abil- order, resulting in the code in block (3). The output
ity will play a vital role as decompilation will become of un-canonicalization might contain decompilation er-
more widely used for finding vulenrabilities. rors, thus we treat it as a code template. Finally, by
Main Contributions The paper makes the following comparing the code in block (3) with the original input
contributions: in Fig. 1, we fill the template (i.e. by determining the
correct numeric values that should appear in the code,
• A significant step towards neural decompilation by
see Section 5), resulting in the code in block (4). The
combining ideas from neural machine translation
code in block (4) is then returned to the user as the final
(NMT) and program analysis. Our work brings
output.
this promising approach to decompilation closer to
For further details on the canonicalizations applied by
being practically useful and viable.
the decompiler, see Section 4.3.
• A decompilation framework that automatically gen-
erates training data and checks the correctness of
translation using a verifier. 2.2 Decompilation Approach
• A decompilation technique that is applicable to Our approach to decompilation consists of two comple-
many pairs of source and target languages and is mentary phases: (1) Generating a code template that,
mostly independent of the actual low-level source when compiled, matches the computation structure of
and high-level target languages used. the input, and (2) Filling the template with values and
• An implementation of our technique in a framework constants that result in code equivalent to the input.
called TraFix (short for TRAnslate and FIX) that,
given a compiler from Lhiдh to Llow automatically 2.2.1 First Phase: Obtaining a Template
learns a decompiler from Llow to Lhiдh .
Fig. 3 provides a schematic representation of this phase.
• An instantiation of our framework for decompi-
At the heart of our decompiler is the NMT model.
lation of C source code from LLVM intermediate
We surround the NMT model with a feedback loop that
representation (IR) [3] and x86 assembly. We used
allows the system to determine success/failure rates and
these instances to evaluate our technique on de-
improve itself as needed by further training.
compilation of small simple code snippets.
Denote the input language of our decompiler as Llow
• An evaluation showing that our framework decom-
and the output language as Lhiдh , such that the gram-
piles statements in both LLVM IR and x86 assem-
mar of both languages is known. Given a dataset of
bly back to C source code with high success rates.
input statements in Llow to decompile, and a compiler
The evaluation demonstrates the framework’s abil-
from Lhiдh to Llow , the decompiler can either start from
ity to successfully self-advance as needed.
scratch, with an empty model, or from a previously
trained model. The decompiler translates each of the
2 Overview input statements to Lhiдh . For each statement, the NMT
In this section we provide an informal overview of our model generates a few translations that it deemed to be
approach. most likely. The decompiler then evaluates the generated
translation. It compiles each suggested translation from
2.1 Motivating Example Lhiдh to Llow using existing of-the-shelf compilers. The
Consider the x86 assembly example of Fig. 1(a). Fig. 2 compiled translations are compared against the origi-
shows the major steps we take for decompiling that nal input statement in Llow and classified as successful
example. translations or failed translations. At this phase, the
The first step in decompiling a given input is applying translations are code templates, not yet actual code,
canonicalization. In this example, for the sake of simplic- thus the comparison focuses on matching the computa-
ity, we limited canonicalization to only splitting numbers tion structure. A failed translation therefore does not
to digits (Section 4.3.1), thus replacing 14 with 1 4, re- match the structure of the input, and cannot produce
sulting in the code in block (1). This code is provided code equivalent to the input in phase 2. We denote input
to the decompiler for translation. statements for which there was no successful translation
3
Figure 2. Steps for decompiling x86 assembly to C: (1) canonicalized x86 input, (2) NMT output, (3) templated output, (4)
final fixed output.

Figure 3. Schematic overview of the first phase of our decompiler

as failed inputs. Successful translations are passed to the dataset, and will be ineffective at teaching the model to
second phase and made available to the user. handle new inputs.
The existence of failed inputs triggers a retraining ses- We update the dataset by adding new samples ob-
sion. The training dataset and validation dataset (used tained from two sources:
to evaluate progress during training) are updated with • Failed translations – We compile failed translations
additional samples, and the model resumes training us- from Lhiдh to Llow and use them as additional
ing the new datasets. This feedback loop, between the training samples. Training on these samples serves
failed inputs and the model’s training session, drives the to teach the model the correct inputs for these
decompiler to improve itself and keep learning as long as translations, thus reducing the chances that the
it has not reached its goal. These iterations will continue model will generate these translations again in
until a predetermined stop condition as been met, e.g. a future iterations.
significant enough portion of the input statements were • Random samples – we generate a predetermined
decompiled successfully. It also allows us to focus train- number of random code samples in Lhiдh and com-
ing on aspects where the model is weaker, as determined pile these samples to Llow .
by the failed inputs.
The well-defined structure of programming languages The validation dataset is updated using only random
allows us to make predictable and reversible modifica- samples. It is also shuffled and truncated to a constant
tions to both the input and output of the NMT model. size. The validation dataset is translated and evaluated
These modifications are referred to as canonicalization many times during training. Thus truncating it prevents
and un-canonicalization, and are aimed at simplifying the validation overhead from increasing.
the translation problem. These steps rely on domain
2.2.2 Second Phase: Filling the Template
specific knowledge and do not exist in traditional NMT
systems for natural languages. Section 4.3 motivates and The first phase of our approach produces a code tem-
describes our canonicalization methods. plate that can lead to code equivalent to the input. The
goal of the second phase is to find the right values for
Updating the Datasets After each iteration we update
instantiating actual code from the template. Note that
the dataset used for training. Retraining without doing
the NMT model provides initial values. We need to ver-
so would lead to over-fitting the model to the existing
ify that these values are correct and replace them with
appropriate values if they are wrong.
4
This step is inspired by the common NLP practice of several candidates in parallel. The NMT system (includ-
delexicalization [18]. In NLP, using delexicalization, some ing the encoder, decoder and attention mechanism) is
words in a sentence would be replaced with placeholders trained over many input-output sequence pairs, where
(e.g. NAME1 instead of an actual name). After translation the goal of the training is to produce correct output
these placeholders would be replaced with values taken sequences for each input sequence. The encoder and the
directly from the input. decoder are implemented as recurrent neural networks
Similarly, we use the input statement as the source (RNNs), and in particular as specific flavors of RNNs
for the values needed for filling our template. Unlike called LSTM [19] and GRU [14] (we use LSTMs in this
delexicalization, it is not always the case that we can work). Refer to [30] for further details on NMT systems.
take a value directly from the input. In many cases,
and usually due to optimizations, we must apply some
4 Decompilation with Neural Machine
transformation to the values in the input in order to find
the correct value to use. Translation
In the example of Fig. 2, the code contains two numeric In this section we describe the algorithm of our decom-
values which we need to “fill” – 14 and 2. For each of this pilation framework using NMT. First, in Section 4.1,
values we need to either verify or replace it. The case we describe the algorithm at a high level. We then de-
of 14 is relatively simple as the NMT provided a correct scribe the realization of operations used in the algorithm
initial value. We can determine that by comparing 14 in such as canonicalization (Section 4.3), the evaluation of
the output to 14 in the original input. For 2, however, the resulting translation (Section 4.4), and the stopping
copying the value 2 from the input did not provide the condition (Section 4.5).
correct output. Compiling the output with the value
2 would result in the instruction sall 1, %eax rather 4.1 Decompiler Algorithm
than the desired sall 2, %eax. We thus replace 2 with a
Our framework implements the process depicted by Fig. 3.
variable N and try to find the right value for N . To get
This process is also formally described in Algorithm 1.
the correct value, we need to apply a transformation to
The algorithm uses a Dataset data structure which holds
the input. Specifically, if the input value is x, the relevant
pairs (x, y) of statements such that x ∈ Lhiдh , y ∈ Llow ,
transformation for this example is N = 2x , resulting in
and y is the output of compiling x.
N = 4 that, when recompiled, yields the desired output.
The framework takes two inputs: (1) a set of state-
Therefore we replace 2 with 4, resulting in the code
ments for decompilation, and (2) a compiler interface.
in Fig. 2(4).
The output is a set of successfully decompiled statements.
Section 5 further elaborates on this phase and provides
Decompilation starts with empty sets for training and
additional possible transformations.
validation and canonicalizes (Section 4.3) the input set.
It then iteratively extends the training and validation
sets (Section 4.2), trains a model on the new sets and
3 Background attempts to translate the input set. Each translation
is then recompiled and evaluated against the original
Current Neural Machine Translation (NMT) models fol-
input (Section 4.4 and Section 5). Successful translations
low a sequence-to-sequence paradigm introduced in [14].
are then put in a Success set, that will eventually be
Conceptually, they have two components, an encoder
returned to the user. Failed translations are put in a
and a decoder. The encoder encodes an arbitrary length
Failed set that will be used to further extend the training
sequence of tokens x 1 , ..., x n over alphabet A into a se-
set. The framework repeats these steps as long as the
quence of vectors, where each vector represents a given
stopping condition was not reached (Section 4.5).
input token x i in the context in which it appears. The
decoder then produces an arbitrary length sequence of
tokens y1 , ..., ym from alphabet B, conditioned on the 4.2 Generating Samples
encoded vectors. The sequence y1 , ..., ym is generated a To generate samples for our decompiler to train on, we
token at a time, until generating an end-of-sequence generate random code samples from a subset of the C
token. When generating the ith token, the model con- programming language. This is done by sampling the
siders the previously generated tokens as well as the grammar of the language. The samples are guaranteed
encoded input sequence. An attention mechanism is to be syntactically and grammatically correct. We then
used to choose which subset of the encoded vectors to compile our code samples using the provided compiler.
consider at each generation step. The generation pro- Doing so results in a dataset of matching pairs of state-
cedure is either greedy, choosing the best continuation ment, one in C and the other in Lll , that can be used by
symbol at each step, or uses beam-search to develop the model for training and validation.
5
Algorithm 1 Decompilation algorithm
movl 1234 , X1 movl 1 2 3 4 , X1
Input inputset, collection of statements in Llow (a) Original input (b) Input after splitting numbers to digits
compile, api to compile Lhiдh to Llow
Output Dataset of successfully decompiled X1 = 1 2 3 4 ; X1 = 1234 ;
statements in Lhiдh (c) Translation output (d) Output after fusing digits to numbers
Data Types Dataset: collection of pairs (x, y),
such that x = compile(y) Figure 4. Reducing vocabulary by splitting numbers to digits
1: procedure Decompile
2: inputset ← canonicalize(inputset)
3: T rain ← newDataset
4.3.1 Reducing Vocabulary Size
4: V alidate ← newDataset
5: model ← newModel The vocabulary size of the samples provided to the model,
6: Success ← newDataset either for training or translating, directly affects the
7: Failures ← newDataset performance and efficiency of the model. In the case of
8: while (?) do code, a large portion of the vocabulary is devoted to
9: T rain ← T rain ∪ Failures ∪ random samples() numerical constants and names (such as variable names,
10: V alidate = V alidate ∪ дen random samples() method names, etc.).
11: model .retrain(T rain, V alidate) Names and numbers are usually considered “uncom-
12: decompiled ← model .translate(inputset) mon” words, i.e. words that do not appear frequently.
13: recompiled ← compile(decompiled) Descriptive variable names, for example, are often used
14: for each i in 0...inputset .size do within a single method but are not often reused in other
15: pair ← (inputset[i], decompiled[i]) methods. This results in a distinctive vocabulary, consist-
16: if equiv(inputset[i], recompiled[i]) then ing largely of uncommon words, and leading to a large
17: if f ill(inputset[i], recompiled[i]) then vocabulary.
18: Success ← Success ∪ [pair ] We observe that the actual variable names do not mat-
19: else ter for preserving the semantics of the code. Furthermore,
20: Failures ← Failures ∪ [pair ] these names are actually removed as part of the stripping
21: end if process. Therefore, we replace all names in our samples
22: else with generic names (e.g. X1 for a variable). This allows
23: Failures ← Failures ∪ [pair ] for more reuse of names in the code, and therefore more
24: end if examples from which the model can learn how to treat
25: end for such names. Restoring informative descriptive names in
26: end while source code is a known and orthogonal research problem
27: return uncanonicalize(Success) for which several solutions exist (e.g. [10, 17, 32]).
28: end procedure Numbers cannot be handled in a similar way. Their
values cannot be replaced with generic values, since
that would alter the semantic meaning of the code. Fur-
thermore, essentially every number used in the samples
becomes a word in the vocabulary. Even limiting the
We note that, alternatively, we could use code snip- values of numbers to some range [1 − K] would still result
pets from publicly available code repositories as training in K different words.
samples, but these are less likely to cover uncommon To deal with the abundance of numbers we take in-
coding patterns. spiration from NMT for natural languages. Whenever
an NMT model for NL encounters an uncommon word,
4.3 Improving Translation Performance with instead of trying to directly translate that word, it falls
Canonicalization back to a sub-word representation (i.e. process the word
It is possible to improve the performance of NMT mod- as several symbols). Similarly, we split all numbers in
els without intervening in the actual model. This can our samples to digits. We train the model to handle
be achieved by manipulating the inputs in ways that single digits and then fuse the digits in the output into
simplify the translation problem. In the context of our numbers. Fig. 4 provides an example of this process on
work, we refer to these domain-specific manipulations as a simple input. Using this process, we reduce the por-
canonicalization. tion of the vocabulary dedicated to numbers to only 10
Following are two forms of canonicalization used by symbols, one per digit. Note that this reduction comes
our implementation: at the expense of prolonging our input sentences.
6
Similarly, NMT models often perform better when
when the source and target languages follow similar
(a) original C code
word orders, even though the model reads the entire
input before generating any output. We therefore modify
the structure of the C input statements to post-order to
create a better correlation with the output. Fig. 5b shows
(b) post order C code the code obtained by canonicalizing the code in Fig. 5a.
After translation, we can easily parse the generated
post-order code using a simple bottom-up parser to
obtain the corresponding in-order code.

(c) compiled x86 assembly


4.4 Evaluating Translations
Figure 5. Example of code structure alignment We rely on the deterministic nature of compilation as
the basis of this evaluation. After translating the inputs,
Alternative Method for Reducing Vocabulary Size for each pair of input i and corresponding translation t
We observe that, in terms of usage and semantic meaning, (i.e. the decompiled code), we recompile t and compare
all numbers are equivalent (other than very few specific the output to i. This allows us to keep track of progress
numbers that hold special meaning, e.g. 0 and 1). Thus, and success rates, even when the correct translation is
as an alternative to splitting numbers to digits, we tried not known in advance.
replacing all numbers with constants (e.g. N1, N2, ...). Comparing computation structure After the first
Similarly to variable names, the purpose of this replace- step of our decompiler, the structure of computation
ment was to increase reuse of the relevant words while in the decompiled program should match the one of
reducing the vocabulary. When applying these replace- the original program. We therefore compare the original
ments to our input statements, we maintained a record program and the templated program from decompila-
of all applied replacements. After translation, we used tion by comparing their program dependence graphs.
this record to restore the original values to the output. We convert each code snippet to its corresponding Pro-
This approach worked well for unoptimized code, but gram Dependence Graph (PDG). The nodes of the graph
failed on optimized code. In unoptimized code there is are the different instructions in the snippet. The graph
a direct correlation between constants in high-level and contains 2 types of edges: data dependency edges and
low-level code. That correlation allowed us to restore control dependency edges. A data dependency edge from
the values in the output. In optimized code, compiler node n 1 to node n 2 means that n 2 uses a value set by
optimizations and transformations break that correlation, n 1 . A control dependency between n 1 and n 2 means that
thus making it impossible for us to restore the output execution of n 2 depends on the outcome of n 1 . Fig. 6b
based on the kept record. shows an example of a program dependence graph for
the code in Fig. 6a. Solid arrows in the graph represent
4.3.2 Order Transformation data dependencies between code lines and dashed ar-
Most high-level programming languages write code in- rows represent control dependencies. Since line 2 uses
order, i.e. an operator appears between its 2 operands. the variable x which was defined in line 1, we have an
On the other hand, low-level programming languages, arrow from 1 to 2. Similarly, line 8 uses the variable z
which are ”closer” to the hardware, often use post-order, which can be defined in either line 4 or line 6. Therefore,
i.e. both operands appear before the operator. line 8 has a data dependency on both line 4 and line 6.
The code in Fig. 5 demonstrates this difference. Fig. 5a Furthermore, the execution of lines 4 and 6 is dependent
shows a simple statement in C and Fig. 5c the x86 as- on the outcome of line 3. This dependency is represented
sembly obtained by compiling it. The different colors by the dashed arrows from 3 to 4 and 6.
represent the correlation between the different parts of We extend the PDG with nodes “initializing” the dif-
the computation. ferent variables in the code. These nodes allow us to
Intuitively, if one was charged with the task of trans- maintain a separation between the different variables.
lating a statement, it would be helpful if both input and We then search for an isomorphism between the 2
output shared the same order. Having a shared order graphs, such that if nodes n and n ′ are matched by the
simplifies ”planning” the output by localizing the depen- isomorphism it is guaranteed that either 1. both n and
dencies to some area of the input rather than spreading n ′ correspond to variables, 2. both n and n ′ correspond
them across the entire input. to numeric constants, or 3. n and n ′ correspond to the
7
3. Iteration limit: given some number n, we can ter-
1 minate the decompilation process after n iterations
have finished. This criteria is optional and can be
1: x = 3; left empty, in which case only the first 2 conditions
2: y = x ∗ x; 2 apply.
3: if y%2 == 0 then
4: z = x + 5;
5: else 3 4.6 Extending the Language
6: z = x − 7; An important feature of our framework is that we can
7: end if 4 6 focus the training done in the first phase to language
8: w = z ∗ 2; features exhibited by the input. Essentially, we can start
8 by “learning” to decompile a subset of the high-level
language.
(a) Source code (b) Program Dependence Graph Learning to decompile some subset s of the high-level
language takes time and resources. Therefore, given a
Figure 6. Example of Program Dependence Graph. new input dataset, utilizing another subset of the lan-
Solid arrows for data dependencies, dashed arrows for control guage s ′, we would like to reuse what we have learned
dependencies. from s.
Because the vocabulary of s ′ is not necessarily con-
tained in the vocabulary of s, i.e. vocab(s ′) ⊈ vocab(s),
same operator (e.g. addition, substraction, branching,
we have implemented a dynamic vocabulary extension
etc...).
mechanism in our framework. When the framework de-
If such an isomorphism exists, we know that both
tects that the current vocabulary is not the same as the
code snippets implement the same computation struc-
vocabulary used for previous training sessions, it creates
ture. The snippets might still differ in the variable or
a new model and partially initializes it using value from
numeric constants they use. However, the way the snip-
a previously trained model. This allow us to add support
pets use these variables and constants is equivalent in
for new tokens in the language without starting from
both snippets. Thus, if we could assign the correct vari-
scratch.
ables and constants to the code, we would get an identical
Note that all tokens are equivalent in the eyes of the
computation in both snippets. We consider translations
NMT model. Specifically, the model does not know that a
that reach this point as a successful template and at-
variable is different from a number or an operator. It only
tempt to fill the template as described in Section 5. A
learns a difference between the tokens from the different
translation is determined fully successful only if filling
contexts in which they appear. Therefore, using this
the template (Section 5) is also successful.
mechanism, we can extend the language supported by the
This kind of evaluation allows us to overcome instruc-
decompiler with new operators, features and constructs,
tion reordering, variable renaming, minor translation
as needed. For example, starting from a subset of the
errors and small modifications to the code (often due to
language containing only arithmetic expressions, we can
optimizations).
easily add if statements to the subset without losing
any previous progress we’ve made while training on
4.5 Stopping Decompilation
arithmetic expressions.
Our framework terminates the decompilation iterations The extension mechanism is also used during training
when 1 of 3 conditions is met: on a specific language subset. At each iteration, our
1. Sufficient results: given a percentage threshold p, framework generates new training samples to extend the
after each iteration the framework checks the num- existing training set. These new samples can, for example,
ber of test samples that remain untranslated and contain new variables/numbers that weren’t previously
stops when at least p% of the initial test set was part of the vocabulary, thus requiring an extension of
successfully decompiled. the vocabulary.
2. No more progress: The framework keeps track It is important to note that in a real-world use-case
of the amount of remaining test samples. When we don’t expect training sessions to be frequent. Addi-
the framework detects that that number has not tional training should only applied when dealing with
changed in x iterations, meaning no progress was new features, a new language or with relatively harder
made during these iterations, it terminates. Such samples than previous samples. We expect the majority
cases highlight samples that are too difficult for of decompilation problems to be solved using an existing
our decompiler to handle model.
8
5 Filling the Template Given that the computation structure of our transla-
In Section 4, we saw how the decompiler takes a low-level tion and the input is the same, errors in constants can be
program and produces a high-level templated program, found in variable names and numeric values. In the first
where some constant assignments require filling. In this phase, as part of comparing the computation structure,
section, we describe how to fill the parameters in the we also verify that there are no cases where a variable
templated program. should have been a numeric value or vice versa. That
means we can treat these two cases in isolation.
We note that since we are dealing with low-level lan-
guages, in which there are often no variable names to
5.1 Motivation begin with, using the correct name is inconsequential. In
From our experimentation with applying NMT models to the case of variables, all that matter is that for each vari-
code, we learned that NMT performs well at generating able in the input there exists a variable in the translation
correct code structure. We also learned that NMT has that is used in exactly the same manner. This require-
difficulties with constants and generating/predicting the ment is already fulfilled by matching the computation
right ones. This is exhibited by many cases where the structure (Section 4.4).
proposed translation differs from an exact translation by
only a numerical constant or a variable. 5.2 Finding assignments for constants
The use of word embeddings in NMT is a major con- We focus on correcting errors resulting from using wrong
tributor to these translation errors. A word embedding numeric values. Denoting the input as i, the translation
is essentially a summary of the different contexts in as t and the result of recompiling the translation as r ,
which that word appears. It is very common in NLP for there are three questions that we need to address:
identifying synonyms and other interchangeable words. Which numbers in r need to change? and to
For example, assume we have an NMT model for NLP which other numbers? Since the NMT model was
which trains on the sentence “The house is blue”. While trained on code containing numeric values and constants,
training, the model will learn that different colors often the generated translation also contains such values (gen-
appear in similar contexts. The model can then gener- erated directly by the model) and constants (due to the
alize what it has learned from “The house is blue” and numeric abstraction step we describe in Section 4.3.1),
apply that to the sentence “The house is green” which it and replaced with their original values. We use these
has never encountered before. In practice, word embed- numbers as an initial suggestion as to which values should
dings are numerical vectors, and the distance between be used.
the embeddings of words that appear in similar contexts As explained in Section 4.4, we compare r and i by
will be smaller than the distance between embeddings of building their corresponding program dependence graphs
words that do not appear in similar contexts. The model and looking for an isomorphism between the graphs. If
itself does not operate on the actual words provided by such an isomorphism is found, it essentially entails a
the user. It instead translates the input to embeddings mapping from nodes in one graph to nodes in the other.
and operates on those vectors. Using this mapping we can search for pairs of nodes nr
Since we are dealing with code rather than natural and ni such that nr ∈ r is mapped to ni ∈ i, both nodes
languages, we have many more “interchangeable” words are numeric values, but nr ! = ni . Such nodes highlight
to handle. During training all numerical values appear which numbers need to be changed (nr ) and to which
in the same contexts, resulting in very similar (if not other numbers (ni ).
identical) embeddings. Thus, the model is often unable to Which numbers in t affect which numbers in
distinguish between different numbers. Therefore, while r ? Note that although we know that n ∈ r is wrong and
word embeddings are still useful for generalizing from to be fixed, we cannot apply the fix directly. Instead we
training examples, using embeddings in our case results need to apply a fix to t that will result in the desired fix
in translation errors when constants are involved. to r . The first step towards achieving that is to create a
Due to the above we have decided to treat the output mapping from numbers in t to numbers in r such that
of the NMT model not as a final translation but as a changing nt ∈ t results in a change to nr ∈ r .
template that needs filling. The 1st phase of our decom- By making small controlled changes to t we can observe
pilation process verifies that the computation structure how r is changed. We find some number nt ∈ t, replace
resulting from recompiling the translation matches that it with nt′ resulting in t ′ and recompile it to get r ′. We
of the input. If that is the case, any differences are most then compare r and r ′ to verify that the change we made
likely the result of using incorrect constants. The 2nd maintains the same low-level computation structure. If
phase of our decompilation process deals with correcting that is the case, we identify all number nr ∈ r that were
any such false constants. changed and record those as affected by nt .
9
How do we enact the right changes in t? At
X0 = X1 + X2;
this point we know which number nt ∈ t we should
(a) C code
change and we know the target value ni we want to have
instead of nr ∈ r . All we need to determine now is how %1 = load i32 , i32* @X1
to correctly modify nt to end up with ni . %2 = load i32 , i32* @X2
The simple case is such that nt == nr , which means %3 = add i32 %1 , %2
whatever number we put in t is copied directly to r and store i32 %3 , i32* @X0
thus we simply need replace nt with ni . (b) LLVM IR
However, due to optimizations (some applied even
movl X1 , %edx
when using -O0), numbers are not always copied as is.
movl X2 , %eax
Following are three examples we encountered in our work
addl %edx , %eax
with x86 assembly. movl %eax , X0
Replacing numbers in conditions Assuming x is a (c) x86 assembly
variable of type int, given the code if (x >= 5), it is
compiled to assembly equivalent to if (x > 4), which is Figure 7. Example of code structure alignment
semantically identical but is slightly more efficient.
Division/Multiplication by powers of 2 These op-
erations are often replaced with semantically equivalent Our implementation uses the NMT implementation
shift operations. For example, Division by 8 would be provided by DyNmt [6] with slight modifications. DyNmt
compiled as shift right by 3. implements the standard encoder-decoder model for
NMT using DyNet [31], a dynamic neural network toolkit.
Implementing division using multiplication Since
division is usually considered the most expensive opera- Compiler Interface The compiler interface consists
tion to execute, when the divisor is known at compilation of a set of methods encapsulating usage of the compiler
time, it is more efficient implement the division using a and representation specific information (e.g. how does
sequence of multiplication and shift operations. For exam- the compiler represent numbers in the assembly?). The
ple, calculating x/3 can be done as (x ∗1431655766) >> 32 core of the api consists of: (1) A compile method that
because 1431655766 ≈ 232 /3. takes a sequence of C statements and returns the sequence
of statements in Llow resulting from compiling it (the
We identified a set of common patterns used to make returned code is “cleaned up” by removing parts of it that
such optimizations in common compilers. Using these don’t contribute any useful information); and (2) An
patterns, we generate candidate replacements for nt . We Instruction class that describes the effects of different
test each replacement by applying it to t, recompiling instructions, which is used for building a PDG during
and checking whether the affected values nr ∈ r are now translation evaluation (Section 4.4).
equal to their ni ∈ i counterparts. We implemented such compiler interfaces for compi-
We declare a translation as successful only if an appro- lation (1) from C to LLVM IR, and (2) from C to x86
priate fix can be found for all incorrect numeric values assembly. Fig. 7 shows the result of compiling the simple
and constants. C statement of Fig. 7a using both compilers.

6.2 Benchmarks
6 Evaluation
We evaluate TraFix using random C snippets sampled
In this section we describe the evaluation of our decom-
from a subset of the C programming language. Each
pilation technique and present our results.
snippet is a sequence of statements, where each statement
is either an assignment of an expression to a variable, an
6.1 Implementation if condition (with or without an else branch), or a while
We implemented our technique in a framework called loop. Expressions consist of numbers, variables, binary
TraFix. Our framework takes as input an implemen- operator and unary operators. If and while statements
tation of our compiler interface and uses it to build a are composed using a condition – a relational operator
decompiler. The resulting decompiler takes as input a between two expression – and a sequence of statements
set of sentences in a low-level language Llow , translates which serves as the body. We limit each sequence of
the sentences and outputs a corresponding set of sen- statements to at most 5. Table 1 provides the formal
tences in a high-level language Lhiдh , specifically C in our grammar from which the benchmarks are sampled.
implementation. Each sentence represents a sequence of All of our benchmarks were compiled using the com-
statements in the relevant language. piler’s default optimizations. Working on optimized code
10
Statements :=Statement | Statements Statement this limit. No iteration of our experiments with LLVM
Statement :=Assignment | Branch | Loop
Assignments :=Assignment | Assignments Assignment
and x86 exceeded more than 140 epochs (and no more
Assignment :=Var = Expr; than 100 epochs when excluding the first iteration). For
Var :=ID each test input we generated 5 possible translations using
Expr :=Var | Number | BinaryExpr | UnaryExpr beam-search. We stopped each experiment when it has
UnaryExpr :=UnaryOp Var | Var UnaryOp successfully translated over 95% of the test statements
UnaryOp :=++ | –
or when no progress was made for the last 10 iterations.
BinaryExpr :=Expr BinaryOp Expr
BinaryOp :=+|-|*|/|% Recall that the validation set is periodically translated
Branch :=if (Condition) { Statements } | during training and used to evaluate training progress.
if (Condition) {Statements} else {Statements} TraFix is capable of stopping a training session early
Loop := while (Condition) { Statements } | (before the epoch limit was reached) if no progress was
Condition := Expr Relation Expr observed in the last consecutive k validation sessions.
Relation := > | >= | < | <= | == | != Intuitively, this process detects when the model has
Table 1. Grammar for experiments. Terminals are underlined reached a stable state close enough to the optimal state
that can be reached on the current training set. In our
experiments a validation session is triggered after pro-
cessing 1000 batches of training samples (each batch
introduces several challenges, as mentioned in Section 5.2, containing 32 samples) and k was set to 10. All training
but is crucial for the practicality of our approach. Note sessions were stopped early, before reaching the epochs
that we didn’t strip the code after compilation. However, limit.
our ”original” C code that we compile is already essen- The NMT model consists of a single layer each for the
tially stripped since our canonicalization step abstracts encoder and decoder. Each layer consists of 100 nodes
all names in the code. and the word embedding size was set to 300.
During benchmark generation we make sure that there We ran our experiments on Amazon AWS instances.
is no overlap between the Training dataset, Validation Each instance is of type r5a.2xlarge – a Linux machine
dataset and our Test dataset (used as input statements with 8 Intel Xeon Platinum 8175M processors, each op-
to the decompiler). erating at 2.5GHz, and 64GiB of RAM, running Ubuntu
Evaluating Benchmarks Despite holding the ground- 16.04 with GCC [1] version 5.4.0 and Clang [4] version
truth for our test set (the C used to generate the set), 3.8.0.
we decided not to compare the decompiled code to the We executed our experiments as a single process using
ground-truth. We observe that, in some cases, different C only a single CPU, without utilizing a GPU, in order
statements could be compiled to the same low-level code to mimic the scenario of running the decompiler on an
(e.g. the statements x = x + 1 and x++). We decided to end-user’s machine. This configuration highlights the
evaluate them in a manner that allows for such occur- applicability of our approach such that it can be used
rences and is closer to what would be applied in a real by many users without requiring specialized hardware.
use-case. We, thus, opted to evaluate our benchmarks
by recompiling the decompiled code and comparing it 6.4 Results
against the input, as described in Section 4.4. 6.4.1 Estimating Problem Hardness
6.3 Experimental Design and Setup As a measure of problem complexity, we first evaluated
our decompiler on several different subsets of C using only
We ran several experiments of TraFix. For each ex-
a single iteration. The purpose of these measurements is
periment we generated 2,000 random statements to be
to estimate how difficult a specific grammar is going to
used as the test set. TraFix was configured to gen-
be for our decompiler.
erate an initial set of 10,000 training samples and an
We used 8 different grammars for these measurement.
additional 5,000 training samples at each iteration. An
Each grammar is building upon the previous one, mean-
additional 1,000 random samples served as the validation
ing that grammar i contains everything in grammar i − 1
set. There is no overlap between the test set and the
and adds a new grammar feature (the only exception
training/validation sets. We decided, at each iteration,
is grammar 4 which does not contain unary operators).
to drop half of the training samples from the previous
The grammars are:
iteration. This serves to limit the growth of the training
set (and thus the training time), and assigns a higher 1. Only assignments of numbers to variables
weight to samples obtained through recent failures com- 2. Assignments of variables to variables
pared to older samples. Each iteration was limited to 3. Computations involving unary operators
2,000 epochs. In practice, our experiments never reached 4. Computations involving binary operators
11
timings successful
# epochs train translate translations
1 75.6 14:16 03:25 1913.6 (95.68%)
2 76.5 14:11 00:42 1940.2 (97.01%)
Table 2. Statistics of iterative experiments of LLVM IR

framework. The only change necessary was adding the


appropriate flags in the compiler interface.

6.4.2 Iterative Decompilation


In our second set of experiments we allowed each ex-
Figure 8. Success rate of x86 decompiler after a single it-
eration on various grammars, with compiler optimization periment to execute iteratively to observe the effects of
enabled and disabled multiple iterations on our decompilation success rates.
We implemented and evaluated 2 instances of our
framework: from LLVM IR to C, and from x86 assembly
to C.
5. Computations involving both operators types We ran each experiments 5 times using the configura-
6. If branches tion described in Section 6.3. We allowed each experiment
7. While loops to run until it reached either a success rate of95% or 6
8. Nested branches and loops iterations. The results reported below are averaged over
Fig. 8 shows the success rate, i.e. percentage of suc- all 5 experiments.
cessfully decompiled inputs, for the different grammars, Decompiling LLVM IR Out of the 5 experiments we
of decompiling x86 assembly with and without com- conducted using our LLVM IR instance, 3 reached the
piler optimizations. Note that measured success rates goal of 95% success rate after a single iteration. The other
are after only a single iteration of our decompilation 2 experiments required one additional iteration to reach
algorithm (Section 4.1). that goal. Table 2 reports average statistics for these two
As can be expected, the success rate drop as the com- iterations. The columns epochs, train time and translate
plexity of the grammar increases. That means that for time report averages for each iteration (i.e. average of
more complicated grammar, our decompiler will require measurements from 5 experiments for the 1st iteration
more iterations and/or more training data to reach the and from only 2 experiments for the 2nd iteration). The
same performance level as on simpler grammars. successful translations column reports the overall success
As can also be expected, and as can be observed from rate, not just the successes in that specific iteration.
the figure, decompiling optimized code is a slightly more The statistics in the table demonstrate that our LLVM
difficult problem for our decompiler compared to unopti- decompiler performed exceptionally well, even though
mized code. Although optimizations reduce our success it was decompiling optimized code snippets (which are
rate by a few percents (at most 5% in our experiments), traditionally considered harder to handle).
it seems that the decisive factor for the hardness of the On average, Our LLVM experiments successfully de-
decompilation problem is the grammar complexity, not compiled 97% of the benchmarks, before autonomously
optimizations. terminating. These include benchmarks consisting of up
Recall that, given a compiler, our framework learns to 845 input tokens and 286 output tokens. We inten-
the inverse of that compiler. That means that, in the tionally set the goal lower than 100%. Setting it higher
eyes of the decompiler, optimizations are “transparent”. than 95% and allowing our instances to run for further
Optimizations only cause the decompiler to learn more iterations would take longer but would also lead to a
complex patterns than it would have learned without higher overall success rate.
optimizations, but don’t increase the number of patterns The timing measurements reported in the table high-
learned nor the vocabulary handled. Grammar complex- light that the majority of execution time is spent on
ity, on the other hand, increases both the number and training the NMT model. Translation is very fast, taking
complexity of the patterns the decompiler needs to learn only a few seconds per input, as witnessed by the first it-
and handle, and the vocabulary size, thus making the eration. The execution time of our translation evaluation
decompilation task much harder to learn. (including parsing each translation into a PDG, com-
We emphasize that enabling/disabling compiler opti- paring with the input PDG, and attempting to fill the
mizations in our framework required no changes to the templates correlating to the translations) is extremely
12
timings successful vocabulary shortens overall training times, but also re-
# epochs train translate translations sults in longer dependencies and meaningful patterns
1 86.0 15:58 03:46 1470.8 (73.54%) that are harder to deduce and learn.
2 58.2 15:55 01:59 1614.2 (80.71%) We note that, in case of a traditional decompiler,
3 51.4 14:47 01:38 1683.2 (84.16%) bridging the remaining gap of 13% failure rate would
4 51.4 14:07 01:26 1721.0 (86.05%) require a team of developers crafting additional rules
5 65.8 17:28 01:18 1745.6 (87.28%) and patterns. Using our technique this can be achieved
6 63.4 16:38 01:14 1762.4 (88.12%) by allowing the decompiler to train longer and on more
Table 3. Statistics of iterative experiments of x86 asembly training data.

7 Discussion
7.1 Limitations
Manual examination of our results from Section 6.4 re-
vealed that currently our main limitation is input length.
There was no threshold such that inputs longer than the
threshold would definitely fail. We observed both success-
ful and failed long inputs, often of the same length. We
did however observe a correlation between input length
and a reduced success rate. As the length of an input
increases, it becomes more likely to fail.
We found no other outstanding distinguishing features,
in the code structure or used vocabulary, that we could
claim are a consistent cause of failures.
Figure 9. Cummulative success rate of x86 decompiler as a This limitation stems from the NMT model we used.
function of how many iteration the decompiler performed long inputs are a known challenge for existing NMT
systems [26]. NMT for natural languages is usually lim-
ited to roughly 60 words [26]. Due to nature of code
low, taking only a couple of minutes for the entire set of (i.e. limited vocabulary, limited structure) we can han-
benchmarks. dle inputs much longer than typical natural language
These observations are important due to the expected sentences (668 words for x86 and 845 words for LLVM ).
operating scenario of our decompiler. We expect the Regardless, this challenge also applies to us, resulting in
majority of inputs to be resolved using a previously poorer results when handling longer inputs. As the field
trained model. Retraining an NMT model should be of NMT evolves to better handle long inputs, so would
done only when the language grammar is extended or our results improve.
when significantly difficult inputs are provided. Thus, in To verify that this limitation is not due to our specific
normal operations, the execution time of the decompiler, implementation, we created another variant of our frame-
consisting of only translation and evaluation, will be work. This new variant is based on TensorFlow [5, 8]
mere seconds. rather than DyNet. Experimenting with this variant, we
Decompiling x86 Assembly Table 3 provides statis- got similar results as those reported in Section 6.4, and
tics of our x86 experiments. All of these experiments ultimately reached the same conclusion — the observed
terminated when they reached the iterations limit which limitation on input length is inherent to using NMT.
was set to 6.
Fig. 9 visualizes the successful translations column. 7.1.1 Other Decompilation Failures
The figure plots our average success rate as a function Though we do not consider this a limitation, another
of the number of completed iterations. It is evident that aspect that could be improved is our template filling
with each iteration the success rate increases, eventually phase. Our manual analysis identified some possibilities
reaching over 88% after 6 iterations. Overall, our decom- for improving our second phase – the template filling
piler successfully handled samples of up 668 input tokens phase (Section 5).
and 177 output tokens. The first type of failure we have observed is the result
Our decompilation success rates on x86 were lower of constant folding – a compiler optimization that re-
than that of LLVM, terminating at around 88%. This places computations involving only constants with their
correlates with the nature of x86 assembly, which has results. Fig. 10 demonstrates this kind of failure. Given
smaller vocabulary than that of LLVM IR. The smaller the C code in Fig. 10a, the compiler determines that 63∗5
13
X3 = 63 * ( 5 * X1 ) ; X2 = 48 + (X5 * (X14 * 66));
(a) High-level code (a) High-level code

movl X1 , %eax X2 = ((N8 * X14) * X5) - N4;


imull 315 , %eax , %eax (b) Suggested decompilation
movl %eax , X3
(b) Low-level code Figure 12. Failure due to incorrect operator

X3 = ( X1 * 43 ) * 70 ;
(c) Suggested decompilation theorem prover based template filling algorithm could
detect that and assign the appropriate values to the
Figure 10. Example of decompilation failure constants, including N 11, resulting in equivalent code.
Fig. 12 shows another kind of failure. In this example
the difference between the expected output and suggested
X2 = ((X0 \% 40) * 63) / ((98 - X1) - X0); translation is a + that was replaced with −. Currently
(a) High-level code only variable names and numeric constants are treated as
template parameters. This kind of difference can be over-
X2 = ((X0 \% N3) * N13) / (((N2 - X1) + N11) - X0);
come by considering operators as template parameters
(b) Suggested decompilation
as well. Since the number of options for each operator
type (unary, binary) is extremely small, we could try all
Figure 11. Failure due to redundant number
options for filling these template parameters.

7.2 Framework Tradeoffs


can be replaced with 315. Therefore, the x86 assembly
in Fig. 10b contains the constant 315. Using the code There are a few tradeoffs that should be taken into
of Fig. 10b as input, our decompiler suggests the C code account when using our decompilation framework:
in Fig. 10c. • Iterations limit – Applying an iterations limit al-
Note that the decompiler suggested code that is iden- lows to tradeoff decompilation success rates for
tical in structure to the input. The first phase of our a shortened decompilation time and would make
decompiler handled this example correctly, resulting in sense in environments with limited resources (time,
a matching code template. The failure occured in the budget, etc.). On the other hand, setting the limit
second phase, in which we were unable to find the appro- too low will prevent the decompiler from reaching
priate numerical values. This failure occurs because our its full potential and will result in low successful
current implementation attempts to find a value for each translations rate.
number independently from other numbers in the code. • Training set size – In our experiments we initial-
Essentially, this resulted in floating-point numbers which ized the training set to 10,000 random samples and
were deemed unacceptable by the decompiler because generated additional 5,000 new random samples
our benchmarks use only integers. each iteration. As we increase the training set size,
This kind of failure can be mitigated by either (1) ap- so do the training time and memory consumption
plying constant folding to the high-level decompiled code, increase. Using too many initial training samples
(2) allowing the template to be filled with floating point would be wasteful in case of relatively simple test
numbers (which was disabled since the benchmarks con- samples, in which a shorter training session, with
tained only integers), or (3) encoding the code as con- fewer training samples, might suffice. On the other
straints and using a theorem prover to find appropriate hand, using too few samples would result in many
assignments to constants. training sessions when dealing with harder test
A similar example is found in Fig. 11. We left the samples. This is also applicable when setting the
suggested translation in this example as constants to number of random samples added at each itera-
simplify the example. One can see that the suggested tion. Furthermore, rather than always generating a
translation in Fig. 11b is structurally identical to the constant number of samples, one can dynamically
expected output in Fig. 11a, up to the addition of N 11. decide the number of samples to generate based
This example was not considered a matching code tem- on some measure of progress (i.e. generate fewer
plate by our implementation, because any value for N 11 samples when progressing at a higher rate).
other than 0 results in a different computation structure. • Patience – the patience parameter determines how
However, if N 11 = 0, we get an exact match between the many iterations to wait before terminating due to
suggested translation and the expected output. Using a not observing any progress. Setting this parameter
14
to high would result in wasted time. This is be- rule-based decompilers. Rule-based decompilers require
cause any training performed since the last time manually written rules and patterns to detect known
we observed progress would essentially have been control-flow structures. These rules are very hard to
in vain. On the other hand, it is possible for the develop, prone to errors and usually only capture part
model to make no progress for a few iterations of the known control-flow structures. According to data
only to resume progressing once it generates the published by Avast, it took a team of 24 developers 7
training samples it needed. Setting the patience years to develop RetDec. This data emphasizes that
parameter too low might cause the decompiler to traditional decompiler development is extremely difficult
stop before it can reach its full potential. and time consuming, supporting our claim that the future
of decompilers lies in approaches that can avoid this step.
7.3 Extracting Rules Our technique removes the burden of rule writing from
As mentioned in Section 1, traditional decompilers rely the developer, replacing it with an automatic, neural
heavily on pattern matching. Development of such de- network based approach that can autonomously extract
compilers depends on hand-crafted rules and patterns, relevant patterns from the data.
designed by experts to detect specific control-flow struc- Katz et al. [23] suggested the first technique to use
tures. Hand-crafting rules is slow, expensive and cum- NMT for decompilation. While they set out to solve the
bersome. We observe that the successful decompilations same problem, in practice they provide a solution to a
produced by our decompiler can be re-templatized to different and significantly easier problem - producing
form rules that can be used by traditional decompil- source-level code that is readable, without any guar-
ers, thus simplifying traditional decompiler development. antees for equivalence, not semantic or even syntactic.
Appendix A provides examples of such rules. Further, the code they generate is not guaranteed to
compile (and does not in practice). Because their code
7.4 Evaluating Readability does not compile nor is equivalent, if you apply our eval-
Measuring the readability of our translations requires a uation criteria to their results, their accuracy would be
user study, which we did not perform. However, note that at most 3.8%. Further, beyond the cardinal difference in
given some training set, a model trained on that set will the problem itself, they have the following limitations:
generate code that is similar to what it was trained on. • They can only operate on code compiled with a
Thus, the readability of our translations stems from the special version of Clang which they modified for
readability of our training samples. Our translations are their purposes.
as readable as the training samples we generated. This • All of their benchmarks are compiled without op-
was also verified by an empirical sampling of our results. timizations. We apply the compiler’s default opti-
Therefore, given readable code as training samples, we mizations to all of our benchmarks.
can surmise that any decompiled code we generate and • They limit their input to 112 tokens and out-
output will also be readable. put to 88 tokens. This limits their input to sin-
gle statements. We successfully decompiled x86
benchmarks of up to 668 input tokens and 177 out-
8 Related Work
put tokens. Each of our samples contains several
Decompilation The Hex-Rays decompiler [2] was con- statements.
sidered the state of the art in decompilation, and is • Their methodology is flawed as they do not control
still considered the de-facto industry standard. Schwartz for overlaps between the training and test sets. We
et al. [34] presented the Phoenix decompiler which im- verify that there is no such overlap in our sets.
proved upon Hex-Rays using new analysis techniques Modeling Source Code Modeling source code using
and iterative refinement, but was still unable to guaran- various statistical models has seen a lot of interest for
tee goto-free code (since goto instructions are rarely used various applications.
in practice, they should not be part of the decompiler Srinivasan et al. [21] used LSTMs to generate natural
output). Yakdan et al. [36, 37] introduced Dream, and its language descriptions for C# source code snippets and
predecessor Dream++, taking a significant step forward SQL queries. Allamanis et al. [11] generated descriptions
by guaranteeing goto-free code. RetDec [7], short for for Java source code using convolutional neural networks
Retargetable Decompiler, is an open-source decompiler with attention. Hu et al. [20] tackled the same problem
released in December 2017 by Avast, aiming to be the by neural networks with a structured based traversal of
first ”generic” decompiler capable of supporting many the abstract syntax tree, aimed at better representing
architectures, languages, ABIs, etc. the structure of the code. Loyola et al. [28] took a sim-
While previous work made significant improvements ilar approach for generating descriptions of changes in
to decompilation, all previous work fall under the title of
15
source code, i.e. translates commits to source code repos- and lift them to source level. As a result decompiler
itories to commit messages. The success presented by development is very costly.
these papers highlights that neural networks are useful We presented a new approach to the decompilation
for summarizing code, and supports the use of neural problem. We base our decompiler framework on neural
networks for decompilation. machine translation. Given a compiler, our framework au-
Another application of source code modeling is for tomatically learns a decompiler from it. We implemented
predicting names for variable, methods and classes. Ray- an instance of our framework for decompiling LLVM IR
chev et al. [33] used conditional random fields (CRFs) and x86 assembly to C. We evaluated these instances on
to predict variable names in obfuscated JavaScript code. randomly generated inputs with a high success rates.
He et al. [17] also used CRFs but for the purpose of pre-
dicting debug information in stripped binaries, focusing References
on names and types of variables. Allamanis et al. [9] used [1] 1987. GCC, the GNU Compiler Collection. https://gcc.gnu.
neural language models to predict variable, method and org/.
class names. Allamanis et al. relied on word embeddings [2] 1998. The IDA Pro disassembler and debugger. http://www.
to determine semantically similar names. We consider hex-rays.com/idapro/.
this problem as orthogonal to our own. Given a semanti- [3] 2002. The LLVM compiler infrastructure project. http:
//llvm.org.
cally equivalent source code produced by our decompiler, [4] 2007. clang: a C language family frontend for LLVM. https:
these techniques could be used to supplement it with //clang.llvm.org/.
variable names, etc. [5] 2015. TensorFlow. https://www.tensorflow.org.
Chen et al. [15] used neural networks to translate code [6] 2017. DyNMT, a DyNet based nueral machine transaltion.
between high-level programming languages. This prob- https://github.com/roeeaharoni/dynmt-py.
[7] 2017. Retargetable Decompiler. https://retdec.com/.
lem resembles that of decompilation, but is infact simpler. [8] Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen,
Translating low-level languages to high-level languages, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat,
as we do, is more challenging. The similarities between Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Lev-
high-level languages are more prevalent than between enberg, Rajat Monga, Sherry Moore, Derek Gordon Murray,
high-level and low-level languages. Furthermore, trans- Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete War-
den, Martin Wicke, Yuan Yu, and Xiaoqiang Zhang. 2016. Ten-
lating source code to source code directly bypasses many sorFlow: A system for large-scale machine learning. (2016).
challenges added by compilation and optimizations. [9] Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles
Levy et al. [27] used neural networks to predict align- Sutton. 2015. Suggesting Accurate Method and Class Names.
ment between source code and compiled object code. In Proceedings of the 2015 10th Joint Meeting on Foundations
Their results can be useful in improving our second phase, of Software Engineering.
[10] Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles
i.e. filling the template and correcting errors. Specifically, Sutton. 2015. Suggesting Accurate Method and Class Names.
their alignment prediction can be utilized to pinpoint In Proceedings of the 10th Joint Meeting on Foundations of
location in the source code that lead to errors. Software Engineering.
Katz et al. [24, 25] used statistical language models [11] Miltiadis Allamanis, Hao Peng, and Charles A. Sutton. 2016.
A Convolutional Attention Network for Extreme Summariza-
for modeling of binary code and aspects of program
tion of Source Code. CoRR (2016). http://arxiv.org/abs/
structure. Based on a combination of static analysis and 1602.03001
simple statistical language models they predict targets [12] Miltiadis Allamanis, Daniel Tarlow, Andrew D. Gordon, and
of virtual function calls [24] and inheritance relations Yi Wei. 2015. Bimodal Modelling of Source Code and Natural
between types [25]. Their work further highlights that Language. In Proceedings of the 32Nd International Con-
these techniques can deduce high-level information from ference on International Conference on Machine Learning -
Volume 37.
low-level representation in binaries. [13] Matthew Amodio, Swarat Chaudhuri, and Thomas W. Reps.
2017. Neural Attribute Machines for Program Generation.
CoRR (2017). http://arxiv.org/abs/1705.09231
9 Conclusion [14] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.
We address the problem of decompilation — converting 2014. Neural Machine Translation by Jointly Learning to
Align and Translate. CoRR (2014).
low-level code to high-level human-readable source code. [15] Xinyun Chen, Chang Liu, and Dawn Song. 2018. Tree-to-tree
Decompilation is extremely useful to security researchers Neural Networks for Program Translation. CoRR (2018).
as the cost of finding vulnerabilities and understanding [16] KyungHyun Cho, Bart van Merrienboer, Dzmitry Bahdanau,
malware drastically drops when source code is available. and Yoshua Bengio. 2014. On the Properties of Neural
A major problem of traditional decompilers is that Machine Translation: Encoder-Decoder Approaches. CoRR
(2014).
they are rule-based. This means that experts are needed [17] Jingxuan He, Pesho Ivanov, Petar Tsankov, Veselin Raychev,
for hand-crafting the rules and patterns used for detect- and Martin Vechev. 2018. Debin: Predicting Debug Informa-
ing control-flow structures and idioms in low-level code tion in Stripped Binaries. In Proceedings of the ACM SIGSAC
16
Conference on Computer and Communications Security. [34] Edward J. Schwartz, JongHyup Lee, Maverick Woo, and David
[18] Matthew Henderson, Blaise Thomson, and Steve J. Young. Brumley. 2013. Native x86 Decompilation Using Semantics-
2014. Robust dialog state tracking using delexicalised recur- preserving Structural Analysis and Iterative Control-flow
rent neural networks and unsupervised adaptation. IEEE Structuring. In Proceedings of the 22Nd USENIX Confer-
Spoken Language Technology Workshop (2014). ence on Security.
[19] Sepp Hochreiter and JÃijrgen Schmidhuber. 1997. Long Short- [35] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence
Term Memory. Neural Computation (1997). to Sequence Learning with Neural Networks. In Advances in
[20] Xing Hu, Yuhan Wei, Ge Li, and Zhi Jin. 2017. CodeSum: Neural Information Processing Systems 27: Annual Confer-
Translate Program Language to Natural Language. CoRR ence on Neural Information Processing System.
(2017). http://arxiv.org/abs/1708.01837 [36] K. Yakdan, S. Dechand, E. Gerhards-Padilla, and M. Smith.
[21] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke 2016. Helping Johnny to Analyze Malware: A Usability-
Zettlemoyer. 2016. Summarizing Source Code using a Neural Optimized Decompiler and Malware Analysis User Study. In
Attention Model. In Proceedings of the 54th Annual Meeting IEEE Symposium on Security and Privacy (SP).
of the Association for Computational Linguistics (Volume 1: [37] Khaled Yakdan, Sebastian Eschweiler, Elmar Gerhards-
Long Papers). Padilla, and Matthew Smith. 2015. No More Gotos: Decompi-
[22] Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent Con- lation Using Pattern-Independent Control-Flow Structuring
tinuous Translation Models. In Proceedings of the Conference and Semantic-Preserving Transformations. In 22nd Annual
on Empirical Methods in Natural Language Processing. Network and Distributed System Security Symposium, NDSS.
[23] D. S. Katz, J. Ruchti, and E. Schulte. 2018. Using recur-
rent neural networks for decompilation. In IEEE 25th In-
ternational Conference on Software Analysis, Evolution and
Reengineering.
[24] Omer Katz, Ran El-Yaniv, and Eran Yahav. 2016. Estimating
Types in Binaries Using Predictive Modeling. In Proceedings
of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on
Principles of Programming Languages.
[25] Omer Katz, Noam Rinetzky, and Eran Yahav. 2018. Sta-
tistical Reconstruction of Class Hierarchies in Binaries. In
Proceedings of the Twenty-Third International Conference
on Architectural Support for Programming Languages and
Operating Systems.
[26] Philipp Koehn and Rebecca Knowles. 2017. Six Challenges
for Neural Machine Translation. (2017).
[27] Dor Levy and Lior Wolf. 2017. Learning to Align the Source
Code to the Compiled Object Code. In Proceedings of the
34th International Conference on Machine Learning.
[28] Pablo Loyola, Edison Marrese-Taylor, and Yutaka Matsuo.
2017. A Neural Architecture for Generating Natural Language
Descriptions from Source Code Changes. CoRR (2017). http:
//arxiv.org/abs/1704.04856
[29] Chris J. Maddison and Daniel Tarlow. 2014. Structured
Generative Models of Natural Source Code. CoRR (2014).
http://arxiv.org/abs/1401.0514
[30] Graham Neubig. 2017. Neural Machine Translation and
Sequence-to-sequence Models: A Tutorial. CoRR (2017).
[31] Graham Neubig, Chris Dyer, Yoav Goldberg, Austin
Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel
Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn,
Kevin Duh, Manaal Faruqui, Cynthia Gan, Dan Garrette,
Yangfeng Ji, Lingpeng Kong, Adhiguna Kuncoro, Gaurav
Kumar, Chaitanya Malaviya, Paul Michel, Yusuke Oda,
Matthew Richardson, Naomi Saphra, Swabha Swayamdipta,
and Pengcheng Yin. 2017. DyNet: The Dynamic Neural Net-
work Toolkit. arXiv preprint arXiv:1701.03980 (2017).
[32] Veselin Raychev, Martin Vechev, and Andreas Krause. 2015.
Predicting Program Properties from ”Big Code”. In Proceed-
ings of the 42nd Annual Symposium on Principles of Pro-
gramming Languages (POPL ’15).
[33] Veselin Raychev, Martin Vechev, and Andreas Krause. 2015.
Predicting Program Properties from ”Big Code”. In Proceed-
ings of the 42Nd Annual ACM SIGPLAN-SIGACT Sympo-
sium on Principles of Programming Languages.

17
A Extracting Decompilation Rules
Table 4 contains examples of decompilation rules extracted from our decompiler. For brevity, we present mostly relatively simple rules, but longer and
more complicated rules were also found by our decompiler (examples of such rules are found at the bottom of the table, below the separating line).

input output
movl X 1 , eax ; addl N 1 , eax ; movl eax , X 2 ; X 2 = N1 + X 1 ;
movl X 1 , eax ; subl N 1 , eax ; movl eax , X 2 ; X 2 = X 0 − N1 ;
movl X 1 , eax ; imull N 1 , eax , eax ; movl eax , X 2 ; X 2 = X 1 ∗ N1 ;
movl X 1 , ecx ; movl N 1 , eax ; idivl ecx ; movl eax , X 2 ; X 2 = N 1 /X 1 ;
movl X 1 , eax ; movl X 2 , ecx ; idivl ecx ; movl eax , X 3 ; X 3 = X 1 /X 2 ;
movl X 1 , eax ; sall N 1 , eax ; movl eax , X 2 ; X 2 = X 1 ∗ 2N1 ;
movl X 1 , ecx ; movl N 1 , eax ; idivl ecx ; movl edx , eax ; movl eax , X 2 ; X 2 = N 1 %X 1 ;
movl X 1 , eax ; movl X 2 , ecx ; idivl ecx ; movl edx , eax ; movl eax , X 3 ; X 3 = X 1 %X 2 ;
movl X 1 , eax ; leal 1 ( eax ) , edx ; movl edx , X 1 ; movl eax , X 2 ; X 2 = X 1 ++;
movl X 1 , eax ; leal -1 ( eax ) , edx ; movl edx , X 1 ; movl eax , X 2 ; X 2 = X 1 --;
movl X 1 , eax ; addl 1 , eax ; movl eax , X 1 ; movl X 1 , eax ; movl eax , X 2 ; X 2 = ++X 1 ;
movl X 1 , eax ; imull N 1 , eax , eax ; addl N 2 , eax ; movl eax , X 2 ; X 2 = N 2 + (N 1 ∗ X 1 );
movl X 1 , eax ; addl N 1 , eax ; sall N 2 , eax ; movl eax , X 2 ; X 2 = (X 1 + N 1 ) ∗ 2 N2 ;
movl X 1 , eax ; imull N 1 , eax , ecx ; movl N 2 , eax ; idivl ecx ; movl eax , X 2 ; X 2 = N 2 /(X 1 ∗ N 1 );
movl X 1 , eax ; cmpl N 2 , eax ; jg .L0 ; movl N 2 , X 2 ; .L0: ; if(X 1 < (N 1 + 1)){X 2 = N 2 ; }
jmp .L1 ; .L0: ; movl N 1 , X 1 ; .L1: ; movl X 2 , eax ; cmpl N 2 , eax ; jg .L0 ; while(X 2 > N 2 ){X 1 = N 1 ; }
jmp .L1 ; .L0: ; movl N 1 , X 1 ; .L1: ; movl X 2 , eax ; cmpl N 2 , eax ; jne .L0 ; while(N 2 ! = X ){X 1 = N 1 ; }
movl X 1 , eax ; cmpl N 1 , eax ; jne .L0 ; movl N 2 , X 2 ; movl X 3 , eax ; movl eax , X 4 ; .L0: ; if(N 1 == X 1 ){X 2 = N 2 ; X 4 = X 3 ; }
movl X 1 , edx ; movl X 2 , eax ; cmpl eax , edx ; jg .L0 ; movl N 1 , X 3 ; jmp .L1 ; .L0: ; movl N 2 , X 4 ; .L1: ; if(X 1 <= X 2 ){X 3 = N 1 ; }else{X 4 = N 2 ; }
18

jmp .L1 ; .L0: ; movl X 1 , eax ; addl N 1 , eax ; movl eax , X 2 ; .L1: ; movl X 3 , eax ; cmpl N 2 , eax ; jle .L0 ; while(X 3 <= N 2 ){X 2 = N 1 + X 1 ; }
jmp .L1 ; .L0 : ; movl X 2 , eax ; addl 1 , eax ; movl eax , X 2 ; movl X 2 , edx ; movl X 2 , eax ; addl edx , ea... while((X 1 − N 1 ) > (X 2 %(X 2 − N 2 ))){X 3 = (++X 2 ) + X 2 ; ...
movl X 1 , eax ; addl 1 , eax ; movl eax , X 1 ; movl X 1 , edx ; movl X 2 , eax ; movl N 1 , ecx ; subl eax , ecx... if(++X 1 == (((X 2 ∗ (N 1 − X 2 )) − N 2 ) ∗ (N 3 − X 3 ))){X 2 = ...
movl X 3 , edx ; movl X 4 , eax ; addl edx , eax ; movl X 4 , ecx ; movl X 5 , edx ; addl edx , ecx ; idivl ecx ; ... X 1 = X 2 ∗ ((X 3 + X 4 )%(X 4 + X 5 )); X 6 = (X 7 + X 9 )/((N 1 − ...
movl N 1 , X 1 ; movl X 1 , eax ; movl eax , X 2 ; movl X 2 , eax ; movl X 3 , edx ; addl N 3 , edx ; subl edx , ea... X 1 = N 1 ; X 2 = X 1 ; if ((N 2 + (X 2 − (X 3 + N 3 ))) <= X 4 ){X ...
jmp .L1 ; .L0 : ; movl X 1 , ebx ; movl N 3 , eax ; idivl ebx ; movl eax , X 1 ; .L1 : ; movl X 1 , edx ; movl X 2 ... whil e((X 1 ∗ X 2 ) >= (N 1 %(X 3 + N 2 ))){X 1 = N 3 /X 1 ; }; X 4 ...
Table 4. Decompilation rules extracted from TraFix

You might also like