Towards Neural Decompilation
Towards Neural Decompilation
Abstract
We address the problem of automatic decompilation, con-
verting a program in low-level representation back to
a higher-level human-readable programming language.
The problem of decompilation is extremely important
for security researchers. Finding vulnerabilities and un-
derstanding how malware operates is much easier when
(a) (b)
done over source code.
The importance of decompilation has motivated the
Figure 1. Example input (a) and output (b) of decompilation.
construction of hand-crafted rule-based decompilers. Such
decompilers have been designed by experts to detect spe-
cific control-flow structures and idioms in low-level code understanding the low-level code comprising the program.
and lift them to source level. The cost of supporting Currently this is done manually by reverse engineering
additional languages or new language features in these the program. Reverse engineering is a slow and tedious
models is very high. process by which a specialist tries to understand what
We present a novel approach to decompilation based a program does and how it does it. Decompilation can
on neural machine translation. The main idea is to au- greatly improve this process by translating the binary
tomatically learn a decompiler from a given compiler. code to a more readable higher-level code.
Given a compiler from a source language S to a target Decompilation has many applications beyond security.
language T , our approach automatically trains a decom- For example, porting a program to a new hardware archi-
piler that can translate (decompile) T back to S. We used tecture or operating system is easier when source code
our framework to decompile both LLVM IR and x86 is available and can be compiled to the new environ-
assembly to C code with high success rates. Using our ment. Decompilation also opens the door to application
LLVM and x86 instantiations, we were able to success- of source-level analysis and optimization tools.
fully decompile over 97% and 88% of our benchmarks
respectively. Existing Decompilers Existing decompilers, such as
Hex-Rays [2] and Phoenix [34], rely on pattern match-
ing to identify the high-level control-flow structure in a
1 Introduction program. These decompilers try to match segments of a
Given a low-level program in binary form or in some program’s control-flow graph (CFG) to some patterns
intermediate representation, decompilation is the task of known to originate from certain control-flow structures
lifting that program to human-readable high-level source (e.g. if-then-else or loops). This approach often fails
code. when faced with non-trivial code, and uses goto state-
Fig. 1 provides a high-level example of decompilation. ments to emulate the control-flow of the binary code. The
The input to the decompilation task is a low-level code resulting code is often low-level, and is really assembly
snippet, such as the one in Fig. 1(a). The goal of De- transliterated into C (e.g. assigning variables to tempo-
compilation is to generate a corresponding equivalent rary values/registers, using gotos, and using low-level
high-level code. The C code snippet of Fig. 1(b) is the operations rather than high-level constructs provided by
desired output for Fig. 1(a). the language). While it is usually semantically equivalent
There are many uses for decompilation. The most to the original binary code, it is hard to read, and in
common is for security purposes. Searching for software some cases less efficient, prohibiting recompilation of the
vulnerabilities and analyzing malware both start with decompiled code.
1
There are goto-free decompilers, such as DREAM++ we can make translations simpler and overcome many
[36, 37], that can decompile code without resorting to shortcomings of the NMT model. This insight is imple-
using gotos in the generated code. However, all existing mented in our approach as our canonicalization step
decompilers, even goto-free ones, are based on hand- (Section 4.3, for simplifying translations) and template
crafted rules designed by experts, making decompiler filling (Section 5, for overcoming NMT shortcomings).
development slow and costly. Our technique is still modest in its abilities, but presents
Even if a decompiler from a low-level language Llow to a significant step forward towards trainable decompil-
a high-level language Lhiдh exists, given a new language ers and in the application of NMT to the problem of
Lhiдh
′ , it is nontrivial to create a decompiler from Llow decompilation. The first phase of our approach borrows
to Lhiдh
′ based on the existing decompiler. There is no techniques from natural language processing (NLP) and
guarantee that any of the existing rules can be reused applies them to programming languages. We use an
for the new decompiler. existing NMT system to translate a program in a lower-
level language to a templated program in a higher-level
Neural Machine Translation Recent years have seen
language.
tremendous progress in Neural Machine Translation
Since we are working on programming languages rather
(NMT) [16, 22, 35]. NMT systems use neural networks
than natural languages, we can overcome some major
to translate a text from one language to another, and
pitfalls for traditional NMT systems, such as training
are widely used on natural languages. Intuitively, one
data generation (Section 4.2) and verification of trans-
can think of NMT as encoding an input text on one side
lation correctness (Section 4.4). We incorporate these
and decoding it to the output language on the other side
insights to create a decompilation technique capable of
(see Section 3 for more details). Recent work suggests
self-improvement by identifying decompilation failures as
that neural networks are also effective in summarizing
they occur, and triggering further training as needed to
source code [9, 11–13, 20, 21, 28, 29].
overcome such failures.
Recently, Katz et al. [23] suggested using neural net-
By using NMT techniques as the core of our decom-
works, specifically RNNs, for decompilation. Their ap-
piler’s first phase, we avoid the manual work required
proach trains a model for translating binary code directly
in traditional decompilers. The core of our technique is
to C source code. However, they did not compensate
language-agnostic requiring only minimal manual inter-
for the differences between natural languages and pro-
vention (i.e., implementing a compiler interface).
gramming languages, thus leading to poor results. For
One of the reasons that NMT works well in our setting
example, the code they generate often cannot be com-
is the fact that, compared to natural language, code has
piled or is not equivalent to the original source code.
a more repetitive structure and a significantly smaller
Their work, however, did highlight the viability of using
vocabulary. This enables training with significantly fewer
Neural Machine Translation for decompilation, thus sup-
examples than what is typically required for NLP [26]
porting the direction we are pursuing. Section 8 provides
(See Section 6).
additional discussion of [23].
Mission Statement Our goal is to decompile short
Our Approach We present a novel automatic neural snippets of low-level code to equivalent high-level snip-
decompilation technique, using a two-phased approach. pets. We aim to handle multiple languages (e.g. x86
In the first phase, we generate a templated code snippet assembly and LLVM IR). We focus on code compiled
which is structurally equivalent to the input. The code using existing off-the-shelf compilers (e.g. gcc [1] and
template determines the computation structure without clang [4]), with compiler optimizations enabled, for the
assignment of variables and numerical constants. Then, purpose of finding bugs and vulnerabilities in benign
in the second phase, we fill the template with values to software. More specifically, we do not attempt to handle
get the final decompiled program. The second phase is hand-crafted assembly as is often found in malware.
described in Section 5. Many previous works aimed to use decompilation as
Our approach can facilitate the creation of a decom- a mean of understanding the low-level code, and thus
piler from Llow to Lhiдh from every pair of languages for focused mostly on code readability. In addition to read-
which a compiler from Lhiдh to Llow exists. ability, we place a great emphasis on generating code
The technique suggested by [23] attempted to apply that is correct (i.e., can be compiled without further
NMT to binary code as-is, i.e. without any additional modifications) and equivalent to the given input.
steps and techniques to support the translation. We rec- We wish to further emphasize that the goal of our
ognize that for a trainable decompiler, and specifically work is not to outperform existing decomopilers (e.g.,
an NMT-based decompiler, to be useful in practice, we Hex-Rays [2]). Many years of development have been
need to augment it with programming-languages knowl- invested in such decompilers, resulting in mature and
edge (i.e. domain-knowledge). Using domain-knowledge
2
well-tested (though not yet perfect) tools. Rather, we The output of our decompiler’s NMT model is a canon-
wish to shed light on trainable decompilation, and NMT- icalized version of C, as seen in block (2). In this example,
based decompilation in particular, as a promising alter- output canonicalization consists of splitting numbers to
native approach to traditional decompilation. This new digits, same as was applied to the input, and printing
approach holds the advantage over existing decompilers the code in post-order (Section 4.3.2), i.e. each operator
not in its current results, but in its potential to handle appears after its operands. We apply un-canonicalization
new languages, features, compilers, and architectures to the output, which converts it from post-order to in-
with minimal manual intervention. We believe this abil- order, resulting in the code in block (3). The output
ity will play a vital role as decompilation will become of un-canonicalization might contain decompilation er-
more widely used for finding vulenrabilities. rors, thus we treat it as a code template. Finally, by
Main Contributions The paper makes the following comparing the code in block (3) with the original input
contributions: in Fig. 1, we fill the template (i.e. by determining the
correct numeric values that should appear in the code,
• A significant step towards neural decompilation by
see Section 5), resulting in the code in block (4). The
combining ideas from neural machine translation
code in block (4) is then returned to the user as the final
(NMT) and program analysis. Our work brings
output.
this promising approach to decompilation closer to
For further details on the canonicalizations applied by
being practically useful and viable.
the decompiler, see Section 4.3.
• A decompilation framework that automatically gen-
erates training data and checks the correctness of
translation using a verifier. 2.2 Decompilation Approach
• A decompilation technique that is applicable to Our approach to decompilation consists of two comple-
many pairs of source and target languages and is mentary phases: (1) Generating a code template that,
mostly independent of the actual low-level source when compiled, matches the computation structure of
and high-level target languages used. the input, and (2) Filling the template with values and
• An implementation of our technique in a framework constants that result in code equivalent to the input.
called TraFix (short for TRAnslate and FIX) that,
given a compiler from Lhiдh to Llow automatically 2.2.1 First Phase: Obtaining a Template
learns a decompiler from Llow to Lhiдh .
Fig. 3 provides a schematic representation of this phase.
• An instantiation of our framework for decompi-
At the heart of our decompiler is the NMT model.
lation of C source code from LLVM intermediate
We surround the NMT model with a feedback loop that
representation (IR) [3] and x86 assembly. We used
allows the system to determine success/failure rates and
these instances to evaluate our technique on de-
improve itself as needed by further training.
compilation of small simple code snippets.
Denote the input language of our decompiler as Llow
• An evaluation showing that our framework decom-
and the output language as Lhiдh , such that the gram-
piles statements in both LLVM IR and x86 assem-
mar of both languages is known. Given a dataset of
bly back to C source code with high success rates.
input statements in Llow to decompile, and a compiler
The evaluation demonstrates the framework’s abil-
from Lhiдh to Llow , the decompiler can either start from
ity to successfully self-advance as needed.
scratch, with an empty model, or from a previously
trained model. The decompiler translates each of the
2 Overview input statements to Lhiдh . For each statement, the NMT
In this section we provide an informal overview of our model generates a few translations that it deemed to be
approach. most likely. The decompiler then evaluates the generated
translation. It compiles each suggested translation from
2.1 Motivating Example Lhiдh to Llow using existing of-the-shelf compilers. The
Consider the x86 assembly example of Fig. 1(a). Fig. 2 compiled translations are compared against the origi-
shows the major steps we take for decompiling that nal input statement in Llow and classified as successful
example. translations or failed translations. At this phase, the
The first step in decompiling a given input is applying translations are code templates, not yet actual code,
canonicalization. In this example, for the sake of simplic- thus the comparison focuses on matching the computa-
ity, we limited canonicalization to only splitting numbers tion structure. A failed translation therefore does not
to digits (Section 4.3.1), thus replacing 14 with 1 4, re- match the structure of the input, and cannot produce
sulting in the code in block (1). This code is provided code equivalent to the input in phase 2. We denote input
to the decompiler for translation. statements for which there was no successful translation
3
Figure 2. Steps for decompiling x86 assembly to C: (1) canonicalized x86 input, (2) NMT output, (3) templated output, (4)
final fixed output.
as failed inputs. Successful translations are passed to the dataset, and will be ineffective at teaching the model to
second phase and made available to the user. handle new inputs.
The existence of failed inputs triggers a retraining ses- We update the dataset by adding new samples ob-
sion. The training dataset and validation dataset (used tained from two sources:
to evaluate progress during training) are updated with • Failed translations – We compile failed translations
additional samples, and the model resumes training us- from Lhiдh to Llow and use them as additional
ing the new datasets. This feedback loop, between the training samples. Training on these samples serves
failed inputs and the model’s training session, drives the to teach the model the correct inputs for these
decompiler to improve itself and keep learning as long as translations, thus reducing the chances that the
it has not reached its goal. These iterations will continue model will generate these translations again in
until a predetermined stop condition as been met, e.g. a future iterations.
significant enough portion of the input statements were • Random samples – we generate a predetermined
decompiled successfully. It also allows us to focus train- number of random code samples in Lhiдh and com-
ing on aspects where the model is weaker, as determined pile these samples to Llow .
by the failed inputs.
The well-defined structure of programming languages The validation dataset is updated using only random
allows us to make predictable and reversible modifica- samples. It is also shuffled and truncated to a constant
tions to both the input and output of the NMT model. size. The validation dataset is translated and evaluated
These modifications are referred to as canonicalization many times during training. Thus truncating it prevents
and un-canonicalization, and are aimed at simplifying the validation overhead from increasing.
the translation problem. These steps rely on domain
2.2.2 Second Phase: Filling the Template
specific knowledge and do not exist in traditional NMT
systems for natural languages. Section 4.3 motivates and The first phase of our approach produces a code tem-
describes our canonicalization methods. plate that can lead to code equivalent to the input. The
goal of the second phase is to find the right values for
Updating the Datasets After each iteration we update
instantiating actual code from the template. Note that
the dataset used for training. Retraining without doing
the NMT model provides initial values. We need to ver-
so would lead to over-fitting the model to the existing
ify that these values are correct and replace them with
appropriate values if they are wrong.
4
This step is inspired by the common NLP practice of several candidates in parallel. The NMT system (includ-
delexicalization [18]. In NLP, using delexicalization, some ing the encoder, decoder and attention mechanism) is
words in a sentence would be replaced with placeholders trained over many input-output sequence pairs, where
(e.g. NAME1 instead of an actual name). After translation the goal of the training is to produce correct output
these placeholders would be replaced with values taken sequences for each input sequence. The encoder and the
directly from the input. decoder are implemented as recurrent neural networks
Similarly, we use the input statement as the source (RNNs), and in particular as specific flavors of RNNs
for the values needed for filling our template. Unlike called LSTM [19] and GRU [14] (we use LSTMs in this
delexicalization, it is not always the case that we can work). Refer to [30] for further details on NMT systems.
take a value directly from the input. In many cases,
and usually due to optimizations, we must apply some
4 Decompilation with Neural Machine
transformation to the values in the input in order to find
the correct value to use. Translation
In the example of Fig. 2, the code contains two numeric In this section we describe the algorithm of our decom-
values which we need to “fill” – 14 and 2. For each of this pilation framework using NMT. First, in Section 4.1,
values we need to either verify or replace it. The case we describe the algorithm at a high level. We then de-
of 14 is relatively simple as the NMT provided a correct scribe the realization of operations used in the algorithm
initial value. We can determine that by comparing 14 in such as canonicalization (Section 4.3), the evaluation of
the output to 14 in the original input. For 2, however, the resulting translation (Section 4.4), and the stopping
copying the value 2 from the input did not provide the condition (Section 4.5).
correct output. Compiling the output with the value
2 would result in the instruction sall 1, %eax rather 4.1 Decompiler Algorithm
than the desired sall 2, %eax. We thus replace 2 with a
Our framework implements the process depicted by Fig. 3.
variable N and try to find the right value for N . To get
This process is also formally described in Algorithm 1.
the correct value, we need to apply a transformation to
The algorithm uses a Dataset data structure which holds
the input. Specifically, if the input value is x, the relevant
pairs (x, y) of statements such that x ∈ Lhiдh , y ∈ Llow ,
transformation for this example is N = 2x , resulting in
and y is the output of compiling x.
N = 4 that, when recompiled, yields the desired output.
The framework takes two inputs: (1) a set of state-
Therefore we replace 2 with 4, resulting in the code
ments for decompilation, and (2) a compiler interface.
in Fig. 2(4).
The output is a set of successfully decompiled statements.
Section 5 further elaborates on this phase and provides
Decompilation starts with empty sets for training and
additional possible transformations.
validation and canonicalizes (Section 4.3) the input set.
It then iteratively extends the training and validation
sets (Section 4.2), trains a model on the new sets and
3 Background attempts to translate the input set. Each translation
is then recompiled and evaluated against the original
Current Neural Machine Translation (NMT) models fol-
input (Section 4.4 and Section 5). Successful translations
low a sequence-to-sequence paradigm introduced in [14].
are then put in a Success set, that will eventually be
Conceptually, they have two components, an encoder
returned to the user. Failed translations are put in a
and a decoder. The encoder encodes an arbitrary length
Failed set that will be used to further extend the training
sequence of tokens x 1 , ..., x n over alphabet A into a se-
set. The framework repeats these steps as long as the
quence of vectors, where each vector represents a given
stopping condition was not reached (Section 4.5).
input token x i in the context in which it appears. The
decoder then produces an arbitrary length sequence of
tokens y1 , ..., ym from alphabet B, conditioned on the 4.2 Generating Samples
encoded vectors. The sequence y1 , ..., ym is generated a To generate samples for our decompiler to train on, we
token at a time, until generating an end-of-sequence generate random code samples from a subset of the C
token. When generating the ith token, the model con- programming language. This is done by sampling the
siders the previously generated tokens as well as the grammar of the language. The samples are guaranteed
encoded input sequence. An attention mechanism is to be syntactically and grammatically correct. We then
used to choose which subset of the encoded vectors to compile our code samples using the provided compiler.
consider at each generation step. The generation pro- Doing so results in a dataset of matching pairs of state-
cedure is either greedy, choosing the best continuation ment, one in C and the other in Lll , that can be used by
symbol at each step, or uses beam-search to develop the model for training and validation.
5
Algorithm 1 Decompilation algorithm
movl 1234 , X1 movl 1 2 3 4 , X1
Input inputset, collection of statements in Llow (a) Original input (b) Input after splitting numbers to digits
compile, api to compile Lhiдh to Llow
Output Dataset of successfully decompiled X1 = 1 2 3 4 ; X1 = 1234 ;
statements in Lhiдh (c) Translation output (d) Output after fusing digits to numbers
Data Types Dataset: collection of pairs (x, y),
such that x = compile(y) Figure 4. Reducing vocabulary by splitting numbers to digits
1: procedure Decompile
2: inputset ← canonicalize(inputset)
3: T rain ← newDataset
4.3.1 Reducing Vocabulary Size
4: V alidate ← newDataset
5: model ← newModel The vocabulary size of the samples provided to the model,
6: Success ← newDataset either for training or translating, directly affects the
7: Failures ← newDataset performance and efficiency of the model. In the case of
8: while (?) do code, a large portion of the vocabulary is devoted to
9: T rain ← T rain ∪ Failures ∪ random samples() numerical constants and names (such as variable names,
10: V alidate = V alidate ∪ дen random samples() method names, etc.).
11: model .retrain(T rain, V alidate) Names and numbers are usually considered “uncom-
12: decompiled ← model .translate(inputset) mon” words, i.e. words that do not appear frequently.
13: recompiled ← compile(decompiled) Descriptive variable names, for example, are often used
14: for each i in 0...inputset .size do within a single method but are not often reused in other
15: pair ← (inputset[i], decompiled[i]) methods. This results in a distinctive vocabulary, consist-
16: if equiv(inputset[i], recompiled[i]) then ing largely of uncommon words, and leading to a large
17: if f ill(inputset[i], recompiled[i]) then vocabulary.
18: Success ← Success ∪ [pair ] We observe that the actual variable names do not mat-
19: else ter for preserving the semantics of the code. Furthermore,
20: Failures ← Failures ∪ [pair ] these names are actually removed as part of the stripping
21: end if process. Therefore, we replace all names in our samples
22: else with generic names (e.g. X1 for a variable). This allows
23: Failures ← Failures ∪ [pair ] for more reuse of names in the code, and therefore more
24: end if examples from which the model can learn how to treat
25: end for such names. Restoring informative descriptive names in
26: end while source code is a known and orthogonal research problem
27: return uncanonicalize(Success) for which several solutions exist (e.g. [10, 17, 32]).
28: end procedure Numbers cannot be handled in a similar way. Their
values cannot be replaced with generic values, since
that would alter the semantic meaning of the code. Fur-
thermore, essentially every number used in the samples
becomes a word in the vocabulary. Even limiting the
We note that, alternatively, we could use code snip- values of numbers to some range [1 − K] would still result
pets from publicly available code repositories as training in K different words.
samples, but these are less likely to cover uncommon To deal with the abundance of numbers we take in-
coding patterns. spiration from NMT for natural languages. Whenever
an NMT model for NL encounters an uncommon word,
4.3 Improving Translation Performance with instead of trying to directly translate that word, it falls
Canonicalization back to a sub-word representation (i.e. process the word
It is possible to improve the performance of NMT mod- as several symbols). Similarly, we split all numbers in
els without intervening in the actual model. This can our samples to digits. We train the model to handle
be achieved by manipulating the inputs in ways that single digits and then fuse the digits in the output into
simplify the translation problem. In the context of our numbers. Fig. 4 provides an example of this process on
work, we refer to these domain-specific manipulations as a simple input. Using this process, we reduce the por-
canonicalization. tion of the vocabulary dedicated to numbers to only 10
Following are two forms of canonicalization used by symbols, one per digit. Note that this reduction comes
our implementation: at the expense of prolonging our input sentences.
6
Similarly, NMT models often perform better when
when the source and target languages follow similar
(a) original C code
word orders, even though the model reads the entire
input before generating any output. We therefore modify
the structure of the C input statements to post-order to
create a better correlation with the output. Fig. 5b shows
(b) post order C code the code obtained by canonicalizing the code in Fig. 5a.
After translation, we can easily parse the generated
post-order code using a simple bottom-up parser to
obtain the corresponding in-order code.
6.2 Benchmarks
6 Evaluation
We evaluate TraFix using random C snippets sampled
In this section we describe the evaluation of our decom-
from a subset of the C programming language. Each
pilation technique and present our results.
snippet is a sequence of statements, where each statement
is either an assignment of an expression to a variable, an
6.1 Implementation if condition (with or without an else branch), or a while
We implemented our technique in a framework called loop. Expressions consist of numbers, variables, binary
TraFix. Our framework takes as input an implemen- operator and unary operators. If and while statements
tation of our compiler interface and uses it to build a are composed using a condition – a relational operator
decompiler. The resulting decompiler takes as input a between two expression – and a sequence of statements
set of sentences in a low-level language Llow , translates which serves as the body. We limit each sequence of
the sentences and outputs a corresponding set of sen- statements to at most 5. Table 1 provides the formal
tences in a high-level language Lhiдh , specifically C in our grammar from which the benchmarks are sampled.
implementation. Each sentence represents a sequence of All of our benchmarks were compiled using the com-
statements in the relevant language. piler’s default optimizations. Working on optimized code
10
Statements :=Statement | Statements Statement this limit. No iteration of our experiments with LLVM
Statement :=Assignment | Branch | Loop
Assignments :=Assignment | Assignments Assignment
and x86 exceeded more than 140 epochs (and no more
Assignment :=Var = Expr; than 100 epochs when excluding the first iteration). For
Var :=ID each test input we generated 5 possible translations using
Expr :=Var | Number | BinaryExpr | UnaryExpr beam-search. We stopped each experiment when it has
UnaryExpr :=UnaryOp Var | Var UnaryOp successfully translated over 95% of the test statements
UnaryOp :=++ | –
or when no progress was made for the last 10 iterations.
BinaryExpr :=Expr BinaryOp Expr
BinaryOp :=+|-|*|/|% Recall that the validation set is periodically translated
Branch :=if (Condition) { Statements } | during training and used to evaluate training progress.
if (Condition) {Statements} else {Statements} TraFix is capable of stopping a training session early
Loop := while (Condition) { Statements } | (before the epoch limit was reached) if no progress was
Condition := Expr Relation Expr observed in the last consecutive k validation sessions.
Relation := > | >= | < | <= | == | != Intuitively, this process detects when the model has
Table 1. Grammar for experiments. Terminals are underlined reached a stable state close enough to the optimal state
that can be reached on the current training set. In our
experiments a validation session is triggered after pro-
cessing 1000 batches of training samples (each batch
introduces several challenges, as mentioned in Section 5.2, containing 32 samples) and k was set to 10. All training
but is crucial for the practicality of our approach. Note sessions were stopped early, before reaching the epochs
that we didn’t strip the code after compilation. However, limit.
our ”original” C code that we compile is already essen- The NMT model consists of a single layer each for the
tially stripped since our canonicalization step abstracts encoder and decoder. Each layer consists of 100 nodes
all names in the code. and the word embedding size was set to 300.
During benchmark generation we make sure that there We ran our experiments on Amazon AWS instances.
is no overlap between the Training dataset, Validation Each instance is of type r5a.2xlarge – a Linux machine
dataset and our Test dataset (used as input statements with 8 Intel Xeon Platinum 8175M processors, each op-
to the decompiler). erating at 2.5GHz, and 64GiB of RAM, running Ubuntu
Evaluating Benchmarks Despite holding the ground- 16.04 with GCC [1] version 5.4.0 and Clang [4] version
truth for our test set (the C used to generate the set), 3.8.0.
we decided not to compare the decompiled code to the We executed our experiments as a single process using
ground-truth. We observe that, in some cases, different C only a single CPU, without utilizing a GPU, in order
statements could be compiled to the same low-level code to mimic the scenario of running the decompiler on an
(e.g. the statements x = x + 1 and x++). We decided to end-user’s machine. This configuration highlights the
evaluate them in a manner that allows for such occur- applicability of our approach such that it can be used
rences and is closer to what would be applied in a real by many users without requiring specialized hardware.
use-case. We, thus, opted to evaluate our benchmarks
by recompiling the decompiled code and comparing it 6.4 Results
against the input, as described in Section 4.4. 6.4.1 Estimating Problem Hardness
6.3 Experimental Design and Setup As a measure of problem complexity, we first evaluated
our decompiler on several different subsets of C using only
We ran several experiments of TraFix. For each ex-
a single iteration. The purpose of these measurements is
periment we generated 2,000 random statements to be
to estimate how difficult a specific grammar is going to
used as the test set. TraFix was configured to gen-
be for our decompiler.
erate an initial set of 10,000 training samples and an
We used 8 different grammars for these measurement.
additional 5,000 training samples at each iteration. An
Each grammar is building upon the previous one, mean-
additional 1,000 random samples served as the validation
ing that grammar i contains everything in grammar i − 1
set. There is no overlap between the test set and the
and adds a new grammar feature (the only exception
training/validation sets. We decided, at each iteration,
is grammar 4 which does not contain unary operators).
to drop half of the training samples from the previous
The grammars are:
iteration. This serves to limit the growth of the training
set (and thus the training time), and assigns a higher 1. Only assignments of numbers to variables
weight to samples obtained through recent failures com- 2. Assignments of variables to variables
pared to older samples. Each iteration was limited to 3. Computations involving unary operators
2,000 epochs. In practice, our experiments never reached 4. Computations involving binary operators
11
timings successful
# epochs train translate translations
1 75.6 14:16 03:25 1913.6 (95.68%)
2 76.5 14:11 00:42 1940.2 (97.01%)
Table 2. Statistics of iterative experiments of LLVM IR
7 Discussion
7.1 Limitations
Manual examination of our results from Section 6.4 re-
vealed that currently our main limitation is input length.
There was no threshold such that inputs longer than the
threshold would definitely fail. We observed both success-
ful and failed long inputs, often of the same length. We
did however observe a correlation between input length
and a reduced success rate. As the length of an input
increases, it becomes more likely to fail.
We found no other outstanding distinguishing features,
in the code structure or used vocabulary, that we could
claim are a consistent cause of failures.
Figure 9. Cummulative success rate of x86 decompiler as a This limitation stems from the NMT model we used.
function of how many iteration the decompiler performed long inputs are a known challenge for existing NMT
systems [26]. NMT for natural languages is usually lim-
ited to roughly 60 words [26]. Due to nature of code
low, taking only a couple of minutes for the entire set of (i.e. limited vocabulary, limited structure) we can han-
benchmarks. dle inputs much longer than typical natural language
These observations are important due to the expected sentences (668 words for x86 and 845 words for LLVM ).
operating scenario of our decompiler. We expect the Regardless, this challenge also applies to us, resulting in
majority of inputs to be resolved using a previously poorer results when handling longer inputs. As the field
trained model. Retraining an NMT model should be of NMT evolves to better handle long inputs, so would
done only when the language grammar is extended or our results improve.
when significantly difficult inputs are provided. Thus, in To verify that this limitation is not due to our specific
normal operations, the execution time of the decompiler, implementation, we created another variant of our frame-
consisting of only translation and evaluation, will be work. This new variant is based on TensorFlow [5, 8]
mere seconds. rather than DyNet. Experimenting with this variant, we
Decompiling x86 Assembly Table 3 provides statis- got similar results as those reported in Section 6.4, and
tics of our x86 experiments. All of these experiments ultimately reached the same conclusion — the observed
terminated when they reached the iterations limit which limitation on input length is inherent to using NMT.
was set to 6.
Fig. 9 visualizes the successful translations column. 7.1.1 Other Decompilation Failures
The figure plots our average success rate as a function Though we do not consider this a limitation, another
of the number of completed iterations. It is evident that aspect that could be improved is our template filling
with each iteration the success rate increases, eventually phase. Our manual analysis identified some possibilities
reaching over 88% after 6 iterations. Overall, our decom- for improving our second phase – the template filling
piler successfully handled samples of up 668 input tokens phase (Section 5).
and 177 output tokens. The first type of failure we have observed is the result
Our decompilation success rates on x86 were lower of constant folding – a compiler optimization that re-
than that of LLVM, terminating at around 88%. This places computations involving only constants with their
correlates with the nature of x86 assembly, which has results. Fig. 10 demonstrates this kind of failure. Given
smaller vocabulary than that of LLVM IR. The smaller the C code in Fig. 10a, the compiler determines that 63∗5
13
X3 = 63 * ( 5 * X1 ) ; X2 = 48 + (X5 * (X14 * 66));
(a) High-level code (a) High-level code
X3 = ( X1 * 43 ) * 70 ;
(c) Suggested decompilation theorem prover based template filling algorithm could
detect that and assign the appropriate values to the
Figure 10. Example of decompilation failure constants, including N 11, resulting in equivalent code.
Fig. 12 shows another kind of failure. In this example
the difference between the expected output and suggested
X2 = ((X0 \% 40) * 63) / ((98 - X1) - X0); translation is a + that was replaced with −. Currently
(a) High-level code only variable names and numeric constants are treated as
template parameters. This kind of difference can be over-
X2 = ((X0 \% N3) * N13) / (((N2 - X1) + N11) - X0);
come by considering operators as template parameters
(b) Suggested decompilation
as well. Since the number of options for each operator
type (unary, binary) is extremely small, we could try all
Figure 11. Failure due to redundant number
options for filling these template parameters.
17
A Extracting Decompilation Rules
Table 4 contains examples of decompilation rules extracted from our decompiler. For brevity, we present mostly relatively simple rules, but longer and
more complicated rules were also found by our decompiler (examples of such rules are found at the bottom of the table, below the separating line).
input output
movl X 1 , eax ; addl N 1 , eax ; movl eax , X 2 ; X 2 = N1 + X 1 ;
movl X 1 , eax ; subl N 1 , eax ; movl eax , X 2 ; X 2 = X 0 − N1 ;
movl X 1 , eax ; imull N 1 , eax , eax ; movl eax , X 2 ; X 2 = X 1 ∗ N1 ;
movl X 1 , ecx ; movl N 1 , eax ; idivl ecx ; movl eax , X 2 ; X 2 = N 1 /X 1 ;
movl X 1 , eax ; movl X 2 , ecx ; idivl ecx ; movl eax , X 3 ; X 3 = X 1 /X 2 ;
movl X 1 , eax ; sall N 1 , eax ; movl eax , X 2 ; X 2 = X 1 ∗ 2N1 ;
movl X 1 , ecx ; movl N 1 , eax ; idivl ecx ; movl edx , eax ; movl eax , X 2 ; X 2 = N 1 %X 1 ;
movl X 1 , eax ; movl X 2 , ecx ; idivl ecx ; movl edx , eax ; movl eax , X 3 ; X 3 = X 1 %X 2 ;
movl X 1 , eax ; leal 1 ( eax ) , edx ; movl edx , X 1 ; movl eax , X 2 ; X 2 = X 1 ++;
movl X 1 , eax ; leal -1 ( eax ) , edx ; movl edx , X 1 ; movl eax , X 2 ; X 2 = X 1 --;
movl X 1 , eax ; addl 1 , eax ; movl eax , X 1 ; movl X 1 , eax ; movl eax , X 2 ; X 2 = ++X 1 ;
movl X 1 , eax ; imull N 1 , eax , eax ; addl N 2 , eax ; movl eax , X 2 ; X 2 = N 2 + (N 1 ∗ X 1 );
movl X 1 , eax ; addl N 1 , eax ; sall N 2 , eax ; movl eax , X 2 ; X 2 = (X 1 + N 1 ) ∗ 2 N2 ;
movl X 1 , eax ; imull N 1 , eax , ecx ; movl N 2 , eax ; idivl ecx ; movl eax , X 2 ; X 2 = N 2 /(X 1 ∗ N 1 );
movl X 1 , eax ; cmpl N 2 , eax ; jg .L0 ; movl N 2 , X 2 ; .L0: ; if(X 1 < (N 1 + 1)){X 2 = N 2 ; }
jmp .L1 ; .L0: ; movl N 1 , X 1 ; .L1: ; movl X 2 , eax ; cmpl N 2 , eax ; jg .L0 ; while(X 2 > N 2 ){X 1 = N 1 ; }
jmp .L1 ; .L0: ; movl N 1 , X 1 ; .L1: ; movl X 2 , eax ; cmpl N 2 , eax ; jne .L0 ; while(N 2 ! = X ){X 1 = N 1 ; }
movl X 1 , eax ; cmpl N 1 , eax ; jne .L0 ; movl N 2 , X 2 ; movl X 3 , eax ; movl eax , X 4 ; .L0: ; if(N 1 == X 1 ){X 2 = N 2 ; X 4 = X 3 ; }
movl X 1 , edx ; movl X 2 , eax ; cmpl eax , edx ; jg .L0 ; movl N 1 , X 3 ; jmp .L1 ; .L0: ; movl N 2 , X 4 ; .L1: ; if(X 1 <= X 2 ){X 3 = N 1 ; }else{X 4 = N 2 ; }
18
jmp .L1 ; .L0: ; movl X 1 , eax ; addl N 1 , eax ; movl eax , X 2 ; .L1: ; movl X 3 , eax ; cmpl N 2 , eax ; jle .L0 ; while(X 3 <= N 2 ){X 2 = N 1 + X 1 ; }
jmp .L1 ; .L0 : ; movl X 2 , eax ; addl 1 , eax ; movl eax , X 2 ; movl X 2 , edx ; movl X 2 , eax ; addl edx , ea... while((X 1 − N 1 ) > (X 2 %(X 2 − N 2 ))){X 3 = (++X 2 ) + X 2 ; ...
movl X 1 , eax ; addl 1 , eax ; movl eax , X 1 ; movl X 1 , edx ; movl X 2 , eax ; movl N 1 , ecx ; subl eax , ecx... if(++X 1 == (((X 2 ∗ (N 1 − X 2 )) − N 2 ) ∗ (N 3 − X 3 ))){X 2 = ...
movl X 3 , edx ; movl X 4 , eax ; addl edx , eax ; movl X 4 , ecx ; movl X 5 , edx ; addl edx , ecx ; idivl ecx ; ... X 1 = X 2 ∗ ((X 3 + X 4 )%(X 4 + X 5 )); X 6 = (X 7 + X 9 )/((N 1 − ...
movl N 1 , X 1 ; movl X 1 , eax ; movl eax , X 2 ; movl X 2 , eax ; movl X 3 , edx ; addl N 3 , edx ; subl edx , ea... X 1 = N 1 ; X 2 = X 1 ; if ((N 2 + (X 2 − (X 3 + N 3 ))) <= X 4 ){X ...
jmp .L1 ; .L0 : ; movl X 1 , ebx ; movl N 3 , eax ; idivl ebx ; movl eax , X 1 ; .L1 : ; movl X 1 , edx ; movl X 2 ... whil e((X 1 ∗ X 2 ) >= (N 1 %(X 3 + N 2 ))){X 1 = N 3 /X 1 ; }; X 4 ...
Table 4. Decompilation rules extracted from TraFix