Augmenting Decompiler Output With Learned Variable Names and Types
Augmenting Decompiler Output With Learned Variable Names and Types
1
typedef struct point { However, these systems typically only support a small set of
float x;
float y; manually-defined types and well-known library calls. Neither
} pnt; the first nor the second deal with the padding issue above.
void fun() { void fun() { In contrast, our system DIRTY (DecompIled variable Re-
pnt p1, p2; float v1[2], v2[2]; TYper) recovers both semantic and syntactic types, handles
p1.x = 1.5; v1[0] = 1.5;
p1.y = 2.3; v1[1] = 2.3; padding, and is not limited to a small set of manually-defined
// ... // ... types. Instead, DIRTY supports 48,888 possible types encoun-
use_pts(&p1, &p2); use_pts(v1, v2);
} } tered “in the wild” in open-source C code (compared to the
(a) Original code (b) Decompiled fun
150 different type names in 84 standard library calls supported
by REWARDS). At a high level, DIRTY is a Transformer-
Figure 1: A function with a struct and its decompilation. based [50] neural network model to recommend types in a
particular context, which operates as a postprocessing step to
void fun() { void fun() { decompilation. DIRTY takes a decompiled function as input,
// stack layout: // stack layout: and outputs probable names and types for all of its variables.
// [xxx][p][yyyy] // [xxxx][yyyy]
char x[3]; char x[4]; To build DIRTY, we start by mining open-source C code
int y; int y; from G IT H UB, and then use a decompiler’s typical ability to
// ... // ...
} } import variable names and types from DWARF debugging
information to create a parallel corpus of decompiled func-
(a) Original code (b) Decompiled fun
tions with and without their corresponding original names
Figure 2: A function illustrating the data layout problem in and types. As a side effect of this large-scale mining effort,
decompilation. In the stack layout the characters x, y, and p we also automatically compile a library of types encountered
represent a single byte assigned to the variables x and y, or across our open-source corpus. We then train DIRTY on this
padding data respectively. The decompiler cannot recognize data, introducing two task-specific innovations. First, we use
that the inserted padding data does not belong to the x array. a Data Layout Encoder to incorporate memory layout infor-
mation into DIRTY’s predictions and simultaneously address
a fundamental limitation of decompilers caused by padding.
is not clear which array index refers to which coordinate, or Second, we address both the variable renaming and retyp-
even that the coordinates are Cartesian (instead of e.g., polar). ing tasks simultaneously with a joint Multi-Task architecture,
Unlike names, types are constrained by memory layouts, enabling them to benefit from each other.
and thus theoretically should be easier to recover (only types We show that DIRTY can assign variable types that agree
that fit that memory layout should be considered as candi- with those written by developers up to 75.8% of the time, and
dates). In fact, decompilers already narrow down possible type DIRTY also outperforms prior work on variable names.
choices using the fact that base types targeting a specific plat- Note that even though we implement DIRTY on top of the
form can only be assigned to variables with a specific memory Hex-Rays1 decompiler because of its positive reputation and
layout (e.g., on most platforms an int variable can never be its programmatic access to decompiler internals, our approach
retyped to a char because they require different amounts of is not fundamentally specific to Hex-Rays, and should concep-
memory). This already makes it possible for decompilers to tually work with any decompiler that names variables using
infer base types and a small set of commonly-used typedefs. DWARF debug symbols.
On the other hand, despite performing a battery of complex In summary, we contribute:
binary analyses, the data layout inferred by the decompiler is • DIRT—the Dataset for Idiomatic ReTyping—a large-
often incorrect, which makes the problem harder. For exam- scale public dataset of C code for training models to
ple, consider the program shown in Figure 2. Two top-level retype or rename decompiled code, consisting of nearly
variables are declared, x: a three-byte char array, and y: a four- 1 million unique functions and 368 million code tokens.
byte int. During compilation, the compiler inserts a single
byte of padding after the x array for alignment. When this • DIRTY—the DecompIler variable ReTYper—an open-
function is decompiled, the decompiler can tell where x and source Transformer-based neural network model to re-
y begin, but it cannot tell if x is a three-byte array followed cover syntactic and semantic types in decompiled vari-
by a single byte of padding, or a four-byte array whose last ables. DIRTY uses the data layout of variables to im-
element is never used. prove retyping accuracy, and is able to simultaneously
Prior work on reconstructing types falls into two groups. retype and rename variables in decompiled code.
The first, such as TIE [31], attempt to recover syntactic types, Example output from DIRTY is available online at
e.g., struct {float; float}, but not the names of the structure https://dirtdirty.github.io/explorer.html.
or its fields. The second, such as REWARDS [33], attempt
to also recover the type name (referred to as semantic types). 1 https://www.hex-rays.com/products/decompiler/
2
2 Model Design more data in the same amount of time. In our case, this en-
ables us to train on our large-scale, real-world dataset, which
In this section, we describe our machine learning model and consists of 368 million decompiled code tokens.
decisions that influenced its design, starting with some rel- Although there have been a number of advances in
evant background. Our model is a neural network with an neural machine translation since the original Transformer
encoder-decoder architecture. model [50], most recent advances focus on improvements
on other factors, such as training data and objectives [4, 10,
32, 42], dealing with longer sequences [57], efficiency [6],
2.1 The Encoder-Decoder Architecture
and scaling [15], rather than changing the fundamental archi-
Our task consists of generating variable types (and names) as tecture. Moreover, most of these improvements are tailored
output given individual functions in decompiled code as input. for the natural language domain, making them less general-
This means that unlike a traditional classification problem izable than the original model and inapplicable to our task.
with a fixed number of classes, both our input and output are Instead, we keep our model simple, which allows different,
sequences of variable length: input functions (e.g., fed into the better architectures or implementations to be used out-of-the-
network as a sequence of tokens) can have arbitrarily many box in the future. For example, the recent Vision Transformer
variables, each requiring a type (and name) prediction. (ViT) [11], which also intentionally follows the original Trans-
Therefore, we adopt an encoder-decoder architecture [7], former architecture “as closely as possible” when adapting
commonly used for sequence-to-sequence transformations, Transformers to computer vision tasks.
as opposed to the traditional feed-forward neural network ar- We omit the technical details of Transformers, includ-
chitecture used in classification problems with a fixed-length ing multi-headed self-attention, positional encoding, and the
input vector and prediction target. More specifically, the en- specifics of training as they are beyond the scope of this paper.
coder takes the variable-length input and encodes it as a fixed-
length vector. Then, this fixed-length encoding is passed to 2.3 DIRTY’s Architecture
the decoder, which converts the fixed-length vector into a
variable-length output sequence. This architecture, further en- In DIRTY, we cast the retyping problem as a transformation
hanced through the attention mechanism [3], has been shown from a sequence of tokens representing the decompiled code
to be effective in many tasks such as machine translation, text to a sequence of types, one for each variable in the original
summarization [36], and image captioning [53]. source code. This section describes DIRTY’s architecture in
detail. Figure 3 shows an overview of the architecture.
2.2 Transformers Code Encoder. The encoder converts the sequence of code
tokens of the decompiled function (lower-left of Figure 3),
There are several ways to implement an encoder-decoder. x = (x1 , x2 , . . . , xn ), into a sequence of representations,
Until recently, the standard implementation used a particu- H = (h1 , h2 , . . . , hn ) , (1)
lar type of recurrent neural network (RNN) with specialized
neurons called long short-term memory units; these neurons where each continuous vector hi ∈ Rd_model is the contextual-
and networks constructed from them are commonly referred ized representation for the i-th token xi . During training, the
to as LSTMs [24]. More recently, Transformer-based mod- encoder learns to encode the information in the decompiled
els [4, 15, 42, 57], building on the original Transformer archi- function x relevant to solving the task into H. For example, for
tecture [50], have been shown to outperform LSTMs and are a code token xi =v1, useful information about v1 in the con-
considered to be the state-of-the-art for a wide range of natural text of x (e.g., operations performed on v1) is automatically
language processing tasks, including machine translation [4], learned and stored in hi .
question answering and abstractive summarization [10, 32], Specifically, we denote the encoding procedure as
and dialog systems [1]. Transformer-based models have also H = fen (x; θen ) , (2)
been shown to outperform convolutional neural networks such where the input x = (x1 , x2 , . . . , xn ) is the code token se-
as ResNet [19] on image recognition tasks [11]. quence of the decompiled function and the output H =
Transformers have several properties that make them a (h1 , h2 , . . . , hn ) is the sequence of deep contextualized repre-
particularly good fit for our type prediction task. First, they sentations. fen denotes the encoder, implemented with neural
capture long-range dependencies, which commonly occur in networks, and θen denotes its learnable parameters.
program code, more effectively than RNNs. For example, a The ultimate goal of DIRTY is to make type predictions
variable declared at the beginning of a function may not be about each variable that appears in the decompiled function.
used until much later; an ideal model captures information However, the encoder produces hidden representations for
about all uses of a variable. Second, transformers can perform every code token (e.g., “v1”, “:”, “=”, “v1”, “+”, “1” are all
more computations in parallel on typical GPUs than LSTMs. tokens). Because a variable can appear multiple times in the
As a result, training is faster, and a Transformer can train on code tokens of a function, we need a way to summarize all
3
VAR1
VAR1
VAR2
…
ID: VAR1
Size: 12
Loc: Stack 0x1c struct timeval {
time_t tv_sec;
Offsets: [0, 8] suseconds_t tv_usec;
…
}
Figure 3: Overview of DIRTY’s neural model architecture for predicting types. Decompiled code is sequentially fed into the
Code Encoder. When the input of the code encoder corresponds to a specific variable (e.g., VAR1), it is pooled with other instances
of the same variable to generate a single encoding for that variable. Each pooled encoding is then passed into the Type Decoder,
which outputs a vector of the log-odds (logits) for predicted types. This vector is masked with a vector generated by the Data
Layout encoder and the most probable type is chosen from the masked logits.
appearances of a variable. We achieve this through pooling, 2. The output layer of the decoder then uses its learnable
where the representation for the t-th variable2 is computed weight matrix W and bias vector b to transform the
based on all of its appearances in the code tokens, At , using hidden representation zt to the logits for prediction
an average pooling operation [29] st = Wzt + b, (5)
vt = AveragePoolxi ∈At hi , t = 1, . . . , m (3) where st ∈ R|T | , W ∈ R|T |×d_model , b ∈ R|T | , and |T | is
where m is the number of variables in the function. This solu- the number of types in the type library. The logits st is
tion removes the burden of gathering all information about a the unnormalized probability predicted by the model, or
variable throughout the function into a single token represen- the model’s scores on all types
tation from the model. The pooled representation for the first
variable, VAR1, is shown in the upper-left of Figure 3. 3. Finally, the softmax function computes a probability dis-
tribution over all possible types from st
Type Decoder. Given the encoding of the decompiled tokens, Pr(ŷt |ŷ1 , ŷ2 , . . . , ŷt−1 , x) = softmax st (6)
the decoder predicts the most probable (i.e., idiomatic) types
for all variables in the function. The decoder takes the encoded Note that the type library T is fixed, meaning DIRTY can
representations of the code tokens (H) and identifiers (vt ) as only predict types that it has seen during training. We discuss
input and predicts the original types ŷ = (ŷ1 , ŷ2 , . . . , ŷm ) for all this limitation, its implications, and potential mitigations in
m variables in the function. Unlike the encoder, the decoder Section 5. However, DIRTY can recover structure types as
predicts the output step-by-step using former predictions as well as normal types, as both are simply entries in T .
input for later ones.3 The goal of the decoder is to find the optimal set of
At each time step t, the decoder tries to predict the type for type predictions for all variables in a given function (i.e.,
the t-th variable as follows: the predictions with the highest combined probability):
1. The decoder takes the code representations H and vari- argmaxŷ Pr(ŷ|x). This probability can be factorized as the
able representation vt from the encoder, and also previ- product of probabilities at each step:
ous predictions ŷ1 , ŷ2 , . . . , ŷt−1 from itself, to compute a m
hidden representation zt ∈ Rd_model Pr(ŷ | x) = ∏ Pr (ŷt | ŷ1 , ŷ2 , . . . ŷt−1 , x) . (7)
t=1
zt = fde (ŷ1 , ŷ2 , . . . , ŷt−1 , vt , H; θde ) (4) We’ve shown how to compute Pr (ŷt | ŷ1 , ŷ2 , . . . ŷt−1 , x) with
where fde , θde denotes the decoder and its parameters. the decoder, but finding the optimal ŷ = (ŷ1 , ŷ2 , . . . , ŷm ) is not
The hidden representation zt is then used for prediction. an easy task, because each variable can have |T | possible pre-
2t is commonly used in RNN literature because it refers to a “timestep”. dictions, and each prediction affects subsequent predictions.
3 This is known as an autoregressive model. The time complexity of exhaustive search is O (|T |m ). There-
4
fore, finding the optimal prediction is often computationally VAR1 Data Layout
5
3 Evaluation
VAR1
VAR1
VAR2
…
The training and prediction procedures remain almost the form training examples.
same, with two notable exceptions. First, to improve perfor- Since DIRE was only concerned with renaming, its dataset
mance, the Data Layout encoder is not activated when the did not include variables which did not correspond to a named
decoder is predicting a variable’s name. This is unnecessary variable in the original source code. Many such variables are
because name prediction depends on the predicted type, which actually caused by mistakes in the decompiler during type
has already incorporated the data layout information. Prelim- recovery, for instance decompiling a structure to multiple
inary experiments confirmed no improvement in accuracy scalar variables instead. Since the goal of DIRT is to enable
when using the Data Layout encoder for name prediction. type recovery and fix such mistakes, we label these instances
as <Component> to denote that they are components of a variable
Second, there are two ways to interleave the predictions of in the source code. This allows the model to combine them
types and names: types first or names first. In theory, this does with other variables into an array or a struct.
not matter because they are equivalent if the learned model
The final DIRT dataset consists of 75,656 binaries ran-
and the decoding algorithm are ideal. In practice, we chose
domly sampled from the full set of 4,346,134 binaries to
to predict types first because we believe the type prediction
task should be easier (since there is more information) and it 4 https://ghtorrent.org
6
yield a dataset that we could fully process based on the com- Overall In Train Not in Train
putational resources we had available. We split the dataset Method All Struct All Struct All Struct
per-binary as opposed to per-function, which ensures that dif-
ferent functions from the same binary cannot be in both the FSize 23.6 9.7 23.5 9.1 23.8 10.4
test and training sets. The training dataset consists of 997,632 HR 37.9 28.7 39.0 28.7 36.4 28.7
decompiled functions, and a total number of 48,888 different DIRTY 75.8 68.6 89.9 79.2 56.4 54.6
types. We also preprocess the decompiled code with byte-pair
encoding (BPE) [45], a widely adopted technique in NLP Table 1: DIRTY has higher retyping accuracy than Frequency
tasks to represent rare words with limited vocabulary by to- By Size (FSize ) and Hex-Rays (HR) on the DIRT dataset, both
kenizing them into subword units. After this step, the DIRT for all types (All) and on structural types alone (Struct).
dataset consists of 368 million decompiled code tokens, and
an average of 220.3 tokens per function. Detailed statistics Baselines. We measure our accuracy with respect to two base-
about the DIRT dataset and the train/valid/test split can be line methods for predicting variable types:
found in Table 11 in Appendix A.
Frequency by Size The number of bytes a variable occupies
Metrics. We evaluate DIRTY using two metrics: is the most basic information for a type. For this tech-
Name Match: Following DIRE [29], we consider a variable nique, we predict the most common developer-assigned
name prediction correct if it exactly string matches the type for a given size (as reported by the decompiler).
name assigned by the original developer. We compute E.g., int is the most common 4-byte type, and __int64
the prediction accuracy as the average percentage of is the most common 8-byte type; this baseline simply
correct predictions across all functions in the test set. assigns these types to variables of the respective size.
Type Match: We consider a type prediction to be correct Hex-Rays [22] During decompilation, Hex-Rays already pre-
only if the predicted type fully matches the ground truth dicts a type for each variable, so we can use these predic-
type, including data layout, and the type and name of tions as a baseline. However, Hex-Rays cannot predict
any fields if applicable. We serialize types to strings and developer-generated types without prior knowledge of
use string matching to determine type matching. them, e.g., Hex-Rays assigns unsigned __int16 instead
of the more common uint16_t, which puts it at an un-
Note that both metrics are conservative. Predictions may fair disadvantage. For this baseline, we reassign the type
still be meaningful, even if not identical to the original names. chosen by Hex-Rays to the most common developer-
A human study evaluating the quality of predicted types and chosen name associated with it (e.g., we replace every
names is beyond the scope of the current paper. unsigned __int16 with uint16_t.
Meaningful Subsets of the Test Data. We introduce several
Results. As shown in Table 1, DIRTY can correctly recover
subsets of the DIRT test set to better interpret the results:
75.8% of the original (developer-written) types from the de-
Function in training vs Function not in training. compiled code. In contrast, Hex-Rays, the highest scoring
Similarly to Lacomis et al. [29], Function in training baseline, can only recover 37.9% of the original types.
consists of the functions in the test set that also appear As expected, DIRTY performs even better when it has seen
in the training set, which are mainly library functions. a particular function before (In Train), generating the same
Allowing this duplication simulates the realistic use type as the developer 89.9% of the time. This indicates that
case of analyzing a new binary that uses well-known DIRTY works particularly well on common code such as
libraries. We also separately measure the cases where libraries. Even when a function has never been seen (Not in
the function is not known during training (i.e., Function Train), DIRTY predicts the correct type 56.4% of the time.
not in training) to measure the model’s generalizability. Table 1 also shows the performance of DIRTY on structure
Structure types. Only 1.8% of variables in DIRT have struc- types alone. Correctly predicting structure types is more diffi-
ture types. Because of this low percentage, examining cult than predicting scalar types, and all models show a drop in
overall accuracy may not reflect DIRTY’s accuracy when performance. Despite this drop, DIRTY still achieves 68.6%
predicting structure types, which we have found anec- accuracy overall, and 54.6% accuracy on the Function not in
dotally to be more challenging. To mitigate this, we training category. Frequency By Size struggles on structures
separately measure DIRTY’s accuracy on structures in with only 9.7% accuracy; this is expected since structures of a
addition to its overall accuracy. given size can have many possible types. Hex-Rays is slightly
more accurate at 28.7%, as the decompiler is able to analyze
the layout of structures.
3.2 RQ1: Overall Effectiveness
Table 2 shows several examples of retyping predictions
We evaluate DIRTY on the idiomatic retyping task and report from the Function not in training partition. These examples
its accuracy compared to several baselines. show that accuracy is not the full story; even when DIRTY
7
int char * class std::string
Table 2: Example variable types from the Function not in training testing partition. The top rows are the developer-assigned
types and the columns show DIRTY’s top-5 most frequent predictions. <Component> represents a prediction that the variable in the
decompiled code does not correspond to a variable in the source code (e.g., because it corresponds to a member of a struct).
is unable to predict the correct type, the differences are often Because DIRTY can predict up to 48,888 different types,
minor (e.g., unsigned int v. int, and const char * v. char *). each including the full syntactic and semantic information,
The bottom half of Table 2 shows prediction examples of we convert its predictions in a post-hoc manner to make it
structure types.6 DIRTY is able to recover the actual struc- comparable with O SPREY.9
ture much of the time. At other times, DIRTY also produces Table 3 compares the accuracies of both systems. On the
some semantically reasonable but syntactically unacceptable overall coreutils benchmark, DIRTY slightly outperforms O S -
predictions, like char[32] for class std::string. PREY (76.8% vs 71.6%). O SPREY outperforms DIRTY on
the Visited subset, but as expected, performs worse on the
3.3 RQ2: Comparison with Prior Work Non-Visited functions. Meanwhile, DIRTY is more consistent
on Visited and Non-Visited. When only looking at structure
We further compare DIRTY with recent work on type recov- types, O SPREY outperforms DIRTY (26.6% vs 15.7%).
ery [58] and variable name recovery [29]. However, this comparison puts DIRTY at a disadvantage,
Type Recovery. While there is prior work on type recovery since O SPREY was designed for this task of recovering syn-
(see also Section 4), none of the existing approaches, TIE [31], tactic types, while DIRTY was trained to recover variable and
Howard [47], Retypd [39], TypeMiner [34] and O SPREY [58], type/field names, and much of this information is thrown out
are publicly available. We are grateful to Zhang et al. [58], for this evaluation. To address this, we trained a new model,
the authors of O SPREY, for kindly sharing their evaluation DIRTYLight , on DIRT, but tailored the training to O SPREY’s
material so we could compare results. simplified task. The accuracy of this model is also reported in
O SPREY is a recently proposed probabilistic technique Table 3. As expected, the DIRTYLight model outperforms the
for variable and structure recovery that outperforms exist- off-the-shelf DIRTY model, since it is trained specifically for
ing work including Howard [47], Angr [46], Hex-Rays [22] this task. DIRTYLight greatly improves prediction accuracy
and Ghidra [58]. The O SPREY authors provided us with the on the Struct subset, and even outperforms O SPREY.
GNU coreutils7 executables they used in their evaluation, To further get a fine-grained comparison with O SPREY, we
which were compiled with -O0 to disable optimization. We calculate accuracy on 101 coreutils binaries individually, and
ran DIRTY on these executables, but only evaluated on stack show the prediction accuracies of DIRTY and O SPREY with
and heap variables, since O SPREY does not recover register respect to the number of variables in the programs in Figure 6.
variables. This benchmark consists of 101 binaries and 17,089 We observe that DIRTY is competitive compared with O S -
variables. We also define two subsets of the dataset: PREY . Interestingly, while the results on large binaries are
Visited A subset of 13,020 variables that are covered by close, DIRTY performs better on small binaries. This suggests
BDA [59], a binary abstract interpretation tool that O S - our learning-based method trained on GitHub data might gen-
PREY relies on. O SPREY is expected to perform better on eralize better on rare patterns compared to empirical methods
these covered functions than uncovered functions, which that might have been developed based on observations on a
we also report as Non-Visited.8 However, DIRTY is not limited number of common and relatively larger programs.
subject to this limitation. In addition, DIRTY is also much faster and scalable. On
Struct A subset of 3,061 variables related to structure types. average, O SPREY takes around 10 minutes to analyze one
Following O SPREY, we include structs allocated on the binary in coreutils, while it takes 75 seconds for DIRTYLight
stack, pointers to structs on the heap, and arrays of structs. to finish inference on the whole coreutils benchmark.
These variables do not have to be in the Visited subset. 9 Specifically, we discard type names and field names. For example, bool
6 We omit the full predicted contents of structs here for conciseness. and char are both converted to Primitive_1, which stands for a primitive
7 https://www.gnu.org/software/coreutils/
type occupying 1 byte of memory, const char * and char * are converted
8 A majority of uncovered functions are unreachable from the entry point to Pointer<Primitive_1>, and struct ImVec2 {float x; float y
of the binary, and others are indirect call targets which BDA fails to analyze. ;} converted to Struct<Primitive_4, Primitive_4>.
8
Coreutils Accuracy
Model All Visited Non-Visited Struct Model Overall Struct
O SPREY 71.6 83.8 32.4 26.6 DIRTYS 74.5 65.4
DIRTY 76.8 79.1 69.6 15.7 DIRTY 75.8 68.6
DIRTYLight 80.1 80.1 80.1 27.7
Table 5: Effect of model size. The accuracy columns show
Table 3: Accuracy comparison on the Coreutils benchmark. the overall accuracy and the accuracy on struct types.
100%
on the DIRT dataset. Since DIRE is focused on variable re-
80% naming, and there is no type information collected in their
dataset, we cannot use the Data Layout Encoder for these
Accuracy
60%
experiments. Instead, we only use our Code Encoder and Re-
40% naming Decoder. We report the accuracy of both systems in
Table 4. DIRTY significantly outperforms DIRE in terms of
20% DIRTY
overall accuracy on both the DIRE dataset (81.4% vs. 72.8%),
OSPREY
0%
and on the DIRT dataset (66.4% vs. 57.5%). DIRTY also
100 200 300 400 500 generalizes better than DIRE: when functions are not in the
Number of Variables in Binary training set, DIRTY outperforms DIRE on both the DIRE
(42.8% vs. 33.5%) and the DIRT datasets (36.9% vs. 31.8%).
Figure 6: Accuracy of DIRTY and O SPREY on 101 individual DIRTY outperforms DIRE in spite of the fact that it only
programs in the coreutils benchmark with different number of leverages the decompiled code, whereas DIRE leverages both
variables. The two methods are competitive on large binaries, the decompiled code and the reconstructed AST from Hex-
while DIRTY performs much better on small binaries. Rays. Since the primary difference between DIRTY without
type prediction and DIRE is that it uses Transformer as its
encoder and decoder network, we attribute this improvement
Overall, we believe both methods are valuable. Since at this to the power of Transformers, which allow modeling interac-
point DIRTY is using Hex-Rays recovered data layout as input tions between any pair of tokens, unrestricted to a sequential
to its Data Layout Encoder, we believe a promising future or tree structure as in DIRE.
direction is to combine these two methods—using O SPREY’s Also notable is how DIRTY trains faster than DIRE. We
results as the input to DIRTY’s, and the combined approach found that DIRTY surpassed DIRE in accuracy after training
can potentially achieve even better results. for 30 GPU hours, compared to the 200 GPU hours required to
Name Recovery. The Decompiled Identifier Renaming En- train DIRE on the full DIRT dataset, which we again attribute
gine (DIRE) is a state-of-the-art neural approach for decom- to the efficiency of the Transformer architecture.
piled variable name recovery [29]. The DIRE model consists
of both a lexical encoder and a structural encoder, utilizing 3.4 RQ3: Ablation Study
both tokenized decompiled code and the reconstructed ab-
stract syntax tree (AST). In contrast, DIRTY’s simpler en- To understand how each component of DIRTY contributes to
coder only uses the tokenized decompiled code. its overall performance, we perform an ablation study.
The DIRE authors provide a public dataset for decompiled
Model Size. Transformers have the merit of scaling easily
variable renaming compiled with -O0. To compare with DIRE,
to larger representational power by stacking more layers, in-
we train DIRTY on the DIRE dataset and also train DIRE
creasing the number of hidden units and attention heads per
layer [10, 50]. We compare DIRTY to a modified, smaller
DIRE Dataset DIRT Dataset version DIRTYS . DIRTY contains 167M parameters, while
Model All FIT FNIT All FIT FNIT DIRTYS only 40M. Table 10 contains details of the hyperpa-
rameter differences between the two models.
DIRE 72.8 84.1 33.5 57.5 75.6 31.8 Table 5 shows overall DIRTY is 75.8% accurate vs. 74.5%
DIRTY 81.4 92.6 42.8 66.4 87.1 36.9 for DIRTYS ’s. This indicates increasing the model size has
a positive effect on retyping performance. The gain from
Table 4: Accuracy comparison of DIRE and DIRTY on the increased model capacity is notably larger when comparing
DIRE and DIRT datasets. Accuracy is reported overall (All), performance on structures. This improvement suggests that
when functions are in the training set (FIT), and when func- complex types are more challenging and require a model with
tions are not in the training set (FNIT). larger representational capacity. We are not able to train a
9
100%
inclusion of the Data Layout encoder improves overall accu-
80% racy from 72.2% to 75.8%, indicating that the Data Layout
encoder is effective. The results are even more interesting
Accuracy
60% when the results are broken into the two partitions. The rel-
ative gain on the Function in not training partition is 13%
40% All (49.9% to 56.4%), compared to 1.7% on the Function in train-
20%
In train ing partition (88.8% to 89.9%). This suggests the Data Layout
Not in train encoder greatly improves DIRTY’s generalization ability.
0% Table 7 compares example predictions from DIRTY and
20% 40% 60% 80% 100%
Size of Training Set
DIRTYNDL on the same types from the Function not in
training partition. For the __int64 example, the type pre-
Figure 7: Effect of training data size. With 100% of the data, dictions from DIRTY mostly have the correct size of 8
the accuracies of All, In train, and Not in train are 75.8%, bytes. DIRTYNDL , however, often incorrectly predicts int and
89.9%, and 56.4% respectively. With 20%, these drop to unsigned int. This is understandable because in situations
67.9%, 82.3%, and 48.0% respectively. where the value doesn’t exceed the 32-bit integer, __int64
can be safely interchanged with int, these situations can be
identified in some decompiled code. However, apart from the
Model Overall In train Not in train
correctness of the retyped program, accuracy to the original
DIRTYNDL 72.2 88.4 49.9 binary, (i.e., allocating 8 bytes instead of 4), is also important.
DIRTY 75.8 89.9 56.4 DIRTY achieves this better than DIRTYNDL .
In the second example, the struct __m128d type occupies
Table 6: Effect of the Data Layout encoder on the accuracy 16 bytes, and has two members at offset 0 and 8. DIRTYNDL
of DIRTY. Accuracy is reported for the model with (DIRTY) mainly mistakes this structure as a double, which might make
and without (DIRTYNDL ) the encoder. sense semantically but is unacceptable syntactically. With the
Data Layout encoder, DIRTY effectively reduces these errors.
This demonstrates this component achieves the soft masking
larger model due to limits on computation power. effect on type prediction as intended in Section 2.4.
Dataset Size. We examine the impact of training data size on Multi-Task Decoder. In this section we study the effective-
prediction accuracy. As a data-driven approach, DIRTY relies ness of the Multi-Task decoder when compared to decoders
on a large-scale code dataset; studying the impact of data size designed for only retyping or only renaming. Inspecting the
gives us insight into the amount of data to collect. We trained accuracy numbers reported in Table 8, the Multi-Task decoder
DIRTY on 20%, 40%, 60%, 80% and 100% portion of the has similar, but slightly lower overall accuracy on both tasks
full training partitionand report the results in Figure 7. as the two specialized models (-0.8% for retyping and -1.3%
Figure 7 shows the change in accuracy with respect to the for renaming). One possible reason is that the Multi-Task
percentage of training data. Increasing the size of training model has twice the length of decoding lengths than a special-
data has a significant positive effect on the accuracy. Between ized model, which makes greedy decoding harder.
20% and 100% of the full size the accuracy increases from Despite the small decrease in performance, the unified
67.9% to 75.8%, a relative gain of 11.6%. model has advantages. These are illustrated in the XName and
Notably, accuracy on Function not in training has a relative XType columns of Table 8. XName and XType stand for the
gain of 17.5% much larger than on the Function in training subsets of the full dataset where the Multi-Task decoder makes
partition. This is likely because the Function in training parti- correct renaming predictions and correct retyping predictions,
tion contains common library functions shared by programs and we evaluate the retyping and renaming performance on
both in the training and test set, and even a smaller dataset them, respectively.10 The Multi-Task decoder outperforms
will have programs that use these functions. In contrast, the the specialized models by 1.9% and 2.4% relatively on these
Function not in training part is open-ended and diverse. metrics, in spite of the longer decoding length. This means
It is also worth noting that the accuracy drops sharply when the type and name predictions from the Multi-Task decoder
the training set size is decreased from 40% to 20%, justifying are more consistent with each other than from specialized
the necessity for using a large-scale dataset. models. In other words, making a correct prediction on one
task increases the probability of success on the other task.
Data Layout Encoder. We explore the impact of the Data In practice, this offers additional flexibility and opens the
Layout encoder on DIRTY’s performance. We experiment opportunity for more applications. For example, consider a
with a new model with no Data Layout encoder, DIRTYNDL . 10 The probability of success on the other task also increases by chance,
Table 6 shows the accuracy results overall and on the Func- because success on one task implies it is easier than average. We have
tion in training and Function not in training partitions. The eliminated this influence by, e.g., comparing 92.3 to 90.6, instead of 74.9.
10
DIRTY DIRTYNDL
__int64 struct __m128d __int64 struct __m128d
Table 7: Comparative examples from DIRTY with and without Data Layout encoder from the Function not in training partition.
Predictions inside a gray box have a different data layout than the ground truth type. DIRTY effectively suppresses these, which
helps guide the model to a correct prediction. The structure’s full type is struct __m128d {double[2] m128d_f64;}.
Retyping Renaming we believe -O0 code to be simpler. Going from -O0 to -O1,
Model Overall XName Overall XType DIRTY’s accuracy drops from 48.2% to 46.0%. However,
there is little difference in performance between -O1, -O2,
Retyping 75.8 90.6 - - and -O3. This suggests that DIRTY does slightly better on
Renaming - - 66.4 82.6 the optimization level of code it was trained on, but that the
Multi-Task 74.9 92.3 65.1 84.6 effect of optimizations is small. We believe this is because
Hex-Rays recognizes and will “undo” some optimizations
Table 8: Performance comparison of the Retyping-only, so that the decompiled code will be very similar. For exam-
Renaming-only, and Multi-Task decoders. Overall perfor- ple, unoptimized code will often reference stack variables
mance is shown, in addition to performance on retyping when using a frame pointer, but optimized code will reference such
the name is correct (XName) and performance on renaming variables using the stack pointer, or even maintain them in
when the type is correct (XType). a register. But both implementations will look similar in the
decompiled code, since the mechanism used to reference the
GNU coreutils variable is not important at the C level. Since DIRTY operates
Model -O0 -O1 -O2 -O3 on the decompiled code, the decompiler effectively insulates
DIRTY from these optimizations.
DIRTY 48.20 46.01 46.04 46.00
11
int find_unused_picture(int a1, int a2, int a3) { from C binaries. However, they use much simpler machine
int i, j, v1;
if (a3) { learning algorithms and their dataset only consists of 23,482
for (i = <Num>;; ++i) { variables and 17 primitive types. Escalada et al. [14] has
if (i > <Num>)
goto LABEL_13; provided similar insights. They adopt simple classification
if (!*(*(<Num> * i + a2) + <Num>)) algorithms to predict function return types in C, but they only
break;
} consider from only 10 different (syntactic) types and their
v1 = i; dataset is limited to 2,339 functions from real programs and
} else {
for (j = <Num>;; ++j) { 18,000 synthetic functions.
if (j > <Num>) { Two other projects targeting the improvement of decom-
LABEL_13:
av_log(a1, <Num>, <Str>); piler output using neural models are DIRE [29], which pre-
abort(); dicts variable names, DIRECT [38], which extends DIRE
}
if (pic_is_unused(<Num> * j + a2)) using transformer-based models, and Nero [8], which gen-
break; erates procedure names. Other approaches work directly on
}
v1 = j; assembly [16, 26, 27], and learn code structure generation in-
} stead of aiming to recover developer-specified variable types
return v1;
} or names. Similarly, D EBIN [18] and CATI [5] use machine
learning to respectively predict debug information and types
directly from stripped binaries without a decompiler.
ID Developer DIRTY
a1 AVCodecContext_0 *avctx MpegEncContext_0 *s
5 Discussion
a2 Picture_0 *picture Picture_0 *pic
a3 int shared int shared In this paper we presented DIRTY, a novel deep learning-
v1 int result int result based technique for predicting variable types and names in
decompiled code. Still, DIRTY is limited in several ways that
Figure 8: Simplified Hex-Rays output. <Num> and <Str> are provide key opportunities for future improvements.
placeholder tokens for constant numbers and strings respec-
tively. The table summarizes the original developer names and Alternative Decompilers to Hex-Rays. We implement
types along with the names and types predicted by DIRTY. DIRTY on top of the Hex-Rays decompiler because of its
positive reputation and the programmatic access it affords to
decompiler internals. However, DIRTY is not fundamentally
4 Related Work specific to Hex-Rays, and the technique should conceptually
work with any decompiler that names variables using DWARF
Other projects related to type recovery for decompilation are debug symbols. Note that, due to its recent popularity and
REWARDS [33], TIE [31], Retypd [39], and O SPREY [58]. promise, we attempted to evaluate our techniques using the
Unlike our approach, they use program analyses to compute newer, open-source Ghidra decompiler. Unfortunately, it is
constraints on types. Additionally, they are either limited to currently infeasible, because Ghidra routinely failed to accu-
only predicting the syntactic type (TIE, Retypd, O SPREY), rately name stack variables based on DWARF. This appears
or only predicting one of a small set of hand-written types to be a combination of specific issues11 and the general de-
(150 for REWARDS). In comparison, DIRTY automatically sign of the decompiler. Ghidra’s decompiler consists of many
generates a database of types by observing real-world code. passes which modify and augment the current decompilation.
Other projects use machine learning to predict types, but tar- Some of these passes combine variables, but in doing so may
get different languages than DIRTY. D EEP T YPER [20] learns combine a DWARF-named variable with others. Since the
type inference for JavaScript and O PT T YPER [40], L AMB - combined variable no longer corresponds directly with the
DA N ET [52], R-GNNNS-CTX [56] target TypeScript. Training DWARF variable information, Ghidra discards the name. We
a machine learning algorithm for the task of typing dynamic are optimistic, however, that when the above-mentioned is-
languages like these is a slightly easier task: generating a sues are addressed, Ghidra may again be a reasonable target
parallel corpus is simple, since the types can simply be re- for our approach.
moved without changing the semantics. The DIRT dataset
Generalizing to Unseen Types. A limitation of DIRTY’s
is fundamentally different: including debug information of-
current decoder is that it can only predict types seen during
ten changes the layout of the code as the decompiler adds
training. Fortunately, there appears, empirically, to be suf-
structures and syntax for accessing them.
ficient redundancy across large corpora that DIRTY is still
To the best of our knowledge, the most directly-related frequently able to successfully recover structural types. This
work to DIRTY is TypeMiner [34]. TypeMiner is a pioneering
work, providing the proof-of-concept for recovering types 11 https://github.com/NationalSecurityAgency/ghidra/issues/2322
12
lends credence to the hypothesis that code is natural, an ob- Extending DIRTY to support higher-level languages such
servation that has been explored in several domains [9, 23]. as C++ is an interesting open problem. To some degree, as
It moreover appears that data layout is of particular impor- long as the decompiler is able to import the higher-level type
tance here: layout information recovered from the decompiler information from debug symbols into the decompiler output, it
impose key constraints on the overall prediction problem. In- should be possible to train DIRTY to recognize non-C types.
deed, our results in Section 3.4 corroborate the intuition that For instance, 6% of the programs in DIRT are written in
the Data Layout Encoder is especially important for succeed- C++, and our evaluation measures DIRTY’s ability to predict
ing on previously unseen code. common C++ class types such as std::string. But recovering
We envision meaningful future opportunities to more di- higher level properties of these types, especially for those
rectly expand DIRTY’s capabilities to predict unseen struc- never seen during training, is a challenging problem and is
tures. This problem is analogous machine translation models likely to require language-specific adaptations [13, 44].
that must deal with rare or compound words (e.g., xenopho-
Limited Input Length. As common with Transformers, we
bia) that are not present in their dictionary. Byte Pair Encod-
truncate the decompiled function if the length n exceeds
ing [45] (BPE) is the most frequently used technique to tackle
some upper limit max_seq_length, which makes training
this problem in the natural language domain. It automatically
more efficient. In our experiments we set max_seq_length
splits words into multiple tokens that are present in the dic-
to 512 for two reasons. First, 512 is the default value for
tionary (e.g., xeno and ##phobia). (The ## indicates the word
max_seq_length in many Transformer models [10, 50]. Sec-
is still part of the current word, instead of a new word next
ond, in DIRT, the average number of tokens in a function is
to it.) This technique greatly increases the number of words
220.3, and only 8.8% of the functions have more than 512
a model can handle despite a limited dictionary size, and en-
tokens, i.e., we exclude relatively few of the possible inputs
ables the composition of new words that were not seen during
encountered in the wild. Still, if enough computational re-
training. This suggests that we can similarly extend DIRTY’s
sources are available, we recommend using efficient Trans-
decoder to predict previously unseen types by decomposing
former implementations such as Big Bird [57] instead. These
structure types into multiple pieces with BPE. For example,
can deal with much larger max_seq_length and can be used
a structure type struct timeval {time_t tv_sec; suseconds_t
out-of-the-box to replace our implementation.
tv_usec;} is split into four separate tokens, which are 1) struct
timeval, 2) time_t tv_vec;, 3) suseconds_t tv_usec;, and 4)
<end_of_struct>. 6 Conclusion
However, unfortunately, our preliminary experiments sug-
gested that this hurts overall prediction accuracy. It also sig- The decompiler is an important tool used for reverse engi-
nificantly slows down prediction, since it drastically increases neering. While decompilers attempt to reverse compilation by
the number of decoding steps. It moreover requires finer- transforming binaries into high-level languages, generating
grained accuracy metrics, like tree distance, to allow us to the same code originally written by the developer is impossi-
measure and credit partially correct predictions. Based on ble. Many of the useful abstractions provided by high-level
these observations, we believe unseen structure types should languages such as loops, typed variables, and comments, are
be handled specially with a tailored model, a problem we irreversibly destroyed by compilation. Decompilers are able
leave to future work. to deterministically reconstruct some structural properties of
code, but comments, variable names, and custom variable
Supporting Non-C Languages. A benefit of decompiling to
types are technically impossible to recover.
C is that as a relatively low-level language, it can express the
In this paper we address the problem of assigning decom-
behavior of executables beyond those written in C. Although
piled variables meaningful names and types by statistically
we designed DIRTY to be used with C programs and types,
modeling how developers write code. We present DIRTY (De-
DIRTY can run on non-C programs, and will try to identify
compIled variable ReTYper), a novel technique for improving
the C type that best captures the way in which that variable is
the quality of decompiler output that automatically generates
being used. Thus, DIRTY provides value to analysts seeking
meaningful variable names and types. Empirical evaluation
to understand non-C programs, similar to how C decompilers
of DIRTY on a novel dataset of C code mined from GitHub
such as Hex-Rays help analysts to understand C++ programs.
shows that DIRTY outperforms prior work approaches by
However, many compiled programming languages have
a sizable margin, recovering the original names written by
type systems far richer than C’s, and expressing these types
developers 66.4% of the time and the original types 75.8% of
in terms of C types may be confusing. For example, in C++,
the time.
virtual functions are often implemented by reading an address
out of a virtual function table [13, 44]. Although techniques
like DIRTY can recognize such tables as structs or arrays of
code pointers, it does not reveal the connection to the higher-
level C++ behavior of virtual functions.
13
References Thomas Unterthiner, Mostafa Dehghani, Matthias
Minderer, Georg Heigold, Sylvain Gelly, et al. An
[1] Daniel Adiwardana, Minh-Thang Luong, David R. So, image is worth 16x16 words: Transformers for image
Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, recognition at scale. arXiv preprint arXiv:2010.11929,
Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and 2020.
Quoc V. Le. Towards a human-like open-domain chat-
bot. arXiv preprint arXiv:2001.09977, 2020. [12] Lukas Durfina, Jakub Kroustek, and Petr Zemek. PsybOt
malware: A step-by-step decompilation case study. In
[2] Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, Working Conference on Reverse Engineering, WCRE,
and Charles Sutton. A survey of machine learning for pages 449–456, 2013.
big code and naturalness. ACM Computing Surveys
(CSUR), 51(4):81, 2018. [13] Rukayat Ayomide Erinfolami and Aravind Prakash.
Devil is virtual: Reversing virtual inheritance in C++
[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- binaries. In Proceedings of the ACM Conference on
gio. Neural machine translation by jointly learning to Computer and Communications Security, CCS, pages
align and translate. In International Conference on 133–148, 2020.
Learning Representations, ICLR, 2015.
[14] Javier Escalada, Ted Scully, and Francisco Ortin. Im-
[4] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie proving type information inferred by decompilers
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Nee- with supervised machine learning. arXiv preprint
lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, arXiv:2101.08116, 2021.
et al. Language models are few-shot learners. arXiv
preprint arXiv:2005.14165, 2020. [15] William Fedus, Barret Zoph, and Noam Shazeer. Switch
transformers: Scaling to trillion parameter models
[5] Ligeng Chen, Zhongling He, and Bing Mao. CATI: with simple and efficient sparsity. arXiv preprint
Context-assisted type inference from stripped binaries. arXiv:2101.03961, 2021.
In International Conference on Dependable Systems
and Networks, DSN, 2020. [16] Cheng Fu, Huili Chen, Haolan Liu, Xinyun Chen, Yuan-
dong Tian, Farinaz Koushanfar, and Jishen Zhao. Coda:
[6] Rewon Child, Scott Gray, Alec Radford, and Ilya An end-to-end neural program decompiler. In Con-
Sutskever. Generating long sequences with sparse trans- ference on Neural Information Processing Systems,
formers. 2019. NeurIPS, 2019.
[7] Kyunghyun Cho, Bart van Merriënboer, Caglar Gul- [17] Edward M. Gellenbeck and Curtis R. Cook. An inves-
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger tigation of procedure and variable names as beacons
Schwenk, and Yoshua Bengio. Learning phrase rep- during program comprehension. Technical report, Ore-
resentations using RNN encoder-decoder for statistical gon State University, 1991.
machine translation. In Conference on Empirical Meth-
ods in Natural Language Processing, EMNLP, 2014. [18] Jingxuan He, Pesho Ivanov, Petar Tsankov, Veselin Ray-
chev, and Martin Vechev. D EBIN: Predicting debug
[8] Yaniv David, Uri Alon, and Eran Yahav. Neural re- information in stripped binaries. In Conference on Com-
verse engineering of stripped binaries using augmented puter and Communications Security, CCS, 2018.
control flow graphs. Proceedings of the ACM on Pro-
gramming Languages, 4(OOPSLA):1–28, 2020. [19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun. Deep residual learning for image recognition.
[9] Premkumar Devanbu. New initiative: The naturalness In IEEE Conference on Computer Vision and Pattern
of software. In International Conference on Software Recognition, CVPR, pages 770–778, 2016.
Engineering, ICSE, pages 543–546, 2015.
[20] Vincent J. Hellendoorn, Christian Bird, Earl T. Barr, and
[10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Miltiadis Allamanis. Deep learning type inference. In
Kristina Toutanova. BERT: Pre-training of deep bidi- Joint Meeting of the European Software Engineering
rectional transformers for language understanding. In Conference and the Symposium on the Foundations of
Annual Conference of the North American Chapter of Software Engineering, ESEC/FSE, 2018.
the Association for Computational Linguistics, NAACL-
HLT, pages 4171–4186, 2019. [21] Dan Hendrycks and Kevin Gimpel. Gaussian error linear
units (GELUs). arXiv preprint arXiv:1606.08415, 2016.
[11] Alexey Dosovitskiy, Lucas Beyer, Alexander
Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
14
[22] Hex-Rays. The hex-rays decompiler, 2019. URL https: execution. In CERIAS Annual Security Symposium, CE-
//www.hex-rays.com/products/decompiler/. RIAS, 2010.
[23] Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, [34] Alwin Maier, Hugo Gascon, Christian Wressnegger, and
and Premkumar Devanbu. On the naturalness of soft- Konrad Rieck. TypeMiner: Recovering types in binary
ware. In International Conference on Software Engi- programs using machine learning. In International
neering, ICSE, pages 837–847. IEEE, 2012. Conference on Detection of Intrusions and Malware,
and Vulnerability Assessment, DIMVA, pages 288–308,
[24] Sepp Hochreiter and Jürgen Schmidhuber. Long short- 2019.
term memory. Neural Computation, 9(8):1735–1780,
1997. [35] Paul Michel and Graham Neubig. Extreme adaptation
for personalized neural machine translation. In Annual
[25] Alan Jaffe, Jeremy Lacomis, Edward J. Schwartz, Claire Meeting of the Association for Computational Linguis-
Le Goues, and Bogdan Vasilescu. Meaningful variable tics (Short Papers), ACL, pages 312–318, 2018.
names for decompiled code: A machine translation ap-
proach. In International Conference on Program Com- [36] Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, and
prehension, ICPC, pages 20–30, May 2018. Bing Xiang. Abstractive text summarization using
sequence-to-sequence RNNs and beyond. In SIGNLL
[26] Deborah S. Katz, Jason Ruchti, and Eric Schulte. Using Conference on Computational Natural Language Learn-
recurrent neural networks for decompilation. In Interna- ing, CoNLL, pages 280–290, 2016.
tional Conference on Software Analysis, Evolution and
Reegnineering, SANER, pages 346–356, 2018. [37] Hermann Ney, Dieter Mergel, Andreas Noll, and
Annedore Paeseler. A data-driven organization of the dy-
[27] Omer Katz, Yuval Olshaker, Yoav Goldberg, and Eran namic programming beam search for continuous speech
Yahav. Towards neural decompilation. arXiv preprint recognition. In International Conference on Acoustics,
arXiv:1905.08325, 2019. Speech, and Signal Processing, ICASSP, pages 833–836,
[28] Diederik P. Kingma and Jimmy Ba. Adam: A method 1987.
for stochastic optimization. In International Conference [38] Vikram Nitkin, Anthony Saieva, Baishakhi Ray, and
on Learning Representations, ICLR, 2015. Gail Kaiser. DIRECT: A transformer-based model for
[29] Jeremy Lacomis, Pengcheng Yin, Edward J. Schwartz, decompiled identifier renaming. In Workshop on Natu-
Miltiadis Allamanis, Claire Le Goues, Graham Neubig, ral Language Processing for Programming, NLP4Prog,
and Bogdan Vasilescu. DIRE: A neural approach to 2021.
decompiled identifier naming. In International Confer- [39] Matthew Noonan, Alexey Loginov, and David Cok.
ence on Automated Software Engineering, ASE, pages Polymorphic type inference for machine code. In Con-
628–639, 2019. ference on Programming Language Design and Imple-
[30] Dawn Lawrie, Christopher Morrell, Henry Feild, and mentation, PLDI, pages 27–41, 2016.
David Binkley. What’s in a name? A study of identifiers. [40] Irene Vlassi Pandi, Earl T Barr, Andrew D Gordon, and
In International Conference on Program Comprehen- Charles Sutton. OptTyper: Probabilistic type inference
sion, ICPC, pages 3–12, 2006. by optimising logical and natural constraints. arXiv
[31] JongHyup Lee, Thanassis Avgerinos, and David Brum- preprint arXiv:2004.00348, 2020.
ley. TIE: Principled reverse engineering of types in
[41] Adam Paszke, Sam Gross, Francisco Massa, Adam
binary programs. In Network and Distributed System
Lerer, James Bradbury, Gregory Chanan, Trevor Killeen,
Security Symposium, NDSS, 2011.
Zeming Lin, Natalia Gimelshein, Luca Antiga, et al.
[32] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan PyTorch: An imperative style, high-performance deep
Ghazvininejad, Abdelrahman Mohamed, Omer Levy, learning library. In Conference on Neural Information
Veselin Stoyanov, and Luke Zettlemoyer. BART: De- Processing Systems, NeurIPS, pages 8024–8035. 2019.
noising sequence-to-sequence pre-training for natural
[42] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
language generation, translation, and comprehension. In
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei
Annual Meeting of the Association for Computational
Li, and Peter J. Liu. Exploring the limits of transfer
Linguistics, ACL, pages 7871–7880, 2020.
learning with a unified text-to-text transformer. Journal
[33] Zhiqiang Lin, Xiangyu Zhang, and Dongyan Xu. Auto- of Machine Learning Research, 21:1–67, 2020.
matic reverse engineering of data structures from binary
15
[43] Eric Schulte, Jason Ruchti, Matt Noonan, David Ciar- [52] Jiayi Wei, Maruth Goyal, Greg Durrett, and Isil Dillig.
letta, and Alexey Loginov. Evolving exact decompila- LambdaNet: Probabilistic type inference using graph
tion. In Workshop on Binary Analysis Research, BAR, neural networks. In International Conference on Learn-
2018. ing Representations, ICLR, 2020.
[44] Edward J. Schwartz, Cory F. Cohen, Michael Duggan, [53] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,
Jeffrey Gennari, Jeffrey S. Havrilla, and Charles Hines. Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and
Using logic programming to recover C++ classes and Yoshua Bengio. Show, attend and tell: Neural image
methods from compiled executables. In Conference on caption generation with visual attention. In Interna-
Computer and Communications Security, CCS, 2018. tional Conference on Machine Learning, ICML, pages
2048–2057, 2015.
[45] Rico Sennrich, Barry Haddow, and Alexandra Birch.
Neural machine translation of rare words with subword [54] Khaled Yakdan, Sebastian Eschweiler, Elmar Gerhards-
units. arXiv preprint arXiv:1508.07909, 2015. Padilla, and Matthew Smith. No more gotos: Decom-
pilation using pattern-independent control-flow struc-
[46] Yan Shoshitaishvili, Ruoyu Wang, Christopher Salls, turing and semantics-preserving transformations. In
Nick Stephens, Mario Polino, Andrew Dutcher, John Network and Distributed System Security Symposium,
Grosen, Siji Feng, Christophe Hauser, Christopher NDSS, 2015.
Kruegel, and Giovanni Vigna. (State of) The art of war:
Offensive techniques in binary analysis. In Symposium [55] Khaled Yakdan, Sergej Dechand, Elmar Gerhards-
on Security and Privacy, SP, pages 138–157, 2016. Padilla, and Matthew Smith. Helping Johnny to analyze
malware: A usability-optimized decompiler and mal-
[47] Asia Slowinska, Traian Stancescu, and Herbert Bos. ware analysis user study. In Symposium on Security and
Howard: A dynamic excavator for reverse engineering Privacy, SP, pages 158–177, 2016.
data structures. In Network and Distributed System Se-
curity Symposium, NDSS, 2011. [56] Fangke Ye, Jisheng Zhao, and Vivek Sarkar. Advanced
graph-based deep learning for probabilistic type infer-
[48] Katerina Troshina, Yegor Derevenets, and Alexander ence. arXiv preprint arXiv:2009.05949, 2020.
Chernov. Reconstruction of composite types for decom-
pilation. In Working Conference on Source Code Analy- [57] Manzil Zaheer, Guru Guruganesh, Avinava Dubey,
sis and Manipulation, SCAM, pages 179–188, 2010. Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip
Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big
[49] Michael James van Emmerik. Static Single Assignment Bird: Transformers for longer sequences. arXiv preprint
for Decompilation. PhD thesis, University of Queens- arXiv:2007.14062, 2020.
land, 2007.
[58] Z. Zhang, Y. Ye, W. You, G. Tao, W. Lee, Y. Kwon,
[50] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Y. Aafer, and X. Zhang. OSPREY: Recovery of variable
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and data structure via probabilistic analysis for stripped
and Illia Polosukhin. Attention is all you need. In binary. In Symposium on Security and Privacy, SP, pages
Conference on Neural Information Processing Systems, 872–891, 2021.
NeurIPS, pages 6000–6010, 2017.
[59] Zhuo Zhang, Wei You, Guanhong Tao, Guannan Wei,
[51] Nguyen Xuan Vinh, Julien Epps, and James Bailey. In- Yonghwi Kwon, and Xiangyu Zhang. BDA: Practical
formation theoretic measures for clusterings compari- dependence analysis for binary executables by unbiased
son: Variants, properties, normalization and correction whole-program path sampling and per-path abstract in-
for chance. The Journal of Machine Learning Research, terpretation. Proceedings of the ACM on Programming
11:2837–2854, 2010. Languages, 3(OOPSLA):1–31, 2019.
16
A Experimental Setup Hyperparameter DIRTY DIRTYS
Hyperparameter Configurations Our detailed hyperpa- Max Sequence Length 512 512
rameters are shown in Table 10. We use a six-layer Trans- Encoder/Decoder layers 6/6 3/3
former Encoder for the code encoder, a three-layer Trans- Hidden units per layer 512 256
former Encoder for the data layout encoder, and a six-layer Attention heads 8 4
Transformer Decoder for the type decoder. We set the num- Layout encoder layers 3 3
ber of attention heads to 8. Input embedding dimensions and Layout encoder hidden units 256 128
hidden sizes dmodel are set to 512 for the code encoder, and Batch size 64 64
256 for the data layout encoder. Following prior work, we Training epochs 15 30
empirically set the size of the inner-layer of the position- Learning rate 10−4 10−4
wise feed-forward inner representation size d f f to four times Adam ε 10−8 10−8
the hidden size dmodel [50]. We use the gelu activation func- Adam β1 0.9 0.9
tion [21] rather than the standard relu, following BERT [10]. Adam β2 0.999 0.999
During training, we set the batch size to 64 and the learning Gradient clipping 1.0 1.0
rate to 1 × 10−4 . We use the Adam optimizer [28] and set Dropout rate 0.1 0.1
β1 = 0.9, β2 = 0.999 and ε = 1 × 10−8 . We apply gradient Number of parameters 167M 40M
clipping by value within the range [−1, 1]. We also apply
a dropout rate of 0.1 as regularization. We train the model Table 10: Summary of the hyperparameters of DIRTY and
for 15 epochs. At the inference time, we use beam search to the smaller DIRTYS .
predict the types for each function with a beam size of 5.
17