0% found this document useful (0 votes)
56 views9 pages

LLM4Decompile: Decompiling Binary Code With Large Language Models

The document introduces LLM4Decompile, the first open-source large language model for decompiling binary code into human-readable source code. It presents a pipeline for evaluating decompilation, including recompiling and reexecuting the decompiled code. Experiments show the model can accurately decompile 21% of assembly code, a 50% improvement over GPT-4. It also introduces Decompile-Eval, the first benchmark for decompilation that evaluates recompilability and reexecutability.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views9 pages

LLM4Decompile: Decompiling Binary Code With Large Language Models

The document introduces LLM4Decompile, the first open-source large language model for decompiling binary code into human-readable source code. It presents a pipeline for evaluating decompilation, including recompiling and reexecuting the decompiled code. Experiments show the model can accurately decompile 21% of assembly code, a 50% improvement over GPT-4. It also introduces Decompile-Eval, the first benchmark for decompilation that evaluates recompilability and reexecutability.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

LLM4Decompile: Decompiling Binary Code with Large Language Models

Hanzhuo Tan, Qi Luo, Jing Li, Yuqun Zhang


Southern University of Science and Technology
The Hong Kong Polytechnic University

Abstract float trun_num(float num) {


return num - (int)num;
Decompilation aims to restore compiled code src }
to human-readable source code, but struggles compile
arXiv:2403.05286v1 [cs.PL] 8 Mar 2024

with details like names and structure. Large


00111010010101010101010
language models (LLMs) show promise for 110101010110101000101...
programming tasks, motivating their applica- binary
tion to decompilation. However, there does disassemble
not exist any open-source LLM for decompila- endbr64
tion. Moreover, existing decompilation evalu- push %rbp
mov %rsp,%rbp
ation systems mainly consider token-level ac- asm ...
curacy and largely ignore code executability,
decompile test
which is the most important feature of any pro-
gram. Therefore, we release the first open- float func(float f) int main() {
access decompilation LLMs ranging from 1B { assert(...);
to 33B pre-trained on 4 billion tokens of C int i = (int)f; assert(...);
src' return f - i; assert(...);
source code and the corresponding assembly } }
code. The open-source LLMs can serve as compile run
baselines for further development in the field.
To ensure practical program evaluation, we in- eval re-compilability re-executability
troduce Decompile-Eval, the first dataset that
considers re-compilability and re-executability Figure 1: Pipeline to evaluate the decompilation.
for decompilation. The benchmark empha-
sizes the importance of evaluating the decom-
pilation model from the perspective of pro-
gram semantics. Experiments indicate that our 2024). Although these tools have the capability
LLM4Decompile has demonstrated the capa- to revert binary code to source code in specific
bility to accurately decompile 21% of the as- scenarios, they often fall short in producing code
sembly code, which achieves a 50% improve- that is easily readable by humans. The inherent
ment over GPT-4. Our code, dataset, and difficulty of decompilation lies in the inability to
models are released at https://github.com/ fully recreate the source, especially finer details
albertan017/LLM4Decompile. like variable names (Lacomis et al., 2019) and the
primary structure (Wei et al., 2007), such as loops
1 Introduction and conditional statements, which are often lost
Decompilation is the process of converting com- during the compilation process.
piled machine code or bytecode back into a high- Recent advances in large language models
level programming languages. This is often done to (LLMs) have led researchers to approach pro-
analyze the workings of software when its source gramming languages as distinct linguistic systems,
code is not accessible (Brumley et al., 2013; Katz using pre-trained code LLMs for various cod-
et al., 2018; Hosseini and Dolan-Gavitt, 2022; Xu ing tasks (Lippincott, 2020; Rozière et al., 2023;
et al., 2023; Armengol-Estapé et al., 2023; Jiang Guo et al., 2024). These models have shown
et al., 2023; Wong et al., 2023). There have been impressive performance improvements over tra-
numerous tools developed for decompilation, such ditional techniques (Zeng et al., 2022; Xu et al.,
as Ghidra (Ghidra, 2024) and IDA Pro (Hex-Rays, 2022), which leads us to the possibility of apply-
ing LLMs to cope with the decompilation chal- (denoted as asm) using the objdump tool. The as-
lenge. To illustrate, Transformer-based models sembly instructions are subsequently decompiled
such as Slade (Armengol-Estapé et al., 2023) and to reconstruct the source code in a format which is
Nova (Jiang et al., 2023) have showcased the poten- readable to humans (denoted as src′ ). To assess the
tial of using language models to turn binary code quality of the decompiled code (src′ ), it is tested
back into source code that is much closer to the for its ability to be recompiled with the original
source code in readability and structure. However, GCC compiler (re-compilability) and for its func-
the scope of their models is somewhat constrained tionality through test assertions (re-executability).
at 200M and 1B parameters for Slade and Nova On Decompile-Eval, the llm4decompile models
which could result in a reduced capacity for learn- demonstrated promising results in their ability to
ing and generalization, whereas larger models typi- decompile binaries, with an impressive 90% of
cally exhibit a marked improvement in these areas the decompiled code being recompilable using the
by leveraging their extensive parameters to process same settings in the GCC compiler, signifying a
and integrate a broader range of information (Roz- solid understanding of code structure and syntax.
ière et al., 2023; OpenAI, 2023). Moreover, their As for the ability to execute the code, 21% of the
lack of public availability limits their contribution decompiled codes from the 6B version successfully
to promoting further progress in this domain. Fur- capture the semantics of a program and pass all the
thermore, to the best of our knowledge, no standard- test cases.
ized benchmark dataset exists for evaluating and In conclusion, our contributions are twofold:
comparing decompilation techniques. Researchers • We provide the first open-source LLM rang-
tend to employ different datasets (da Silva et al., ing from 1B to 33B tailored for decompilation,
2021; Collie et al., 2020; Tan et al., 2017) to eval- which also facilitates compilation and diverse bi-
uate their results, making direct comparison diffi- nary tasks.
cult. Therefore, there is a strong need to establish • We construct the first decompilation bench-
a benchmark to evaluate the decompilation perfor- mark which targets re-compilation and re-execution
mance, which can significantly facilitate the formu- rate, which indicate syntax recovery and semantic
lation of coherent and standard evaluation criteria preservation—both essential for usable and robust
for the decompilation domain. decompilation.
Thus, our objective is to create and release the
first open-source LLM dedicated to decompila- 2 Related Work
tion, and to assess its capabilities by construct-
ing the first decompilation benchmark focused on 2.1 Decompilation
re-compilability and re-executability. We start by The practice of reversing executable binaries to
compiling a million C code samples from Ang- their source code form, known as decompilation,
haBench (da Silva et al., 2021) into assembly code has been researched for decades. Traditional de-
using GCC (Stallman et al., 2003) with differ- compilation relies on analyzing the control and data
ent configurations, forming a dataset of assembly- flows of program (Brumley et al., 2013), and em-
source pairs in 4 billion tokens. We then finetune ploying pattern matching, as seen in tools like Hex-
the DeepSeek-Coder model (Guo et al., 2024), a Rays Ida pro (Hex-Rays, 2024) and Ghidra (Ghidra,
leading-edge code LLM, using this dataset. Fol- 2024). These systems attempt to identify patterns
lowed by constructing the evaluation benchmark, within a program’s control-flow graph (CFG) that
Decompile-Eval, based on HumanEval (Chen et al., corresponds to standard programming constructs
2021) questions and test samples. Specifically, we such as conditional statements or loops. However,
formulate the evaluation from two perspectives: crafting these rule-based systems can be challeng-
whether the decompiled code can recompile suc- ing and prone to mistakes as the rules are complex
cessfully, and whether it passes all assertions in to create, often only partially cover the CFG, and
the test cases. Figure 1 presents the steps involved take extensive time to develop (Armengol-Estapé
in our decompilation evaluation. First, the source et al., 2023; Jiang et al., 2023). They are partic-
code (denoted as src) is compiled by the GCC com- ularly weak when facing code that has been opti-
piler with specific parameters, such as optimization mized, which is a common practice for commer-
levels, to produce the executable binary. This bi- cially compiled software. The code output from
nary is then disassembled into assembly language such decompilation processes tends to be a source-
code-like representation of assembly code, includ- code generation framework, and the subset of
ing direct translations of variables to registers, use ExeBench (da Silva et al., 2021; Collie et al., 2020),
of gotos, and other low-level operations instead of a benchmark consisting of executable C programs.
the original high-level language constructs. This Nova (Jiang et al., 2023) evaluates its decompila-
output, while often functionally similar to the orig- tion capabilities on a synthetic dataset that it cre-
inal code, is difficult to understand and may not ated, in addition to the CodeFlaws (Tan et al., 2017)
be efficient enough to allow for recompilation (Liu dataset, which is designed to identify common cod-
and Wang, 2020). ing errors.
Drawing inspiration from neural machine trans- The metrics employed for these evaluations pre-
lation, researchers have reformulated decompila- dominantly focus on N-gram similarity, with the
tion as a translation exercise, converting machine- use of BLEU or Token Accuracy, as well as Edit
level instructions into readable source code. Initial Similarity (ES). Slade (Armengol-Estapé et al.,
attempts in this area utilized recurrent neural net- 2023) goes a step further by incorporating Input
works (RNNs) (Katz et al., 2018) for decompila- Output (IO) Accuracy (Le et al., 2014; Liu and
tion, complemented by error-correction techniques Wang, 2020) into its evaluation framework. This
to enhance the outcomes. Nonetheless, these ef- metric assesses semantic equivalence through be-
forts were constrained in their effectiveness. havioral equality, meaning it checks whether the
The latest advancements in natural language decompiled code and the original code produce the
processing (NLP) have enabled the use of pre- same outputs when given the same inputs. How-
trained language models (LMs) for coding-related ever, IO Accuracy relies on external processes to
tasks (Rozière et al., 2023; Lippincott, 2020; Guo generate input and output samples for comparison.
et al., 2024). These models generally incorpo- The generation of input and output samples often
rate the Transformer architecture (Vaswani et al., involves randomness, leading to non-deterministic
2017), using self-attention mechanisms, and are results and making it difficult to consistently assess
pre-trained on extensive text datasets. This ap- the performance of a decompiler.
proach allows LMs to capture contextual nuances Consequently, our goal is to develop and make
and aids in the acquisition of general language un- the first open-source large language model (LLM)
derstanding. In the realm of binary decompilation, tailored for decompilation. We also aim to estab-
BTC (Hosseini and Dolan-Gavitt, 2022) was one lish the first benchmark for re-compilability and
of the first to fine-tune an LM for this purpose. Fol- re-executability to set a standard for performance
lowing this, Slade (Armengol-Estapé et al., 2023) evaluation in the field of decompilation.
utilized the BART model and trained an LM-based
decompiler with 200 million parameters, while 3 LLM4Decompile
Nova (Jiang et al., 2023) developed a binary LM
with 1 billion parameters starting from the Star- In this section, we describe the pre-training data,
Coder checkpoint and fine-tuned it for decompi- present different model configurations, and discuss
lation. Although these models show potential in the pre-training objectives involved in pre-training
decompilation, they are limited in size, e.g., the the LLM.
Code Llama model (Rozière et al., 2023), launched
3.1 Pre-training Data
in 2023, has at least 7 billion parameters.
We construct asm-source pairs based on Ang-
2.2 Evaluation habench (da Silva et al., 2021), which is a public
There is a notable gap in the field of decompi- collection of one million compilable C files. Fol-
lation: a lack of a unified, accepted benchmark lowing the practice of previous works (Armengol-
for measuring the quality of decompilation tools. Estapé et al., 2023), we first compile the source
Various sources are used for evaluation purposes. code into an binary object file, disassemble the ob-
BTC (Hosseini and Dolan-Gavitt, 2022), for ex- ject file into assembly code, and pair it with the
ample, leverages web data including interview- source code, where we only consider the x86 Linux
style coding problems and the extensive Debian platform. In the real deployment, programmers will
Linux repository to evaluate decompilation accu- choose different compiler optimization flags to op-
racy. Meanwhile, Slade (Armengol-Estapé et al., timize the execution performance. Compiler opti-
2023) is tested using both Synth, a synthetic mization refers to the process of tweaking and trans-
LLM models targeting predicting the next token
endbr64
push %rbp given previous inputs. As depicted in Equation 1,
mov %rsp,%rbp it minimizes the negative log probability for the
asm ...

ground truth token xi :

LLM4Decompile
X

… L=− log Pi (xi |x1 , x2 , ..., xi−1 ; θ) (1)
i

float trun_num(float num) { where the conditional probability P is modeled


return num - (int)num;}
src using LLM4Decompile model with parameters θ.
Sequence-to-sequence These parameters are optimized by applying the
(Translation) gradient descent algorithms (Ruder, 2016) with
respect to the input sequence x1 , x2 , ..., xi−1 pre-
Endbr64 push %rbp mov %rsp,%rbp ceding the given token xi .
... Float trun_num(float num) 2) Sequence-to-sequence prediction (S2S),
{ return num - (int)num; }
asm which is the training objective adopted in most neu-
… ral machine translation models that aims to predict
the output given the input sequence. As depicted

LLM4Decompile …

in Equation 2, it minimizes the negative log proba-
bility for the C code tokens xi , ..., xj :
Endbr64 push %rbp mov %rsp,%rbp
... Float trun_num(float num)
src
X
{ return num - (int)num; } L=− log Pi (xi , ..., xj |x1 , ..., xi−1 ; θ) (2)
i
Next token prediction
(Language modeling) where the loss is calculated only for the output
sequence xi ...xj , or the C code.
Figure 2: Training objectives The main difference lies in whether the input
sequence or the input assembly code is included in
forming source code to generate faster and more the training loss or not, whereas in language model-
efficient machine code. It involves techniques like ing, all the inputs are included for loss calculation.
eliminating redundant instructions, better register We conduct different ablation studies on these two
allocation, loop transformations, etc. The different training objectives to explore their effectiveness in
optimization levels trade off compile time against decompilation.
execution time and debugging ability. The key opti- 4 Experiment Setups
mization levels are: O0 (default, No optimization)
to O3 (Aggressive optimizations, compilation time 4.1 Decompile-Eval Benchmark
consuming). We compile the source code into all Currently, there appears to be no benchmark for de-
four stages, i.e., O0, O1, O2, O3, and pair each of compilation evaluation that considers whether code
them with the source code. To inform the model can be recompiled or executed correctly. When
with the optimization stage, we use the following assessing decompilation model performance, re-
prompt: # This is the assembly code with searchers rely on metrics that measure N-gram sim-
[optimization state] optimization: [asm ilarity (such as BLEU or CodeBLEU) or edit sim-
code] # What is the source code?. ilarity (ES). However, these metrics, commonly
used in machine translation and text generation,
3.2 Model Configurations fail to adapt to the evaluation of programming lan-
Our LLM4Decompile uses the same architecture as guages.
DeepSeek-Coder and we initialize our model with Programming languages are highly structured
the corresponding DeepSeek-Coder checkpoints. and logical, insensitive to the naming of functions
As a neural translator, the training objectives can and variables, yet very sensitive to the flow of data
be categorized into two categories, as shown in and logic. Changing variable or function names
Figure 2. does not affect the meaning of a program, but a sin-
1) Next token prediction (NTP) or language mod- gle logical error can alter its entire function and pur-
eling, which is the pre-training objective of most pose. As illustrated in Figure 3, the use of BLEU
float trun_num(float num) {
uation process (Figure 1), the C source code is
return num - (int)num; first compiled into a binary, then disassembled into
src } assembly code, and finally fed into the decompila-
float trun_num(float num) { BLEU: 73.3 tion system to be reconstructed back into C source
return (int)num - (int)num; ES: 91.7 code. This regenerated C code is compiled with
src" } GCC to test re-compilability and combined with
float trun_num(float num) { BLEU: 67.6 the original assertions to check if it can successfully
ES: 90.9
return num - num; execute and pass those assertions. Re-compilability
src# }
and re-executability serve as critical indicators in
float func(float x) { BLEU: 0.0 validating the effectiveness of a decompilation pro-
ES: 69.1
return x - int(x);
src$ }
cess. When decompiled code can be recompiled, it
provides strong evidence of syntactic integrity. It
float func(float f) { BLEU: 0.0 ensures that the decompiled code is not just read-
int i = (int)f; ES: 41.4
return f - i; able, but also adheres to the structural and syntacti-
src! } cal standards expected by the compiler. However,
syntax alone does not guarantee semantic equiv-
Figure 3: Limitation to use BLEU and Edit Similarity alence to the original pre-compiled program. Re-
for evaluating decompilation results. executability provides this critical measure of se-
mantic correctness. By re-compiling the decom-
piled output and running the test cases, we assess
and ES in evaluating code similarity is problem- if the decompilation preserved the program logic
atic. For src1 , the variation from the original src and behavior. Together, re-compilability and re-
is confined to type conversion of variable num, executability indicate syntax recovery and semantic
which leads to high BLEU and ES scores. How- preservation—both essential for usable and robust
ever, this alteration completely changes the intent decompilation.
of the code. Similarly, src2 achieves high BLEU In alignment with established evaluation prac-
and ES scores, yet the semantics of the function are tices, following Slade (Armengol-Estapé et al.,
lost. Conversely, src3 undergoes normalization of 2023), we partition 1000 samples from the Ang-
function and variable names, causing no semantic haBench into a test set. We then utilize BLEU and
shift yet scoring zero in BLEU against the original ES as our primary metrics for assessment.
code. The example of src4 is more extreme: if the
program logic is broken down into multiple lines, 4.2 Baselines
the ES drops to 41.4%, falsely indicating a low sim- To benchmark against SOTA decompilers, we se-
ilarity. However, during compilation, names are lected two key baselines. First, GPT-4 repre-
typically standardized by the compiler, and source sents the most capable LLMs, providing an up-
code is often broken down into basic operations per bound on LLM performance. As one of the
depending on optimization. For this reason, the largest language models, GPT-4 significantly sur-
ability to recompile and execute the code is far passes previous LLMs across modalities. Second,
more indicative than N-gram or edit similarity for DeepSeek-Coder is selected as the current SOTA
evaluating decompilation efficacy. open-source Code LLM. It represents the forefront
To address the gap in decompilation assessment, of publicly available models specifically tailored
we introduce Decompile-Eval, the first benchmark for coding tasks. While recent academic works
to evaluate the re-compilability and re-executability like BTC (Hosseini and Dolan-Gavitt, 2022) and
of decompilation systems. This benchmark is de- Slade (Armengol-Estapé et al., 2023) showcase
rived from HumanEval (Chen et al., 2021), which is LLMs for decompilation, these models present
the leading benchmark for code generation assess- significant integration challenges, such as com-
ment and includes 164 programming challenges plex pre-processing settings, non-standardized to-
with accompanying Python solutions and asser- kenizer/model loading, and necessitating signifi-
tions. We converted these Python solutions and cant effort to modify and adapt them. We thus
assertions into C, making sure that they compile selected GPT-4 and DeepSeek-Coder as represen-
with the GCC compiler using standard C libraries tative cutting-edge and open-source baselines ac-
and pass all the original assertions. In our eval- cessible for evaluation.
Table 1: Evaluation Results on Decompile-Eval

Model Re-compilability Re-executability


Opt-level O0 O1 O2 O3 Avg. O0 O1 O2 O3 Avg.
GPT4 0.92 0.94 0.88 0.84 0.895 0.1341 0.1890 0.1524 0.0854 0.1402
DeepSeek-Coder-33B 0.0659 0.0866 0.1500 0.1463 0.1122 0.0000 0.0000 0.0000 0.0000 0.0000
LLM4Decompile-1b 0.8780 0.8732 0.8683 0.8378 0.8643 0.1573 0.0768 0.1000 0.0878 0.1055
LLM4Decompile-6b 0.8817 0.8951 0.8671 0.8476 0.8729 0.3000 0.1732 0.1988 0.1841 0.2140
LLM4Decompile-33b 0.8134 0.8195 0.8183 0.8305 0.8204 0.3049 0.1902 0.1817 0.1817 0.2146

Table 2: Evaluation Results on Anghabench.

model BLEU Edit Similarity


Opt-level O0 O1 O2 O3 Avg. O0 O1 O2 O3 Avg.
DeepSeek-Coder-33B 0.0362 0.0367 0.0306 0.0313 0.0337 0.1186 0.1196 0.1124 0.1133 0.116
LLM4Decompile-1b 0.5099 0.493 0.487 0.4835 0.4934 0.6223 0.5946 0.5825 0.5822 0.5954
LLM4Decompile-6b 0.8219 0.8246 0.8143 0.8148 0.8189 0.8562 0.8551 0.8422 0.8453 0.8497
LLM4Decompile-33b 0.7724 0.7477 0.7485 0.7514 0.755 0.8252 0.7974 0.7993 0.8056 0.8069

4.3 Implementation marginal improvement, with an average increase of


less than one percentage point in re-executability.
We use the Python implementation of the
This plateau may be due to the challenge of tuning
DeepSeek-Coder models (1.3B, 6.7B, and 33B) ob-
the 33B model.
tained on Hugging Face (Wolf et al., 2019). We set
a global batch size = 2048 and learning rate = Table 2 summarizes the results on AnghaBench,
2e−5 and train the models with the AdamW where LLM4Decompile shows notably high BLEU
optimizer (Loshchilov and Hutter, 2019) for 2 and ES scores, e.g., the 6B model achieves 0.82
epochs. During the evaluation, we set the BLEU scores, almost identical to the source code.
max_new_tokens to 512. To ensure fairness in This outstanding performance suggests a signifi-
the analysis of time and space complexity, all ex- cant data leakage issue within the test set. Decom-
periments are performed on a cluster equipped with piled code, with its variables normalized, should
8 NVIDIA A100-80GB GPUs. not realistically allow for such high N-gram/ES
scores. This anomaly underscores the importance
5 Experiment Results of establishing an independent, reliable benchmark
for decompilation evaluation, as similarly high
5.1 Main Results BLEU and ES scores have been reported in prior
research.
Table 1 presents the primary findings of our study.
Initially, the base version of DeepSeek-Coder was
5.2 Ablations
unable to accurately decompile binaries. It could
generate code that seemed correct and was some- As discussed in Section 3.2, our LLM4Decompile
times compilable but failed to retain the origi- model adopts a sequence-to-sequence (S2S) fore-
nal program semantics. After fine-tuning, the casting approach, which outperforms other train-
LLM4Decompile models demonstrated a signifi- ing techniques for several reasons. In this training
cant improvement in their ability to decompile bina- methodology, the input—specifically the assembly
ries, with an impressive around 90% of the code be- code is not included in the calculation of the loss
ing compilable, signifying a solid understanding of function. This allows the model to focus solely on
code structure and syntax. As for the ability to exe- generating accurate output source code, enabling
cute the code, the 6B version of LLM4Decompile it to better understand the underlying patterns and
shows a remarkable advantage over the 1B version, structures of the decompiled code. In contrast, inte-
21% of the decompiled codes from the 6B version grating the assembly code into the training process,
successfully capture the semantics of a program as in the next token prediction (NTP) task, encom-
and pass all the test cases, while for the 1B version passes both the input assembly code and the output
only 10% can be re-executed. The improvement source code, which can decrease performance by
highlights the benefits of larger model sizes in cap- around 4 points, as shown in Table 3. The complex-
turing the semantics of a program. Nonetheless, ity of assembly code is another factor; being inher-
the increase in model size to 33B yields only a ently complex and low-level, it is harder for the
Table 3: Ablation study on training methodology.

Model Re-compilability Re-executability


Opt-level O0 O1 O2 O3 Avg. O0 O1 O2 O3 Avg.
S2S 0.8817 0.8951 0.8671 0.8476 0.8729 0.3000 0.1732 0.1988 0.1841 0.2140
NTP 0.8329 0.8598 0.8317 0.8329 0.8393 0.2805 0.1390 0.1573 0.1341 0.1777
NTP+S2S 0.8963 0.8598 0.8963 0.8720 0.8811 0.3232 0.1463 0.1951 0.1707 0.2088

model to learn meaningful patterns when assembly References


code is included in the training process. By exclud- Jordi Armengol-Estapé, Jackson Woodruff, Chris Cum-
ing the assembly code from the loss calculation, the mins, and Michael F. P. O’Boyle. 2023. Slade: A
S2S approach enables the model to avoid this com- portable small language model decompiler for opti-
plexity and concentrate on high-level source code mized assembler. CoRR, abs/2305.12520.
patterns. Although an alternative strategy involves David Brumley, JongHyup Lee, Edward J. Schwartz,
an initial training step with both assembly and C and Maverick Woo. 2013. Native x86 decompila-
code followed by fine-tuning focused on the trans- tion using semantics-preserving structural analysis
and iterative control-flow structuring. In Proceedings
lation task (NTP+S2S), this approach still doesn’t of the 22th USENIX Security Symposium, Washing-
perform as well as the S2S. ton, DC, USA, August 14-16, 2013, pages 353–368.
USENIX Association.
6 Conclusions
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan,
We presented the first open-source decompilation- Henrique Pondé de Oliveira Pinto, Jared Kaplan,
Harrison Edwards, Yuri Burda, Nicholas Joseph,
focused LLM and standardized re-compilability/re- Greg Brockman, Alex Ray, Raul Puri, Gretchen
executability benchmark. Analyses on this di- Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas-
verse compiled C code dataset revealed promising try, Pamela Mishkin, Brooke Chan, Scott Gray,
capabilities—our 6B LLM4Decompile achieved Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz
Kaiser, Mohammad Bavarian, Clemens Winter,
87% re-compilability, indicating syntactic under- Philippe Tillet, Felipe Petroski Such, Dave Cum-
standing, and 21% re-executability, suggesting se- mings, Matthias Plappert, Fotios Chantzis, Eliza-
mantic preservation. As an initial exploration into beth Barnes, Ariel Herbert-Voss, William Hebgen
data-driven decompilation, our work establishes an Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie
open benchmark to motivate future efforts. The Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain,
William Saunders, Christopher Hesse, Andrew N.
public dataset, model, and analyses represent en- Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan
couraging first steps toward enhancing decompila- Morikawa, Alec Radford, Matthew Knight, Miles
tion through novel techniques. Brundage, Mira Murati, Katie Mayer, Peter Welinder,
Bob McGrew, Dario Amodei, Sam McCandlish, Ilya
7 Limitations Sutskever, and Wojciech Zaremba. 2021. Evaluat-
ing large language models trained on code. CoRR,
The scope of this research is limited to the compi- abs/2107.03374.
lation and decompilation of C language targeting Bruce Collie, Jackson Woodruff, and Michael F. P.
the x86 platform. While we are confident that the O’Boyle. 2020. Modeling black-box components
methodologies developed here could be adapted to with probabilistic synthesis. In GPCE ’20: Proceed-
ings of the 19th ACM SIGPLAN International Con-
other programming languages and platforms with ference on Generative Programming: Concepts and
relative ease, these potential extensions have been Experiences, Virtual Event, USA, November 16-17,
reserved for future investigation. Additionally, our 2020, pages 1–14. ACM.
current study is constrained to the decompilation Anderson Faustino da Silva, Bruno Conde Kind,
of single functions, without taking into account José Wesley de Souza Magalhães, Jerônimo Nunes
factors such as cross-references and external type Rocha, Breno Campos Ferreira Guimarães, and
definitions. This presents a simplified view of the Fernando Magno Quintão Pereira. 2021. ANG-
HABENCH: A suite with one million compilable C
decompilation process, omitting the complexities benchmarks for code-size reduction. In IEEE/ACM
introduced by these elements. Addressing these International Symposium on Code Generation and
aspects would provide a more comprehensive un- Optimization, CGO 2021, Seoul, South Korea, Febru-
derstanding of decompilation across a broader spec- ary 27 - March 3, 2021, pages 378–390. IEEE.
trum of scenarios and is an important avenue for Ghidra. 2024. Ghidra software reverse engineering
subsequent research. framework.
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten
Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi,
Y Wu, YK Li, et al. 2024. Deepseek-coder: When the Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom
large language model meets programming–the rise of Kozhevnikov, Ivan Evtimov, Joanna Bitton, Man-
code intelligence. arXiv preprint arXiv:2401.14196. ish Bhatt, Cristian Canton-Ferrer, Aaron Grattafiori,
Wenhan Xiong, Alexandre Défossez, Jade Copet,
Hex-Rays. 2024. Ida pro: a cross-platform multi- Faisal Azhar, Hugo Touvron, Louis Martin, Nico-
processor disassembler and debugger. las Usunier, Thomas Scialom, and Gabriel Synnaeve.
2023. Code llama: Open foundation models for code.
Iman Hosseini and Brendan Dolan-Gavitt. 2022. Be- CoRR, abs/2308.12950.
yond the C: retargetable decompilation using neural
Sebastian Ruder. 2016. An overview of gradient
machine translation. CoRR, abs/2212.08950.
descent optimization algorithms. arXiv preprint
arXiv:1609.04747.
Nan Jiang, Chengxiao Wang, Kevin Liu, Xiangzhe
Xu, Lin Tan, and Xiangyu Zhang. 2023. Nova+ : Richard M Stallman et al. 2003. Using the gnu compiler
Generative language models for binaries. CoRR, collection. Free Software Foundation, 4(02).
abs/2311.13721.
Shin Hwei Tan, Jooyong Yi, Yulis, Sergey Mechtaev,
Deborah S. Katz, Jason Ruchti, and Eric M. Schulte. and Abhik Roychoudhury. 2017. Codeflaws: a pro-
2018. Using recurrent neural networks for decompi- gramming competition benchmark for evaluating au-
lation. In 25th International Conference on Software tomated program repair tools. In Proceedings of the
Analysis, Evolution and Reengineering, SANER 2018, 39th International Conference on Software Engineer-
Campobasso, Italy, March 20-23, 2018, pages 346– ing, ICSE 2017, Buenos Aires, Argentina, May 20-28,
356. IEEE Computer Society. 2017 - Companion Volume, pages 180–182. IEEE
Computer Society.
Jeremy Lacomis, Pengcheng Yin, Edward J. Schwartz, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Miltiadis Allamanis, Claire Le Goues, Graham Neu- Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
big, and Bogdan Vasilescu. 2019. DIRE: A neural Kaiser, and Illia Polosukhin. 2017. Attention is all
approach to decompiled identifier naming. In 34th you need. In Advances in Neural Information Pro-
IEEE/ACM International Conference on Automated cessing Systems 30: Annual Conference on Neural
Software Engineering, ASE 2019, San Diego, CA, Information Processing Systems 2017, December 4-9,
USA, November 11-15, 2019, pages 628–639. IEEE. 2017, Long Beach, CA, USA, pages 5998–6008.

Vu Le, Mehrdad Afshari, and Zhendong Su. 2014. Com- Tao Wei, Jian Mao, Wei Zou, and Yu Chen. 2007. A
piler validation via equivalence modulo inputs. In new algorithm for identifying loops in decompilation.
ACM SIGPLAN Conference on Programming Lan- In Static Analysis, 14th International Symposium,
guage Design and Implementation, PLDI ’14, Edin- SAS 2007, Kongens Lyngby, Denmark, August 22-24,
burgh, United Kingdom - June 09 - 11, 2014, pages 2007, Proceedings, volume 4634 of Lecture Notes in
216–226. ACM. Computer Science, pages 170–183. Springer.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Thomas Lippincott. 2020. Starcoder: A general neural Chaumond, Clement Delangue, Anthony Moi, Pier-
ensemble technique to support traditional scholar- ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz,
ship, illustrated with a study of the post-atlantic slave and Jamie Brew. 2019. Huggingface’s transformers:
trade. In 15th Annual International Conference of the State-of-the-art natural language processing. CoRR,
Alliance of Digital Humanities Organizations, DH abs/1910.03771.
2020, Ottawa, Canada, July 20-25, 2020, Conference
Abstracts. Wai Kin Wong, Huaijin Wang, Zongjie Li, Zhibo Liu,
Shuai Wang, Qiyi Tang, Sen Nie, and Shi Wu. 2023.
Zhibo Liu and Shuai Wang. 2020. How far we have Refining decompiled C code with large language
come: testing decompilation correctness of C decom- models. CoRR, abs/2310.06530.
pilers. In ISSTA ’20: 29th ACM SIGSOFT Interna-
Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Jo-
tional Symposium on Software Testing and Analysis,
sua Hellendoorn. 2022. A systematic evaluation of
Virtual Event, USA, July 18-22, 2020, pages 475–487.
large language models of code. In Proceedings of the
ACM.
6th ACM SIGPLAN International Symposium on Ma-
chine Programming, MAPS 2022, page 1–10, New
Ilya Loshchilov and Frank Hutter. 2019. Decoupled York, NY, USA. Association for Computing Machin-
weight decay regularization. In 7th International ery.
Conference on Learning Representations, ICLR 2019,
New Orleans, LA, USA, May 6-9, 2019. OpenRe- Xiangzhe Xu, Zhuo Zhang, Shiwei Feng, Yapeng Ye,
view.net. Zian Su, Nan Jiang, Siyuan Cheng, Lin Tan, and
Xiangyu Zhang. 2023. Lmpa: Improving decompila-
OpenAI. 2023. GPT-4 technical report. CoRR, tion by synergy of large language model and program
abs/2303.08774. analysis. CoRR, abs/2306.02546.
Zhengran Zeng, Hanzhuo Tan, Haotian Zhang, Jing Li,
Yuqun Zhang, and Lingming Zhang. 2022. An ex-
tensive study on pre-trained models for program un-
derstanding and generation. In ISSTA ’22: 31st ACM
SIGSOFT International Symposium on Software Test-
ing and Analysis, Virtual Event, South Korea, July 18
- 22, 2022, pages 39–51. ACM.

You might also like