0% found this document useful (0 votes)

58 views9 pages

LLM4Decompile: Decompiling Binary Code With Large Language Models

The document introduces LLM4Decompile, the first open-source large language model for decompiling binary code into human-readable source code. It presents a pipeline for evaluating decompilation, including recompiling and reexecuting the decompiled code. Experiments show the model can accurately decompile 21% of assembly code, a 50% improvement over GPT-4. It also introduces Decompile-Eval, the first benchmark for decompilation that evaluates recompilability and reexecutability.

Uploaded by

rakeshsinghparihar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views9 pages

LLM4Decompile: Decompiling Binary Code With Large Language Models

Uploaded by

rakeshsinghparihar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

LLM4Decompile: Decompiling Binary Code with Large Language Models

Hanzhuo Tan, Qi Luo, Jing Li, Yuqun Zhang

Southern University of Science and Technology
The Hong Kong Polytechnic University

Abstract float trun_num(float num) {

return num - (int)num;
Decompilation aims to restore compiled code src }
to human-readable source code, but struggles compile
arXiv:2403.05286v1 [cs.PL] 8 Mar 2024

with details like names and structure. Large

00111010010101010101010
language models (LLMs) show promise for 110101010110101000101...
programming tasks, motivating their applica- binary
tion to decompilation. However, there does disassemble
not exist any open-source LLM for decompila- endbr64
tion. Moreover, existing decompilation evalu- push %rbp
mov %rsp,%rbp
ation systems mainly consider token-level ac- asm ...
curacy and largely ignore code executability,
decompile test
which is the most important feature of any pro-
gram. Therefore, we release the first open- float func(float f) int main() {
access decompilation LLMs ranging from 1B { assert(...);
to 33B pre-trained on 4 billion tokens of C int i = (int)f; assert(...);
src' return f - i; assert(...);
source code and the corresponding assembly } }
code. The open-source LLMs can serve as compile run
baselines for further development in the field.
To ensure practical program evaluation, we in- eval re-compilability re-executability
troduce Decompile-Eval, the first dataset that
considers re-compilability and re-executability Figure 1: Pipeline to evaluate the decompilation.
for decompilation. The benchmark empha-
sizes the importance of evaluating the decom-
pilation model from the perspective of pro-
gram semantics. Experiments indicate that our 2024). Although these tools have the capability
LLM4Decompile has demonstrated the capa- to revert binary code to source code in specific
bility to accurately decompile 21% of the as- scenarios, they often fall short in producing code
sembly code, which achieves a 50% improve- that is easily readable by humans. The inherent
ment over GPT-4. Our code, dataset, and difficulty of decompilation lies in the inability to
models are released at https://github.com/ fully recreate the source, especially finer details
albertan017/LLM4Decompile. like variable names (Lacomis et al., 2019) and the
primary structure (Wei et al., 2007), such as loops
1 Introduction and conditional statements, which are often lost
Decompilation is the process of converting com- during the compilation process.
piled machine code or bytecode back into a high- Recent advances in large language models
level programming languages. This is often done to (LLMs) have led researchers to approach pro-
analyze the workings of software when its source gramming languages as distinct linguistic systems,
code is not accessible (Brumley et al., 2013; Katz using pre-trained code LLMs for various cod-
et al., 2018; Hosseini and Dolan-Gavitt, 2022; Xu ing tasks (Lippincott, 2020; Rozière et al., 2023;
et al., 2023; Armengol-Estapé et al., 2023; Jiang Guo et al., 2024). These models have shown
et al., 2023; Wong et al., 2023). There have been impressive performance improvements over tra-
numerous tools developed for decompilation, such ditional techniques (Zeng et al., 2022; Xu et al.,
as Ghidra (Ghidra, 2024) and IDA Pro (Hex-Rays, 2022), which leads us to the possibility of apply-
ing LLMs to cope with the decompilation chal- (denoted as asm) using the objdump tool. The as-
lenge. To illustrate, Transformer-based models sembly instructions are subsequently decompiled
such as Slade (Armengol-Estapé et al., 2023) and to reconstruct the source code in a format which is
Nova (Jiang et al., 2023) have showcased the poten- readable to humans (denoted as src′ ). To assess the
tial of using language models to turn binary code quality of the decompiled code (src′ ), it is tested
back into source code that is much closer to the for its ability to be recompiled with the original
source code in readability and structure. However, GCC compiler (re-compilability) and for its func-
the scope of their models is somewhat constrained tionality through test assertions (re-executability).
at 200M and 1B parameters for Slade and Nova On Decompile-Eval, the llm4decompile models
which could result in a reduced capacity for learn- demonstrated promising results in their ability to
ing and generalization, whereas larger models typi- decompile binaries, with an impressive 90% of
cally exhibit a marked improvement in these areas the decompiled code being recompilable using the
by leveraging their extensive parameters to process same settings in the GCC compiler, signifying a
and integrate a broader range of information (Roz- solid understanding of code structure and syntax.
ière et al., 2023; OpenAI, 2023). Moreover, their As for the ability to execute the code, 21% of the
lack of public availability limits their contribution decompiled codes from the 6B version successfully
to promoting further progress in this domain. Fur- capture the semantics of a program and pass all the
thermore, to the best of our knowledge, no standard- test cases.
ized benchmark dataset exists for evaluating and In conclusion, our contributions are twofold:
comparing decompilation techniques. Researchers • We provide the first open-source LLM rang-
tend to employ different datasets (da Silva et al., ing from 1B to 33B tailored for decompilation,
2021; Collie et al., 2020; Tan et al., 2017) to eval- which also facilitates compilation and diverse bi-
uate their results, making direct comparison diffi- nary tasks.
cult. Therefore, there is a strong need to establish • We construct the first decompilation bench-
a benchmark to evaluate the decompilation perfor- mark which targets re-compilation and re-execution
mance, which can significantly facilitate the formu- rate, which indicate syntax recovery and semantic
lation of coherent and standard evaluation criteria preservation—both essential for usable and robust
for the decompilation domain. decompilation.
Thus, our objective is to create and release the
first open-source LLM dedicated to decompila- 2 Related Work
tion, and to assess its capabilities by construct-
ing the first decompilation benchmark focused on 2.1 Decompilation
re-compilability and re-executability. We start by The practice of reversing executable binaries to
compiling a million C code samples from Ang- their source code form, known as decompilation,
haBench (da Silva et al., 2021) into assembly code has been researched for decades. Traditional de-
using GCC (Stallman et al., 2003) with differ- compilation relies on analyzing the control and data
ent configurations, forming a dataset of assembly- flows of program (Brumley et al., 2013), and em-
source pairs in 4 billion tokens. We then finetune ploying pattern matching, as seen in tools like Hex-
the DeepSeek-Coder model (Guo et al., 2024), a Rays Ida pro (Hex-Rays, 2024) and Ghidra (Ghidra,
leading-edge code LLM, using this dataset. Fol- 2024). These systems attempt to identify patterns
lowed by constructing the evaluation benchmark, within a program’s control-flow graph (CFG) that
Decompile-Eval, based on HumanEval (Chen et al., corresponds to standard programming constructs
2021) questions and test samples. Specifically, we such as conditional statements or loops. However,
formulate the evaluation from two perspectives: crafting these rule-based systems can be challeng-
whether the decompiled code can recompile suc- ing and prone to mistakes as the rules are complex
cessfully, and whether it passes all assertions in to create, often only partially cover the CFG, and
the test cases. Figure 1 presents the steps involved take extensive time to develop (Armengol-Estapé
in our decompilation evaluation. First, the source et al., 2023; Jiang et al., 2023). They are partic-
code (denoted as src) is compiled by the GCC com- ularly weak when facing code that has been opti-
piler with specific parameters, such as optimization mized, which is a common practice for commer-
levels, to produce the executable binary. This bi- cially compiled software. The code output from
nary is then disassembled into assembly language such decompilation processes tends to be a source-
code-like representation of assembly code, includ- code generation framework, and the subset of
ing direct translations of variables to registers, use ExeBench (da Silva et al., 2021; Collie et al., 2020),
of gotos, and other low-level operations instead of a benchmark consisting of executable C programs.
the original high-level language constructs. This Nova (Jiang et al., 2023) evaluates its decompila-
output, while often functionally similar to the orig- tion capabilities on a synthetic dataset that it cre-
inal code, is difficult to understand and may not ated, in addition to the CodeFlaws (Tan et al., 2017)
be efficient enough to allow for recompilation (Liu dataset, which is designed to identify common cod-
and Wang, 2020). ing errors.
Drawing inspiration from neural machine trans- The metrics employed for these evaluations pre-
lation, researchers have reformulated decompila- dominantly focus on N-gram similarity, with the
tion as a translation exercise, converting machine- use of BLEU or Token Accuracy, as well as Edit
level instructions into readable source code. Initial Similarity (ES). Slade (Armengol-Estapé et al.,
attempts in this area utilized recurrent neural net- 2023) goes a step further by incorporating Input
works (RNNs) (Katz et al., 2018) for decompila- Output (IO) Accuracy (Le et al., 2014; Liu and
tion, complemented by error-correction techniques Wang, 2020) into its evaluation framework. This
to enhance the outcomes. Nonetheless, these ef- metric assesses semantic equivalence through be-
forts were constrained in their effectiveness. havioral equality, meaning it checks whether the
The latest advancements in natural language decompiled code and the original code produce the
processing (NLP) have enabled the use of pre- same outputs when given the same inputs. How-
trained language models (LMs) for coding-related ever, IO Accuracy relies on external processes to
tasks (Rozière et al., 2023; Lippincott, 2020; Guo generate input and output samples for comparison.
et al., 2024). These models generally incorpo- The generation of input and output samples often
rate the Transformer architecture (Vaswani et al., involves randomness, leading to non-deterministic
2017), using self-attention mechanisms, and are results and making it difficult to consistently assess
pre-trained on extensive text datasets. This ap- the performance of a decompiler.
proach allows LMs to capture contextual nuances Consequently, our goal is to develop and make
and aids in the acquisition of general language un- the first open-source large language model (LLM)
derstanding. In the realm of binary decompilation, tailored for decompilation. We also aim to estab-
BTC (Hosseini and Dolan-Gavitt, 2022) was one lish the first benchmark for re-compilability and
of the first to fine-tune an LM for this purpose. Fol- re-executability to set a standard for performance
lowing this, Slade (Armengol-Estapé et al., 2023) evaluation in the field of decompilation.
utilized the BART model and trained an LM-based
decompiler with 200 million parameters, while 3 LLM4Decompile
Nova (Jiang et al., 2023) developed a binary LM
with 1 billion parameters starting from the Star- In this section, we describe the pre-training data,
Coder checkpoint and fine-tuned it for decompi- present different model configurations, and discuss
lation. Although these models show potential in the pre-training objectives involved in pre-training
decompilation, they are limited in size, e.g., the the LLM.
Code Llama model (Rozière et al., 2023), launched
3.1 Pre-training Data
in 2023, has at least 7 billion parameters.
We construct asm-source pairs based on Ang-
2.2 Evaluation habench (da Silva et al., 2021), which is a public
There is a notable gap in the field of decompi- collection of one million compilable C files. Fol-
lation: a lack of a unified, accepted benchmark lowing the practice of previous works (Armengol-
for measuring the quality of decompilation tools. Estapé et al., 2023), we first compile the source
Various sources are used for evaluation purposes. code into an binary object file, disassemble the ob-
BTC (Hosseini and Dolan-Gavitt, 2022), for ex- ject file into assembly code, and pair it with the
ample, leverages web data including interview- source code, where we only consider the x86 Linux
style coding problems and the extensive Debian platform. In the real deployment, programmers will
Linux repository to evaluate decompilation accu- choose different compiler optimization flags to op-
racy. Meanwhile, Slade (Armengol-Estapé et al., timize the execution performance. Compiler opti-
2023) is tested using both Synth, a synthetic mization refers to the process of tweaking and trans-
LLM models targeting predicting the next token
endbr64
push %rbp given previous inputs. As depicted in Equation 1,
mov %rsp,%rbp it minimizes the negative log probability for the
asm ...
…
ground truth token xi :
…

LLM4Decompile
X
…
… L=− log Pi (xi |x1 , x2 , ..., xi−1 ; θ) (1)
i

float trun_num(float num) { where the conditional probability P is modeled

return num - (int)num;}
src using LLM4Decompile model with parameters θ.
Sequence-to-sequence These parameters are optimized by applying the
(Translation) gradient descent algorithms (Ruder, 2016) with
respect to the input sequence x1 , x2 , ..., xi−1 pre-
Endbr64 push %rbp mov %rsp,%rbp ceding the given token xi .
... Float trun_num(float num) 2) Sequence-to-sequence prediction (S2S),
{ return num - (int)num; }
asm which is the training objective adopted in most neu-
… ral machine translation models that aims to predict
the output given the input sequence. As depicted
…

LLM4Decompile …
…
in Equation 2, it minimizes the negative log proba-
bility for the C code tokens xi , ..., xj :
Endbr64 push %rbp mov %rsp,%rbp
... Float trun_num(float num)
src
X
{ return num - (int)num; } L=− log Pi (xi , ..., xj |x1 , ..., xi−1 ; θ) (2)
i
Next token prediction
(Language modeling) where the loss is calculated only for the output
sequence xi ...xj , or the C code.
Figure 2: Training objectives The main difference lies in whether the input
sequence or the input assembly code is included in
forming source code to generate faster and more the training loss or not, whereas in language model-
efficient machine code. It involves techniques like ing, all the inputs are included for loss calculation.
eliminating redundant instructions, better register We conduct different ablation studies on these two
allocation, loop transformations, etc. The different training objectives to explore their effectiveness in
optimization levels trade off compile time against decompilation.
execution time and debugging ability. The key opti- 4 Experiment Setups
mization levels are: O0 (default, No optimization)
to O3 (Aggressive optimizations, compilation time 4.1 Decompile-Eval Benchmark
consuming). We compile the source code into all Currently, there appears to be no benchmark for de-
four stages, i.e., O0, O1, O2, O3, and pair each of compilation evaluation that considers whether code
them with the source code. To inform the model can be recompiled or executed correctly. When
with the optimization stage, we use the following assessing decompilation model performance, re-
prompt: # This is the assembly code with searchers rely on metrics that measure N-gram sim-
[optimization state] optimization: [asm ilarity (such as BLEU or CodeBLEU) or edit sim-
code] # What is the source code?. ilarity (ES). However, these metrics, commonly
used in machine translation and text generation,
3.2 Model Configurations fail to adapt to the evaluation of programming lan-
Our LLM4Decompile uses the same architecture as guages.
DeepSeek-Coder and we initialize our model with Programming languages are highly structured
the corresponding DeepSeek-Coder checkpoints. and logical, insensitive to the naming of functions
As a neural translator, the training objectives can and variables, yet very sensitive to the flow of data
be categorized into two categories, as shown in and logic. Changing variable or function names
Figure 2. does not affect the meaning of a program, but a sin-
1) Next token prediction (NTP) or language mod- gle logical error can alter its entire function and pur-
eling, which is the pre-training objective of most pose. As illustrated in Figure 3, the use of BLEU
float trun_num(float num) {
uation process (Figure 1), the C source code is
return num - (int)num; first compiled into a binary, then disassembled into
src } assembly code, and finally fed into the decompila-
float trun_num(float num) { BLEU: 73.3 tion system to be reconstructed back into C source
return (int)num - (int)num; ES: 91.7 code. This regenerated C code is compiled with
src" } GCC to test re-compilability and combined with
float trun_num(float num) { BLEU: 67.6 the original assertions to check if it can successfully
ES: 90.9
return num - num; execute and pass those assertions. Re-compilability
src# }
and re-executability serve as critical indicators in
float func(float x) { BLEU: 0.0 validating the effectiveness of a decompilation pro-
ES: 69.1
return x - int(x);
src$ }
cess. When decompiled code can be recompiled, it
provides strong evidence of syntactic integrity. It
float func(float f) { BLEU: 0.0 ensures that the decompiled code is not just read-
int i = (int)f; ES: 41.4
return f - i; able, but also adheres to the structural and syntacti-
src! } cal standards expected by the compiler. However,
syntax alone does not guarantee semantic equiv-
Figure 3: Limitation to use BLEU and Edit Similarity alence to the original pre-compiled program. Re-
for evaluating decompilation results. executability provides this critical measure of se-
mantic correctness. By re-compiling the decom-
piled output and running the test cases, we assess
and ES in evaluating code similarity is problem- if the decompilation preserved the program logic
atic. For src1 , the variation from the original src and behavior. Together, re-compilability and re-
is confined to type conversion of variable num, executability indicate syntax recovery and semantic
which leads to high BLEU and ES scores. How- preservation—both essential for usable and robust
ever, this alteration completely changes the intent decompilation.
of the code. Similarly, src2 achieves high BLEU In alignment with established evaluation prac-
and ES scores, yet the semantics of the function are tices, following Slade (Armengol-Estapé et al.,
lost. Conversely, src3 undergoes normalization of 2023), we partition 1000 samples from the Ang-
function and variable names, causing no semantic haBench into a test set. We then utilize BLEU and
shift yet scoring zero in BLEU against the original ES as our primary metrics for assessment.
code. The example of src4 is more extreme: if the
program logic is broken down into multiple lines, 4.2 Baselines
the ES drops to 41.4%, falsely indicating a low sim- To benchmark against SOTA decompilers, we se-
ilarity. However, during compilation, names are lected two key baselines. First, GPT-4 repre-
typically standardized by the compiler, and source sents the most capable LLMs, providing an up-
code is often broken down into basic operations per bound on LLM performance. As one of the
depending on optimization. For this reason, the largest language models, GPT-4 significantly sur-
ability to recompile and execute the code is far passes previous LLMs across modalities. Second,
more indicative than N-gram or edit similarity for DeepSeek-Coder is selected as the current SOTA
evaluating decompilation efficacy. open-source Code LLM. It represents the forefront
To address the gap in decompilation assessment, of publicly available models specifically tailored
we introduce Decompile-Eval, the first benchmark for coding tasks. While recent academic works
to evaluate the re-compilability and re-executability like BTC (Hosseini and Dolan-Gavitt, 2022) and
of decompilation systems. This benchmark is de- Slade (Armengol-Estapé et al., 2023) showcase
rived from HumanEval (Chen et al., 2021), which is LLMs for decompilation, these models present
the leading benchmark for code generation assess- significant integration challenges, such as com-
ment and includes 164 programming challenges plex pre-processing settings, non-standardized to-
with accompanying Python solutions and asser- kenizer/model loading, and necessitating signifi-
tions. We converted these Python solutions and cant effort to modify and adapt them. We thus
assertions into C, making sure that they compile selected GPT-4 and DeepSeek-Coder as represen-
with the GCC compiler using standard C libraries tative cutting-edge and open-source baselines ac-
and pass all the original assertions. In our eval- cessible for evaluation.
Table 1: Evaluation Results on Decompile-Eval

Model Re-compilability Re-executability

Opt-level O0 O1 O2 O3 Avg. O0 O1 O2 O3 Avg.
GPT4 0.92 0.94 0.88 0.84 0.895 0.1341 0.1890 0.1524 0.0854 0.1402
DeepSeek-Coder-33B 0.0659 0.0866 0.1500 0.1463 0.1122 0.0000 0.0000 0.0000 0.0000 0.0000
LLM4Decompile-1b 0.8780 0.8732 0.8683 0.8378 0.8643 0.1573 0.0768 0.1000 0.0878 0.1055
LLM4Decompile-6b 0.8817 0.8951 0.8671 0.8476 0.8729 0.3000 0.1732 0.1988 0.1841 0.2140
LLM4Decompile-33b 0.8134 0.8195 0.8183 0.8305 0.8204 0.3049 0.1902 0.1817 0.1817 0.2146

Table 2: Evaluation Results on Anghabench.

model BLEU Edit Similarity

Opt-level O0 O1 O2 O3 Avg. O0 O1 O2 O3 Avg.
DeepSeek-Coder-33B 0.0362 0.0367 0.0306 0.0313 0.0337 0.1186 0.1196 0.1124 0.1133 0.116
LLM4Decompile-1b 0.5099 0.493 0.487 0.4835 0.4934 0.6223 0.5946 0.5825 0.5822 0.5954
LLM4Decompile-6b 0.8219 0.8246 0.8143 0.8148 0.8189 0.8562 0.8551 0.8422 0.8453 0.8497
LLM4Decompile-33b 0.7724 0.7477 0.7485 0.7514 0.755 0.8252 0.7974 0.7993 0.8056 0.8069

4.3 Implementation marginal improvement, with an average increase of

less than one percentage point in re-executability.
We use the Python implementation of the
This plateau may be due to the challenge of tuning
DeepSeek-Coder models (1.3B, 6.7B, and 33B) ob-
the 33B model.
tained on Hugging Face (Wolf et al., 2019). We set
a global batch size = 2048 and learning rate = Table 2 summarizes the results on AnghaBench,
2e−5 and train the models with the AdamW where LLM4Decompile shows notably high BLEU
optimizer (Loshchilov and Hutter, 2019) for 2 and ES scores, e.g., the 6B model achieves 0.82
epochs. During the evaluation, we set the BLEU scores, almost identical to the source code.
max_new_tokens to 512. To ensure fairness in This outstanding performance suggests a signifi-
the analysis of time and space complexity, all ex- cant data leakage issue within the test set. Decom-
periments are performed on a cluster equipped with piled code, with its variables normalized, should
8 NVIDIA A100-80GB GPUs. not realistically allow for such high N-gram/ES
scores. This anomaly underscores the importance
5 Experiment Results of establishing an independent, reliable benchmark
for decompilation evaluation, as similarly high
5.1 Main Results BLEU and ES scores have been reported in prior
research.
Table 1 presents the primary findings of our study.
Initially, the base version of DeepSeek-Coder was
5.2 Ablations
unable to accurately decompile binaries. It could
generate code that seemed correct and was some- As discussed in Section 3.2, our LLM4Decompile
times compilable but failed to retain the origi- model adopts a sequence-to-sequence (S2S) fore-
nal program semantics. After fine-tuning, the casting approach, which outperforms other train-
LLM4Decompile models demonstrated a signifi- ing techniques for several reasons. In this training
cant improvement in their ability to decompile bina- methodology, the input—specifically the assembly
ries, with an impressive around 90% of the code be- code is not included in the calculation of the loss
ing compilable, signifying a solid understanding of function. This allows the model to focus solely on
code structure and syntax. As for the ability to exe- generating accurate output source code, enabling
cute the code, the 6B version of LLM4Decompile it to better understand the underlying patterns and
shows a remarkable advantage over the 1B version, structures of the decompiled code. In contrast, inte-
21% of the decompiled codes from the 6B version grating the assembly code into the training process,
successfully capture the semantics of a program as in the next token prediction (NTP) task, encom-
and pass all the test cases, while for the 1B version passes both the input assembly code and the output
only 10% can be re-executed. The improvement source code, which can decrease performance by
highlights the benefits of larger model sizes in cap- around 4 points, as shown in Table 3. The complex-
turing the semantics of a program. Nonetheless, ity of assembly code is another factor; being inher-
the increase in model size to 33B yields only a ently complex and low-level, it is harder for the
Table 3: Ablation study on training methodology.

Model Re-compilability Re-executability

Opt-level O0 O1 O2 O3 Avg. O0 O1 O2 O3 Avg.
S2S 0.8817 0.8951 0.8671 0.8476 0.8729 0.3000 0.1732 0.1988 0.1841 0.2140
NTP 0.8329 0.8598 0.8317 0.8329 0.8393 0.2805 0.1390 0.1573 0.1341 0.1777
NTP+S2S 0.8963 0.8598 0.8963 0.8720 0.8811 0.3232 0.1463 0.1951 0.1707 0.2088

model to learn meaningful patterns when assembly References

code is included in the training process. By exclud- Jordi Armengol-Estapé, Jackson Woodruff, Chris Cum-
ing the assembly code from the loss calculation, the mins, and Michael F. P. O’Boyle. 2023. Slade: A
S2S approach enables the model to avoid this com- portable small language model decompiler for opti-
plexity and concentrate on high-level source code mized assembler. CoRR, abs/2305.12520.
patterns. Although an alternative strategy involves David Brumley, JongHyup Lee, Edward J. Schwartz,
an initial training step with both assembly and C and Maverick Woo. 2013. Native x86 decompila-
code followed by fine-tuning focused on the trans- tion using semantics-preserving structural analysis
and iterative control-flow structuring. In Proceedings
lation task (NTP+S2S), this approach still doesn’t of the 22th USENIX Security Symposium, Washing-
perform as well as the S2S. ton, DC, USA, August 14-16, 2013, pages 353–368.
USENIX Association.
6 Conclusions
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan,
We presented the first open-source decompilation- Henrique Pondé de Oliveira Pinto, Jared Kaplan,
Harrison Edwards, Yuri Burda, Nicholas Joseph,
focused LLM and standardized re-compilability/re- Greg Brockman, Alex Ray, Raul Puri, Gretchen
executability benchmark. Analyses on this di- Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas-
verse compiled C code dataset revealed promising try, Pamela Mishkin, Brooke Chan, Scott Gray,
capabilities—our 6B LLM4Decompile achieved Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz
Kaiser, Mohammad Bavarian, Clemens Winter,
87% re-compilability, indicating syntactic under- Philippe Tillet, Felipe Petroski Such, Dave Cum-
standing, and 21% re-executability, suggesting se- mings, Matthias Plappert, Fotios Chantzis, Eliza-
mantic preservation. As an initial exploration into beth Barnes, Ariel Herbert-Voss, William Hebgen
data-driven decompilation, our work establishes an Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie
open benchmark to motivate future efforts. The Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain,
William Saunders, Christopher Hesse, Andrew N.
public dataset, model, and analyses represent en- Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan
couraging first steps toward enhancing decompila- Morikawa, Alec Radford, Matthew Knight, Miles
tion through novel techniques. Brundage, Mira Murati, Katie Mayer, Peter Welinder,
Bob McGrew, Dario Amodei, Sam McCandlish, Ilya
7 Limitations Sutskever, and Wojciech Zaremba. 2021. Evaluat-
ing large language models trained on code. CoRR,
The scope of this research is limited to the compi- abs/2107.03374.
lation and decompilation of C language targeting Bruce Collie, Jackson Woodruff, and Michael F. P.
the x86 platform. While we are confident that the O’Boyle. 2020. Modeling black-box components
methodologies developed here could be adapted to with probabilistic synthesis. In GPCE ’20: Proceed-
ings of the 19th ACM SIGPLAN International Con-
other programming languages and platforms with ference on Generative Programming: Concepts and
relative ease, these potential extensions have been Experiences, Virtual Event, USA, November 16-17,
reserved for future investigation. Additionally, our 2020, pages 1–14. ACM.
current study is constrained to the decompilation Anderson Faustino da Silva, Bruno Conde Kind,
of single functions, without taking into account José Wesley de Souza Magalhães, Jerônimo Nunes
factors such as cross-references and external type Rocha, Breno Campos Ferreira Guimarães, and
definitions. This presents a simplified view of the Fernando Magno Quintão Pereira. 2021. ANG-
HABENCH: A suite with one million compilable C
decompilation process, omitting the complexities benchmarks for code-size reduction. In IEEE/ACM
introduced by these elements. Addressing these International Symposium on Code Generation and
aspects would provide a more comprehensive un- Optimization, CGO 2021, Seoul, South Korea, Febru-
derstanding of decompilation across a broader spec- ary 27 - March 3, 2021, pages 378–390. IEEE.
trum of scenarios and is an important avenue for Ghidra. 2024. Ghidra software reverse engineering
subsequent research. framework.
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten
Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi,
Y Wu, YK Li, et al. 2024. Deepseek-coder: When the Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom
large language model meets programming–the rise of Kozhevnikov, Ivan Evtimov, Joanna Bitton, Man-
code intelligence. arXiv preprint arXiv:2401.14196. ish Bhatt, Cristian Canton-Ferrer, Aaron Grattafiori,
Wenhan Xiong, Alexandre Défossez, Jade Copet,
Hex-Rays. 2024. Ida pro: a cross-platform multi- Faisal Azhar, Hugo Touvron, Louis Martin, Nico-
processor disassembler and debugger. las Usunier, Thomas Scialom, and Gabriel Synnaeve.
2023. Code llama: Open foundation models for code.
Iman Hosseini and Brendan Dolan-Gavitt. 2022. Be- CoRR, abs/2308.12950.
yond the C: retargetable decompilation using neural
Sebastian Ruder. 2016. An overview of gradient
machine translation. CoRR, abs/2212.08950.
descent optimization algorithms. arXiv preprint
arXiv:1609.04747.
Nan Jiang, Chengxiao Wang, Kevin Liu, Xiangzhe
Xu, Lin Tan, and Xiangyu Zhang. 2023. Nova+ : Richard M Stallman et al. 2003. Using the gnu compiler
Generative language models for binaries. CoRR, collection. Free Software Foundation, 4(02).
abs/2311.13721.
Shin Hwei Tan, Jooyong Yi, Yulis, Sergey Mechtaev,
Deborah S. Katz, Jason Ruchti, and Eric M. Schulte. and Abhik Roychoudhury. 2017. Codeflaws: a pro-
2018. Using recurrent neural networks for decompi- gramming competition benchmark for evaluating au-
lation. In 25th International Conference on Software tomated program repair tools. In Proceedings of the
Analysis, Evolution and Reengineering, SANER 2018, 39th International Conference on Software Engineer-
Campobasso, Italy, March 20-23, 2018, pages 346– ing, ICSE 2017, Buenos Aires, Argentina, May 20-28,
356. IEEE Computer Society. 2017 - Companion Volume, pages 180–182. IEEE
Computer Society.
Jeremy Lacomis, Pengcheng Yin, Edward J. Schwartz, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Miltiadis Allamanis, Claire Le Goues, Graham Neu- Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
big, and Bogdan Vasilescu. 2019. DIRE: A neural Kaiser, and Illia Polosukhin. 2017. Attention is all
approach to decompiled identifier naming. In 34th you need. In Advances in Neural Information Pro-
IEEE/ACM International Conference on Automated cessing Systems 30: Annual Conference on Neural
Software Engineering, ASE 2019, San Diego, CA, Information Processing Systems 2017, December 4-9,
USA, November 11-15, 2019, pages 628–639. IEEE. 2017, Long Beach, CA, USA, pages 5998–6008.

Vu Le, Mehrdad Afshari, and Zhendong Su. 2014. Com- Tao Wei, Jian Mao, Wei Zou, and Yu Chen. 2007. A
piler validation via equivalence modulo inputs. In new algorithm for identifying loops in decompilation.
ACM SIGPLAN Conference on Programming Lan- In Static Analysis, 14th International Symposium,
guage Design and Implementation, PLDI ’14, Edin- SAS 2007, Kongens Lyngby, Denmark, August 22-24,
burgh, United Kingdom - June 09 - 11, 2014, pages 2007, Proceedings, volume 4634 of Lecture Notes in
216–226. ACM. Computer Science, pages 170–183. Springer.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Thomas Lippincott. 2020. Starcoder: A general neural Chaumond, Clement Delangue, Anthony Moi, Pier-
ensemble technique to support traditional scholar- ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz,
ship, illustrated with a study of the post-atlantic slave and Jamie Brew. 2019. Huggingface’s transformers:
trade. In 15th Annual International Conference of the State-of-the-art natural language processing. CoRR,
Alliance of Digital Humanities Organizations, DH abs/1910.03771.
2020, Ottawa, Canada, July 20-25, 2020, Conference
Abstracts. Wai Kin Wong, Huaijin Wang, Zongjie Li, Zhibo Liu,
Shuai Wang, Qiyi Tang, Sen Nie, and Shi Wu. 2023.
Zhibo Liu and Shuai Wang. 2020. How far we have Refining decompiled C code with large language
come: testing decompilation correctness of C decom- models. CoRR, abs/2310.06530.
pilers. In ISSTA ’20: 29th ACM SIGSOFT Interna-
Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Jo-
tional Symposium on Software Testing and Analysis,
sua Hellendoorn. 2022. A systematic evaluation of
Virtual Event, USA, July 18-22, 2020, pages 475–487.
large language models of code. In Proceedings of the
ACM.
6th ACM SIGPLAN International Symposium on Ma-
chine Programming, MAPS 2022, page 1–10, New
Ilya Loshchilov and Frank Hutter. 2019. Decoupled York, NY, USA. Association for Computing Machin-
weight decay regularization. In 7th International ery.
Conference on Learning Representations, ICLR 2019,
New Orleans, LA, USA, May 6-9, 2019. OpenRe- Xiangzhe Xu, Zhuo Zhang, Shiwei Feng, Yapeng Ye,
view.net. Zian Su, Nan Jiang, Siyuan Cheng, Lin Tan, and
Xiangyu Zhang. 2023. Lmpa: Improving decompila-
OpenAI. 2023. GPT-4 technical report. CoRR, tion by synergy of large language model and program
abs/2303.08774. analysis. CoRR, abs/2306.02546.
Zhengran Zeng, Hanzhuo Tan, Haotian Zhang, Jing Li,
Yuqun Zhang, and Lingming Zhang. 2022. An ex-
tensive study on pre-trained models for program un-
derstanding and generation. In ISSTA ’22: 31st ACM
SIGSOFT International Symposium on Software Test-
ing and Analysis, Virtual Event, South Korea, July 18
- 22, 2022, pages 39–51. ACM.

2nd Year Computer Science Exercise
No ratings yet
2nd Year Computer Science Exercise
9 pages
Update to Modern C++
From Everand
Update to Modern C++
James Raynard
No ratings yet
Mastering C# and .NET Framework
From Everand
Mastering C# and .NET Framework
Marino Posadas
5/5 (7)
LLM4Decompile: Decompiling Binary Code With Large Language Models
No ratings yet
LLM4Decompile: Decompiling Binary Code With Large Language Models
13 pages
Wadec: Decompiling Webassembly Using Large Language Model: Xinyu She Yanjie Zhao Haoyu Wang
No ratings yet
Wadec: Decompiling Webassembly Using Large Language Model: Xinyu She Yanjie Zhao Haoyu Wang
12 pages
Szalay Poor
No ratings yet
Szalay Poor
15 pages
Slade
No ratings yet
Slade
14 pages
Using Recurrent Neural Networks For Decompilation
No ratings yet
Using Recurrent Neural Networks For Decompilation
11 pages
C Decompilation PDF
No ratings yet
C Decompilation PDF
15 pages
Nova
No ratings yet
Nova
21 pages
Evolving Exact Decompilation: Eric Schulte, Jason Ruchti, Matt Noonan, David Ciarletta, Alexey Loginov
No ratings yet
Evolving Exact Decompilation: Eric Schulte, Jason Ruchti, Matt Noonan, David Ciarletta, Alexey Loginov
11 pages
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
From Everand
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
Manoj R Chakravarthi
No ratings yet
C# Fundamentals Made Simple: A Practical Guide with Examples
From Everand
C# Fundamentals Made Simple: A Practical Guide with Examples
William E. Clark
No ratings yet
Binsok
No ratings yet
Binsok
19 pages
Cifuentes 95 Decompilation
No ratings yet
Cifuentes 95 Decompilation
19 pages
Decompilation
No ratings yet
Decompilation
6 pages
Living with Linux in the Industrial World
From Everand
Living with Linux in the Industrial World
Elaiya Iswera Lallan
No ratings yet
LDB: A Large Language Model Debugger Via Verifying Runtime Execution Step by Step
No ratings yet
LDB: A Large Language Model Debugger Via Verifying Runtime Execution Step by Step
19 pages
DECOMPILER
100% (1)
DECOMPILER
17 pages
Computer Practices Using C++
From Everand
Computer Practices Using C++
Ramkrishna Ghosh
No ratings yet
KeyuHe MaxLi JosephLiu
No ratings yet
KeyuHe MaxLi JosephLiu
12 pages
Desquirr Master Thesis
No ratings yet
Desquirr Master Thesis
31 pages
Designing An Object Oriented Decompiler
No ratings yet
Designing An Object Oriented Decompiler
31 pages
Objective-C Language Reference and Techniques: Definitive Reference for Developers and Engineers
From Everand
Objective-C Language Reference and Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The Comprehensive Guide to Kotlin Programming : A Complete Reference Guide
From Everand
The Comprehensive Guide to Kotlin Programming : A Complete Reference Guide
Madison Giroux
No ratings yet
C# OOP Step by Step: A Practical Guide with Examples
From Everand
C# OOP Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
WhiteFox - White-Box Compiler Fuzzing Empowered by Large Language Models
No ratings yet
WhiteFox - White-Box Compiler Fuzzing Empowered by Large Language Models
27 pages
O C: T O C T - T C L L M: PEN Oder HE PEN Ookbook For OP IER ODE Arge Anguage Odels
No ratings yet
O C: T O C T - T C L L M: PEN Oder HE PEN Ookbook For OP IER ODE Arge Anguage Odels
35 pages
Miniprojectposter
No ratings yet
Miniprojectposter
1 page
Evaluating Large Language Models Trained On Code
No ratings yet
Evaluating Large Language Models Trained On Code
35 pages
OpenAI Codex Arxiv
No ratings yet
OpenAI Codex Arxiv
35 pages
C++ Functional Programming for Starters: A Practical Guide with Examples
From Everand
C++ Functional Programming for Starters: A Practical Guide with Examples
William E. Clark
No ratings yet
Thinking About Star
From Everand
Thinking About Star
Francis McCabe
No ratings yet
OpenCoder 1731317971
No ratings yet
OpenCoder 1731317971
35 pages
Mastering the Art of C# Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of C# Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
CodeGeeX - A Pre-Trained Model For Code Generation With Multilingual Evaluations On HumanEval-X
No ratings yet
CodeGeeX - A Pre-Trained Model For Code Generation With Multilingual Evaluations On HumanEval-X
30 pages
Learn Python in One Hour: Programming by Example
From Everand
Learn Python in One Hour: Programming by Example
Victor R. Volkman
3/5 (2)
C + +: C++ programming
From Everand
C + +: C++ programming
Ummed Singh
No ratings yet
Codexglue: A Machine Learning Benchmark Dataset For Code Understanding and Generation
No ratings yet
Codexglue: A Machine Learning Benchmark Dataset For Code Understanding and Generation
14 pages
Blackbook
No ratings yet
Blackbook
35 pages
C++ OOP Made Simple: A Practical Guide with Examples
From Everand
C++ OOP Made Simple: A Practical Guide with Examples
William E. Clark
No ratings yet
Large Language Models For Code Analysis - Do LLMs Really Do Their Job
No ratings yet
Large Language Models For Code Analysis - Do LLMs Really Do Their Job
18 pages
Kotlin Essentials: Definitive Reference for Developers and Engineers
From Everand
Kotlin Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Programming in Star
From Everand
Programming in Star
Francis McCabe
No ratings yet
1 s2.0 S0167739X24002449 Main
No ratings yet
1 s2.0 S0167739X24002449 Main
13 pages
Learn C++
From Everand
Learn C++
Aishik Dutta
No ratings yet
Terraform for Developers, Second Edition: Essentials of Infrastructure Automation and Provisioning
From Everand
Terraform for Developers, Second Edition: Essentials of Infrastructure Automation and Provisioning
Kimiko Lee
No ratings yet
Terraform for Developers, Second Edition
From Everand
Terraform for Developers, Second Edition
Kimiko Lee
No ratings yet
Bytecode Obfuscation For Smart Contracts
No ratings yet
Bytecode Obfuscation For Smart Contracts
2 pages
Mastering COBOL Programming: From Basics to Expert Proficiency
From Everand
Mastering COBOL Programming: From Basics to Expert Proficiency
William Smith
No ratings yet
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet
Basic Information About C language PDF
From Everand
Basic Information About C language PDF
Suraj Das
No ratings yet
Fault-Aware Neural Code Rankers: Jeevana Priya Inala Chenglong Wang Mei Yang Andres Codas
No ratings yet
Fault-Aware Neural Code Rankers: Jeevana Priya Inala Chenglong Wang Mei Yang Andres Codas
19 pages
A Decompiler
No ratings yet
A Decompiler
4 pages
Program Synthesis With Large Language Models: Jacob Austin Augustus Odena
No ratings yet
Program Synthesis With Large Language Models: Jacob Austin Augustus Odena
34 pages
Fully Autonomous Programming With Large Language Models
No ratings yet
Fully Autonomous Programming With Large Language Models
10 pages
IGNOU PGDCA MCS 201 Programming in C and Python Previous Years Unsolved Papers
From Everand
IGNOU PGDCA MCS 201 Programming in C and Python Previous Years Unsolved Papers
Manish Soni
No ratings yet
Compiling ONNX Neural Network Models Using Mlir
No ratings yet
Compiling ONNX Neural Network Models Using Mlir
8 pages
C++ Programming: Effective Practices and Techniques
From Everand
C++ Programming: Effective Practices and Techniques
Joe Smith
No ratings yet
C Programming: Core Concepts and Techniques
From Everand
C Programming: Core Concepts and Techniques
William Smith
No ratings yet
An In-Depth Analysis of Disassembly On Full-Scale x86-x64 Binaries - 2016 (Sec16 - Paper - Andriesse)
No ratings yet
An In-Depth Analysis of Disassembly On Full-Scale x86-x64 Binaries - 2016 (Sec16 - Paper - Andriesse)
19 pages
Generative AI and Machine Learning Course Content
No ratings yet
Generative AI and Machine Learning Course Content
19 pages
Apple Augmented Reality by Tutorials Compress
No ratings yet
Apple Augmented Reality by Tutorials Compress
359 pages
(Ebook) The AI Advantage. How AI Is Solving Problems For The Worlds Largest Retailers-1
No ratings yet
(Ebook) The AI Advantage. How AI Is Solving Problems For The Worlds Largest Retailers-1
19 pages
Virtual Try On
No ratings yet
Virtual Try On
12 pages
C Tutorial
No ratings yet
C Tutorial
4 pages
DL 1 - ComputerVision With PyTorch Notes
No ratings yet
DL 1 - ComputerVision With PyTorch Notes
304 pages
Nodejs Tutorial
No ratings yet
Nodejs Tutorial
5 pages
WWW Learnpytorch
No ratings yet
WWW Learnpytorch
14 pages
Pandas
No ratings yet
Pandas
2 pages
Preprints202206 0384 v1
No ratings yet
Preprints202206 0384 v1
12 pages
Course Fee
No ratings yet
Course Fee
4 pages
Heathkit IT-28 Restoration - Thomas Bonomo K6AD
No ratings yet
Heathkit IT-28 Restoration - Thomas Bonomo K6AD
25 pages
Opción 1 - Eq 5
No ratings yet
Opción 1 - Eq 5
15 pages
New Format Billing 2020
No ratings yet
New Format Billing 2020
98 pages
938 G
100% (1)
938 G
24 pages
Iron Kingdoms - WarMachine
No ratings yet
Iron Kingdoms - WarMachine
6 pages
Apv Eq 37t Homogeniser GB
100% (1)
Apv Eq 37t Homogeniser GB
2 pages
Fuente Bpsa10 Kidde
No ratings yet
Fuente Bpsa10 Kidde
62 pages
Malshiras
No ratings yet
Malshiras
1 page
Presented by Guided By: Gaurav Dhuppar Final Year I.T. Ms. Kavita Bhatt
No ratings yet
Presented by Guided By: Gaurav Dhuppar Final Year I.T. Ms. Kavita Bhatt
26 pages
HLL Lifecare Limited
No ratings yet
HLL Lifecare Limited
61 pages
Jobs and Job Analysis - HRM: Part 3 - R. L. Mathis & J. N. Jackson
No ratings yet
Jobs and Job Analysis - HRM: Part 3 - R. L. Mathis & J. N. Jackson
10 pages
Special Effects Tech
No ratings yet
Special Effects Tech
2 pages
Com - Hello.miheapp - Secretspace Logcat
No ratings yet
Com - Hello.miheapp - Secretspace Logcat
25 pages
Touro University
No ratings yet
Touro University
3 pages
Bubathi PHD Thesis Bearing
No ratings yet
Bubathi PHD Thesis Bearing
284 pages
Sơ đồ mạch điện Hyundai Tucson 2009
No ratings yet
Sơ đồ mạch điện Hyundai Tucson 2009
7 pages
CSC Update Log
No ratings yet
CSC Update Log
13 pages
SafeRing SafePlus Air Technical Leaflet
No ratings yet
SafeRing SafePlus Air Technical Leaflet
4 pages
World Fip Protocol
No ratings yet
World Fip Protocol
47 pages
IT1110E IntroductionTo Programming Ver5.0
No ratings yet
IT1110E IntroductionTo Programming Ver5.0
37 pages
Tiquete Computador de Flujo Omni
No ratings yet
Tiquete Computador de Flujo Omni
1 page
Sungrow - SG - 5 6 7 8 10 12RT
No ratings yet
Sungrow - SG - 5 6 7 8 10 12RT
2 pages
Chapter 1 (Fundamentals of Business Analytics)
No ratings yet
Chapter 1 (Fundamentals of Business Analytics)
20 pages
Copper Handbook V3.1
No ratings yet
Copper Handbook V3.1
40 pages
Advance Java MCQ 2
100% (3)
Advance Java MCQ 2
13 pages
Linear Block Code Matlab
No ratings yet
Linear Block Code Matlab
1 page
Manipulatives 1.Pptxm7 3
No ratings yet
Manipulatives 1.Pptxm7 3
18 pages
12thy Maths Creative One Mark Questions Vol-2
100% (1)
12thy Maths Creative One Mark Questions Vol-2
21 pages
Revolutionizing Autonomous Vehicles With Edge Computing
No ratings yet
Revolutionizing Autonomous Vehicles With Edge Computing
7 pages

LLM4Decompile: Decompiling Binary Code With Large Language Models

Uploaded by

LLM4Decompile: Decompiling Binary Code With Large Language Models

Uploaded by

LLM4Decompile: Decompiling Binary Code with Large Language Models

Hanzhuo Tan, Qi Luo, Jing Li, Yuqun Zhang

Abstract float trun_num(float num) {

with details like names and structure. Large

float trun_num(float num) { where the conditional probability P is modeled

Model Re-compilability Re-executability

Table 2: Evaluation Results on Anghabench.

model BLEU Edit Similarity

4.3 Implementation marginal improvement, with an average increase of

Model Re-compilability Re-executability

model to learn meaningful patterns when assembly References

You might also like