LLM4Decompile: Decompiling Binary Code With Large Language Models
LLM4Decompile: Decompiling Binary Code With Large Language Models
LLM4Decompile
X
…
… L=− log Pi (xi |x1 , x2 , ..., xi−1 ; θ) (1)
i
LLM4Decompile …
…
in Equation 2, it minimizes the negative log proba-
bility for the C code tokens xi , ..., xj :
Endbr64 push %rbp mov %rsp,%rbp
... Float trun_num(float num)
src
X
{ return num - (int)num; } L=− log Pi (xi , ..., xj |x1 , ..., xi−1 ; θ) (2)
i
Next token prediction
(Language modeling) where the loss is calculated only for the output
sequence xi ...xj , or the C code.
Figure 2: Training objectives The main difference lies in whether the input
sequence or the input assembly code is included in
forming source code to generate faster and more the training loss or not, whereas in language model-
efficient machine code. It involves techniques like ing, all the inputs are included for loss calculation.
eliminating redundant instructions, better register We conduct different ablation studies on these two
allocation, loop transformations, etc. The different training objectives to explore their effectiveness in
optimization levels trade off compile time against decompilation.
execution time and debugging ability. The key opti- 4 Experiment Setups
mization levels are: O0 (default, No optimization)
to O3 (Aggressive optimizations, compilation time 4.1 Decompile-Eval Benchmark
consuming). We compile the source code into all Currently, there appears to be no benchmark for de-
four stages, i.e., O0, O1, O2, O3, and pair each of compilation evaluation that considers whether code
them with the source code. To inform the model can be recompiled or executed correctly. When
with the optimization stage, we use the following assessing decompilation model performance, re-
prompt: # This is the assembly code with searchers rely on metrics that measure N-gram sim-
[optimization state] optimization: [asm ilarity (such as BLEU or CodeBLEU) or edit sim-
code] # What is the source code?. ilarity (ES). However, these metrics, commonly
used in machine translation and text generation,
3.2 Model Configurations fail to adapt to the evaluation of programming lan-
Our LLM4Decompile uses the same architecture as guages.
DeepSeek-Coder and we initialize our model with Programming languages are highly structured
the corresponding DeepSeek-Coder checkpoints. and logical, insensitive to the naming of functions
As a neural translator, the training objectives can and variables, yet very sensitive to the flow of data
be categorized into two categories, as shown in and logic. Changing variable or function names
Figure 2. does not affect the meaning of a program, but a sin-
1) Next token prediction (NTP) or language mod- gle logical error can alter its entire function and pur-
eling, which is the pre-training objective of most pose. As illustrated in Figure 3, the use of BLEU
float trun_num(float num) {
uation process (Figure 1), the C source code is
return num - (int)num; first compiled into a binary, then disassembled into
src } assembly code, and finally fed into the decompila-
float trun_num(float num) { BLEU: 73.3 tion system to be reconstructed back into C source
return (int)num - (int)num; ES: 91.7 code. This regenerated C code is compiled with
src" } GCC to test re-compilability and combined with
float trun_num(float num) { BLEU: 67.6 the original assertions to check if it can successfully
ES: 90.9
return num - num; execute and pass those assertions. Re-compilability
src# }
and re-executability serve as critical indicators in
float func(float x) { BLEU: 0.0 validating the effectiveness of a decompilation pro-
ES: 69.1
return x - int(x);
src$ }
cess. When decompiled code can be recompiled, it
provides strong evidence of syntactic integrity. It
float func(float f) { BLEU: 0.0 ensures that the decompiled code is not just read-
int i = (int)f; ES: 41.4
return f - i; able, but also adheres to the structural and syntacti-
src! } cal standards expected by the compiler. However,
syntax alone does not guarantee semantic equiv-
Figure 3: Limitation to use BLEU and Edit Similarity alence to the original pre-compiled program. Re-
for evaluating decompilation results. executability provides this critical measure of se-
mantic correctness. By re-compiling the decom-
piled output and running the test cases, we assess
and ES in evaluating code similarity is problem- if the decompilation preserved the program logic
atic. For src1 , the variation from the original src and behavior. Together, re-compilability and re-
is confined to type conversion of variable num, executability indicate syntax recovery and semantic
which leads to high BLEU and ES scores. How- preservation—both essential for usable and robust
ever, this alteration completely changes the intent decompilation.
of the code. Similarly, src2 achieves high BLEU In alignment with established evaluation prac-
and ES scores, yet the semantics of the function are tices, following Slade (Armengol-Estapé et al.,
lost. Conversely, src3 undergoes normalization of 2023), we partition 1000 samples from the Ang-
function and variable names, causing no semantic haBench into a test set. We then utilize BLEU and
shift yet scoring zero in BLEU against the original ES as our primary metrics for assessment.
code. The example of src4 is more extreme: if the
program logic is broken down into multiple lines, 4.2 Baselines
the ES drops to 41.4%, falsely indicating a low sim- To benchmark against SOTA decompilers, we se-
ilarity. However, during compilation, names are lected two key baselines. First, GPT-4 repre-
typically standardized by the compiler, and source sents the most capable LLMs, providing an up-
code is often broken down into basic operations per bound on LLM performance. As one of the
depending on optimization. For this reason, the largest language models, GPT-4 significantly sur-
ability to recompile and execute the code is far passes previous LLMs across modalities. Second,
more indicative than N-gram or edit similarity for DeepSeek-Coder is selected as the current SOTA
evaluating decompilation efficacy. open-source Code LLM. It represents the forefront
To address the gap in decompilation assessment, of publicly available models specifically tailored
we introduce Decompile-Eval, the first benchmark for coding tasks. While recent academic works
to evaluate the re-compilability and re-executability like BTC (Hosseini and Dolan-Gavitt, 2022) and
of decompilation systems. This benchmark is de- Slade (Armengol-Estapé et al., 2023) showcase
rived from HumanEval (Chen et al., 2021), which is LLMs for decompilation, these models present
the leading benchmark for code generation assess- significant integration challenges, such as com-
ment and includes 164 programming challenges plex pre-processing settings, non-standardized to-
with accompanying Python solutions and asser- kenizer/model loading, and necessitating signifi-
tions. We converted these Python solutions and cant effort to modify and adapt them. We thus
assertions into C, making sure that they compile selected GPT-4 and DeepSeek-Coder as represen-
with the GCC compiler using standard C libraries tative cutting-edge and open-source baselines ac-
and pass all the original assertions. In our eval- cessible for evaluation.
Table 1: Evaluation Results on Decompile-Eval
Vu Le, Mehrdad Afshari, and Zhendong Su. 2014. Com- Tao Wei, Jian Mao, Wei Zou, and Yu Chen. 2007. A
piler validation via equivalence modulo inputs. In new algorithm for identifying loops in decompilation.
ACM SIGPLAN Conference on Programming Lan- In Static Analysis, 14th International Symposium,
guage Design and Implementation, PLDI ’14, Edin- SAS 2007, Kongens Lyngby, Denmark, August 22-24,
burgh, United Kingdom - June 09 - 11, 2014, pages 2007, Proceedings, volume 4634 of Lecture Notes in
216–226. ACM. Computer Science, pages 170–183. Springer.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Thomas Lippincott. 2020. Starcoder: A general neural Chaumond, Clement Delangue, Anthony Moi, Pier-
ensemble technique to support traditional scholar- ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz,
ship, illustrated with a study of the post-atlantic slave and Jamie Brew. 2019. Huggingface’s transformers:
trade. In 15th Annual International Conference of the State-of-the-art natural language processing. CoRR,
Alliance of Digital Humanities Organizations, DH abs/1910.03771.
2020, Ottawa, Canada, July 20-25, 2020, Conference
Abstracts. Wai Kin Wong, Huaijin Wang, Zongjie Li, Zhibo Liu,
Shuai Wang, Qiyi Tang, Sen Nie, and Shi Wu. 2023.
Zhibo Liu and Shuai Wang. 2020. How far we have Refining decompiled C code with large language
come: testing decompilation correctness of C decom- models. CoRR, abs/2310.06530.
pilers. In ISSTA ’20: 29th ACM SIGSOFT Interna-
Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Jo-
tional Symposium on Software Testing and Analysis,
sua Hellendoorn. 2022. A systematic evaluation of
Virtual Event, USA, July 18-22, 2020, pages 475–487.
large language models of code. In Proceedings of the
ACM.
6th ACM SIGPLAN International Symposium on Ma-
chine Programming, MAPS 2022, page 1–10, New
Ilya Loshchilov and Frank Hutter. 2019. Decoupled York, NY, USA. Association for Computing Machin-
weight decay regularization. In 7th International ery.
Conference on Learning Representations, ICLR 2019,
New Orleans, LA, USA, May 6-9, 2019. OpenRe- Xiangzhe Xu, Zhuo Zhang, Shiwei Feng, Yapeng Ye,
view.net. Zian Su, Nan Jiang, Siyuan Cheng, Lin Tan, and
Xiangyu Zhang. 2023. Lmpa: Improving decompila-
OpenAI. 2023. GPT-4 technical report. CoRR, tion by synergy of large language model and program
abs/2303.08774. analysis. CoRR, abs/2306.02546.
Zhengran Zeng, Hanzhuo Tan, Haotian Zhang, Jing Li,
Yuqun Zhang, and Lingming Zhang. 2022. An ex-
tensive study on pre-trained models for program un-
derstanding and generation. In ISSTA ’22: 31st ACM
SIGSOFT International Symposium on Software Test-
ing and Analysis, Virtual Event, South Korea, July 18
- 22, 2022, pages 39–51. ACM.