LLM4Decompile: Decompiling Binary Code With Large Language Models
LLM4Decompile: Decompiling Binary Code With Large Language Models
Decompilation aims to convert binary code to for (i = 0; i < size; i++) Disassemble
for (j = i + 1; j < size; j++)
<func0>: ...
if (fabs(num[i] - num[j])
high-level source code, but traditional tools < threshold)
return 1; ASM
endbr64
push %rbp
mov $0x0,%eax
pop %rbp
arXiv:2403.05286v2 [cs.PL] 19 Jun 2024
... retq
like Ghidra often produce results that are dif- return 0;}
Preprocessor
01010101...
Assembler
Compiler
for (...)
Motivated by the success of Large Language
Linker
for (...)
if (...)
Models (Li et al., 2023; Rozière et al., 2023; Guo return 1;
return 0;} Disassemble
et al., 2024), researchers have employed LLMs for
decompilation, primarily through two approaches— Loss
Refined-Decompile and End2end-Decompile. In int func0(...) { … <func0>:
int i, j; endbr64
particular, Refined-Decompile prompts the LLMs for (...) … push
mov
%rbp
%rsp,%rbp
for (...)
...
to refine results from traditional decompilation if (...)
return 1;
…
mov
pop
$0x0,%eax
%rbp
return 0;} …
tools like Ghidra or IDA Pro. For instance, retq
DeGPT (Hu et al., 2024) enhances Ghidra’s read- SRC’ LLM4Decompile-End ASM
ability by reducing cognitive load by 24.4%, while
Figure 2: End2end-Decompile framework. The source
DecGPT (Wong et al., 2023) increases IDA Pro’s
code (SRC) is compiled to binary, disassembled to
re-executability rate to over 75% by integrating er- assembly instructions (ASM), and decompiled by
ror messages into its refinement process. These LLM4Decompile to generate SRC’. Loss is computed
approaches, however, largely ignore the fact that between SRC and SRC’ for training.
LLMs are designed primarily for high-level pro-
gramming languages (Li et al., 2023; Rozière et al.,
2023; Guo et al., 2024), and their effectiveness hancing the Refined-Decompile approach, the cor-
with binary files is not well-established. End2end- responding fine-tuned models are referred to as
Decompile, on the other hand, fine-tunes LLMs LLM4Decompile-Ref, which can effectively refine
to decompile binaries directly. Early open-source the decompiled results from Ghidra.
models like BTC (Hosseini and Dolan-Gavitt,
2022) and recent development Slade (Armengol-
3.1 LLM4Decompile-End
Estapé et al., 2023) adopt the language model with
around 200 million parameters (Lewis et al., 2020) In this section, we describe the general End2end-
to fine-tune for decompilation. While Nova (Jiang Decompile framework, and present details
et al., 2023), which is not open-sourced, devel- on our strategy to optimize the training of
ops a binary LLM with 1 billion parameters and LLM4Decompile-End models.
fine-tunes it for decompilation. Consequently, the
largest open-source model in this domain is limited 3.1.1 The End2End-Decompile Framework
to 200M. Whereas utilizing larger models trained
on broader datasets has proven to substantially im- Figure 2 illustrates the End2end-Decompile frame-
prove the performance (Hoffmann et al., 2024; Ka- work from compilation to decompilation processes.
plan et al., 2020; Rozière et al., 2023). During compilation, the Preprocessor processes the
Therefore, our objective is to present the first source code (SRC) to eliminate comments and ex-
and most extensive open-source LLM4Decompile pand macros or includes. The cleaned code is then
series, aiming at comprehensively advancing the forwarded to the Compiler, which converts it into
decompilation capability of LLMs. Initially, we assembly code (ASM). This ASM is transformed
optimize the End2end-Decompile approach to train into binary code (0s and 1s) by the Assembler.
the LLM4Decompile-End, demonstrating its effec- The Linker finalizes the process by linking func-
tiveness in directly decompiling binary files. Subse- tion calls to create an executable file. Decompila-
quently, we enhance the Refined-Decompile frame- tion, on the other hand, involves converting binary
works to integrate LLMs with Ghidra, augmenting code back into a source file. LLMs, being trained
traditional tools for optimal effectiveness. on text, lack the ability to process binary data di-
rectly. Therefore, binaries must be disassembled
3 LLM4Decompile by Objdump into assembly language (ASM) first.
It should be noted that binary and disassembled
First, we introduce our strategy for optimizing ASM are equivalent, they can be interconverted,
LLM training to directly decompile binaries, the and thus we refer to them interchangeably. Finally,
resulting models are named as LLM4Decompile- the loss is computed between the decompiled code
End. Following this, we detail our efforts for en- and source code to guide the training.
3.1.2 Optimize LLM4Decompile-End SRC Compile Binary
int func0(...) {
We optimize the training of LLM4Decompile-End int i, j; 0011101001
Preprocessor
01010101...
Assembler
Compiler
for (...)
Linker
Models through three key steps: 1) augmenting for (...)
if (...)
the training corpus, 2) improving the quality of the return 1;
return 0;}
data, 3) and incorporating two-state training.
Loss Ghidra
Training Corpus. As indicated by the Scaling-
int func0(...) { …
undefined func0(...){
Law (Hoffmann et al., 2024; Kaplan et al., 2020), int i, j;
for (...) …
int local_28;
do {...}
for (...) while (...) {
the effectiveness of an LLM heavily relies on the if (...) … if (...) {
return 1;}}
return 1;
size of the training corpus. Consequently, our ini- return 0;} … } while(...);}
Table 1: Main comparison of End2end-Decompile approaches for re-executability rates on evaluation benchmarks.
HumanEval-Decompile ExeBench
Model/Benchmark
O0 O1 O2 O3 AVG O0 O1 O2 O3 AVG
Compilable-1.3B 0.4268 0.1646 0.1646 0.1707 0.2317 0.0568 0.0446 0.0416 0.0443 0.0468
Compilable-6.7B 0.5183 0.3354 0.3232 0.3232 0.3750 0.0752 0.0649 0.0671 0.0660 0.0683
Executable-1.3B 0.1951 0.1280 0.1280 0.1159 0.1418 0.2194 0.1946 0.1931 0.1950 0.2005
Executable-6.7B 0.3720 0.1829 0.2256 0.1707 0.2378 0.2938 0.2598 0.2591 0.2549 0.2669
Table 2: Ablation study on training dataset. The “Compilable” models are trained on 7.2M non-executable functions,
while the “Executable” models are trained on 1.6M executable functions.
2019). We train our models using LLaMA- a program’s semantics more effectively. While
Factory library (Zheng et al., 2024). For 1.3B attempting to fine-tune the 33B model, we
and 6.7B models, we set a batch size = 2048 encountered substantial challenges related to the
and learning rate = 2e−5 and train the mod- high communication loads, which significantly
els for 2 epochs (15B tokens). Experiments are slowed the training process and restricted us to
performed on NVIDIA A100-80GB GPU clusters. using only 200M tokens (Section 4.1.1). Despite
Fine-tuning the 1.3B and 6.7B LLM4Decompile- this limitation, the 33B model still outperforms the
End takes 12 and 61 days on 8 × A100 respectively. 1.3B model, reaffirming the importance of scaling
Limited by the resources, for the 33B model we up the model size.
only train for 200M tokens. For evaluation, we
use the vllm (Kwon et al., 2023) to accelerate the Ablation Study. As discussed in Section 4.1.1,
generation (decompilation) process. We employ our training data comprises two distinct sets: 7.2
greedy decoding to minimize randomness. million compilable functions (non-executable) and
1.6M executable functions. We conducted an ab-
4.1.2 Experimental Results lation study using these datasets, and the results
Main Results. Table 1 presents the re- are displayed in Table 2. Here, “Compilable” de-
executability rate under different optimization notes the model trained solely on compilable data,
states for our studied models. The base version while “Executable” indicates models trained ex-
of DeepSeek-Coder-33B is unable to accurately clusively on executable data. Notably, the binary
decompile binaries. It could generate code that object from compilable functions lacks links to
seemed correct but failed to retain the original function calls, which is similar in text distribu-
program semantics. GPT-4o shows notable tion to the HumanEval-Decompile data, consisting
decompilation skills; it’s capable to decompile of single functions dependent only on standard C
non-optimized (O0) code with a success rate of libraries. Consequently, the 6.7B model trained
30.5%, though the rate significantly decreases to only on compilable data successfully decompiled
about 11% for optimized codes (O1-O3). The 37.5% of HumanEval-Decompile functions, but
LLM4Decompile-End models, on the other hand, only 6.8% on ExeBench, which features real func-
demonstrate excellent decompilation abilities. tions with extensive user-defined functions. On
The 1.3B version successfully decompiles and the other hand, the 6.7B model trained solely on
retains the program semantics in 27.3% of cases executable data achieved a 26.7% re-executability
on average, whereas the 6.7B version has a success rate on the ExeBench test set but faced challenges
rate of 45.4%. This improvement underscores with single functions, with only a 23.8% success
the advantages of using larger models to capture rate on HumanEval-Decompile due to the smaller
Re-executability Rate Edit Similarity
Model/Metrics
O0 O1 O2 O3 AVG O0 O1 O2 O3 AVG
LLM4Decompile-End-6.7B 0.6805 0.3951 0.3671 0.3720 0.4537 0.1557 0.1292 0.1293 0.1269 0.1353
Ghidra
Base 0.3476 0.1646 0.1524 0.1402 0.2012 0.0699 0.0613 0.0619 0.0547 0.0620
+GPT-4o 0.4695 0.3415 0.2866 0.3110 0.3522 0.0660 0.0563 0.0567 0.0499 0.0572
+LLM4Decompile-Ref-1.3B 0.6890 0.3720 0.4085 0.3720 0.4604 0.1517 0.1325 0.1292 0.1267 0.1350
+LLM4Decompile-Ref-6.7B 0.7439 0.4695 0.4756 0.4207 0.5274 0.1559 0.1353 0.1342 0.1273 0.1382
+LLM4Decompile-Ref-33B 0.7073 0.4756 0.4390 0.4146 0.5091 0.1540 0.1379 0.1363 0.1307 0.1397
Table 3: Main comparison of Refined-Decompile approaches for re-executability rate and Edit Similar-
ity on HumanEval-Decompile benchmark. “+GPT-4o” refers to enhance the Ghidra results with GPT-4o,
“+LLM4Decompile-Ref” means refining Ghidra results with the fine-tuned LLM4Decompile-Ref models.
size of the training corpus. Limited by the space, For the pseudo-code decompiled by Ghidra, which
we present further analysis in Appendix B. is not optimized for re-execution, only an average
of 20.1% of them pass the test cases. GPT-4o as-
4.2 LLM4Decompile-Ref sists in refining this pseudo-code and enhancing
4.2.1 Experimental Setups its quality. The LLM4Decompile-Ref models offer
Experimental Datasets. The training data is con- substantial improvements over Ghidra’s outputs,
structed using ExeBench, with Ghidra Headless em- with the 6.7B model yielding a 160% increase in
ployed to decompile the binary object file. Due to re-executability. Similar to the discussion in Sec-
constraints in computational resources, only 400K tion 4.1.2, the 33B model outperforms the 1.3B
functions each with optimization levels from O0 to model even though it used considerably less train-
O3 (1.6M samples, 1B tokens) are used for training ing data. And it achieves performance that is only
and the evaluation is conducted on HumanEval- 3.6% below the 6.7B model, which benefited from
Decompile. The models are trained using the same ten times more training data. When compared to
template described in Section 4.1.1. In addition, fol- LLM4Decompile-End-6.7B, the LLM4Decompile-
lowing previous work (Hosseini and Dolan-Gavitt, Ref-6.7B model, though trained on just 10% of
2022; Armengol-Estapé et al., 2023), we access the the data in LLM4Decompile-Ref models, shows a
readability of decompiled results in terms of Edit 16.2% performance increase, suggesting a greater
Similarity score. potential for the Refined-Decompile approach.
Implementation. Configuration settings for the An analysis of readability across different meth-
model are consistent with those in Section 4.1.1. ods is also conducted and presented in Table 3,
For the 1.3B, 6.7B models, the fine-tuning pro- examples are presented in Figure 4. For text sim-
cess involves 2B tokens in 2 epochs, and requires ilarity, all decompiled outputs diverge from the
2, and 8 days respectively on 8 × A100 respec- original source code, with Edit Similarity rang-
tively. Limited by the resource, for 33B model ing from 5.7% to 14.0%, primarily because the
we only train for 200M tokens. For evaluation, compilation process removes variable names and
we first access the re-executability rate of Ghidra optimizes the logic structure. Ghidra generates
to establish a baseline. Subsequently, GPT-4o is pseudo-code that is particularly less readable with
used to enhance Ghidra’s decompilation result with 6.2% Edit Similarity on average. Interestingly, with
the prompt, Generate linux compilable C/C++ refinement from GPT (Ghidra+GPT-4o), there is a
code of the main and other functions marginal decrease in Edit Similarity. GPT assists
in the supplied snippet without using in refining type errors like undefined4 and ulong
goto, fix any missing headers. Do not (Figure 4), however, it struggles to accurately re-
explain anything., following DecGPT (Wong construct for loops and array indexing. In contrast,
et al., 2023). Finally, we use LLM4Decompile-Ref both LLM4Decompile-End and LLM4Decompile-
models to refine the Ghidra’s output. Ref generate outputs that are more aligned with the
format of the source code and easier to comprehend.
4.2.2 Experimental Results To summarize, domain-specific fine-tuning is cru-
The results for the baselines and Refined- cial for enhancing re-executability and readability
Decompile approaches are summarized in Table 3. of decompilation outputs.
Control Flow Flattening Bogus Control Flow
Model/Obfuscation
O0 O1 O2 O3 AVG O0 O1 O2 O3 AVG
LLM4Decompile-End-6.7B 0.0427 0.0488 0.0488 0.0305 0.0427 0.0976 0.0732 0.0793 0.0976 0.0869
Ghidra 0.1220 0.0671 0.0610 0.0671 0.0793 0.0610 0.0427 0.0305 0.0427 0.0442
+LLM4Decompile-Ref-6.7B 0.0671 0.0366 0.0488 0.0549 0.0519 0.1585 0.1402 0.0854 0.0793 0.1159
Table 4: Re-executability rates of different approaches on the HumanEval-Decompile benchmark under obfuscations.
Compared to Table 3, the decompilation success rates significantly drop for over 70%.
Iman Hosseini and Brendan Dolan-Gavitt. 2022. Be- Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas
yond the C: retargetable decompilation using neural Muennighoff, Denis Kocetkov, Chenghao Mou, Marc
machine translation. CoRR, abs/2212.08950. Marone, Christopher Akiki, Jia Li, Jenny Chim,
Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo,
Thomas Wang, Olivier Dehaene, Mishig Davaadorj,
Peiwei Hu, Ruigang Liang, and Kai Chen. 2024. Degpt:
Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko,
Optimizing decompiler output with llm. In Proceed-
Nicolas Gontier, Nicholas Meade, Armel Zebaze,
ings 2024 Network and Distributed System Security
Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu,
Symposium (2024). https://api. semanticscholar. org/-
Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo
CorpusID, volume 267622140.
Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp
Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey,
Nan Jiang, Chengxiao Wang, Kevin Liu, Xiangzhe Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya,
Xu, Lin Tan, and Xiangyu Zhang. 2023. Nova+ : Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo
Generative language models for binaries. CoRR, Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel
abs/2311.13721. Romero, Tony Lee, Nadav Timor, Jennifer Ding,
Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri
Pascal Junod, Julien Rinaldini, Johan Wehrli, and Julie Dao, Mayank Mishra, Alex Gu, Jennifer Robinson,
Michielin. 2015. Obfuscator-LLVM – software Carolyn Jane Anderson, Brendan Dolan-Gavitt, Dan-
protection for the masses. In Proceedings of the ish Contractor, Siva Reddy, Daniel Fried, Dzmitry
IEEE/ACM 1st International Workshop on Software Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis,
Protection, SPRO’15, Firenze, Italy, May 19th, 2015, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro
pages 3–9. IEEE. von Werra, and Harm de Vries. 2023. Starcoder: may
the source be with you! Preprint, arXiv:2305.06161.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B.
Brown, Benjamin Chess, Rewon Child, Scott Gray, Zhibo Liu and Shuai Wang. 2020a. How far we have
Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. come: testing decompilation correctness of c decom-
Scaling laws for neural language models. Preprint, pilers. In Proceedings of the 29th ACM SIGSOFT
arXiv:2001.08361. International Symposium on Software Testing and
Analysis, ISSTA 2020, page 475–487, New York,
Deborah S. Katz, Jason Ruchti, and Eric M. Schulte. NY, USA. Association for Computing Machinery.
2018. Using recurrent neural networks for decompi- Zhibo Liu and Shuai Wang. 2020b. How far we have
lation. In 25th International Conference on Software come: testing decompilation correctness of C decom-
Analysis, Evolution and Reengineering, SANER 2018, pilers. In ISSTA ’20: 29th ACM SIGSOFT Interna-
Campobasso, Italy, March 20-23, 2018, pages 346– tional Symposium on Software Testing and Analysis,
356. IEEE Computer Society. Virtual Event, USA, July 18-22, 2020, pages 475–487.
ACM.
Omer Katz, Yuval Olshaker, Yoav Goldberg, and Eran
Yahav. 2019. Towards neural decompilation. ArXiv, Jerome Miecznikowski and Laurie J. Hendren. 2002.
abs/1905.08325. Decompiling java bytecode: Problems, traps and
pitfalls. In International Conference on Compiler variable names from stripped binary. Preprint,
Construction. arXiv:2306.02546.
Steven S. Muchnick. 1997. Advanced compiler design Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan
and implementation. Ye, Zheyan Luo, and Yongqiang Ma. 2024. Llamafac-
tory: Unified efficient fine-tuning of 100+ language
Vikram Nitin, Anthony Saieva, Baishakhi Ray, and Gail models. arXiv preprint arXiv:2403.13372.
Kaiser. 2021. DIRECT : A transformer-based model
for decompiled identifier renaming. In Proceedings A ExeBench Setups
of the 1st Workshop on Natural Language Processing
for Programming (NLP4Prog 2021), pages 48–57, For every sample in ExeBench’s executable splits,
Online. Association for Computational Linguistics. assembly code from *.s file—a compiler’s interme-
Godfrey Nolan. 2012. Decompiling android. In Apress. diate output as discussed in Section 3.1 and Fig-
ure 1—is required to compile the sample into a
OpenAI. 2023. GPT-4 technical report. CoRR,
binary. The specific compilation settings and pro-
abs/2303.08774.
cessing details, however, are not provided by the
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten authors. Consequently, we choose to compile the
Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, code in a standard way and manage to compile only
Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom
Kozhevnikov, Ivan Evtimov, Joanna Bitton, Man- half of the samples. This leaves us with 443K out
ish Bhatt, Cristian Canton-Ferrer, Aaron Grattafiori, of 797K samples for the executable training set and
Wenhan Xiong, Alexandre Défossez, Jade Copet, 2621 out of 5000 samples for the executable test
Faisal Azhar, Hugo Touvron, Louis Martin, Nico- set. Accordingly, we train our model on the 443K
las Usunier, Thomas Scialom, and Gabriel Synnaeve.
2023. Code llama: Open foundation models for code.
samples and conduct the re-executability evalua-
CoRR, abs/2308.12950. tion on these 2621 samples, the results are shown
in Table 1.
Richard M Stallman et al. 2003. Using the gnu compiler The researchers from Slade (Armengol-
collection. Free Software Foundation, 4(02).
Estapé et al., 2023), who also developed
Ruoyu Wang, Yan Shoshitaishvili, Antonio Bianchi, ExeBench (Armengol-Estapé et al., 2022),
Aravind Machiry, John Grosen, Paul Grosen, Christo- have published their decompilation findings
pher Kruegel, and Giovanni Vigna. 2017. Ramblr:
Making reassembly great again. In NDSS.
on ExeBench. They chose to decompile the
intermediate output, or assembly code from *.s
Tao Wei, Jian Mao, Wei Zou, and Yu Chen. 2007. A file, directly without further compilation into
new algorithm for identifying loops in decompilation. binaries, where in practice, such intermediate
In Static Analysis, 14th International Symposium,
SAS 2007, Kongens Lyngby, Denmark, August 22-24, output is rarely released by software developers.
2007, Proceedings, volume 4634 of Lecture Notes in Their reported results, as seen in Table 5, show
Computer Science, pages 170–183. Springer. a significant difference from ours. Their version
of ChatGPT achieved a re-executability rate of
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier- 22.2% and an edit similarity of 44.0% under O0
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, optimization. On the other hand, our GPT-4o
and Jamie Brew. 2019. Huggingface’s transformers: model only reached a 4.4% re-executability rate
State-of-the-art natural language processing. CoRR, and a 7.9% edit similarity. The approach taken by
abs/1910.03771.
Slade involves settings not commonly available in
Wai Kin Wong, Huaijin Wang, Zongjie Li, Zhibo Liu, practical decompilation scenarios, which explains
Shuai Wang, Qiyi Tang, Sen Nie, and Shi Wu. 2023. why their results vary significantly from ours. We
Refining decompiled C code with large language
models. CoRR, abs/2310.06530. adheres to a more realistic setting, decompiling
binary files based solely on their intrinsic data,
Xiangzhe Xu, Zhuo Zhang, Shiwei Feng, Yapeng Ye, without any external information.
Zian Su, Nan Jiang, Siyuan Cheng, Lin Tan, and To further illustrate our settings, Figure 5 of-
Xiangyu Zhang. 2023. Lmpa: Improving decompila-
tion by synergy of large language model and program fers an example where the source function includes
analysis. CoRR, abs/2306.02546. specific user-defined types like Ltc4151State,
Ltc4151, and device. However, these types are
Xiangzhe Xu, Zhuo Zhang, Zian Su, Ziyang Huang,
Shiwei Feng, Yapeng Ye, Nan Jiang, Danning
completely lost after compilation, i.e., no informa-
Xie, Siyuan Cheng, Lin Tan, and Xiangyu Zhang. tion related to these user-definitions can be found
2024. Leveraging generative models to recover in the binary (disassembled ASM code). Conse-
Model/Metrics Re-executability Edit Similarity O0 O1 O2 O3
Optimization Level O0 O3 O0 O3 1.3B Performance on HumanEval-Decompile
Slade 59.5 52.2 71.0 60.0 1.0
Re-executability Rate
ChatGPT 22.2 13.6 44.0 34.0 0.8
GPT-4o(ours) 4.4 3.4 7.9 6.6
0.6
Table 5: Re-executability and Edit Similarity on 0.4
Exebench.
0.2
0.0
Source Code ASM
void StateIdle(Ltc4151State next, <StateIdle>: 6.7B Performance on HumanEval-Decompile
Ltc4151 *device) { endbr64 1.0
device->state = next; push %rbp
Re-executability Rate
} mov %rsp,%rbp
mov %edi,-0x4(%rbp)
0.8
mov %rsi,-0x10(%rbp)
mov -0x10(%rbp),%rax 0.6
mov -0x4(%rbp),%edx
GPT-4o mov %edx,(%rax) 0.4
void StateIdle(int a, int *b) { nop
*b = a; pop %rbp
} retq 0.2
0.0
40
80
0
0
0
0
0
0
0
0
12
16
20
24
28
32
36
40
Figure 5: Decompilation results of GPT-4o on
>=
Length
ExeBench test case.
Figure 6: Re-executability rate with the growth of input
length. 6.7B model is more robust against input length.
quently, GPT-4o is unable to reconstruct these types
based purely on the ASM (the realistic setting), Return 4% Struct 7% Type Struct
Type 3% 16%
instead converting them to default types int or Other Syntax
28%
8% 14%
pointer, producing non-executable code. This is- Assert
4%
sue was pervasive across the ExeBench test set, Other
2%
leading to the failure of GPT-4o models in decom- Assert Declare
64% 50%
piling the ExeBench samples in a realistic setting.
HumanEval-Decompile ExeBench
B Further Analysis of
LLM4Decompile-Ref Figure 7: Types of errors identified in the two bench-
marks: LLM4Decomile-End-6.7B faces issues with log-
Figure 6 illustrates that the re-executability rate de- ical errors in HumanEval-Decompile and user-defined
creases as the input length increases, and there is a components in ExeBench.
marked decline in performance at higher levels of
code optimization, highlighting the difficulties in
decompiling long and highly optimized sequences. structures. Given that these user-defined details are
Importantly, the performance difference between typically lost during the compilation process, re-
the 1.3B and 6.7B models showcased in the figure constructing them can be particularly challenging.
emphasizes the advantages of larger models in such Integrating techniques like Retrieval Augmented
tasks. Larger models, with their expanded compu- Generation might supplement the decompilation
tational resources and deeper learning capabilities, process with necessary external information.
are inherently better at resolving the challenges
posed by complex decompilations. C Obfuscation Techniques
The error analysis presented in Figure 7 for We provide the details of two classic obfuscation
LLM4Decompile-End-6.7B indicates that logical techniques suggested in Obfuscator-LLVM.
errors are prevalent in the HumanEval-Decompile
scenarios, with 64% of errors due to assertions that Control Flow Flattening enhances the security
the decompiled codes do not pass. In the ExeBench of software by transforming its straightforward,
dataset, which features real functions with user- hierarchical control flow into a more complex, flat-
defined structures and types, the major challenges tened structure. The workflow involves breaking a
are related to reclaiming these user-specific com- function into basic blocks, arranging these blocks
ponents. Where 50% of the errors come from un- at the same level, and encapsulating them within a
declared functions, and 28% from improper use of switch statement inside a loop.
Bogus Control Flow modifies a function’s ex-
ecution sequence by inserting an additional basic
blockprior to the existing one. This added block
includes an opaque predicate, followed by a con-
ditional jump that leads back to the original block.
Additionally, the original basic block is polluted
with randomly selected, meaningless instructions.