0% found this document useful (0 votes)
27 views13 pages

LLM4Decompile: Decompiling Binary Code With Large Language Models

LLM4Decompile is an innovative series of large language models designed to enhance the decompilation of binary code into high-level source code, significantly outperforming traditional tools like Ghidra and GPT-4o by over 100% on relevant benchmarks. The models, ranging from 1.3B to 33B parameters, utilize advanced training techniques to improve readability and executability of the decompiled output. This research addresses the limitations of existing decompilation methods, particularly in handling obfuscated code and enhancing the clarity of the resulting pseudo-code.

Uploaded by

jstxdingying
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views13 pages

LLM4Decompile: Decompiling Binary Code With Large Language Models

LLM4Decompile is an innovative series of large language models designed to enhance the decompilation of binary code into high-level source code, significantly outperforming traditional tools like Ghidra and GPT-4o by over 100% on relevant benchmarks. The models, ranging from 1.3B to 33B parameters, utilize advanced training techniques to improve readability and executability of the decompiled output. This research addresses the limitations of existing decompilation methods, particularly in handling obfuscated code and enhancing the clarity of the resulting pseudo-code.

Uploaded by

jstxdingying
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

LLM4Decompile: Decompiling Binary Code with Large Language Models

Hanzhuo Tan, Qi Luo, Jing Li, Yuqun Zhang


Southern University of Science and Technology
The Hong Kong Polytechnic University

Source Code Binary


Abstract int func0(float num[], int size, Compile 00111010010101010101010
float threshold) { 110101010110101000101...
int i, j;

Decompilation aims to convert binary code to for (i = 0; i < size; i++) Disassemble
for (j = i + 1; j < size; j++)
<func0>: ...
if (fabs(num[i] - num[j])
high-level source code, but traditional tools < threshold)
return 1; ASM
endbr64
push %rbp
mov $0x0,%eax
pop %rbp
arXiv:2403.05286v2 [cs.PL] 19 Jun 2024

... retq
like Ghidra often produce results that are dif- return 0;}

Ghidra Decompiled Pseudo-Code Decompile


ficult to read and execute. Motivated by
undefined4 func0(float param_1,long param_2,int param_3){
the advancements in Large Language Mod- int local_28;
int local_24;
local_24 = 0;
els (LLMs), we propose LLM4Decompile, do {
local_28 = local_24;
the first and largest open-source LLM series if (param_3 <= local_24) {
return 0;}
(1.3B to 33B) trained to decompile binary while (local_28 = local_28 + 1, local_28 < param_3) {
if ((double)((ulong)(double)(*(float *)(param_2 + (long)local_24 * 4) -
*(float *)(param_2 + (long)local_28 * 4)) &
code. We optimize the LLM training process SUB168(_DAT_00402010,0)) < (double)param_1) {
return 1;}}
and introduce the LLM4Decompile-End mod- local_24 = local_24 + 1;
} while( true );}
els to decompile binary directly. The result-
ing models significantly outperform GPT-4o Figure 1: Illustration of compiling source code to binary,
and Ghidra on the HumanEval and ExeBench disassembling binary to assembly code (ASM), and
benchmarks by over 100%. Additionally, we decompiling ASM to pseudo-code with Ghidra. The
improve the standard refinement approach to pseudo-code is hard to read and not executable.
fine-tune the LLM4Decompile-Ref models, en-
abling them to effectively refine the decompiled lenges, numerous tools have been developed for de-
code from Ghidra and achieve a further 16.2% compilation, with Ghidra (Ghidra, 2024) and IDA
improvement over the LLM4Decompile-End. Pro (Hex-Rays, 2024) being the most commonly
LLM4Decompile1 demonstrates the potential used. Although these tools have the capability to re-
of LLMs to revolutionize binary code decom-
vert binary code to high-level pseudo-code, the out-
pilation, delivering remarkable improvements
in readability and executability while comple-
puts often lack readability and re-executability (Liu
menting conventional tools for optimal results. and Wang, 2020a; Wang et al., 2017), which are
essential for applications like legacy software mi-
1 Introduction gration and security instrumentation tasks (Wong
et al., 2023; Dinesh et al., 2020).
Decompilation, the reverse process of converting
Figure 1 illustrates the transformation from the
machine code or binary code into a high-level
source C code to a binary file, assembly code
programming language, facilitates various reverse
(ASM), and pseudo-code decompiled from Ghidra.
engineering tasks such as vulnerability identifica-
In this pseudo-code, the original nested for struc-
tion, malware research, and legacy software mi-
ture is replaced with a less intuitive combination of
gration (Brumley et al., 2013; Katz et al., 2018;
a do-while loop inside another while loop. Fur-
Hosseini and Dolan-Gavitt, 2022; Xu et al., 2023;
thermore, array indexing like num[i] is decom-
Armengol-Estapé et al., 2023; Jiang et al., 2023;
piled into complicated pointer arithmetic such as
Wong et al., 2023; Hu et al., 2024). Decompilation
*(float *)(param_2 + (long)local_24 * 4).
is challenging due to the loss of information inher-
The decompiled output also exhibits syntactical er-
ent in the compilation process, particularly finer de-
rors, with the function return type being converted
tails such as variable names (Lacomis et al., 2019)
to undefined4. Overall, traditional decompilation
and fundamental structures like loops and condi-
tools often strip away the syntactic clarity provided
tionals (Wei et al., 2007). To address these chal-
by high-level languages and do not ensure the cor-
1
https://github.com/albertan017/LLM4Decompile rectness of syntax, posing significant challenges
even for skilled developers to reconstruct the algo- protection. Our findings indicate that neither our
rithmic logic (Wong et al., 2023; Hu et al., 2024) approach nor Ghidra can effectively decompile ob-
Recent advancements in Large Language Mod- fuscated code, mitigating concerns about unautho-
els (LLMs) have greatly improved the process rized use for infringement of intellectual property.
of decompiling code.There are two primary ap- In summary, our contributions are as follows:
proaches to LLM-based decompilation—Refined- • We introduce the LLM4Decompile series, the
Decompile and End2end-Decompile. In particular, first and largest open-source LLMs (ranging from
Refined-Decompile prompts LLMs to refine the re- 1.3B to 33B parameters) fine-tuned on 15 billion
sults from traditional decompilation tools (Hu et al., tokens for decompilation.
2024; Wong et al., 2023; Xu et al., 2023). However,
LLMs are primarily optimized for high-level pro- • We optimize the LLM training process and in-
gramming languages and may not be as effective troduce LLM4Decompile-End models, which set
with binary data. End2end-Decompile fine-tunes a new performance standard of direct binary de-
LLMs to decompile binaries directly. Nevertheless, compilation, significantly surpassing GPT-4o and
previous open-source applications of this approach Ghidra by over 100% on the HumanEval and
were limited by the use of smaller models with ExeBench benchmarks.
only around 200 million parameters and restricted
training corpus (Hosseini and Dolan-Gavitt, 2022; • We improve the Refined-Decompile approach to
Armengol-Estapé et al., 2023; Jiang et al., 2023), In fine-tune the LLM4Decompile-Ref models, en-
contrast, utilizing larger models trained on broader abling them to effectively refine the decompiled
datasets has proven to substantially improve the results from Ghidra and achieve further 16.2%
performance (Hoffmann et al., 2024; Kaplan et al., enhancements over LLM4Decompile-End.
2020; Rozière et al., 2023; OpenAI, 2023).
To address the limitations of previous studies, 2 Related Work
we propose LLM4Decompile, the first and largest
open-source LLM series with sizes ranging from The practice of reversing executable binaries to
1.3B to 33B parameters specifically trained to de- their source code form, known as decompilation,
compile binary code. To the best of our knowl- has been researched for decades (Miecznikowski
edge, there’s no previous study attempts to im- and Hendren, 2002; Nolan, 2012; Katz et al., 2019).
prove the capability of LLM-based decompila- Traditional decompilation relies on analyzing the
tion in such depth or incorporate such large-scale control and data flows of program (Brumley et al.,
LLMs. Based on the End2end-Decompile ap- 2013), and employing pattern matching, as seen
proach, we introduce three critical steps: data aug- in tools like Hex-Rays Ida pro (Hex-Rays, 2024)
mentation, data cleaning, and two-stage training, to and Ghidra (Ghidra, 2024). These systems attempt
optimize the LLM training process and introduce to identify patterns within a program’s control-
the LLM4Decompile-End models to decompile bi- flow graph (CFG) that corresponding to standard
nary directly. Specifically, our LLM4Decompile- programming constructs such as conditional state-
End-6.7B model demonstrates a successful decom- ments or loops. However, the output from such
pilation rate of 45.4% on HumanEval (Chen et al., decompilation processes tends to be a source-code-
2021) and 18.0% on ExeBench (Armengol-Estapé like representation of assembly code, including
et al., 2022), far exceeding Ghidra or GPT-4o by direct translations of variables to registers, use of
over 100%. Additionally, we improve the Refined- gotos, and other low-level operations instead of
Decompile strategy by examining the efficiency of the original high-level language constructs. This
Ghidra’s decompilation process, augmenting and output, while often functionally similar to the orig-
filtering data to fine-tune the LLM4Decompile-Ref inal code, is difficult to understand and may not be
models, which excel at refining Ghidra’s output. re-executable (Liu and Wang, 2020b; Wong et al.,
Experiments suggest a higher performance ceil- 2023). Drawing inspiration from neural machine
ing for the enhanced Refined-Decompile approach, translation, researchers have reformulated decompi-
with 16.2% improvement over LLM4Decompile- lation as a translation exercise, converting machine-
End. Additionally, we assess the risks associated level instructions into readable source code (Katz
with the potential misuse of our model under ob- et al., 2019). Initial attempts in this area utilized
fuscation conditions commonly used in software recurrent neural networks (RNNs) (Katz et al.,
2018) for decompilation, complemented by error- SRC Compile Binary
correction techniques to enhance the outcomes. int func0(...) {
int i, j;
0011101001

Preprocessor
01010101...

Assembler
Compiler
for (...)
Motivated by the success of Large Language

Linker
for (...)
if (...)
Models (Li et al., 2023; Rozière et al., 2023; Guo return 1;
return 0;} Disassemble
et al., 2024), researchers have employed LLMs for
decompilation, primarily through two approaches— Loss
Refined-Decompile and End2end-Decompile. In int func0(...) { … <func0>:
int i, j; endbr64
particular, Refined-Decompile prompts the LLMs for (...) … push
mov
%rbp
%rsp,%rbp
for (...)
...
to refine results from traditional decompilation if (...)
return 1;

mov
pop
$0x0,%eax
%rbp
return 0;} …
tools like Ghidra or IDA Pro. For instance, retq

DeGPT (Hu et al., 2024) enhances Ghidra’s read- SRC’ LLM4Decompile-End ASM
ability by reducing cognitive load by 24.4%, while
Figure 2: End2end-Decompile framework. The source
DecGPT (Wong et al., 2023) increases IDA Pro’s
code (SRC) is compiled to binary, disassembled to
re-executability rate to over 75% by integrating er- assembly instructions (ASM), and decompiled by
ror messages into its refinement process. These LLM4Decompile to generate SRC’. Loss is computed
approaches, however, largely ignore the fact that between SRC and SRC’ for training.
LLMs are designed primarily for high-level pro-
gramming languages (Li et al., 2023; Rozière et al.,
2023; Guo et al., 2024), and their effectiveness hancing the Refined-Decompile approach, the cor-
with binary files is not well-established. End2end- responding fine-tuned models are referred to as
Decompile, on the other hand, fine-tunes LLMs LLM4Decompile-Ref, which can effectively refine
to decompile binaries directly. Early open-source the decompiled results from Ghidra.
models like BTC (Hosseini and Dolan-Gavitt,
2022) and recent development Slade (Armengol-
3.1 LLM4Decompile-End
Estapé et al., 2023) adopt the language model with
around 200 million parameters (Lewis et al., 2020) In this section, we describe the general End2end-
to fine-tune for decompilation. While Nova (Jiang Decompile framework, and present details
et al., 2023), which is not open-sourced, devel- on our strategy to optimize the training of
ops a binary LLM with 1 billion parameters and LLM4Decompile-End models.
fine-tunes it for decompilation. Consequently, the
largest open-source model in this domain is limited 3.1.1 The End2End-Decompile Framework
to 200M. Whereas utilizing larger models trained
on broader datasets has proven to substantially im- Figure 2 illustrates the End2end-Decompile frame-
prove the performance (Hoffmann et al., 2024; Ka- work from compilation to decompilation processes.
plan et al., 2020; Rozière et al., 2023). During compilation, the Preprocessor processes the
Therefore, our objective is to present the first source code (SRC) to eliminate comments and ex-
and most extensive open-source LLM4Decompile pand macros or includes. The cleaned code is then
series, aiming at comprehensively advancing the forwarded to the Compiler, which converts it into
decompilation capability of LLMs. Initially, we assembly code (ASM). This ASM is transformed
optimize the End2end-Decompile approach to train into binary code (0s and 1s) by the Assembler.
the LLM4Decompile-End, demonstrating its effec- The Linker finalizes the process by linking func-
tiveness in directly decompiling binary files. Subse- tion calls to create an executable file. Decompila-
quently, we enhance the Refined-Decompile frame- tion, on the other hand, involves converting binary
works to integrate LLMs with Ghidra, augmenting code back into a source file. LLMs, being trained
traditional tools for optimal effectiveness. on text, lack the ability to process binary data di-
rectly. Therefore, binaries must be disassembled
3 LLM4Decompile by Objdump into assembly language (ASM) first.
It should be noted that binary and disassembled
First, we introduce our strategy for optimizing ASM are equivalent, they can be interconverted,
LLM training to directly decompile binaries, the and thus we refer to them interchangeably. Finally,
resulting models are named as LLM4Decompile- the loss is computed between the decompiled code
End. Following this, we detail our efforts for en- and source code to guide the training.
3.1.2 Optimize LLM4Decompile-End SRC Compile Binary
int func0(...) {
We optimize the training of LLM4Decompile-End int i, j; 0011101001

Preprocessor
01010101...

Assembler
Compiler
for (...)

Linker
Models through three key steps: 1) augmenting for (...)
if (...)
the training corpus, 2) improving the quality of the return 1;
return 0;}
data, 3) and incorporating two-state training.
Loss Ghidra
Training Corpus. As indicated by the Scaling-
int func0(...) { …
undefined func0(...){
Law (Hoffmann et al., 2024; Kaplan et al., 2020), int i, j;
for (...) …
int local_28;
do {...}
for (...) while (...) {
the effectiveness of an LLM heavily relies on the if (...) … if (...) {
return 1;}}
return 1;
size of the training corpus. Consequently, our ini- return 0;} … } while(...);}

tial step in training optimization involves incorpo- SRC’ LLM4Decompile-Ref Pseudo-code


rating a large training corpus. We construct asm-
source pairs based on ExeBench (Armengol-Estapé Figure 3: Refined-Decompile framework. It differs from
et al., 2022), which is the largest public collection End2end-Decompile (Figure 2) only in the LLM’s input,
of five million C functions. To further expand the which is pseudo-code decompiled from Ghidra.
training data, we consider the compilation opti-
mization states frequently used by developers. The 3.2 LLM4Decompile-Ref
compilation optimization involves techniques like
We now examine how the conventional decompi-
eliminating redundant instructions, better register
lation tool, Ghidra, can be significantly improved
allocation, and loop transformations (Muchnick,
by integrating it with LLMs. Note that our ap-
1997), which perfectly acts as data augmentation
proach aims at refining entire outputs from Ghidra,
for decompilation. The key optimization levels are
offering a broader strategy than merely recover-
O0 (default, no optimization) to O3 (aggressive
ing names or types (Nitin et al., 2021; Xu et al.,
optimizations). We compile the source code into
2024). We begin by detailing the general Refined-
all four stages, i.e., O0, O1, O2, and O3, and pair
Decompile framework, and discuss our strategy to
each of them with the source code.
enhance Ghidra’s output by LLM4Decompile-Ref.
Data Quality. Data quality is critical in training
3.2.1 The Refined-Decompile Framework
an effective model (Li et al., 2023). Therefore, our
second step is to clean our training set. We follow The Refined-Decompile approach is shown in Fig-
the guidelines of StarCoder (Li et al., 2023) by ure 3. This approach differs from that in Figure 2
computing MinHash (Broder, 2000) for the code only in terms of the LLM’s input, which in the
and utilizing Locally Sensitive Hashing (LSH) to case of Refined-Decompile comes from Ghidra’s
remove duplicates. We also exclude samples that decompilation output. Specifically, Ghidra is used
are less than 10 tokens. to decompile the binary, and then the LLM is fine-
tuned to enhance Ghidra’s output. While Ghidra
Two-Stage Training. Our final step for training produces high-level pseudo-code that may suffer
optimization aims to educate the model with bi- from readability issues and syntax errors, it effec-
nary knowledge, and includes two-stage training. tively preserves the underlying logic. Refining this
In the first stage, we train the model with a large pseudo-code significantly mitigates the challenges
corpus of compilable but not linkable (executable) associated with understanding the obscure ASM.
data. Note that it’s significantly easier to extract C
code that is compilable but not linkable (da Silva 3.2.2 Refine Ghidra by LLM4Decompile-Ref
et al., 2021; Armengol-Estapé et al., 2022). Such Decompiling using Ghidra. Decompiling the
not-executable binary object code will closely re- executable code with Ghidra (Figure 3) is time-
semble its executable version except it lacks linked consuming due to the complex nature of the exe-
addresses for external symbols. Therefore, in the cutables in ExeBench, which include numerous ex-
first stage, we use the extensive compilable codes ternal functions and IO wrappers. Ghidra Headless
to ground our model in binary knowledge. In the requires 2 seconds per sample using 128-core multi-
second stage, we refine the model using executable processing. Given such a high computational load,
code to ensure its practical applicability. We also and the high similarities between non-executable
conduct an ablation study for the two-stage training and executable binaries, we choose to decompile
in Section 4.1.2. the non-executable files using Ghidra. This choice
significantly reduces the time to 0.2 seconds per C functions taken from GitHub with IO examples.
sample, enabling us to efficiently gather large Note that the HumanEval-Decompile consists of
amounts of training data. individual functions that depend only on the stan-
dard C library. In contrast, ExeBench includes
Optimization Strategies. Similar to Sec- functions extracted from real-world projects with
tion 3.1.2, we augment our dataset by compiling user-defined structures and functions.
with optimization levels O0, O1, O2, and O3. We As for the evaluation metrics, we follow
further filter the dataset using LSH to remove previous work to calculate the re-executability
duplicates. As shown in Figure 1, Ghidra often rate (Armengol-Estapé et al., 2023; Wong et al.,
generates overly long pseudo-code. Consequently, 2023). During evaluation, the C source code is
we filter out any samples that exceed the maximum first compiled into a binary, then disassembled into
length accepted by our model. assembly code, and fed into the decompilation sys-
tem to be reconstructed back into C code. This
4 Experiments
decompiled C code is then combined with the as-
In this section, we discuss the experimental se- sertions to check if it can successfully execute and
tups and results for LLM4Decompile-End and pass those assertions.
LLM4Decompile-Ref respectively.
Model Configurations. The LLM4Decompile
4.1 LLM4Decompile-End uses the same architecture as DeepSeek-
Coder (Guo et al., 2024) and we initialize our
4.1.1 Experimental Setups
models with the corresponding DeepSeek-Coder
Training Data. As discussed in Section 3.1.2, checkpoints. We employ Sequence-to-sequence
we construct asm-source pairs based on compilable prediction (S2S), which is the training objective
and executable datasets from ExeBench (Armengol- adopted in most neural machine translation
Estapé et al., 2022), where we only consider the models that aim to predict the output given the
decompilation of GCC (Stallman et al., 2003) com- input sequence. As illustrated in Equation 1, it
piled C function under x86 Linux platform. After minimizes the negative log likelihood for the
filtering, our refined compilable training dataset source code tokens xi , ..., xj :
includes 7.2 million samples, containing roughly X
7 billion tokens. Our executable training dataset L=− log Pi (xi , ..., xj |x1 , ..., xi−1 ; θ) (1)
includes 1.6 million samples, containing roughly i
572 million tokens. To train the model, we use the
Where the loss is calculated only for the output
following template: # This is the assembly
sequence xi ...xj , or the source code.
code: [ASM code] # What is the source
code? [source code], where [ASM code] corre- Baselines. We selected two key baselines for
sponds to the disassembled assembly code from the comparison. First, GPT-4o (OpenAI, 2023) rep-
binary, and [source code] is the original C func- resents the most capable LLMs, providing an upper
tion. Note that the template choice does not impact bound on LLM performance. Second, DeepSeek-
the performance, since we fine-tune the model to Coder (Guo et al., 2024) is selected as the cur-
produce the source code. rent SOTA open-source Code LLM. It represents
the forefront of publicly available models specifi-
Evaluation Benchmarks and Metrics. To eval-
cally tailored for coding tasks. While recent work
uate the models, we introduce HumanEval (Chen
Slade (Armengol-Estapé et al., 2023) fine-tunes
et al., 2021) and ExeBench (Armengol-Estapé et al.,
language model for decompilation, it relies on in-
2022) benchmarks. HumanEval is the leading
termediate compiler outputs, specifically, the *.s
benchmark for code generation assessment and in-
files. In practice, however, such intermediate files
cludes 164 programming challenges with accom-
are rarely released by developers. Therefore, we
panying Python solutions and assertions. We con-
focus on a more realistic approach, and consider
verted these Python solutions and assertions into
decompilation only from the binaries, for further
C, making sure that they can be compiled with
discussions please refer to Appendix A.
the GCC compiler using standard C libraries and
pass all the assertions, and name it HumanEval- Implementation. We use the DeepSeek-Coder
Decompile. ExeBench consists of 5000 real-world models obtained from Hugging Face (Wolf et al.,
HumanEval-Decompile ExeBench
Model/Benchmark
O0 O1 O2 O3 AVG O0 O1 O2 O3 AVG
DeepSeek-Coder-6.7B 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
GPT-4o 0.3049 0.1159 0.1037 0.1159 0.1601 0.0443 0.0328 0.0397 0.0343 0.0378
LLM4Decompile-End-1.3B 0.4720 0.2061 0.2122 0.2024 0.2732 0.1786 0.1362 0.1320 0.1328 0.1449
LLM4Decompile-End-6.7B 0.6805 0.3951 0.3671 0.3720 0.4537 0.2289 0.1660 0.1618 0.1625 0.1798
LLM4Decompile-End-33B 0.5168 0.2556 0.2415 0.2475 0.3154 0.1886 0.1465 0.1396 0.1411 0.1540

Table 1: Main comparison of End2end-Decompile approaches for re-executability rates on evaluation benchmarks.

HumanEval-Decompile ExeBench
Model/Benchmark
O0 O1 O2 O3 AVG O0 O1 O2 O3 AVG
Compilable-1.3B 0.4268 0.1646 0.1646 0.1707 0.2317 0.0568 0.0446 0.0416 0.0443 0.0468
Compilable-6.7B 0.5183 0.3354 0.3232 0.3232 0.3750 0.0752 0.0649 0.0671 0.0660 0.0683
Executable-1.3B 0.1951 0.1280 0.1280 0.1159 0.1418 0.2194 0.1946 0.1931 0.1950 0.2005
Executable-6.7B 0.3720 0.1829 0.2256 0.1707 0.2378 0.2938 0.2598 0.2591 0.2549 0.2669

Table 2: Ablation study on training dataset. The “Compilable” models are trained on 7.2M non-executable functions,
while the “Executable” models are trained on 1.6M executable functions.

2019). We train our models using LLaMA- a program’s semantics more effectively. While
Factory library (Zheng et al., 2024). For 1.3B attempting to fine-tune the 33B model, we
and 6.7B models, we set a batch size = 2048 encountered substantial challenges related to the
and learning rate = 2e−5 and train the mod- high communication loads, which significantly
els for 2 epochs (15B tokens). Experiments are slowed the training process and restricted us to
performed on NVIDIA A100-80GB GPU clusters. using only 200M tokens (Section 4.1.1). Despite
Fine-tuning the 1.3B and 6.7B LLM4Decompile- this limitation, the 33B model still outperforms the
End takes 12 and 61 days on 8 × A100 respectively. 1.3B model, reaffirming the importance of scaling
Limited by the resources, for the 33B model we up the model size.
only train for 200M tokens. For evaluation, we
use the vllm (Kwon et al., 2023) to accelerate the Ablation Study. As discussed in Section 4.1.1,
generation (decompilation) process. We employ our training data comprises two distinct sets: 7.2
greedy decoding to minimize randomness. million compilable functions (non-executable) and
1.6M executable functions. We conducted an ab-
4.1.2 Experimental Results lation study using these datasets, and the results
Main Results. Table 1 presents the re- are displayed in Table 2. Here, “Compilable” de-
executability rate under different optimization notes the model trained solely on compilable data,
states for our studied models. The base version while “Executable” indicates models trained ex-
of DeepSeek-Coder-33B is unable to accurately clusively on executable data. Notably, the binary
decompile binaries. It could generate code that object from compilable functions lacks links to
seemed correct but failed to retain the original function calls, which is similar in text distribu-
program semantics. GPT-4o shows notable tion to the HumanEval-Decompile data, consisting
decompilation skills; it’s capable to decompile of single functions dependent only on standard C
non-optimized (O0) code with a success rate of libraries. Consequently, the 6.7B model trained
30.5%, though the rate significantly decreases to only on compilable data successfully decompiled
about 11% for optimized codes (O1-O3). The 37.5% of HumanEval-Decompile functions, but
LLM4Decompile-End models, on the other hand, only 6.8% on ExeBench, which features real func-
demonstrate excellent decompilation abilities. tions with extensive user-defined functions. On
The 1.3B version successfully decompiles and the other hand, the 6.7B model trained solely on
retains the program semantics in 27.3% of cases executable data achieved a 26.7% re-executability
on average, whereas the 6.7B version has a success rate on the ExeBench test set but faced challenges
rate of 45.4%. This improvement underscores with single functions, with only a 23.8% success
the advantages of using larger models to capture rate on HumanEval-Decompile due to the smaller
Re-executability Rate Edit Similarity
Model/Metrics
O0 O1 O2 O3 AVG O0 O1 O2 O3 AVG
LLM4Decompile-End-6.7B 0.6805 0.3951 0.3671 0.3720 0.4537 0.1557 0.1292 0.1293 0.1269 0.1353
Ghidra
Base 0.3476 0.1646 0.1524 0.1402 0.2012 0.0699 0.0613 0.0619 0.0547 0.0620
+GPT-4o 0.4695 0.3415 0.2866 0.3110 0.3522 0.0660 0.0563 0.0567 0.0499 0.0572
+LLM4Decompile-Ref-1.3B 0.6890 0.3720 0.4085 0.3720 0.4604 0.1517 0.1325 0.1292 0.1267 0.1350
+LLM4Decompile-Ref-6.7B 0.7439 0.4695 0.4756 0.4207 0.5274 0.1559 0.1353 0.1342 0.1273 0.1382
+LLM4Decompile-Ref-33B 0.7073 0.4756 0.4390 0.4146 0.5091 0.1540 0.1379 0.1363 0.1307 0.1397

Table 3: Main comparison of Refined-Decompile approaches for re-executability rate and Edit Similar-
ity on HumanEval-Decompile benchmark. “+GPT-4o” refers to enhance the Ghidra results with GPT-4o,
“+LLM4Decompile-Ref” means refining Ghidra results with the fine-tuned LLM4Decompile-Ref models.

size of the training corpus. Limited by the space, For the pseudo-code decompiled by Ghidra, which
we present further analysis in Appendix B. is not optimized for re-execution, only an average
of 20.1% of them pass the test cases. GPT-4o as-
4.2 LLM4Decompile-Ref sists in refining this pseudo-code and enhancing
4.2.1 Experimental Setups its quality. The LLM4Decompile-Ref models offer
Experimental Datasets. The training data is con- substantial improvements over Ghidra’s outputs,
structed using ExeBench, with Ghidra Headless em- with the 6.7B model yielding a 160% increase in
ployed to decompile the binary object file. Due to re-executability. Similar to the discussion in Sec-
constraints in computational resources, only 400K tion 4.1.2, the 33B model outperforms the 1.3B
functions each with optimization levels from O0 to model even though it used considerably less train-
O3 (1.6M samples, 1B tokens) are used for training ing data. And it achieves performance that is only
and the evaluation is conducted on HumanEval- 3.6% below the 6.7B model, which benefited from
Decompile. The models are trained using the same ten times more training data. When compared to
template described in Section 4.1.1. In addition, fol- LLM4Decompile-End-6.7B, the LLM4Decompile-
lowing previous work (Hosseini and Dolan-Gavitt, Ref-6.7B model, though trained on just 10% of
2022; Armengol-Estapé et al., 2023), we access the the data in LLM4Decompile-Ref models, shows a
readability of decompiled results in terms of Edit 16.2% performance increase, suggesting a greater
Similarity score. potential for the Refined-Decompile approach.

Implementation. Configuration settings for the An analysis of readability across different meth-
model are consistent with those in Section 4.1.1. ods is also conducted and presented in Table 3,
For the 1.3B, 6.7B models, the fine-tuning pro- examples are presented in Figure 4. For text sim-
cess involves 2B tokens in 2 epochs, and requires ilarity, all decompiled outputs diverge from the
2, and 8 days respectively on 8 × A100 respec- original source code, with Edit Similarity rang-
tively. Limited by the resource, for 33B model ing from 5.7% to 14.0%, primarily because the
we only train for 200M tokens. For evaluation, compilation process removes variable names and
we first access the re-executability rate of Ghidra optimizes the logic structure. Ghidra generates
to establish a baseline. Subsequently, GPT-4o is pseudo-code that is particularly less readable with
used to enhance Ghidra’s decompilation result with 6.2% Edit Similarity on average. Interestingly, with
the prompt, Generate linux compilable C/C++ refinement from GPT (Ghidra+GPT-4o), there is a
code of the main and other functions marginal decrease in Edit Similarity. GPT assists
in the supplied snippet without using in refining type errors like undefined4 and ulong
goto, fix any missing headers. Do not (Figure 4), however, it struggles to accurately re-
explain anything., following DecGPT (Wong construct for loops and array indexing. In contrast,
et al., 2023). Finally, we use LLM4Decompile-Ref both LLM4Decompile-End and LLM4Decompile-
models to refine the Ghidra’s output. Ref generate outputs that are more aligned with the
format of the source code and easier to comprehend.
4.2.2 Experimental Results To summarize, domain-specific fine-tuning is cru-
The results for the baselines and Refined- cial for enhancing re-executability and readability
Decompile approaches are summarized in Table 3. of decompilation outputs.
Control Flow Flattening Bogus Control Flow
Model/Obfuscation
O0 O1 O2 O3 AVG O0 O1 O2 O3 AVG
LLM4Decompile-End-6.7B 0.0427 0.0488 0.0488 0.0305 0.0427 0.0976 0.0732 0.0793 0.0976 0.0869
Ghidra 0.1220 0.0671 0.0610 0.0671 0.0793 0.0610 0.0427 0.0305 0.0427 0.0442
+LLM4Decompile-Ref-6.7B 0.0671 0.0366 0.0488 0.0549 0.0519 0.1585 0.1402 0.0854 0.0793 0.1159

Table 4: Re-executability rates of different approaches on the HumanEval-Decompile benchmark under obfuscations.
Compared to Table 3, the decompilation success rates significantly drop for over 70%.

Source Code LLM4Decompile-End-6.7B cation. In our study, we focus on two fundamental


int func0(float num[], int size, int func0(float *x, int n,
float threshold) { float eps){
int i, j; int i, j; obfuscation techniques as suggested in Obfuscator-
for (i = 0; i < size; i++) for (i = 0; i < n; i++)
for (j = i + 1; j < size; j++)
if (fabs(num[i] - num[j])
for (j = i + 1; j < n; j++)
if (fabsf(x[i] - x[j]) < eps)
LLVM (Junod et al., 2015): Control Flow Flatten-
< threshold) return 1;
return 1;
return 0;}
return 0;} ing (CFF) and Bogus Control Flow (BCF). These
Ghidra
undefined4 func0(float param_1,long param_2,int param_3){
techniques are designed to disguise the true logic of
int local_28;
int local_24;
local_24 = 0;
the software, thereby making decompilation more
do {
local_28 = local_24; challenging to protect the software’s intellectual
if (param_3 <= local_24) {
return 0;}
while (local_28 = local_28 + 1, local_28 < param_3) {
property. We present the details of these two tech-
if ((double)((ulong)(double)(*(float *)(param_2 + (long)local_24 * 4) -
*(float *)(param_2 + (long)local_28 * 4)) &
SUB168(_DAT_00402010,0)) < (double)param_1) {
niques in the Appendix C.
return 1;}}
local_24 = local_24 + 1;
} while( true );} Results summarized in Table 4 demonstrate that
Ghidra+LLM4Decompile-6.7B Ghidra+GPT-4o basic conventional obfuscation techniques are suffi-
int func0(float *x, int n, typedef unsigned int undefined4;
float eps){ undefined4 func0(float param_1, cient to prevent both Ghidra and LLM4Decompile
int i, j; long param_2, int param_3)
for (i = 0; i < n; i++)
for (j = i + 1; j < n; j++)
{
int local_28;
from decoding obfuscated binaries. For example,
if (fabsf(x[i] - x[j]) < eps) int local_24;
return 1;
return 0;}
local_24 = 0;
while (1) {
the decompilation success rate for the most ad-
local_28 = local_24;
GPT-4o if (param_3 <= local_24) { vanced model, LLM4Decompile-Ref-6.7B, drops
int func0(float **arr, int length, return 0;
float threshold) {
int outer = 0;
}
while (local_28 + 1 < param_3) { significantly for 90.2% (0.5274 to 0.0519) under
while (outer < length) { local_28++;
int inner = 0;
while (inner < length) {
if ((double)((*(float *)
(param_2 + (long)local_24 * 4)
CFF and 78.0% (0.5274 to 0.1159) under BCF.
float diff = arr[outer][inner] - *(float *)(param_2 +
- arr[inner][inner];
if (fabs(diff) <= threshold) {
(long)local_28 * 4)))
< (double)param_1) {
Considering the industry standard of employing
return 1;
return 1;}
inner++;} } several complex obfuscation methods prior to soft-
outer++;} }
return 0;} local_24++;}}
ware release, experimental results in Table 4 mit-
igate the concerns about unauthorized use for in-
Figure 4: Decompilation results of different approaches. fringement of intellectual property.
GPT-4o output is plausible yet fail to recover the array
dimension (incorrect 2D array arr[outer][inner]).
Ghidra’s pseudo-code is notably less readable as 6 Conclusions
discussed in Figure 1. GPT-refined Ghidra re-
sult (Ghidra+GPT-4o) marginally enhances readabil-
ity but fails to correctly render for loops and ar- We propose LLM4Decompile, the first and largest
ray indexing. Conversely, LLM4Decompile-End and open-source LLM series with sizes ranging from
LLM4Decompile-Ref produce accurate and easy-to- 1.3B to 33B trained to decompile binary code.
read outputs. Based on the End2end-Decompile approach, we
optimize the LLM training process and introduce
5 Obfuscation Discussion the LLM4Decompile-End models to decompile bi-
nary directly. The resulting 6.7B model shows a
The process of decompilation aims at revealing the decompilation accuracy of 45.4% on HumanEval
source code from binaries distributed by develop- and 18.0% on ExeBench, surpassing existing tools
ers, presenting a potential threat to the protection like Ghidra and GPT-4o over 100%. Addition-
of intellectual property. To resolve the ethical con- ally, we improve the Refined-Decompile strategy to
cerns, this section accesses the risks of the possible fine-tune the LLM4Decompile-Ref models, which
misuse of our decompilation models. excel at refining the Ghidra’s output, with 16.2%
In software development, engineers typically im- improvement over LLM4Decompile-End. Finally,
plement obfuscation techniques before releasing we conduct obfuscation experiments and address
binary files to the public. This is done to protect concerns regarding the misuse of LLM4Decompile
the software from unauthorized analysis or modifi- models for infringement of intellectual property.
Limitations References
The scope of this research is limited to the com- Jordi Armengol-Estapé, Jackson Woodruff, Alexander
Brauckmann, José Wesley de Souza Magalhães, and
pilation and decompilation of C language target- Michael F. P. O’Boyle. 2022. Exebench: An ml-scale
ing the x86 platform. While we are confident that dataset of executable c functions. In Proceedings of
the methodologies developed here could be eas- the 6th ACM SIGPLAN International Symposium on
ily adapted to other programming languages and Machine Programming, MAPS 2022, page 50–59,
New York, NY, USA. Association for Computing
platforms, these potential extensions have been re-
Machinery.
served for future investigation. Furthermore, Our
research is limited by financial constraints, with a Jordi Armengol-Estapé, Jackson Woodruff, Chris Cum-
budget equivalent to using 8 × A100 GPUs for one mins, and Michael F. P. O’Boyle. 2023. Slade: A
portable small language model decompiler for opti-
year, which includes all trials and iterations. As mized assembler. CoRR, abs/2305.12520.
a result, we have only managed to fully fine-tune
models up to 6.7B, and conducted initial explo- Andrei Z Broder. 2000. Identifying and filtering near-
rations on the 33B models with a small dataset, duplicate documents. In Annual symposium on com-
binatorial pattern matching, pages 1–10. Springer.
leaving the exploration of 70B and larger models
to future studies. Nonetheless, our preliminary David Brumley, JongHyup Lee, Edward J. Schwartz,
tests confirm the potential advantages of scaling up and Maverick Woo. 2013. Native x86 decompila-
model sizes and suggest a promising direction for tion using semantics-preserving structural analysis
and iterative control-flow structuring. In Proceedings
future decompilation research into larger models. of the 22th USENIX Security Symposium, Washing-
ton, DC, USA, August 14-16, 2013, pages 353–368.
Ethic Statement USENIX Association.
We have evaluated the risks of the possible mis- Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan,
use of our decompilation models in Section 5. Henrique Pondé de Oliveira Pinto, Jared Kaplan,
Basic obfuscation methods such as Control Flow Harrison Edwards, Yuri Burda, Nicholas Joseph,
Greg Brockman, Alex Ray, Raul Puri, Gretchen
Flattening and Bogus Control Flow have been Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas-
empirically tested and proven to protect against try, Pamela Mishkin, Brooke Chan, Scott Gray,
unauthorized decompilation by both traditional Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz
tools like Ghidra and advanced models like Kaiser, Mohammad Bavarian, Clemens Winter,
Philippe Tillet, Felipe Petroski Such, Dave Cum-
LLM4Decompile. This built-in limitation ensures mings, Matthias Plappert, Fotios Chantzis, Eliza-
that while LLM4Decompile is a powerful tool for beth Barnes, Ariel Herbert-Voss, William Hebgen
legitimate uses, it does not facilitate the infringe- Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie
ment of intellectual property. Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain,
William Saunders, Christopher Hesse, Andrew N.
In practical applications in the industry, software Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan
developers typically employ a series of complex ob- Morikawa, Alec Radford, Matthew Knight, Miles
fuscation methods before releasing their software. Brundage, Mira Murati, Katie Mayer, Peter Welinder,
This practice adds an additional layer of security Bob McGrew, Dario Amodei, Sam McCandlish, Ilya
Sutskever, and Wojciech Zaremba. 2021. Evaluat-
and intellectual property protection against decom-
ing large language models trained on code. CoRR,
pilation. LLM4Decompile’s design and intended abs/2107.03374.
use respect these measures, ensuring that it serves
as an aid in legal and ethical scenarios, such as un- Anderson Faustino da Silva, Bruno Conde Kind,
José Wesley de Souza Magalhães, Jerônimo Nunes
derstanding legacy code or enhancing cybersecurity Rocha, Breno Campos Ferreira Guimarães, and
defenses, rather than undermining them. Fernando Magno Quintão Pereira. 2021. ANG-
The development and deployment of HABENCH: A suite with one million compilable C
LLM4Decompile are guided by strict ethi- benchmarks for code-size reduction. In IEEE/ACM
International Symposium on Code Generation and
cal standards. The model is primarily intended for
Optimization, CGO 2021, Seoul, South Korea, Febru-
use in scenarios where permission has been granted ary 27 - March 3, 2021, pages 378–390. IEEE.
or where the software is not protected by copyright.
This includes academic research, debugging, Sushant Dinesh, Nathan Burow, Dongyan Xu, and Math-
ias Payer. 2020. Retrowrite: Statically instrument-
learning, and situations where companies seek to ing cots binaries for fuzzing and sanitization. In
recover lost source code of their own software. 2020 IEEE Symposium on Security and Privacy (SP),
pages 1497–1511.
Ghidra. 2024. Ghidra software reverse engineering Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying
framework. Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E.
Gonzalez, Hao Zhang, and Ion Stoica. 2023. Effi-
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai cient memory management for large language model
Dong, Wentao Zhang, Guanting Chen, Xiao Bi, serving with pagedattention. In Proceedings of the
Y Wu, YK Li, et al. 2024. Deepseek-coder: When the ACM SIGOPS 29th Symposium on Operating Systems
large language model meets programming–the rise of Principles.
code intelligence. arXiv preprint arXiv:2401.14196.
Jeremy Lacomis, Pengcheng Yin, Edward J. Schwartz,
Miltiadis Allamanis, Claire Le Goues, Graham Neu-
Hex-Rays. 2024. Ida pro: a cross-platform multi-
big, and Bogdan Vasilescu. 2019. DIRE: A neural
processor disassembler and debugger.
approach to decompiled identifier naming. In 34th
IEEE/ACM International Conference on Automated
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Software Engineering, ASE 2019, San Diego, CA,
Elena Buchatskaya, Trevor Cai, Eliza Rutherford, USA, November 11-15, 2019, pages 628–639. IEEE.
Diego de Las Casas, Lisa Anne Hendricks, Johannes
Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Katie Millican, George van den Driessche, Bogdan Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
Damoc, Aurelia Guy, Simon Osindero, Karen Si- Veselin Stoyanov, and Luke Zettlemoyer. 2020.
monyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, BART: Denoising sequence-to-sequence pre-training
and Laurent Sifre. 2024. Training compute-optimal for natural language generation, translation, and com-
large language models. In Proceedings of the 36th prehension. In Proceedings of the 58th Annual Meet-
International Conference on Neural Information Pro- ing of the Association for Computational Linguistics,
cessing Systems, NIPS ’22, Red Hook, NY, USA. pages 7871–7880, Online. Association for Computa-
Curran Associates Inc. tional Linguistics.

Iman Hosseini and Brendan Dolan-Gavitt. 2022. Be- Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas
yond the C: retargetable decompilation using neural Muennighoff, Denis Kocetkov, Chenghao Mou, Marc
machine translation. CoRR, abs/2212.08950. Marone, Christopher Akiki, Jia Li, Jenny Chim,
Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo,
Thomas Wang, Olivier Dehaene, Mishig Davaadorj,
Peiwei Hu, Ruigang Liang, and Kai Chen. 2024. Degpt:
Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko,
Optimizing decompiler output with llm. In Proceed-
Nicolas Gontier, Nicholas Meade, Armel Zebaze,
ings 2024 Network and Distributed System Security
Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu,
Symposium (2024). https://api. semanticscholar. org/-
Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo
CorpusID, volume 267622140.
Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp
Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey,
Nan Jiang, Chengxiao Wang, Kevin Liu, Xiangzhe Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya,
Xu, Lin Tan, and Xiangyu Zhang. 2023. Nova+ : Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo
Generative language models for binaries. CoRR, Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel
abs/2311.13721. Romero, Tony Lee, Nadav Timor, Jennifer Ding,
Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri
Pascal Junod, Julien Rinaldini, Johan Wehrli, and Julie Dao, Mayank Mishra, Alex Gu, Jennifer Robinson,
Michielin. 2015. Obfuscator-LLVM – software Carolyn Jane Anderson, Brendan Dolan-Gavitt, Dan-
protection for the masses. In Proceedings of the ish Contractor, Siva Reddy, Daniel Fried, Dzmitry
IEEE/ACM 1st International Workshop on Software Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis,
Protection, SPRO’15, Firenze, Italy, May 19th, 2015, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro
pages 3–9. IEEE. von Werra, and Harm de Vries. 2023. Starcoder: may
the source be with you! Preprint, arXiv:2305.06161.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B.
Brown, Benjamin Chess, Rewon Child, Scott Gray, Zhibo Liu and Shuai Wang. 2020a. How far we have
Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. come: testing decompilation correctness of c decom-
Scaling laws for neural language models. Preprint, pilers. In Proceedings of the 29th ACM SIGSOFT
arXiv:2001.08361. International Symposium on Software Testing and
Analysis, ISSTA 2020, page 475–487, New York,
Deborah S. Katz, Jason Ruchti, and Eric M. Schulte. NY, USA. Association for Computing Machinery.
2018. Using recurrent neural networks for decompi- Zhibo Liu and Shuai Wang. 2020b. How far we have
lation. In 25th International Conference on Software come: testing decompilation correctness of C decom-
Analysis, Evolution and Reengineering, SANER 2018, pilers. In ISSTA ’20: 29th ACM SIGSOFT Interna-
Campobasso, Italy, March 20-23, 2018, pages 346– tional Symposium on Software Testing and Analysis,
356. IEEE Computer Society. Virtual Event, USA, July 18-22, 2020, pages 475–487.
ACM.
Omer Katz, Yuval Olshaker, Yoav Goldberg, and Eran
Yahav. 2019. Towards neural decompilation. ArXiv, Jerome Miecznikowski and Laurie J. Hendren. 2002.
abs/1905.08325. Decompiling java bytecode: Problems, traps and
pitfalls. In International Conference on Compiler variable names from stripped binary. Preprint,
Construction. arXiv:2306.02546.

Steven S. Muchnick. 1997. Advanced compiler design Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan
and implementation. Ye, Zheyan Luo, and Yongqiang Ma. 2024. Llamafac-
tory: Unified efficient fine-tuning of 100+ language
Vikram Nitin, Anthony Saieva, Baishakhi Ray, and Gail models. arXiv preprint arXiv:2403.13372.
Kaiser. 2021. DIRECT : A transformer-based model
for decompiled identifier renaming. In Proceedings A ExeBench Setups
of the 1st Workshop on Natural Language Processing
for Programming (NLP4Prog 2021), pages 48–57, For every sample in ExeBench’s executable splits,
Online. Association for Computational Linguistics. assembly code from *.s file—a compiler’s interme-
Godfrey Nolan. 2012. Decompiling android. In Apress. diate output as discussed in Section 3.1 and Fig-
ure 1—is required to compile the sample into a
OpenAI. 2023. GPT-4 technical report. CoRR,
binary. The specific compilation settings and pro-
abs/2303.08774.
cessing details, however, are not provided by the
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten authors. Consequently, we choose to compile the
Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, code in a standard way and manage to compile only
Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom
Kozhevnikov, Ivan Evtimov, Joanna Bitton, Man- half of the samples. This leaves us with 443K out
ish Bhatt, Cristian Canton-Ferrer, Aaron Grattafiori, of 797K samples for the executable training set and
Wenhan Xiong, Alexandre Défossez, Jade Copet, 2621 out of 5000 samples for the executable test
Faisal Azhar, Hugo Touvron, Louis Martin, Nico- set. Accordingly, we train our model on the 443K
las Usunier, Thomas Scialom, and Gabriel Synnaeve.
2023. Code llama: Open foundation models for code.
samples and conduct the re-executability evalua-
CoRR, abs/2308.12950. tion on these 2621 samples, the results are shown
in Table 1.
Richard M Stallman et al. 2003. Using the gnu compiler The researchers from Slade (Armengol-
collection. Free Software Foundation, 4(02).
Estapé et al., 2023), who also developed
Ruoyu Wang, Yan Shoshitaishvili, Antonio Bianchi, ExeBench (Armengol-Estapé et al., 2022),
Aravind Machiry, John Grosen, Paul Grosen, Christo- have published their decompilation findings
pher Kruegel, and Giovanni Vigna. 2017. Ramblr:
Making reassembly great again. In NDSS.
on ExeBench. They chose to decompile the
intermediate output, or assembly code from *.s
Tao Wei, Jian Mao, Wei Zou, and Yu Chen. 2007. A file, directly without further compilation into
new algorithm for identifying loops in decompilation. binaries, where in practice, such intermediate
In Static Analysis, 14th International Symposium,
SAS 2007, Kongens Lyngby, Denmark, August 22-24, output is rarely released by software developers.
2007, Proceedings, volume 4634 of Lecture Notes in Their reported results, as seen in Table 5, show
Computer Science, pages 170–183. Springer. a significant difference from ours. Their version
of ChatGPT achieved a re-executability rate of
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier- 22.2% and an edit similarity of 44.0% under O0
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, optimization. On the other hand, our GPT-4o
and Jamie Brew. 2019. Huggingface’s transformers: model only reached a 4.4% re-executability rate
State-of-the-art natural language processing. CoRR, and a 7.9% edit similarity. The approach taken by
abs/1910.03771.
Slade involves settings not commonly available in
Wai Kin Wong, Huaijin Wang, Zongjie Li, Zhibo Liu, practical decompilation scenarios, which explains
Shuai Wang, Qiyi Tang, Sen Nie, and Shi Wu. 2023. why their results vary significantly from ours. We
Refining decompiled C code with large language
models. CoRR, abs/2310.06530. adheres to a more realistic setting, decompiling
binary files based solely on their intrinsic data,
Xiangzhe Xu, Zhuo Zhang, Shiwei Feng, Yapeng Ye, without any external information.
Zian Su, Nan Jiang, Siyuan Cheng, Lin Tan, and To further illustrate our settings, Figure 5 of-
Xiangyu Zhang. 2023. Lmpa: Improving decompila-
tion by synergy of large language model and program fers an example where the source function includes
analysis. CoRR, abs/2306.02546. specific user-defined types like Ltc4151State,
Ltc4151, and device. However, these types are
Xiangzhe Xu, Zhuo Zhang, Zian Su, Ziyang Huang,
Shiwei Feng, Yapeng Ye, Nan Jiang, Danning
completely lost after compilation, i.e., no informa-
Xie, Siyuan Cheng, Lin Tan, and Xiangyu Zhang. tion related to these user-definitions can be found
2024. Leveraging generative models to recover in the binary (disassembled ASM code). Conse-
Model/Metrics Re-executability Edit Similarity O0 O1 O2 O3
Optimization Level O0 O3 O0 O3 1.3B Performance on HumanEval-Decompile
Slade 59.5 52.2 71.0 60.0 1.0

Re-executability Rate
ChatGPT 22.2 13.6 44.0 34.0 0.8
GPT-4o(ours) 4.4 3.4 7.9 6.6
0.6
Table 5: Re-executability and Edit Similarity on 0.4
Exebench.
0.2
0.0
Source Code ASM
void StateIdle(Ltc4151State next, <StateIdle>: 6.7B Performance on HumanEval-Decompile
Ltc4151 *device) { endbr64 1.0
device->state = next; push %rbp

Re-executability Rate
} mov %rsp,%rbp
mov %edi,-0x4(%rbp)
0.8
mov %rsi,-0x10(%rbp)
mov -0x10(%rbp),%rax 0.6
mov -0x4(%rbp),%edx
GPT-4o mov %edx,(%rax) 0.4
void StateIdle(int a, int *b) { nop
*b = a; pop %rbp
} retq 0.2
0.0

40
80

0
0
0
0
0
0
0
0
12
16
20
24
28
32
36
40
Figure 5: Decompilation results of GPT-4o on

>=
Length
ExeBench test case.
Figure 6: Re-executability rate with the growth of input
length. 6.7B model is more robust against input length.
quently, GPT-4o is unable to reconstruct these types
based purely on the ASM (the realistic setting), Return 4% Struct 7% Type Struct
Type 3% 16%
instead converting them to default types int or Other Syntax
28%

8% 14%
pointer, producing non-executable code. This is- Assert
4%
sue was pervasive across the ExeBench test set, Other
2%
leading to the failure of GPT-4o models in decom- Assert Declare
64% 50%
piling the ExeBench samples in a realistic setting.

HumanEval-Decompile ExeBench
B Further Analysis of
LLM4Decompile-Ref Figure 7: Types of errors identified in the two bench-
marks: LLM4Decomile-End-6.7B faces issues with log-
Figure 6 illustrates that the re-executability rate de- ical errors in HumanEval-Decompile and user-defined
creases as the input length increases, and there is a components in ExeBench.
marked decline in performance at higher levels of
code optimization, highlighting the difficulties in
decompiling long and highly optimized sequences. structures. Given that these user-defined details are
Importantly, the performance difference between typically lost during the compilation process, re-
the 1.3B and 6.7B models showcased in the figure constructing them can be particularly challenging.
emphasizes the advantages of larger models in such Integrating techniques like Retrieval Augmented
tasks. Larger models, with their expanded compu- Generation might supplement the decompilation
tational resources and deeper learning capabilities, process with necessary external information.
are inherently better at resolving the challenges
posed by complex decompilations. C Obfuscation Techniques
The error analysis presented in Figure 7 for We provide the details of two classic obfuscation
LLM4Decompile-End-6.7B indicates that logical techniques suggested in Obfuscator-LLVM.
errors are prevalent in the HumanEval-Decompile
scenarios, with 64% of errors due to assertions that Control Flow Flattening enhances the security
the decompiled codes do not pass. In the ExeBench of software by transforming its straightforward,
dataset, which features real functions with user- hierarchical control flow into a more complex, flat-
defined structures and types, the major challenges tened structure. The workflow involves breaking a
are related to reclaiming these user-specific com- function into basic blocks, arranging these blocks
ponents. Where 50% of the errors come from un- at the same level, and encapsulating them within a
declared functions, and 28% from improper use of switch statement inside a loop.
Bogus Control Flow modifies a function’s ex-
ecution sequence by inserting an additional basic
blockprior to the existing one. This added block
includes an opaque predicate, followed by a con-
ditional jump that leads back to the original block.
Additionally, the original basic block is polluted
with randomly selected, meaningless instructions.

You might also like