0% found this document useful (0 votes)

9 views15 pages

No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantics-Preserving Transformations

The paper introduces D REAM, a novel decompiler that eliminates goto statements from decompiled code, providing a structured output suitable for human analysis and program analysis techniques. It employs a pattern-independent control-flow structuring algorithm and semantics-preserving transformations to recover high-level control constructs from binary programs. The authors demonstrate D REAM's effectiveness by comparing it to leading decompilers, showing significant improvements in code structure and compactness.

Uploaded by

zhangrunze60

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views15 pages

No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantics-Preserving Transformations

Uploaded by

zhangrunze60

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring

and Semantics-Preserving Transformations

Khaled Yakdan∗ , Sebastian Eschweiler† , Elmar Gerhards-Padilla† , Matthew Smith∗

∗ University
of Bonn, Germany
{yakdan, smith}@cs.uni-bonn.de
† Fraunhofer FKIE, Germany

{sebastian.eschweiler, elmar.gerhards-padilla}@fkie.fraunhofer.de

Abstract—Decompilation is important for many security appli- effective countermeasures and mitigation strategies requires a
cations; it facilitates the tedious task of manual malware reverse thorough understanding of functionality and actions performed
engineering and enables the use of source-based security tools on by the malware. Although many automated malware analysis
binary code. This includes tools to find vulnerabilities, discover techniques have been developed, security analysts often have
bugs, and perform taint tracking. Recovering high-level control to resort to manual reverse engineering, which is difficult and
constructs is essential for decompilation in order to produce
structured code that is suitable for human analysts and source-
time-consuming. Decompilers that can reliably generate high-
based program analysis techniques. State-of-the-art decompilers level code are very important tools in the fight against malware:
rely on structural analysis, a pattern-matching approach over they speed up the reverse engineering process by enabling
the control flow graph, to recover control constructs from malware analysts to reason about the high-level form of code
binary code. Whenever no match is found, they generate goto instead of its low-level assembly form.
statements and thus produce unstructured decompiled output.
Those statements are problematic because they make decompiled Decompilation is not only beneficial for manual analy-
code harder to understand and less suitable for program analysis. sis, but also enables the application of a wealth of source-
based security techniques in cases where only binary code
In this paper, we present D REAM, the first decompiler is available. This includes techniques to discover bugs [5],
to offer a goto-free output. D REAM uses a novel pattern- apply taint tracking [10], or find vulnerabilities such as RICH
independent control-flow structuring algorithm that can recover
[7], KINT [38], Chucky [42], Dowser [24], and the property
all control constructs in binary programs and produce structured
decompiled code without any goto statement. We also present graph approach [41]. These techniques benefit from the high-
semantics-preserving transformations that can transform unstruc- level abstractions available in source code and therefore are
tured control flow graphs into structured graphs. We demonstrate faster and more efficient than their binary-based counterparts.
the correctness of our algorithms and show that we outperform For example, the average runtime overhead for the source-
both the leading industry and academic decompilers: Hex-Rays based taint tracking system developed by Chang et al. [10] is
and Phoenix. We use the GNU coreutils suite of utilities as a 0.65% for server programs and 12.93% for compute-bound
benchmark. Apart from reducing the number of goto statements applications, whereas the overhead of Minemu, the fastest
to zero, D REAM also produced more compact code (less lines of binary-based taint tracker, is between 150% and 300% [6].
code) for 72.7% of decompiled functions compared to Hex-Rays
and 98.8% compared to Phoenix. We also present a comparison One of the essential steps in decompilation is control-flow
of Hex-Rays and D REAM when decompiling three samples from structuring, which is a process that recovers the high-level
Cridex, ZeusP2P, and SpyEye malware families. control constructs (e.g., if-then-else or while loops) from
the program’s control flow graph (CFG) and thus plays a vital
I. I NTRODUCTION role in creating code which is readable by humans. State-of-
the-art decompilers such as Hex-Rays [22] and Phoenix [33]
Malicious software (malware) is one of the most serious employ structural analysis [31, 34] (§II-A3) for this step. At a
threats to the Internet security today. The level of sophistication high level, structural analysis is a pattern-matching approach
employed by current malware continues to evolve significantly. that tries to find high-level control constructs by matching
For example, modern botnets use advanced cryptography, com- regions in the CFG against a predefined set of region schemas.
plex communication and protocols to make reverse engineering When no match is found, structural analysis must use goto
harder. These security measures employed by malware authors statements to encode the control flow inside the region. As a
are seriously hampering the efforts by computer security result, it is very common for the decompiled code to contain
researchers and law enforcement [4, 32] to understand and many goto statements. For instance, the de facto industry stan-
take down botnets and other types of malware. Developing dard decompiler Hex-Rays (version v2.0.0.140605) produces
1,571 goto statements for a peer-to-peer Zeus sample (MD5
hash 49305d949fd7a2ac778407ae42c4d2ba) that consists of
Permission to freely reproduce all or part of this paper for noncommercial
purposes is granted provided that copies bear this notice and the full citation
997 nontrivial functions (functions with more than one basic
on the first page. Reproduction for commercial purposes is strictly prohibited block). The decompiled malware code consists of 49,514 lines
without the prior written consent of the Internet Society, the first-named author of code. Thus, on average it contains one goto statement for
(for reproduction of an entire paper only), and the author’s employer if the each 32 lines of code. This high number of goto statements
paper was prepared within the scope of employment. makes the decompiled code less suitable for both manual
NDSS ’15, 8-11 February 2015, San Diego, CA, USA
Copyright 2015 Internet Society, ISBN 1-891562-38-X and automated program analyses. Structured code is easier to
http://dx.doi.org/10.14722/ndss.2015.23185 understand [16] and helps scale program analysis [31]. The
research community has developed several enhancements to
c1
structural analysis to recover control-flow abstractions. One of c1 c1
the most recent and advanced academic tools is the Phoenix c1
I F T HEN
decompiler [33]. The focus of Phoenix and this line of research ¬c1 c2
S EQUENCE
c2 ¬c2 ¬c1 W HILE
in general is on correctly recovering more control structure and
reducing the number of goto statements in the decompiled n2 n1
n2
code. While significant advances are being made, whenever n2

no pattern match is found, goto statements must be used

and this is hampering the time-critical analysis of malware.
This motivated us to develop a new control-flow structuring Fig. 2: Example of structural analysis.
algorithm that relies on the semantics of high-level control
constructs rather than the shape of the corresponding flow
graphs.
• We use D REAM to decompile three malware samples from
In this paper, we overcome the limitations of structural Cridex, ZeusP2P and SpyEye and compare the results
analysis and improve the state of the art by presenting a novel with Hex-Rays.
approach to control-flow structuring that is able to recover
all high-level control constructs and produce structured code II. BACKGROUND & P ROBLEM D EFINITION
without a single goto statement. To the best of our knowledge,
this is the first control-flow structuring algorithm to offer a In this section, we introduce necessary background con-
completely goto-free output. The key intuition behind our cepts, define the problem of control-flow structuring and
approach is based on two observations: (1) high-level control present our running example.
constructs have a single entry point and a single successor
point, and (2) the type and nesting of high-level control A. Background
constructs are reflected by the logical conditions that determine We start by briefly discussing two classic representations
when CFG nodes are reached. Given the above intuition, of code used throughout the paper and provide a high-level
we propose a technique, called pattern-independent control overview of structural analysis. As a simple example illustrat-
flow structuring, that can structure any region satisfying the ing the different representations, we consider the code sample
above criteria without any assumptions regarding its shape. shown in Figure 1a.
In case of cyclic regions with multiple entries or multiple
successors, we propose semantics-preserving transformations 1) Abstract Syntax Tree (AST): Abstract syntax trees are
to transform those regions into semantically equivalent single- ordered trees that represent the hierarchical syntactic structure
entry single-successor regions that can be structured by our of source code. In this tree, each interior node represents an
pattern-independent approach. operator (e.g., additions, assignments, or if statements). Each
child of the node represents an operand of the operator (e.g.,
We have implemented our algorithm in a decompiler called constants, identifiers, or nested operators). ASTs encode how
D REAM (Decompiler for Reverse Engineering and Analysis statements and expressions are nested to produce a program.
of Malware). Based on the implementation, we measure our As an example, consider Figure 1b showing an abstract syntax
results with respect to correctness and compare D REAM to two tree for the code sample given in Figure 1a.
state-of-the-art decompilers: Phoenix and Hex-Rays.
2) Control Flow Graph (CFG): A control flow graph of
In summary, we make the following contributions: a program P is a directed graph G = (N, E, nh ). Each node
n ∈ N represents a basic block, a sequence of statements
that can be entered only at the beginning and exited only
• We present a novel pattern-independent control-flow at the end. Header node nh ∈ N is P ’s entry. An edge
structuring algorithm to recover all high-level control e = (ns , nt ) ∈ E represents a possible control transfer from
structures from binary programs without using any goto ns ∈ N to nt ∈ N . A tag is assigned to each edge to represent
statements. Our algorithm can structure arbitrary control the logical predicate that must be satisfied so that control
flow graphs without relying on a predefined set of region is transferred along the edge. We distinguish between two
schemas or patterns. types of nodes: code nodes represent basic blocks containing
• We present new semantics-preserving graph restructuring program statements executed as a unit, and condition nodes
techniques that transform unstructured CFGs into a se- represent testing a condition based on which a control transfer
mantically equivalent form that can be structured without is made. We also keep a mapping of tags to the corresponding
goto statements.
logical expressions. Figure 1c shows the CFG for the code
• We implement D REAM, a decompiler containing both sample given in Figure 1a.
the pattern-independent control-flow structuring algo-
rithm and the semantics-preserving graph restructuring 3) Structural Analysis: At a high level, the traditional
techniques. approach of structural analysis relies on a predefined set of
• We demonstrate the correctness of our control-flow struc- patterns or region schemas that describe the shape of high-
turing algorithm using the joern C/C++ code parser and level control structures (e.g., while loop, if-then-else
the GNU coreutils. construct). The algorithm iteratively visits all nodes of the CFG
• We evaluate D REAM against the Hex-Rays and Phoenix in post-order and locally compares subgraphs to its predefined
decompilers based on the coreutils benchmark. patterns. When a match is found, the corresponding region is

2
SEQ
int i = 0

DECL WHILE RETURN

1 int foo(){ c: i < MAX
2 int i = 0; int = < SEQ i
c
3 while(i < MAX){
4 print(i); i 0 i MAX CALL = print(i)
¬c
5 i = i + 1; i = i + 1
6 } print ARG i +
7 return i;
8 } i i 1 return i

(a) Exemplary code sample (b) Abstract Syntax Tree (c) Control Flow Graph

Fig. 1: Code representations.

collapsed to one node of corresponding type. If no match is TABLE I: AST nodes that represent high-level control con-
found, goto statements are inserted to represent the control structs
flow. In the literature, acyclic and cyclic subgraphs for which
no match is found are called proper and improper intervals, AST Node Description
respectively. For instance, Figure 2 shows the progression of
Sequence of nodes [n1 , . . . , nk ] executed in
structural analysis on a simple example from left to right.
Seq [ni ]i∈1..k order. Sequences can also be represented as
In the initial (leftmost) graph nodes n1 and c2 match the Seq [n1 , . . . , nk ].
shape of a while loop. Therefore, the region is collapsed
into one node that is labeled as a while region. The new If construct with a condition c, a true branch nt
Cond [c, nt , nf ] and a false branch nf . It may have only one
node is then reduced with node c1 into an if-then region branch.
and finally the resulting graph is reduced to a sequence. This
series of reductions are used to represent the control flow as Loop of type τ ∈ {τwhile , τdowhile , τendless }
Loop [τ, c, nb ]
if (c1 ) {while (¬c2 ) {n1 }} ; n2 with continuation condition c and body nb .
Switch construct consisting of a variable v, a
B. Problem Definition list of cases C = [(V1 , n1 ) , . . . , (Vk , nk )], and
Switch [v, C, nd ]
a default node nd . Each case (Vi , ni ) represents
Given a program P in CFG form, the problem of control- a node ni that is executed when v ∈ Vi
flow structuring is to recover high-level, structured control con-
structs such as loops, if-then and switch constructs from
the graph representation. An algorithm that solves the control-
flow structuring problem is a program transformation function
fP that returns, for a program’s control flow graph PCFG , a use to illustrate different parts of our structuring algorithm. R1
semantically equivalent abstract syntax tree PAST . Whenever represents a loop that contains a break statement resulting in
fP cannot find a high-level structured control construct it an exit from the middle of the loop to the successor node.
will resort to using goto statements. In the context of this R2 is a proper interval (also called abnormal selection path).
paper, we denote code that does not use goto statements as In this region, the subgraph headed at b1 cannot be structured
structured code. The control-flow of P can be represented in as an if-then-else region due to an abnormal exit caused
several ways, i.e., several correct ASTs may exist. In its general by the edge (b2 , n6 ). Similarly, the subgraph with the head
form structural analysis can and usually does contain goto at b2 cannot be structured as if-then-else region due to
statements to represent the control flow. Our goal is to achieve an abnormal entry caused by the edge (n4 , n5 ). Due to this,
fully structured code, i.e., code without any goto. For this, we structural analysis represents at least one edge in this region
restrict the solution space to structured solutions. That is, all as a goto statement. The third region, R3 , represents a loop
nodes n ∈ PAST representing control constructs must belong with an unstructured condition, i.e., it cannot be structured by
to the set of structured constructs shown in Table I. The table structural analysis. These three regions where chosen such that
does not contain for loops since these are not needed at this the difficulty for traditional structuring algorithms increases
stage of the process. for loops are recovered during our post- from R1 to R3 . The right hand side of Figure 5 shows how
structuring optimization step to enhance readability (§VI). the structuring algorithm of Hex-Rays structures this CFG.
For comparison, the left hand side shows how the algorithms
developed over the course of this paper structure the CFG.
C. Running Example
As can be seen for the three regions, the traditional approach
As an example illustrating a sample control flow graph and produces goto statements and thus impacts readability. Even
running throughout this paper, we consider the CFG shown in in this toy example a non-negligible amount of work needs to
Figure 3. In this graph, code nodes are denoted by ni where i be invested to extract the semantics of region R3 . In contrast,
is an integer. Code nodes are represented in white. Condition using our approach, the entire region is represented by a
nodes are represented in blue and labeled with the condition single while loop with a single clear and understandable
tested at that node. The example contains three regions that we continuation condition.

3
¶ Control-Flow Structuring · Post-Structuring Optimzations
IF
IF
Pattern-Independent Control Constructs Simplification
!= WHILE
Structuring != CALL
...
v n < String Functions Outlining
v n foo
Semantics-Preserving i MAX
Transformations Variable Renaming

Fig. 4: Architecture of our approach. We first compute the abstract syntax tree using pattern-independent structuring and semantics-
preserving transformations (¶). Then, we simplify the computed AST to improve readability (·).

R2 ¬A A i f (A)
b1 A while ( 1 )
¬b1 b1 R1 w h i l e (c1 )
c1 n1
n4 b2 i f (c2 )
¬b2 ¬c1 c1 R1 break
b2
n3
c2 n1 i f (¬c3)
n5 n6
c3 g o t o LABEL_4
¬c2 c2
n2
n7 i f (A) else
n3 n2
do i f (¬b1 )
w h i l e (c1 ) n4
R3 n1 g o t o LABEL_1
d1
c3 i f (c2 ) i f (¬b2 )
d1 ¬d1 R1 n2 R2 LABEL_1 :
¬c3 break n5
d3 d2 n3 g o t o LABEL_2
¬d2 w h i l e (c3 ) n6
d3
d2 else LABEL_2 :
n8 ¬d3
i f (¬b1 ) n7
n9
n4 w h i l e (d1 )
i f (b1 ∧ b2 ) i f (¬d3 )
Fig. 3: Running example. Sample CFG that contains three R2 n6 g o t o LABEL_4
regions: a while loop with a break statement (R1 ), a proper else R3 LABEL_3 :
n5 n8
interval (R2 ), and a loop with unstructured condition (R3 ). n7 i f ( d2 )
w h i l e ((d1 ∧ d3 ) ∨ (¬d1 ∧ d2 )) g o t o LABEL_3
R3 n8 LABEL_4 :
n9 n9

III. D REAM OVERVIEW Fig. 5: Decompiled code generated by D REAM (left) and by
Hex-Rays (right). The arrows represent the jumps realized by
D REAM consists of several stages. First, the binary file is goto statements.
parsed and the code is disassembled. This stage builds the
CFG for all binary functions and transforms the disassembled
code into D REAM’s intermediate representation (IR). There
are several disassemblers and binary analysis frameworks that A high-level overview of our approach is presented in
already implement this step. We use IDA Pro [2]. Should the Figure 4. It comprises two phases: control-flow structuring,
binary be obfuscated tools such as [27] and [43] can be used and post-structuring optimizations. The first phase is our
to extract the binary code. algorithm to recover control-flow abstractions and computes
the corresponding AST. Next, we perform several optimization
The second stage performs several data-flow analyses in- steps to improve readability.
cluding constant propagation and dead code elimination. The
third stage infers the types of variables. Our implementation Our control-flow structuring algorithm starts by performing
relies on the concepts employed by TIE [29]. The forth and a depth-first traversal (DFS) over the CFG to find back
last phase is control-flow structuring that recovers high-level edges which identify cyclic regions. Then, it visits nodes in
control constructs from the CFG representation. The first three post-order and tries to structure the region headed by the
phases rely on existing work and therefore will not be covered visited node. Structuring a region is done by computing the
in details in this paper. The remainder of this paper focuses on AST of control flow inside the region and then reduce it
the novel aspects of our research concerning the control-flow into an abstract node. Post-order traversal guarantees that all
structuring algorithm. descendants of a given node n are handled before n is visited.

4
When at node n, our algorithm proceeds as follows: if n is Algorithm 1: Graph Slice
the head of an acyclic region, we compute the set of nodes
Input : Graph G = (N, E, h); source
dominated by n and structure the corresponding region if it
node ns ; sink node ne
has a single successor (§IV-B). If n is the head of a cyclic
Output: SG (ns , ne )
region, we compute loop nodes. If the corresponding region d1
has multiple entry or successor nodes, we transform it into a 1 SG ← ∅;
d1 ¬d1
semantically equivalent graph with a single entry and a single 2 dfsStack ← {ns };
successor (§V) and structure the resulting region (§IV-C). 3 while E has unexplored edges do
d3 d2
The last iteration reduces the CFG to a single node with the 4 e := DFSNextEdge(G);
program’s AST. 5 nt := target(e); ¬d3 ¬d2
6 if nt is unvisited then
n9
7 dfsStack.push(nt );
Pattern-independent structuring. We use this approach to
8 if nt = ne then
compute the AST of single-entry and single-successor regions Fig. 6:
9 AddPath(SG , dfsStack)
in the CFG. The entry node is denoted as the region’s header. SG (d1 , n9 )
10 end
Our approach to structuring acyclic regions proceeds as fol- of running
11 else if nt ∈ SG ∧ nt ∈/ dfsStack
lows: first, we compute the lexical order in which code nodes example
then
should appear in the decompiled code. Then, for each node
12 AddPath(SG , dfsStack)
we compute the condition that determines when the node is
13 end
reached from the region’s header (§IV-A), denoted by reaching
14 RemoveVisitedNodes()
condition. In the second phase, we iteratively group nodes
15 end
based on their reaching conditions and reachability relations
into subsets that can be represented using if or switch
constructs. In the case of cyclic regions, our algorithm first A. Reaching Condition
represents edges to the successor node by break statements.
It then computes the AST of the loop body (acyclic region). In In this section, we discuss our algorithm to find the
the third phase, the algorithm finds the loop type and condition condition that takes the control flow from node ns to node
by first assuming an endless loop and then reasoning about the ne , denoted by reaching condition cr (ns , ne ). This step is es-
whole structure. The intuition behind this approach is that any sential for our pattern-independent structuring and guarantees
loop can be represented as endless loop with additional break the semantics-preserving property of our transformations (§V).
statements. For example, a while loop while (c) {body;} can 1) Graph Slice: We introduce the concept of the graph
be represented by while (1) {if (¬c) {break;} body;}. slice to compute the reaching condition between two nodes.
We define the graph slice of graph G (N, E, nh ) from a source
Semantics-preserving transformations. We transform cyclic node ns ∈ N to a sink node ne ∈ N , denoted by SG (ns , ne ),
regions with multiple entries or multiple successors into se- as the directed acyclic graph Gs (Ns , Es , ns ), where Ns is
mantically equivalent single-entry single-successor regions. the set of nodes on simple paths from ns to ne in G and
The key idea is to compute the unique condition cond (n) Es is the set of edges on simple paths from ns to ne in G.
based on which the region is entered at or exited to a given We only consider simple paths since the existence of cycles
node n, and then redirect corresponding edges into a unique on a path between two nodes does not affect the condition
header/successor where we add a series of checks that take based on which one is reached from the other. Intuitively,
control flow from the new header/successor to n if cond (n) we are only interested in the condition that causes control
is satisfied. to leave the cycle and get closer to the target node. A path
p that includes a cycle can be decomposed into two disjoint
components: simple-path component ps and cycle component
Post-structuring optimizations. After having recovered the pc . The target node is reached if only ps is followed (cycle is
control flow structure represented by the computed AST, we not executed) or if ps and pc are traversed (cycle is executed).
perform several optimization steps to improve readability. Therefore, the condition represented by p is cond (p) =
These optimizations include simplifying control constructs cond (ps )∨[cond (ps ) ∧ cond (pc )]. The last logical expres-
(e.g., transforming certain while loops into for loops), out- sion can be rewritten as cond (ps ) ∧ [1 ∨ cond (pc )] which
lining common string functions, and giving meaningful names finally evaluates to cond (ps ).
to variables based on the API calls.
Algorithm 1 computes the graph slice by performing depth-
first traversal of the CFG starting from the source node. The
slice is augmented whenever the traversal discovers a new
IV. PATTERN -I NDEPENDENT C ONTROL -F LOW simple path to the sink. The algorithm uses a stack data
S TRUCTURING structure, denoted by dfsStack, to represent the currently
explored simple path from the header node to the currently
In this section we describe our pattern-independent struc- visited node. Nodes are pushed to dfsStack upon first-time
turing algorithm to compute the AST of regions with a single visit (line 7) and popped when all their descendants have been
entry (h) and single successor node, called region header discovered (line 14). In each iteration of edge exploration,
and region successor. The first step necessary is to find the the current path represented by dfsStack is added to the
condition that determines when each node is reached from the slice when traversal reaches the sink node (line 9) or when it
header. discovers a simple path to a slice node (line 12). The last step

5
is justified by the fact that any slice node n has a simple path At a high level, our refinement steps iterate over the
to the sink node. The path represented by dfsStack and the children of each sequence node V and choose a subset Vc ∈ V
currently explored edge e is simple if the target node of e is that satisfies a specific criterion. Then, we construct a new
not in dfsStack. compound AST node vc that represents control flow inside
Vc and replaces it in a way that preserves the topological
We extend Algorithm 1 to calculate the graph slice from
order of V . That is, vc is placed after all nodes reaching it
a given node to a set of sink nodes. For this purpose, we
and before all nodes reached from it. Note that we define
first create a virtual sink node nv , add edges from the sink
reachability between two AST nodes in terms of corresponding
set to nv , compute SG (ns , nv ), and finally remove nv and
basic blocks in the CFG, i.e., let u, v be two AST nodes, u
its incoming edges. Figure 6 shows the computed graph slice
reaches v if u contains a basic block that reaches a basic block
between nodes d1 and n9 in our running example. The slice
contained in v.
shows that n9 is reached from d1 if and only if the condition
(d1 ∧ ¬d3 ) ∨ (¬d1 ∧ ¬d2 ) is satisfied. Condition-based Refinement. Here, we use the observation
2) Deriving and Simplifying Conditions: After having com- that nodes belonging to the true branch of an if construct with
puted the slice SG (ns , ne ), the reaching conditions for all slice condition c is executed (reached) if and only if c is satisfied.
nodes can be computed by one traversal over the nodes in That is, the reaching condition of corresponding node(s) is an
their topological order. This guarantees that all predecessors AND expression of the form c ∧ R. Similarly, nodes belonging
of a node n are handled before n. To compute the reaching to the false branch have reaching conditions of the form ¬c∧R.
condition of node n, we need the reaching conditions of its This refinement step chooses a condition c and divides children
direct predecessors and the tags of incoming edges from these nodes into three groups: true-branch candidates Vc , false-
nodes. Specifically, we compute the reaching conditions using branch candidates V¬c , and remaining nodes. If the true-branch
the formula: and false-branch candidates contain more than two nodes, i.e.,
|Vc | + |V¬c | ≥ 2, we create a condition node vc for c with
_ children {Vc , V¬c } whose conditions are replaced by terms R.
cr (ns , n) = (cr (ns , v) ∧ τ (v, n)) Obviously, the second term of logical AND expressions (c or
v∈Preds(n) ¬c) is implied by the conditional node.
where Preds (n) returns the immediate predecessors of node
The conditions that we use in this refinement are chosen
n and τ (v, n) is the tag assigned to edge (v, n). Then, we
as follows: we first check for pairs of code nodes (ni , nj )
simplify the logical expressions.
that satisfy cr (h, ni ) = ¬cr (h, nj ) and group according
to cr (h, ni ). These conditions correspond to if-then-else
B. Structuring Acyclic Regions
constructs, and thus are given priority. When no such pairs
The key idea behind our algorithm is that any directed can be found, we traverse all nodes in topological order
acyclic graph has at least one topological ordering defined by (including conditional nodes) and check if nodes can be
its reverse postordering [14, p. 614]. That is, we can order structured by the reaching condition of the currently visited
its nodes linearly such that for any directed edge (u, v), u node. Intuitively, this traversal mimics the nesting order by
comes before v in the ordering. Our approach to structuring visiting the topmost nodes first. Clustering according to the
acyclic region proceeds as follows. First, we compute reaching corresponding conditions allows to structure inner nodes by
conditions from the region header h to every node n in the removing common factors from logical expressions. Therefore,
region. Next, we construct the initial AST as sequence of we iteratively repeat this step on all newly created sequence
code nodes in topological order associated with corresponding nodes to find further nodes with complementing conditions.
reaching conditions, i.e., it represents the control flow in-
side the region as if (cr (h, n1 )) {n1 } ; . . . ; if (cr (h, nk )) {nk }. In our running example, when the algorithm structures the
Obviously, the initial AST is not optimal. For example, acyclic region headed at node b1 (region R2 ), it computes
nodes with complementary conditions are represented as two the initial AST as shown in Figure 7. Condition nodes are
if-then constructs if (c) {nt } if (¬c) {nf } and not as one
represented by white nodes with up to two outgoing edges
if-then-else construct if (c) {nt } else {nf }. Therefore, in
that represent when the condition is satisfied (black arrowhead)
the second phase, we iteratively refine the initial AST to find or not (white arrowhead). Sequence nodes are depicted by
a concise high-level representation of control flow inside the blue nodes. Their children are ordered from left to right in
region. topological order. Leaf nodes (rectangles) are the basic blocks.
The algorithm performs a condition-based refinement wrt.
1) Abstract Syntax Tree Refinement: We apply three re- condition b1 ∧ b2 since nodes n5 and n6 have complementary
finement steps to AST sequence nodes. First, we check if conditions. This results in three clusters Vb1 ∧b2 = {n6 },
there exist subsets of nodes that can be represented using V¬(b1 ∧b2 ) = {n5 }, and Vr = {n4 } and leads to creating a
if-then-else. We denote this step by condition-based re- condition node. At this point, no further condition-based re-
finement since it reasons about the logical expressions rep- finement is possible. Cifuentes proposed a method to structure
resenting nodes’ reaching conditions. Second, we search for compound conditions by defining four patterns that describe
nodes that can be represented by switch constructs. Here, the shape of subgraphs resulting from short circuit evaluation
we also look at the checks (comparisons) represented by of compound conditions [11]. Obviously, this method fails if
each logical variable. Hence, we denote it by condition-aware no match to these patterns is found.
refinement. Third, we additionally use the reachability relations
among nodes to represent them as cascading if-else con- Condition-aware Refinement. This step checks if the child
structs. The third step is called reachability-based refinement. nodes, or a subset of them, can be structured as a switch

6
construct cascading condition nodes to represent them. That
SEQ SEQ
SEQ is, for each node ni ∈ N , we construct a condition node with
condition ci whose true branch is node ni and the false branch
n4 b1 ∧b2 n7 ¬b1 b1 ∧b2 n7
n6 n4 n5 n7 is the next condition node for ci+1 (if i < k − 1) or nk (if
b1 ∧b2 ¬b1 ¬b1 ∨¬b2
¬b1 i = k − 1).
n5 n6 n4 n5 n6
We iteratively process sequence nodes and construct clus-
ters Nr that satisfy the above conditions. In each iteration,
Fig. 7: Development of the initial AST when structuring the we initialize Nr to contain the last sequence node with a
region R2 in the running example. The initial AST (left) nontrivial reaching condition and traverse the remaining nodes
is refined by a condition-based refinement with respect to backwards. A node u is added to Nr if ∀n ∈ Nr : u 9 n since
condition b1 ∧ b2 (middle). Finally, a condition node is created the topological order implies that no node in Nr has a path
for n4 (right). to n (this would cause this node to be before n in the order).
We stop when the logical OR of reaching conditions evaluates
to true. Since nodes in Nr are unreachable from each other,
any ordering of them is a valid topological order. With the
construct. We apply this refinement when no further progress goal of producing well-readable code, we sort nodes in Nr by
can be made by condition-based refinement and the AST has increasing complexity of the logical expressions representing
sequence nodes with more than two children. Here, we use their reaching conditions defined as the expression’s number of
the observation that in a switch construct with variable x, terms. Finally, we build the corresponding cascading condition
reaching conditions of case nodes are comparisons of x with nodes.
scalar constants. A given case node is reached if x is equal to
the case value or the preceding case node does not end with
a break statement. As a result, the reaching condition is an C. Structuring Cyclic Regions
?
equality check x = c where c is a scalar constant or a logical A loop is characterized by the existence of a back edge
OR expression of such checks. The reaching condition for the (nl , nh ) from a latching node nl into loop header node
default case node, if it exists, can additionally contain checks nh . With the aim of structuring cyclic regions in a pattern-
for x such as ≥ with constants. independent way, we first compute the set of loop nodes, re-
Our approach is to first search for a switch candidate structure the cyclic region into a single-entry single-successor
node whose reaching condition is a comparison of a variable region if necessary, compute the AST of the loop body, and
with a constant. We then cluster the remaining nodes in the finally infer the loop type and condition by reasoning about the
sequence based on the type of their reaching conditions into computed AST. Our CFG traversal guarantees that we handle
three groups: case candidates Vc , default candidates Vd , and inner loops before outer ones and thus we can assume that
remaining items Vr . If at least two case nodes are found, i.e., when structuring a cyclic region it does not contain nested
|Vc | + |Vd | ≥ 3, we construct a switch node vs that replaces loops.
Vc ∪Vd in the sequence. We compute the values associated with 1) Initial Loop Nodes and Successors: We first determine
each case and determine whether the case ends with a break the set of initial loop nodes Nloop , i.e., nodes located on a path
statement depending on the corresponding node’s reaching from the header node to a latching node. For this purpose, we
condition. For this purpose, we traverse case candidate nodes compute the graph slice SG (nh , Nl ) where Nl is the set of
in topological order which defines the lexical order of cases latching nodes. This allows to compute loop nodes even if
in the switch construct. When at node n, we check if the they are not dominated by the header node in the presence
reaching condition of a subsequent case node v is a logical OR of abnormal entries. Abnormal entries are defined as ∃n ∈
expression of the form cr (h, v) = cr (h, n) ∨ Rn . This means Nloop \ {nh } : Preds (n) 6⊂ Nloop . If the cyclic region has
that if n is reached, then v is also reached and thus n does abnormal entries, we transform it into a single-entry region
not end with a break statement. The set of values associated (§V-A). We then identify the set of initial exit nodes Nsucc ,
to case node n is Vn \ Vp where Vn is the set of constants i.e., targets of outgoing edges from loop nodes not contained
checked in the reaching condition of node n and Vp is the set in Nloop . These sets are denoted as initial because they are
of values of previous cases. refined by the next step to the final sets.
Reachability-based Refinement. This is the last refinement 2) Successor Refinement and Loop Membership: In order
that we apply when no further condition-based and condition- to compute the final sets of loop nodes and successor nodes,
aware refinements are possible. Intuitively, a set of nodes N = we perform a successor node refinement step. The idea is
{n1 , . . . , nk } with nontrivial reaching conditions {c1 , . . . , ck }, that certain initial successor nodes can be considered as loop
i.e. ∀i ∈ [1, k] : ci 6= true, can be represented as cas- nodes, and thus we can avoid prematurely considering them
cading if-else constructs if the following conditions are as final successor nodes and avoid unnecessary restructuring.
satisfied: First, there exists no path between any two nodes For example, a while loop containing break statements
in N . Second, the ORWexpression of their reaching conditions proceeded by some code results in multiple exits from the loop
evaluates to true, i.e., 1≤i≤k ci = true. These nodes can be that converge to the unique loop successor. This step provides
represented as if (c1 ) {n1 } . . . else if (ck−1 ) {nk−1 } else {nk }. a precise loop membership definition that avoids prematurely
This eliminates the need to explicitly include condition ck in analyzing the loop type and identifying the successor node
the decompiled code as it is implied by the last else. The based on initial loop nodes which may lead to suboptimal
main idea is to group nodes that satisfy these conditions and structuring. Algorithm 2 provides an overview of the successor

7
refinement step. The algorithm iteratively extends the current while (1)
while (1)
if (c1 ) do
set of loop nodes by looking for successor nodes that have all n1
while (c1 )
while (c1 )
their immediate predecessors in the loop and are dominated C OND T O S EQ n1 D OW HILE
else → → n1
by the header node. When a successor node is identified as ...
... ...
loop node, its immediate successors that are not currently if (¬c3 )
if (¬c3 ) while (c3 )
loop nodes are added to the set of successor nodes. The break
break
algorithm stops when the set of successor nodes contains
at most one node, i.e., the final unique loop successor is Fig. 9: Example of loop type inference of region R1 .
identified, or when the previous iteration did not find new
successor nodes. If the loop still has multiple successors after
refinement, we select from them the successor of the loop body’s AST, i.e., n` = Loop [τendless , −, nb ]. Our assumption is
node with smallest post-order as the loop final successor. The justified since all exits from the loop are represented by break
remaining successors are classified as abnormal exit nodes. statements. Finally, we infer the loop type and continuation
We then transform the region into a single-successor region condition by reasoning about the structure of loop n` .
as will be described in Section V-B. For instance, when
structuring region R1 in our running example (Figure 3), the Inference rules. We specify loop structuring rules as inference
algorithm identifies the following initial loop and successor rules of the form:
nodes Nloop = {c1 , n1 , c2 , n3 , c3 }, Nsucc = {n2 , n9 }. Next,
P1 P 2 . . . P n
node n2 is added to the set of loop nodes since all its prede-
cessors are loop nodes. This results in a unique loop node and C
the final sets Nloop = {c1 , n1 , c2 , n3 , c3 , n2 }, Nsucc = {n9 }. The top of the inference rule bar contains the premises
P1 , P2 , . . . , Pn . If all premises are satisfied, then we can
Algorithm 2: Loop Successor Refinement conclude the statement below the bar C. Figure 8 presents our
loop structuring rules. The first premise in our rules describes
Input : Initial sets of loop nodes Nloop and successor the input loop structure, i.e., loop type and body structure.
nodes Nsucc ; loop header nh The remaining premises describe additional properties of loop
Output: Refined Nloop and Nsucc body. The conclusion is described as a transformation rule of
1 Nnew ← Nsucc ; the form n ; ń. Inference rules provide a formal compact
2 while |Nsucc | > 1 ∧ Nnew 6= ∅ do notation for single-step inference and implicitly specify an
3 Nnew ← ∅; inference algorithm by recursively applying rules on premises
4 forall the n ∈ Nsucc do until a fixed point is reached. We denote by Br a break
5 if preds(n) ⊆ Nloop then statement, and by Brc a condition node that represents the
6 Nloop ← Nloop ∪ {n}; statement if (c) {break}, i.e., Brc = Cond [c, Seq [Br ] , −]. We
7 Nsucc ← Nsucc \ {n}; represent by n ⇓ Br the fact that a break statement is attached
8 Nnew ← Nnew ∪ to each exit from P the control construct represented by node n.
{u : u ∈ [succs(n) \ Nloop ] ∧ dom(nh ,u)}; The operator returns the list of statements in a given node.
In our running example, computing the initial loop structure
9 end
for region R1 results in the first (leftmost) code in Figure 9.
10 end
The loop body consists of an if statement with break state-
11 Nsucc ← Nsucc ∪ Nnew
ments only in its false branch. This matches the C OND T O S EQ
12 end
rule, which transforms the loop body into a sequence of a
while loop and the false branch of the if statement. The rule
Phoenix [33] employs a similar approach to define loop states that in this case the true branch of the if statement (n1 )
membership. The key difference to our approach is that is continuously executed as long as the condition c1 is satisfied.
Phoenix assumes that the loop successor is either the imme- Then, control flows to the false branch. This is repeated until
diate successor of the header or latching node. For example, the execution reaches a break statement. The resulting loop
in case of endless loops with multiple break statements or body is a sequence that ends with a conditional break Br¬c3
loops with unstructured continuation condition (e.g., region that matches the D OW HILE rule. The second transformation
R3 ), the simple assumption that loop successor is directly results in the third (rightmost) loop structure. At this point the
reached from loop header or latching nodes fails. In these inference algorithm reaches a fixed point and terminates.
cases Phoenix generates an endless loop and represents exits
To give an intuition of the unstructured code produced by
using goto statements. In contrast, our successor refinement
structural analysis when a region in the CFG does not match its
technique described above does not suffer from this problem
predefined region schemas, we consider the region R3 in our
and generates structured code without needing to use goto
running example. Computing the body’s AST of the loop in re-
statements.
gion R3 and assuming an endless loop results in the loop repre-
3) Loop Type and Condition: In order to identify loop type sented as while (1) {if ((¬d1 ∧ ¬d2 ) ∨ (d1 ∧ ¬d3 )) {break;} . . .}.
and condition, we first represent each edge to the successor The loop’s body starts with a conditional break and
node as a break statement and compute the AST of the hence is structured according to the W HILE rule into
loop body after refinement nb . Note that the loop body is while ((d1 ∧ d3 ) ∨ (¬d1 ∧ d2 )) {. . .}. We wrote a small function
an acyclic region that we structure as explained in §IV-B. that produces the same CFG as the region R3 and decompiled
Next, we represent the loop as endless loop with the computed it with D REAM and Hex-Rays. Figure 11 shows that our

8
h i h i
n` = Loop τendless , −, Seq [ni ]i∈1..k n1 = Brc n` = Loop τendless , −, Seq [ni ]i∈1..k nk = Brc
h i W HILE h i D OW HILE
n` ; Loop τwhile , ¬c, Seq [ni ]i∈2..k n` ; Loop τdowhile , ¬c, Seq [ni ]i∈1..k−1
h i
n` = Loop τendless , −, Seq [ni ]i∈1..k
P
∀i ∈ 1..k − 1 : Br ∈
/ [ni ] nk = Cond [c, nt , −]
" h i # N ESTED D OW HILE
i∈1..k−1
n` ; Loop τendless , −, Seq Loop τdowhile , ¬c, Seq [ni ] , nt
h i
n` = Loop τendless , −, Seq [ni ]i∈1..k nk = n´k ⇓ Br
h i L OOP T O S EQ
n` ; Seq n1 , . . . , nk−1 , n´k
h i P P
n` = Loop τendless , −, Cond [c, nt , nf ] Br ∈ / [nt ] Br ∈ [nf ]
h i C OND T O S EQ
n` ; Loop τendless , −, Seq Loop [τwhile , c, nt ] , nf
h i P P
n` = Loop τendless , −, Cond [c, nt , nf ] Br ∈ [nt ] Br ∈
/ [nf ]
h i C OND T O S EQ N EG
n` ; Loop τendless , −, Seq Loop [τwhile , ¬c, nf ] , nt

Fig. 8: Loop structuring rules. The input to the rules is a loop node n` .

1 signed int __cdecl loop(signed int a1) D. Side Effects

2 {
3 signed int v2; // [sp+1Ch] [bp-Ch]@1 Our structuring algorithm may result in the same condition
4 appearing multiple times in the computed AST. For example,
5 v2 = 0;
6 while ( a1 > 1 ){
structuring region R2 in the running example leads to the AST
7 if ( v2 > 10 ) shown in Figure 7 where condition b1 is tested twice. If the
8 goto LABEL_7; variables tested by condition b1 are modified in block n4 ,
9 LABEL_6: the second check of b1 in the AST would not be the same
10 printf("inside_loop"); as the first check. As a result, the code represented by the
11 ++v2;
12 --a1; computed AST would not be semantically equivalent to the
13 } CFG representation.
14 if ( v2 <= 100 )
15 goto LABEL_6; To guarantee the semantics-preserving property of our
16 LABEL_7: algorithm, we first check if any condition is used multiple
17 printf("loop_terminated"); times in the computed AST. If this is the case, we check if any
18 return v2; of the variables used in the test is changed on an execution
19 }
path between any two uses. This includes if the variable is
assigned a new value, used in a call expression, or used in
Fig. 10: Decompiled code generated by Hex-Rays. reference expression (its address is read). If a possible change
is detected, we insert a Boolean variable to store the initial
value of the condition. All subsequent uses of the condition
1 int loop(int a){ are replaced by the inserted Boolean variable.
2 int b = 0;
3 while((a <= 1 && b <= 100)||(a > 1 && b <= 10)){
4 printf("inside_loop"); E. Summary
5 ++b;
6 --a; In this section, we have discussed our approach to creating
7 } an AST for single-entry and single-successor CFG regions.
8 printf("loop_terminated"); The above algorithm can structure every CFG except cyclic
9 return b; regions with multiple entries and/or multiple successors. The
10 }
following section discusses how we handle these problematic
regions.
Fig. 11: Decompiled code generated by D REAM.
V. S EMANTICS -P RESERVING C ONTROL -F LOW
T RANSFORMATIONS
approach correctly found the loop type and continuation con- In this section, we describe our method to transform
dition. In comparison, Hex-Rays produced unstructured code cyclic regions into semantically equivalent single-entry single-
with two goto statements as shown in Figure 10; one goto successor regions. As the only type of regions that cannot be
statement jumps outside the loop and the other one jumps back structured by our pattern-independent structuring algorithm are
in the loop. cyclic regions with multiple entries or multiple successors, we

9
p0 i=0 n0
p0 n0

n0 c0 n1
p1 i=1 n1
.. c0 ..
p1 .. .nk−1
. ¬c0 s1
.. pk−1 i=k−1
n0 . nk
n1 c1 ..
. c1 nk−1 .
.. ..
pk−1 . . n1
pk i=k sk−1
i=0 ¬c1
c1 c1
nk−1 ck−1 nk
ck−1 ¬ck−1
ck−1 ck−1
s1
pk ¬ck−1 nk−1
sk ck sk−1
nk i=0 ¬ck ck
nk
i=0 ns ns sk

Fig. 12: Transforming abnormal entries: multi-entry loops Fig. 13: Transforming abnormal exits: loops with multiple
(left) are transformed into semantically equivalent single-entry successors (left) are transformed into semantically equivalent
loops (right). Tags cn represent the logical predicates i = n. single-successor loops (right).

B. Restructuring Abnormal Exits

apply the proposed transformations on those regions. Based on
the previous steps we know the following information about The high-level approach to structuring abnormal exits (cf.
the cyclic region: a) region nodes Nloop , b) normal entry nh , IV-C2) is illustrated in Figure 13. Our approach computes
and c) successor node ns . for each exit the unique condition that causes the control-
flow to choose that exit and redirects all exit edges to a new
successor node. Here, we insert cascading condition nodes that
successively check the exit conditions and transfer control to
A. Restructuring Abnormal Entries the original exit if the corresponding condition is satisfied or to
the next check (or the last exit node) otherwise. We restructure
The high-level approach to structuring abnormal entries abnormal exits after restructuring abnormal entries. Therefore,
(cf. IV-C1) is illustrated in Figure 12. The underlying idea at this stage the loop successor is known and the loop has a
is to insert a structuring variable (i in Figure 12) that takes unique entry node dominating all loop nodes.
different values based on the node at which the loop is
entered. We then redirect all loop entries to a new header We start by computing the set of edges that exit the
node (c0 ) where we insert cascading condition nodes that test cyclic region to a node other than the successor node Eout =
equality of the structuring variable to the values representing {(n, u) ∈ E : n ∈ Nloop ∧ u ∈ / Nloop ∪ {ns }}. Then, we com-
the different entries. Each condition node transfers control to pute nearest common dominator (NCD) for the set of source
the corresponding entry node if the check is satisfied and to nodes for edges in Eout , denoted nncd . In a graph G (N, E),
the next check (or the last entry node) otherwise. All incoming a node d ∈ N is the nearest common dominator of a set
edges to the original header n0 are directed to the new header of nodes U ⊆ N if d dominates all nodes of U and there
c0 . We preserve semantics by inserting assignments of zero to exists no node d´ 6= d that dominates all nodes of U and is
the structuring variable at the end of each abnormal entry so strictly dominated by d. Since the loop header dominates all
that the next loop iteration is executed normally. loop nodes (after restructuring abnormal entries), the NCD of
any subset of loop nodes is also a loop node. The basic idea
For each loop node n ∈ Nloop with incoming edges from here is that any change in the control flow to a given exit does
outside the loop, we first compute the set of corresponding not happen before nncd . Thus, we need to compute the set
abnormal entries En = {(p, n) ∈ E : p ∈ / Nloop }. Then, we of reaching conditions starting from nncd , i.e., we compute
create a new code node consisting of assignment of the reaching conditions cr (nncd , u) to the target nodes of edges
structuring variable to a unique value and redirect edges in in Eout .
En into the newly created node. Finally, we add an edge from
the new code node to the new loop header. We represent the C. Summary
normal entry to the loop by assigning zero to the structuring At this point we transformed the CFG to an AST that
variable. In order to produce well-readable decompiled code, contains only high-level control constructs and no goto state-
we strive to keep the changes caused by our transformations ments. As a final step, we introduce several optimizations to
minimal. For this reason, the first check we make at the new improve the readability of the decompiled output.
loop header is whether the loop is entered normally. In this
case, we transfer control to the original header. This has the
VI. P OST-S TRUCTURING O PTIMIZATIONS
advantage of preserving loop type and minimally modifying
the original condition. For example, restructuring a while After having computed the abstract syntax tree, we perform
loop while (c) {. . .} with abnormal entries results in a while several optimizations to improve code readability. Specifically,
loop whose condition contains additional term representing the we perform three optimizations: control constructs simplifica-
abnormal entries while (c ∨ i 6= 0) {. . .}. tion, outlining certain string functions, and variable renaming.

10
We implement several transformations that find simpler A. Metrics
forms for certain control constructs. For instance, we trans-
We evaluate our approach with respect to the following
form if statements that assign different values to the same
quantitative metrics.
variable into a ternary operator. That is, code such as
if (c) {x = vt } else {x = vf } is transformed into the equivalent • Correctness. Correctness measures the functional equiv-
form x = c ? vt : vf . Also, we identify while loops that can alence between the decompiled output and the input code.
be represented as for loops. for loop candidates are while More specifically, two functions are semantically equiv-
loops that have a variable x used both in their continuation alent if they follow the same behavior and produce the
condition and the last statement in their body. We then check same results when they are executed using the same set
if a single definition of x reaches the loop entry that is only of parameters. Correctness is a crucial criterion to ensure
used in the loop body. We transform the loop into a for loop that the decompiled output is a faithful representation of
if the variables used in the definition are not used on the way the corresponding binary code.
from the definition to the loop entry. These checks allow us • Structuredness. Structuredness measures the ability of
to identify for loops even if their initialization statements are a decompiler to recover high-level control flow structure
not directly before the loop. and produce structured decompiled code. Structuredness
is measured by the number of generated goto statements
Several functions such as strcpy, strlen, or strcmp in the output. Structured code is easier to understand [16]
are often inlined by compilers. That is, a function call is and helps scale program analysis [31]. For this reason, it is
replaced by the called function body. Having several duplicates desired to have as few goto statements in the decompiled
of the same function results in larger code and is detrimental code as possible. These statements indicate the failure to
to manual analysis. D REAM recognizes and outlines several find a better representation of control flow.
functions. That is, it replaces the corresponding code by the • Compactness. For compactness we perform two mea-
equivalent function call. surements: first, we measure the total lines of code gen-
erated by each decompiler. This gives a global picture on
For the third optimization, we leverage API calls to assign the compactness of decompiled output. Second, we count
meaningful names to variables. API functions have known for how many functions each decompiler generated the
signatures including the types and names of parameters. If fewest lines of code compared to the others. If multiple
a variable is used in an API call, we give it the name of decompilers generate the same (minimal) number of lines
corresponding parameter if that name is not already used. of code, that is counted towards the total of each of them.

B. Experiment Setup & Results

VII. E VALUATION To evaluate our algorithm on the mentioned metrics, we
conducted two experiments.
In this section, we describe the results of the experiments
we have performed to evaluate D REAM. We base our evalua- 1) Correctness Experiment: We evaluated the correctness
tion on the technique used to evaluate Phoenix by Schwartz et of our algorithm on the GNU coreutils 8.22 suite of
al. [33]. This evaluation used the GNU coreutils to evaluate utilities. coreutils consist of a collection of mature pro-
the quality of the decompilation results. We compared our grams and come with a suite of high-coverage tests. We
results with Phoenix [33] and Hex-Rays [22]. We included followed a similar approach to that proposed in [33] where
Hex-Rays because it is the leading commercial decompiler and the coreutils tests were used to measure correctness. Also,
the de facto industry standard. We tested the latest version of since the coreutils source code contains goto statements,
Hex-Rays at the time of writing, which is v2.0.0.140605. We this means that both parts of our algorithm are invoked;
picked Phoenix because it is the most recent and advanced the pattern-independent structuring part and the semantics-
academic decompiler. We did not include dcc [11], DISC [28], preserving transformations part. Our goal is to evaluate the
REC [1], and Boomerang [17] in our evaluation. The reason control-flow structuring component. For this, we computed
is that these projects are either no longer actively maintained the CFG for each function in the coreutils source code
(e.g., Boomerang) or do not support x86 (e.g., dcc). However, and provided it as input to our algorithm. Then, we replaced
most importantly, they are outperformed by Phoenix. The the original functions with the generated algorithm output,
implementation of Phoenix is not publicly available yet. How- compiled the restructured coreutils source code, and finally
ever, the authors kindly agreed to share both the coreutils executed the tests. We used joern [41] to compute the CFGs.
binaries used in their experiments and the raw decompiled Joern is a state-of-the-art platform for analysis of C/C++ code.
source code produced by Phoenix to enable us to compute our It generates code property graphs, a novel graph representation
metrics and compare our results with theirs. We very much of code that combines three classic code representations;
appreciate this good scientific practice. This way, we could ASTs, CFGs, and Program Dependence Graphs (PDG). Code
ensure that all three decompilers are tested on the same binary property graphs are stored in a Neo4J graph database. More-
code base. We also had the raw source code produced by all over, a thin python interface for joern and a set of useful
three decompilers as well, so we can compare them fairly. In utility traversals are provided to ease interfacing with the graph
addition to the GNU coreutils benchmark we also evaluated database. We iterated over all parsed functions in the database
our approach using real-world malware samples. Specifically, and extracted the CFGs. We then transformed statements in
we decompiled and analyzed ZeusP2P, SpyEye, Cridex. For the CFG nodes into D REAM’s intermediate representation. The
this part of our evaluation we could only compare our approach extracted graph representation was then provided to our struc-
to Hex-Rays since Phoenix is not yet released. turing algorithm. Under the assumption of correct parsing, we

11
Considered Functions F |F | Number of gotos the result of the same library function being statically
linked to several binaries, i.e., its code is copied into
Functions after preprocessor 1,738 219 the binary. Depending on the duplicate functions this can
Functions correctly parsed by joern 1,530 129 skew the results. Thus, we wrote a small IDAPython script
Functions passed tests after structuring 1,530 0 that extracts the assembly listings of all functions and
then computed the SHA-512 hash for the resulting files.
TABLE II: Correctness results. We found that of the 14,747 functions contained in the
coreutils binaries, only 3,141 functions are unique, i.e.,
78.7% of the functions are duplicates. For better com-
parability, we report the results both on the filtered and
can attribute the failure of any test on the restructured functions unfiltered function lists. However, for future comparisons
to the structuring algorithm. To make the evaluation tougher, we would argue that filtering duplicate functions before
we used the source files produced by the C-preprocessor, since comparison avoids skewing the results based on the same
depending on the operating system and installed software, code being included multiple times.
some functions or parts of functions may be removed by the • Also in the original Phoenix evaluation only recompilable
preprocessor before passing them to the compiler. That in turn functions were considered in the goto test. In the context
would lead to potential structuring errors to go unnoticed if of coreutils, this meant that only 39% of the unique
the corresponding function is removed by the preprocessor. functions decompiled by Phoenix were considered in the
We got the preprocessed files by passing the --save-temps goto experiment. We extend these tests to consider the
to CFLAGS in the configure script. The preprocessed source intersection of all functions produced by the decompilers,
code contains 219 goto statements. since even non-recompilable functions are valuable and
important to look at, especially for malware and security
2) Correctness Results: Table II shows statistics about the analysis. For instance, the property graph approach [41]
functions included in our correctness experiments. The pre- to find vulnerabilities in source code does not assume that
processed coreutils source code contains 1,738 functions. the input source code is compilable. Also, understanding
We encountered parsing errors for 208 functions. We excluded the functionality of a sample is the main goal of manual
these functions from our tests. The 1,530 correctly parsed malware analysis. Hence, the quality of all decompiled
functions were fed to our structuring algorithm. Next, we code is highly relevant and thus included in our evalua-
replaced the original functions in coreutils by the structured tion. For completeness, we also present the results based
code produced by our algorithm. The new version of the on the functions used in the original evaluation done by
source code passed all coreutils tests. This shows that Schwartz et al.
our algorithm correctly recovered control-flow abstractions
from the input CFGs. More importantly, goto statements in 4) Structuredness & Compactness Results: Table III sum-
the original source code are transformed into semantically marizes the results of our second experiment. For the sake of
equivalent structured forms. completeness, we report our results in two settings. First, we
consider all functions without filtering duplicates as was done
The original Phoenix evaluation shows that their control-
in the original Phoenix evaluation. We report our results for the
flow structuring algorithm is correct. Thus, both tools correctly
functions considered in the original Phoenix evaluation (i.e.,
structure the input CFG.
only recompilable functions) (T1) and for the intersection of
3) Structuredness and Compactness Experiment: We tested all functions decompiled by the three decompilers (T2). In the
and compareed D REAM to Phoenix and Hex-Rays. In this second setting we only consider unique functions and again
experiment we used the same GNU coreutils 8.17 binaries report the results only for the functions used in the original
used in Phoenix evaluation. Structuredness is measured by the Phoenix study (T3) and for all functions (T4). In the table |F |
number of goto statements in code. These statements indicate denotes the number of functions considered. The following
that the structuring algorithm was unable to find a structured three columns report on the metrics defined above. First, the
representation of the control flow. Therefore, structuredness is number of goto statements in the functions is presented. This
inversely proportional to the number of goto statements in is the main contribution of our paper. While both state-of-
the decompiled output. To measure compactness, we followed the-art decompilers produced thousands of goto statements
a straightforward approach. We used David A. Wheeler’s for the full list of functions, D REAM produced none. We
SLOCCount utility to measure the lines of code in each believe this is a major step forward for decompilation. Next,
decompiled function. To ensure fair comparison, the Phoenix we present total lines of code generated by each decompiler in
evaluation only considered functions that were decompiled the four settings. D REAM generated more compact code overall
by both Phoenix and Hex-Rays. We extend this principle to than Phoenix and Hex-Rays. When considering all unique
only consider functions that were decompiled by all the three functions, D REAM’s decompiled output consists of 107k lines
decompilers. If this was not done, a decompiler that failed to of code in comparison to 164k LoC in Phoenix output and
decompile functions would have an unfair advantage. Beyond 135k LoC produced by Hex-Rays. Finally, the percentage of
that, we extend the evaluation performed by Schwartz et al. functions for which a given decompiler generated the most
[33] in several ways. compact function is depicted. In the most relevant test setting
T4, D REAM produced the minimum lines of code for 75.2% of
• Duplicate functions. In the original Phoenix evaluation the functions. For 31.3% of the functions, Hex-Rays generated
all functions were considered, i.e., including duplicate the most compact code. Phoenix achieved the best compactness
functions. It is common to have duplicate functions as in 0.7% of the cases. Note that the three percentages exceed

12
Considered Functions F |F | Number of goto Statements Lines of Code Compact Functions

D REAM Phoenix Hex-Rays D REAM Phoenix Hex-Rays D REAM Phoenix Hex-Rays

coreutils functions with duplicates

T1 : Fpr ∩ Fhr 8,676 0 40 47 93k 243k 120k 81.3% 0.3% 32.1%
T2 : Fd ∩ Fp ∩ Fh 10,983 0 4,505 3,166 196k 422k 264k 81% 0.2% 30.4%

coreutils functions without duplicates

T3 : Fpr ∩ Fhr 785 0 31 28 15k 30k 18k 74.9% 1.1% 36.2%
T4 : Fd ∩ Fp ∩ Fh 1,821 0 4,231 2,949 107k 164k 135k 75.2% 0.7% 31.3%

Malware Samples
ZeusP2P 1,021 0 N/A 1,571 42k N/A 53k 82.9% N/A 14.5%
SpyEye 442 0 N/A 446 24k N/A 28k 69.9% N/A 25.7%
Cridex 167 0 N/A 144 7k N/A 9k 84.8% N/A 12.3%

TABLE III: Structuredness and compactness results. For the coreutiles benchmark, we denote by Fx the set of functions
decompiled by compiler x. Fxr is the set of recompilable functions decompiled by compiler x. d represents D REAM, p represents
Phoenix, and h represents Hex-Rays.

100% due to the fact that multiple decompilers could generate from the CFG representation, namely interval analysis and
the same minimal number of lines of code. In a one on one structural analysis. Originally, these techniques were devel-
comparison between D REAM and Phoenix, D REAM scored oped to assist data flow analysis in optimizing compilers.
98.8% for the compactness of the decompiled functions. In Interval analysis [3, 13] deconstructs the CFG into nested
a one on one comparison with Hex-Rays, D REAM produced regions called intervals. The nesting structure of these regions
more compact code for 72.7% of decompiled functions. helps to speed up data-flow analysis. Structural analysis [34] is
a refined form of interval analysis that is developed to enable
5) Malware Analysis: For our malware analysis, we picked the syntax-directed method of data-flow analysis designed for
three malware samples from three families: ZeusP2P, Cridex, ASTs to be applicable on low-level intermediate code. These
and SpyEye. The results for the malware samples shown algorithms are also used in the context of decompilation to
in Table III are similarly clear. D REAM produces goto-free recover high-level control constructs from the CFG.
and compact code. As can be seen in the Zeus sample,
Hex-Rays produces 1,571 goto statements. These statements Prior work on control-flow structuring proposed several
make analyzing these pieces of malware very time-consuming enhancement to vanilla structural analysis. The goal is to
and difficult. While further studies are needed to evaluate if recover more control structure and minimize the number
compactness is always an advantage, the total elimination of of goto statements in the decompiled code. Engel et. al.
goto statements from the decompiled code is a major step [18] extended structural analysis to handle C-specific control
forward and has already been of great benefit to us in our statements. They proposed a Single Entry Single Successor
work analyzing malware samples. (SESS) analysis as an extension to structural analysis to handle
the case of statements that exist before break and continue
Due to space constraints, we cannot present a com- statements in the loop body.
parison of the decompiled malware source code in this
paper. For this reason, we have created a supplemen- These approaches share a common property; they rely on a
tal document which can be accessed under the follow- predefined set of region patterns to structure the CFG. For this
ing URL: https://net.cs.uni-bonn.de/fileadmin/ag/martini/Staff/ reason, they cannot structure arbitrary graphs without using
yakdan/code_snippets_ndss_2015.pdf. Here we present listings goto statements. Our approach is fundamentally different in
of selected malware functions so that the reader can get a that it does not rely on any patterns.
personal impression on the readability improvements offered Another related line of research lies in the area of elimi-
by D REAM compared to Hex-Rays. nating goto statements at the source code level such as [19]
and [39]. These approaches define transformations at the AST
VIII. R ELATED W ORK level to replace goto statements by equivalent constructs. In
some cases, several transformations are necessary to remove a
There has been much work done in the field of de- single goto statement. These approaches increase the code
compilation and abstraction recovery from binary code. In size and miss opportunities to find more concise forms to
this section, we review related work and place D REAM in represent the control-flow. Moreover, they may insert unneces-
the context of existing approaches. We start by reviewing sary Boolean variables. For example, these approaches cannot
control-flow structuring algorithms. Next, we discuss work in find the concise form found by D REAM for region R3 in our
decompilation, binary code extraction and analysis. Finally, running example. These algorithms do not solve the control-
techniques to recover type abstractions from binary code are flow structuring problem as defined in Section II-B.
discussed.
Decompilers. Cifuentes laid the foundations for modern de-
Control-flow structuring. There exist two main approaches compilers. In her PhD thesis [11], she presented several tech-
used by modern decompilers to recover control-flow structure niques to decompile binary code into a high-level language.

13
These techniques were implemented in dcc, a decompiler for multiple disassembly rounds with data-flow analysis to achieve
Intel 80286/DOS to C. The structuring algorithm in dcc [12] accurate and complete CFG extraction. Zeng et el. presented
is based on interval analysis. She also presented four region trace-oriented programming (TOP) [43] to reconstruct pro-
patterns to structure regions resulted from the short-circuit gram source code from execution traces. The executed instruc-
evaluation of compound conditional expressions, e.g., x ∨ y. tions are translated into a high-level program representation
using C with templates and inlined assembly. TOP relies on
Van Emmerik proposed to use the Static Single Assignment
dynamic analysis and is therefore able to cope with obfuscated
(SSA) form for decompilation in his PhD thesis [17]. His
binaries. With the goal of achieving high coverage, an offline
work demonstrates the advantages of the SSA form for several
combination component combines multiple runs of the binary.
data flow components of decompilers, such as expression
BitBlaze [37] is a binary analysis platform. The CMU Binary
propagation, identifying function signatures, and eliminating
Analysis Platform (BAP) [8] is successor to the binary analysis
dead code. His approach is implemented in Boomerang, an
techniques developed for Vine in the BitBlaze project.
open-source decompiler. Boomerang’s structuring algorithm is
based on parenthesis theory [35]. Although faster than interval Type recovery. Reconstructing type abstractions from binary
analysis, it recovers less structure. code is important for decompilation to produce correct and
Chang et. el. [9] demonstrated the possibility of applying high-quality code. This includes both elementary and complex
source-level tools to assembly code using decompilation. For types. Several prominent approaches have been developed
this goal, they proposed a modular decompilation architecture. in this field including Howard [36], REWARDS [30], TIE
Their architecture consists of a series of decompilers connected [29], and [23]. Other work [15, 20, 21, 25] focused on C++
by intermediate languages. For their applications, no control- specific issues, such as recovering C++ objects, reconstructing
flow structuring is performed. class hierarchy, and resolving indirect calls resulting from
virtual inheritance. Since our work focuses on the control flow
Hex-Rays is the de facto industry standard decompiler. It structuring we do not make a contribution to type recovery but
is built as plugin for the Interactive Disassembler Pro (IDA). we based our type recovery on TIE [29].
Hex-Rays is closed source, and thus little is known about its
inner workings. It uses structural analysis [22]. As noted by
Schwartz et el. in [33], Hex-Rays seems to use an improved IX. C ONCLUSION
version of vanilla structural analyses. In this paper we presented the first control-flow struc-
Yakdan et al. [40] developed REcompile, a decompiler turing algorithm that is capable of recovering all control
that employs interval analysis to recover control structure. The structure and thus does not generate any goto statements. Our
authors also proposed node splitting to reduce the number of novel algorithm combines two techniques: pattern-independent
goto statements. Here, nodes are split into several copies. structuring and semantics-preserving transformations. The key
While this reduces the amount of goto statements, it increases property of our approach is that it does not rely on any
the size of decompiled output. patterns (region schemas). We implemented these techniques
in our D REAM decompiler and evaluated the correctness
Phoenix is the state-of-the-art academic decompiler [33]. of our control-flow structuring algorithm. We also evaluated
It is built on top of the CMU Binary Analysis Platform our approach against the de facto industry standard decom-
(BAP) [8]. BAP lifts sequential x86 assembly instructions piler, Hex-Rays, and the state-of-the-art academic decompiler,
into an intermediate language called BIL. It also uses TIE Phoenix. Our evaluation shows that D REAM outperforms both
[29] to recover types from binary code. Phoenix enhances decompilers; it produced more compact code and recovered
structural analysis by employing two techniques: first, iterative the control structure of all the functions in the test without any
refinement chooses an edge and represents it using a goto goto statements. We also decompiled and analyzed a number
statement when the algorithm cannot make further progress. of real-world malware samples and compared the results to
This allows the algorithm to find more structure. Second, Hex-Rays. Again, D REAM performed very well, producing
semantics-preserving ensures correct control structure recov- goto-free and compact code compared to Hex-Rays, which
ery. The authors proposed correctness as an important metric had one goto for every 32 lines of code. This represents
to measure the performance of a decompiler. a significant step forward for decompilation and malware
The key property that all structuring algorithms presented analysis. In future work, we will further examine the quality
above share is the reliance on pattern matching, i.e, they use of the code produced by D REAM specifically concerning the
a predefined set of region schemas that are matched against compactness. Our experience based on the malware samples
regions in the CFG. This is a key issue that prevents these we analyzed during the course of this paper suggests that more
algorithms from structuring arbitrary CFGs. This leads to compact code is better for human understanding. However, it
unstructured decompiled output with goto statements. Our is conceivable that in some cases less compact code is easier
algorithm does not rely on such patterns and is therefore able to to understand. This will require further research and potential
produce well-structured code without a single goto statement. optimization of the post-processing step.

Binary code extraction. Correctly extracting binary code ACKNOWLEDGEMENTS

is essential for correct decompilation. Research in this field
is indispensable for decompilation. Kruegel et al. presented We are grateful to Fabian Yamaguchi, the author of joern,
a method [27] to disassemble x86 obfuscated code. Jakstab who was very helpful and created several patches to improve
[26] is a static analysis framework for binaries that follows the parsing of the coreutils. We sincerely thank Edward
the paradigm of iterative disassembly. That is, it interleaves J. Schwartz for sharing the Phoenix experiments results. We

14
would also like to thank the anonymous reviewers for their [23] I. Haller, A. Slowinska, and H. Bos, “MemPick: High-Level Data
valuable feedback. Structure Detection in C/C++ Binaries,” in Proceedings of the 20th
Working Conference on Reverse Engineering (WCRE), 2013.
[24] I. Haller, A. Slowinska, M. Neugschwandtner, and H. Bos, “Dowsing
for Overflows: A Guided Fuzzer to Find Buffer Boundary Violations,”
R EFERENCES in Proceedings of the 22nd USENIX Security Symposium, 2013.
[1] REC Studio 4 - Reverse Engineering Compiler. http://www.backerstreet. [25] W. Jin, C. Cohen, J. Gennari, C. Hines, S. Chaki, A. Gurfinkel,
com/rec/rec.htm. Page checked 7/20/2014. J. Havrilla, and P. Narasimhan, “Recovering C++ Objects From Binaries
[2] The IDA Pro disassembler and debuger. http://www.hex-rays.com/ Using Inter-Procedural Data-Flow Analysis,” in Proceedings of ACM
idapro/. SIGPLAN on Program Protection and Reverse Engineering Workshop
(PPREW), 2014.
[3] F. E. Allen, “Control Flow Analysis,” in Proceedings of ACM Sympo-
[26] J. Kinder and H. Veith, “Jakstab: A Static Analysis Platform for
sium on Compiler Optimization, 1970.
Binaries,” in Proceedings of the 20th International Conference on
[4] D. Andriesse, C. Rossow, B. Stone-Gross, D. Plohmann, and H. Bos, Computer Aided Verification (CAV), 2008.
“Highly Resilient Peer-to-Peer Botnets Are Here: An Analysis of
[27] C. Kruegel, W. Robertson, F. Valeur, and G. Vigna, “Static Disassembly
Gameover Zeus,” in Proceedings of the 8th IEEE International Confer-
of Obfuscated Binaries,” in Proceedings of the 13th Conference on
ence on Malicious and Unwanted Software (MALWARE), 2013.
USENIX Security Symposium, 2004.
[5] A. Bessey, K. Block, B. Chelf, A. Chou, B. Fulton, S. Hallem, C. Henri- [28] S. Kumar. DISC: Decompiler for TurboC. http://www.debugmode.com/
Gros, A. Kamsky, S. McPeak, and D. Engler, “A Few Billion Lines of dcompile/disc.htm. Page checked 7/20/2014.
Code Later: Using Static Analysis to Find Bugs in the Real World,”
Communications of the ACM, vol. 53, no. 2, pp. 66–75, Feb. 2010. [29] J. Lee, T. Avgerinos, and D. Brumley, “TIE: Principled Reverse En-
gineering of Types in Binary Programs,” in Proceedings of the 18th
[6] E. Bosman, A. Slowinska, and H. Bos, “Minemu: The World’s Fastest Network and Distributed System Security Symposium (NDSS), 2011.
Taint Tracker,” in Proceedings of the 14th International Conference on
Recent Advances in Intrusion Detection (RAID), 2011. [30] Z. Lin, X. Zhang, and D. Xu, “Automatic Reverse Engineering of Data
Structures from Binary Execution,” in Proceedings of the 17th Annual
[7] D. Brumley, T. Chiueh, R. Johnson, H. Lin, and D. Song, “RICH: Network and Distributed System Security Symposium (NDSS), 2010.
Automatically Protecting Against Integer-Based Vulnerabilities,” in
Proceedings of the 14th Network and Distributed System Security [31] S. S. Muchnick, Advanced Compiler Design and Implementation. San
Symposium (NDSS), 2007. Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1997.
[8] D. Brumley, I. Jager, T. Avgerinos, and E. J. Schwartz, “BAP: A Binary [32] C. Rossow, D. Andriesse, T. Werner, B. Stone-Gross, D. Plohmann,
Analysis Platform,” in Proceedings of the 23rd International Conference C. J. Dietrich, and H. Bos, “P2PWNED: Modeling and Evaluating the
on Computer Aided Verification (CAV), 2011. Resilience of Peer-to-Peer Botnets,” in Proceedings of the 34th IEEE
Symposium on Security and Privacy (S&P), 2013.
[9] B.-Y. E. Chang, M. Harren, and G. C. Necula, “Analysis of Low-
[33] E. J. Schwartz, J. Lee, M. Woo, and D. Brumley, “Native x86 Decom-
level Code Using Cooperating Decompilers,” in Proceedings of the 13th
pilation using Semantics-Preserving Structural Analysis and Iterative
International Conference on Static Analysis (SAS), 2006.
Control-Flow Structuring,” in Proceedings of the 22nd USENIX Security
[10] W. Chang, B. Streiff, and C. Lin, “Efficient and Extensible Security Symposium, 2013.
Enforcement Using Dynamic Data Flow Analysis,” in Proceedings of
[34] M. Sharir, “Structural Analysis: A New Approach to Flow Analysis
the 15th ACM Conference on Computer and Communications Security
in Optimizing Compilers,” Computer Languages, vol. 5, no. 3-4, pp.
(CCS), 2008.
141–153, Jan. 1980.
[11] C. Cifuentes, “Reverse Compilation Techniques,” Ph.D. dissertation, [35] D. Simon, “Structuring Assembly Programs,” Honours thesis, Univer-
Queensland University of Technology, 1994. sity of Queensland, 1997.
[12] ——, “Structuring Decompiled Graphs,” in Proceedings of the 6th [36] A. Slowinska, T. Stancescu, and H. Bos, “Howard: A Dynamic Excava-
International Conference on Compiler Construction (CC), 1996. tor for Reverse Engineering Data Structures,” in Proceedings of the 18th
[13] J. Cocke, “Global Common Subexpression Elimination,” in Proceedings Annual Network and Distributed System Security Symposium (NDSS),
of the ACM Symposium on Compiler Optimization, 1970. 2011.
[14] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction [37] D. Song, D. Brumley, H. Yin, J. Caballero, I. Jager, M. G. Kang,
to Algorithms, 3rd ed. The MIT Press, 2009. Z. Liang, J. Newsome, P. Poosankam, and P. Saxena, “BitBlaze: A New
[15] D. Dewey and J. T. Giffin, “Static detection of C++ vtable escape Approach to Computer Security via Binary Analysis,” in Proceedings
vulnerabilities in binary code,” in Proceedings of the 19th Network and of the 4th International Conference on Information Systems Security
Distributed System Security Symposium (NDSS), 2012. (ICISS), 2008.
[16] E. W. Dijkstra, “Letters to the Editor: Go to Statement Considered [38] X. Wang, H. Chen, Z. Jia, N. Zeldovich, and M. F. Kaashoek, “Improv-
Harmful,” Communications of the ACM, vol. 11, no. 3, pp. 147–148, ing Integer Security for Systems with KINT,” in Proceedings of the 10th
Mar. 1968. USENIX Conference on Operating Systems Design and Implementation
(OSDI), 2012.
[17] M. J. V. Emmerik, “Static Single Assignment for Decompilation,” Ph.D.
[39] M. H. Williams and G. Chen, “Restructuring Pascal Programs Contain-
dissertation, University of Queensland, 2007.
ing Goto Statements,” The Computer Journal, 1985.
[18] F. Engel, R. Leupers, G. Ascheid, M. Ferger, and M. Beemster,
[40] K. Yakdan, S. Eschweiler, and E. Gerhards-Padilla, “REcompile: A De-
“Enhanced Structural Analysis for C Code Reconstruction from IR
compilation Framework for Static Analysis of Binaries,” in Proceedings
Code,” in Proceedings of the 14th International Workshop on Software
of the 8th IEEE International Conference on Malicious and Unwanted
and Compilers for Embedded Systems (SCOPES), 2011.
Software (MALWARE), 2013.
[19] A. Erosa and L. J. Hendren, “Taming Control Flow: A Structured [41] F. Yamaguchi, N. Golde, D. Arp, and K. Rieck, “Modeling and
Approach to Eliminating Goto Statements,” in Proceedings of 1994 Discovering Vulnerabilities with Code Property Graphs,” in Proceedings
IEEE International Conference on Computer Languages, 1994. of the 35th IEEE Symposium on Security and Privacy (S&P), 2014.
[20] A. Fokin, E. Derevenetc, A. Chernov, and K. Troshina, “SmartDec: [42] F. Yamaguchi, C. Wressnegger, H. Gascon, and K. Rieck, “Chucky:
Approaching C++ Decompilation,” in Proceedings of the 2011 18th Exposing Missing Checks in Source Code for Vulnerability Discov-
Working Conference on Reverse Engineering (WCRE), 2011. ery,” in Proceedings of the 20th ACM Conference on Computer and
[21] A. Fokin, K. Troshina, and A. Chernov, “Reconstruction of Class Communications Security (CCS), 2013.
Hierarchies for Decompilation of C++ Programs,” in Proceedings of the [43] J. Zeng, Y. Fu, K. A. Miller, Z. Lin, X. Zhang, and D. Xu, “Obfuscation
14th European Conference on Software Maintenance and Reengineering Resilient Binary Code Reuse Through Trace-oriented Programming,” in
(CSMR), 2010. Proceedings of the 20th ACM Conference on Computer and Communi-
[22] I. Guilfanov, “Decompilers and Beyond,” in Black Hat, USA, 2008. cations Security (CCS), 2013.

Accelerated Computing with HIP
From Everand
Accelerated Computing with HIP
Yifan Sun
4.5/5 (2)
A Comb For Decompiled C Code
No ratings yet
A Comb For Decompiled C Code
15 pages
Ahoy SAILR! There Is No Need To DREAM of C: A Compiler-Aware Structuring Algorithm For Binary Decompilation
No ratings yet
Ahoy SAILR! There Is No Need To DREAM of C: A Compiler-Aware Structuring Algorithm For Binary Decompilation
18 pages
Daemon Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
Daemon Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Helping Johnny To Analyze Malware
No ratings yet
Helping Johnny To Analyze Malware
20 pages
Mastering System Programming with C: Files, Processes, and IPC
From Everand
Mastering System Programming with C: Files, Processes, and IPC
Larry Jones
No ratings yet
JFrog Solutions in Modern DevOps: Definitive Reference for Developers and Engineers
From Everand
JFrog Solutions in Modern DevOps: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
C Decompilation PDF
No ratings yet
C Decompilation PDF
15 pages
Binsok
No ratings yet
Binsok
19 pages
C++ Regular Expressions Simplified: A Practical Guide with Examples
From Everand
C++ Regular Expressions Simplified: A Practical Guide with Examples
William E. Clark
No ratings yet
Mastering the Art of Nix Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of Nix Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Truffle for Blockchain Development: Definitive Reference for Developers and Engineers
From Everand
Truffle for Blockchain Development: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Reverse Engineering Tools Review
No ratings yet
Reverse Engineering Tools Review
53 pages
Mastering the Art of C# Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of C# Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Valgrind Essentials: Definitive Reference for Developers and Engineers
From Everand
Valgrind Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Report On Decompilation
No ratings yet
Report On Decompilation
4 pages
Terraform for Developers, Second Edition
From Everand
Terraform for Developers, Second Edition
Kimiko Lee
No ratings yet
Ganache for Ethereum Development: Definitive Reference for Developers and Engineers
From Everand
Ganache for Ethereum Development: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Terraform for Developers, Second Edition: Essentials of Infrastructure Automation and Provisioning
From Everand
Terraform for Developers, Second Edition: Essentials of Infrastructure Automation and Provisioning
Kimiko Lee
No ratings yet
Comprehensive Guide to HashiCorp Technologies: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to HashiCorp Technologies: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Advanced Metaprogramming Techniques: Definitive Reference for Developers and Engineers
From Everand
Advanced Metaprogramming Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Binja Symbexec SENinja
No ratings yet
Binja Symbexec SENinja
7 pages
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
From Everand
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
Mustafa Al-Dori
5/5 (1)
Concurrency in C++: Writing High-Performance Multithreaded Code
From Everand
Concurrency in C++: Writing High-Performance Multithreaded Code
Robert Johnson
No ratings yet
Expert Guide to Eclipse CDT: Definitive Reference for Developers and Engineers
From Everand
Expert Guide to Eclipse CDT: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cifuentes 95 Decompilation
No ratings yet
Cifuentes 95 Decompilation
19 pages
Master PDF
No ratings yet
Master PDF
334 pages
Code Generation Techniques and Applications: Definitive Reference for Developers and Engineers
From Everand
Code Generation Techniques and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Portnoy
No ratings yet
Portnoy
5 pages
Mastering the Craft of C++ Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Craft of C++ Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Podman Essentials: Definitive Reference for Developers and Engineers
From Everand
Podman Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Decompilation
No ratings yet
Decompilation
6 pages
C# Fundamentals Made Simple: A Practical Guide with Examples
From Everand
C# Fundamentals Made Simple: A Practical Guide with Examples
William E. Clark
No ratings yet
Accelerated Computing With HIP: Second Edition
From Everand
Accelerated Computing With HIP: Second Edition
Yifan Sun
No ratings yet
Embedded Systems Programming with C: Writing Code for Microcontrollers
From Everand
Embedded Systems Programming with C: Writing Code for Microcontrollers
Larry Jones
No ratings yet
Efficient Deployment Automation with Fabric: Definitive Reference for Developers and Engineers
From Everand
Efficient Deployment Automation with Fabric: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Litmus Chaos Experiments in Practice: The Complete Guide for Developers and Engineers
From Everand
Litmus Chaos Experiments in Practice: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Binary Code Obfuscation Through C++ Template Metaprogramming
No ratings yet
Binary Code Obfuscation Through C++ Template Metaprogramming
12 pages
C++ Basics for New Programmers: A Practical Guide with Examples
From Everand
C++ Basics for New Programmers: A Practical Guide with Examples
William E. Clark
No ratings yet
Embedded Systems Programming with C++: Real-World Techniques
From Everand
Embedded Systems Programming with C++: Real-World Techniques
Robert Johnson
No ratings yet
Building Scalable Systems with C: Optimizing Performance and Portability
From Everand
Building Scalable Systems with C: Optimizing Performance and Portability
Larry Jones
No ratings yet
GDB Fundamentals and Techniques: Definitive Reference for Developers and Engineers
From Everand
GDB Fundamentals and Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Effective Workflow in PyCharm: Definitive Reference for Developers and Engineers
From Everand
Effective Workflow in PyCharm: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Concurrency and Multithreading in C: POSIX Threads and Synchronization
From Everand
Concurrency and Multithreading in C: POSIX Threads and Synchronization
Larry Jones
No ratings yet
Essential Hardhat Development: Definitive Reference for Developers and Engineers
From Everand
Essential Hardhat Development: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Sysdig Falco Rules in Practice: The Complete Guide for Developers and Engineers
From Everand
Sysdig Falco Rules in Practice: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
SystemTap Essentials: Definitive Reference for Developers and Engineers
From Everand
SystemTap Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Tcpdump in Depth: Definitive Reference for Developers and Engineers
From Everand
Tcpdump in Depth: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Learning PyTorch 2.0, Second Edition
From Everand
Learning PyTorch 2.0, Second Edition
Matthew Rosch
No ratings yet
Learning PyTorch 2.0, Second Edition: Utilize PyTorch 2.3 and CUDA 12 to experiment neural networks and deep learning models
From Everand
Learning PyTorch 2.0, Second Edition: Utilize PyTorch 2.3 and CUDA 12 to experiment neural networks and deep learning models
Matthew Rosch
No ratings yet
Hadolint Essentials: Definitive Reference for Developers and Engineers
From Everand
Hadolint Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering the Craft: Unleashing the Art of Software Engineering
From Everand
Mastering the Craft: Unleashing the Art of Software Engineering
Kiran Nagesh
No ratings yet
CodePipeline in Depth: Definitive Reference for Developers and Engineers
From Everand
CodePipeline in Depth: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Object-Oriented Design Patterns in Modern C++: Unlock the Secrets of Expert-Level Skills
From Everand
Mastering Object-Oriented Design Patterns in Modern C++: Unlock the Secrets of Expert-Level Skills
Larry Jones
No ratings yet
Algorithms Made Simple: Understanding the Building Blocks of Software
From Everand
Algorithms Made Simple: Understanding the Building Blocks of Software
William E. Clark
No ratings yet
C# OOP Step by Step: A Practical Guide with Examples
From Everand
C# OOP Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Image Collection Exploration: Unveiling Visual Landscapes in Computer Vision
From Everand
Image Collection Exploration: Unveiling Visual Landscapes in Computer Vision
Fouad Sabry
No ratings yet
Deno KV for Scalable, Distributed Applications: The Complete Guide for Developers and Engineers
From Everand
Deno KV for Scalable, Distributed Applications: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
ChaosBlade in Practice: The Complete Guide for Developers and Engineers
From Everand
ChaosBlade in Practice: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Mastering Concurrency and Multithreading in C++: Unlock the Secrets of Expert-Level Skills
From Everand
Mastering Concurrency and Multithreading in C++: Unlock the Secrets of Expert-Level Skills
Larry Jones
No ratings yet
Advanced Bash Shell Scripting Guide - Reference Cards
No ratings yet
Advanced Bash Shell Scripting Guide - Reference Cards
5 pages
Autosar Sws Lindriver
No ratings yet
Autosar Sws Lindriver
67 pages
IT3401 - WE Lesson Plan
No ratings yet
IT3401 - WE Lesson Plan
6 pages
Bash Script
No ratings yet
Bash Script
6 pages
Abstract:: Summary Report Multi-Layer Perceptron Training Optimization Using Nature Inspired Computing
No ratings yet
Abstract:: Summary Report Multi-Layer Perceptron Training Optimization Using Nature Inspired Computing
2 pages
Spring Kafka Reference
No ratings yet
Spring Kafka Reference
241 pages
Operating System 1.1
No ratings yet
Operating System 1.1
14 pages
Engineering - Ebook - PDF - Matlab Programming
No ratings yet
Engineering - Ebook - PDF - Matlab Programming
283 pages
Javascript While Loop: ! - Var Mybankbalance 0
No ratings yet
Javascript While Loop: ! - Var Mybankbalance 0
12 pages
Victaulic Grooved Aluminum Installation
No ratings yet
Victaulic Grooved Aluminum Installation
3 pages
Emitir SOM No ABAP
No ratings yet
Emitir SOM No ABAP
3 pages
HTML - CSS - Input Forms 2 PDF
No ratings yet
HTML - CSS - Input Forms 2 PDF
58 pages
Introduction To Maven
No ratings yet
Introduction To Maven
30 pages
Cws 1
No ratings yet
Cws 1
15 pages
Entity Translator
No ratings yet
Entity Translator
6 pages
Classification Classify Images of Clothing - ALI LAZIM
No ratings yet
Classification Classify Images of Clothing - ALI LAZIM
21 pages
CIT304 Chapter 2 Y24
No ratings yet
CIT304 Chapter 2 Y24
18 pages
Loop
No ratings yet
Loop
74 pages
Spring Boot
No ratings yet
Spring Boot
2 pages
Spring Boot Notes 1
No ratings yet
Spring Boot Notes 1
83 pages
RCSTM8
No ratings yet
RCSTM8
214 pages
Long Palindrome Text
No ratings yet
Long Palindrome Text
1 page
Readme
No ratings yet
Readme
3 pages
Ims HSSR
No ratings yet
Ims HSSR
642 pages
Yokohama Now Platform Administration Create A Table 2025-04!29!20!35!58
No ratings yet
Yokohama Now Platform Administration Create A Table 2025-04!29!20!35!58
14 pages
Oracle Database 11g: SQL Fundamentals I: D49996GC10 Edition 1.0 August 2007 D52128
No ratings yet
Oracle Database 11g: SQL Fundamentals I: D49996GC10 Edition 1.0 August 2007 D52128
14 pages
Algorithms, Part I: Exercises: Analysis of Algorithms
No ratings yet
Algorithms, Part I: Exercises: Analysis of Algorithms
2 pages
Designing and Building Parallel Programs
No ratings yet
Designing and Building Parallel Programs
371 pages
Chapter - 5 Array
No ratings yet
Chapter - 5 Array
23 pages
Assignment-3 Ch-1 & 2
No ratings yet
Assignment-3 Ch-1 & 2
3 pages

No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantics-Preserving Transformations

Uploaded by

No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantics-Preserving Transformations

Uploaded by

No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring

and Semantics-Preserving Transformations

Khaled Yakdan∗ , Sebastian Eschweiler† , Elmar Gerhards-Padilla† , Matthew Smith∗

no pattern match is found, goto statements must be used

DECL WHILE RETURN

Fig. 1: Code representations.

1 signed int __cdecl loop(signed int a1) D. Side Effects

B. Restructuring Abnormal Exits

B. Experiment Setup & Results

D REAM Phoenix Hex-Rays D REAM Phoenix Hex-Rays D REAM Phoenix Hex-Rays

coreutils functions with duplicates

coreutils functions without duplicates

Binary code extraction. Correctly extracting binary code ACKNOWLEDGEMENTS

You might also like