Master PDF
Master PDF
by
Doctor of Philosophy
at
The last quote could have been describing the process of decompilation: machine specic
detail is discarded, leading to the essence of the program (source code), from which it
AT X.
Produced with LYX and L E
iii
Statement of Originality
I declare that the work presented in the thesis is, to the best of my knowledge and
belief, original and my own work, except as acknowledged in the text, and that the
material has not been submitted, either in whole or in part, for a degree at this or any
other university.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Acknowledgements
I would like to rstly thank my associate advisor, Cristina Cifuentes, for her inspiration,
hard work nding grant money to pay me during the binary translation years (1997
Professors John Gough and Bill Caelli, who came up with the idea of using decompila-
tion for security analysis, and as a PhD topic. My primary advisor, Prof. Paul Bailes,
was also very helpful, particularly for showing me the value of writing thesis material
In the rst years of the IBM PC and clones (early 1980s), Ron Harris and I would
disassemble scores of Z80 and 8086 programs. Thanks, Ron; they were the best years!
By the late 1980s, Glen McDiarmid was inspiring me (and the rest of the world) with a
truly great interactive disassembler. Thanks also to Glenn Wallace, who in about 1992
started a pattern based decompiler, which we then both worked on, called dc.
I am indebted to several people for their helpful discussions. Ilfak Guilfanov, author of
IDA Pro, was one of these, and was kind enough to drive from a neighbouring country
for one such discussion. Alan Mycroft's paper Type-Based Decompilation was an
early inspiration for investigating type analysis [Myc99]. Two of Alan's students were
also inuential. Eben Upton pointed out to me that type analysis for aggregates is
a dierent problem to analysis of the more elementary types. The other was Jeremy
Singer, whose ideas on Static Single Information triggered many positive ideas.
The theory in this thesis has been tested on an open source decompiler platform called
wrote the pre-SSA data ow code, early preservation code (his proof engine), the
idea of propagating %flags, early Statement class hierarchy, and much else. His code
helped make it possible for me to test the theory without having to write a complete
decompiler from scratch. He demonstrated what could be achieved with ad hoc type
analysis, and also helped by participating in long discussions. I'd like to thank also those
open source developers who helped test and maintain Boomerang, particularly Gerard
Krol, Emmanuel Fleury and his students [BDMP05], Mike Melanson, Mike tamlin
Nordell, and Luke indel Dunstan. Thanks, guys, it's much easier to maintain a big
which was funded in large part by Sun Microsystems. UQBT beneted from the work of
many students. In particular, I'd like to single out Doug Simon, for the early parameter
identication work, and Shane Sendall, for the Semantic Specication Language and
supporting code.
Email correspondents were also helpful with more general discussions about decompila-
tion, particularly Jeremy Smith and Raimar Falke. Ian Peake showed how to implement
AT X and L X
the one-sentence summaries, while Daniel Jarrott was very helpful with L E Y
problems, and generally with discussions.
Gary Holden and Paul Renner of LEA Detection System, Inc. were the rst clients
this period.
I am indebted to Prof. Simon Kaplan for supporting my scholarship, even though the
combined teaching and research plan did not work out in the long term.
Thanks to Jens Tröger and Trent Waddington for reviewing drafts of this thesis; I'm
In any long list of acknowledgements, some names are inevitably left out. If you belong
on this list and have not been mentioned, please don't feel hurt; your contributions are
appreciated.
Last but by no means least, thanks to my wife Margaret and daughter Felicity for
List of Publications
M. Van Emmerik and T. Waddington. Using a Decompiler for Real-World Source
C Cifuentes. and M. Van Emmerik. Recovery of Jump Table Case Statements from
C. Cifuentes and M. Van Emmerik. UQBT: Adaptable Binary Translation at Low Cost.
Experiences with the Use of the UQBT Binary Translation Framework. In Proceedings
of the Workshop on Binary Translation, NewPort Beach, Oct 16, 1999. Technical
22.
C. Cifuentes and M. Van Emmerik. Recovery of Jump Table Case Statements from
Abstract
Static Single Assignment enables the ecient implementation of many important de-
when it is not available, it can be worthwhile deriving it from the executable form of
computer programs through the process of decompilation. There are many applications
for decompiled source code, including inspections for malware, bugs, and vulnerabilities;
interoperability; and the maintenance of an application that has some or all source code
Java and similar platforms, have signicant deciencies. These include poor recovery
of parameters and returns, poor handling of indirect jumps and calls, and poor to
nonexistent type analysis. It is shown that use of the Static Single Assignment form
(SSA form) enables many of these deciencies to be overcome. SSA enables or assists
with
of individual instruction semantics to be combined into more complex, high level state-
ments. Parameters, returns, and types are features of high level languages that do
not appear explicitly in machine code programs, hence their recovery is important for
readability and the ability to recompile the generated code. In addition, type analysis
simple propagation of types from library function calls. The analysis of indirect jumps
and calls is important for nding all code in a machine code program, and enables
the translation of important high level program elements such as switch statements,
assigned gotos, virtual function calls, and calls through function pointers.
Because of these challenges, machine code decompilers are the most interesting case.
Existing machine code decompilers are weak at identifying parameters and returns,
particularly where parameters are passed in registers, or the calling convention is non
devices such as Collectors. These analyses become more complex in the presence of
recursion. The elimination of redundant parameters and returns are shown to be global
analyses, implying that for a general decompiler, procedures can not be nalised until
Full type analysis is discussed, where the semantics of individual instructions, as well as
information from library calls, contribute to the solution. A sparse, iterative, data ow
based approach is compared with the more common constraint based approach. The
former requires special functions to handle the multiple constraints that result from
overloaded operators such as addition and subtraction. Special problems arise with
Indirect branch instructions are often handled at instruction decode time. Delaying
analysis until the program is represented in SSA form allows more powerful techniques
analysis, at the cost of having to throw away some results and restart some analyses. It
is shown that this technique easily extends to handling Fortran assigned gotos, which
can not be eectively analysed at decode time. The analysis of indirect call instructions
has the potential for enabling the recovery of object oriented virtual function calls.
Many of the techniques presented in this thesis have been veried with the Boomerang
open source decompiler. The goal of extending the state of the art of machine code
decompilation has been achieved. There are of course still some areas left for future
work. The most promising areas for future research have been identied as range
Quotations ii
Acknowledgements iv
List of Publications vi
Abstract vii
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxviii
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xl
Summary xliii
1 Introduction 1
1.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
ix
x
1.5.4.1 Disassemblers . . . . . . . . . . . . . . . . . . . . . . . 23
1.7 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2 Decompiler Review 33
2.6.1 Disassembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.6.11.1 LLVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.6.11.2 SUIF2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.6.11.3 COINS . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.6.11.4 SCALE . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.6.11.6 Phoenix . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.6.11.7 Open64 . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.7.1 Sub-elds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4 SSA Form 97
7 Results 223
8 Conclusion 243
Bibliography 249
xvi
List of Figures
1.2 An obfuscated C program; it prints the lyrics for The Twelve Days of
1.3 Part of a traditional reverse engineering (from source code) of the Twelve
1.4 The machine code decompiler and its relationship to other tools and
used, and whether the user modies the automatically generated output. 9
pointers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.8 Assembly language output for the underlined code of Figure 1.6(a), pro-
1.9 Disassembly of the underlined code from Figure 1.6(a) starting with ob-
ject code. Intel syntax. Compare with Figure 1.6(b), which started with
machine code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1 LLVM can be used to compile and optimise a program at various stages
+
2.3 Overview of the COINS compiler infrastructure. From Fig. 1 of [SFF 05]. 55
2.5 The various IRs used in the GCC compiler. From a presentation by D.
Novillo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
xvii
xviii
2.6 Overview of the Phoenix compiler infrastructure and IR. From [MRU07]. 57
2.7 The various levels of the WHIRL IR used by the Open64 compiler in-
3.3 First part of the compiled machine code for procedure comb of Figure 3.2. 62
3.5 Two machine instructions referring to the same memory location us-
3.8 The subtract immediate from stack pointer instruction from register esp
from Figure 3.4(a) including the side eect on the condition codes. . . . 71
3.10 Code from the running example where the carry ag is used explicitly. . 74
3.10 (continued). Code from the running example where the carry ag is used
explicitly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.11 80386 code for the oating point compare in the running example. . . . 76
a called procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.1 The main loop of the running example and its equivalent SSA form. . . 99
xix
4.2 A propagation not normally possible is enabled by the SSA form. . . . 101
4.3 Part of the main loop of the running example, after the propagation of
4.4 The IR of Figure 4.3 after transformation out of SSA form. . . . . . . . 104
4.5 The code of Figure 4.3, with the loop condition optimised to num>r. . . 105
4.6 Two possible transformations out of SSA form for the code of Figure 4.5. 105
4.7 A version of the running example where an unused denition has not
4.8 Incorrect results from translating the code of Figure 4.7 out of SSA form,
4.10 The code from Figure 4.1 after exhaustive expression propagation, show-
4.11 Generated code from a real decompiler with extraneous variables for the
IR of Figure 4.10. Copy statements inserted before the loop are not shown.111
4.12 Live ranges for x2 and x3 when x3 := af(x2 ) is propagated inside a loop. 111
4.15 A recursive version comb from the running example, where the frame
pointer (ebp) has not yet been shown to be preserved because of a re-
4.17 The example of Figure 4.14 with expression propagation before renaming
4.18 The eect of ignoring a restored location. The last example uses a call-
4.19 Pseudo code for a procedure. It uses three xor instructions to swap
4.20 Pseudo code for the procedure of Figure 4.19 in SSA form. Here it
overwritten. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
xx
4.21 The procedure of gure 4.19 with two extra statements. . . . . . . . . . 122
4.22 A version of the running example with the push and pop of ebx removed,
illustrating how preservation analysis handles φ-functions. . . . . . . . 123
4.23 Analysing preserved parameters using propagation and dead code elimi-
nation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.25 A small part of the call graph for the 253.perlbmk SPEC CPU2000 bench-
4.26 A call graph illustrating the algorithm for nding the correct ordering
4.27 A simplied control ow graph for the program of Figure 4.15. . . . . . 132
4.29 Example program illustrating that not all parameters to recursive calls
4.30 Use of collectors for call bypassing, caller and callee contexts, arguments
(only for childless calls), results, denes (also only for childless calls),
4.31 The weak update problem for malloc blocks. From Fig. 1 of [BR06]. . . 140
4.34 IR of the optimised machine code output from Figure 4.33. . . . . . . . 144
4.35 A comparison of IRs for the program of Figure 4.1. Only a few def-use
5.2 Organisation of the CodeSurfer/x86 and companion tools. From [RBL06]. 152
5.3 Elementary and aggregate types at the machine code level. . . . . . . . 156
5.5 A program referencing two dierent types from the same pointer. . . . 157
5.8 A simple program fragment typed using constraints. From [Myc99]. . . 164
5.9 Constraints for the two instruction version of the above. Example from
[Myc99]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5.10 A program fragment illustrating how a pointer can initially appear not
5.16 Source code for accessing the rst element of an array with a nonzero
5.17 Equivalent programs which use the representation m[pl + K]. . . . . . 184
5.20 A program with colocated variables and taking the address. . . . . . . 189
6.4 Output for the program of Figure 6.2 when the switch expression is not
6.6 Tree of φ-statements and assignments to the goto variable from Figure 6.5.203
6.7 Decompiled output for the program of Figure 6.5. Output has been
6.8 Source code for a short switch statement with special case values. . . . 204
6.9 Direct decompiled output for the program of Figure 6.8. . . . . . . . . 205
6.11 Control Flow Graph for the program of Figure 6.10 (part 1). . . . . . . 206
6.11 Control Flow Graph for the program of Figure 6.10 (part 2). . . . . . . 207
6.12 Typical data layout of an object ready to make a virtual call such as
p->draw(). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.15 Source code for a simple program using shared multiple inheritance. . . 212
6.16 Machine code for the start of the main function of Figure 6.15. . . . . . 213
7.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
7.6 Original C source code for function test in the Boomerang SPARC
minmax2 test program. The code was compiled with Sun's compiler using
-xO2 optimisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
7.7 Disassembled SPARC machine code for the program fragment of Fig-
ure 7.6. Note the subcc (subtract and set condition codes) and subx
(subtract extended, i.e. with carry) instructions. . . . . . . . . . . . . . 229
7.8 Decompiled output for the program fragment of Figure 7.6, without ex-
7.9 Decompiled output for the program fragment of Figure 7.6, but with
7.10 A copy of the output of Figure 4.11 with local variables named after the
7.12 Assembly language source code for part of the Boomerang restoredparam
test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
7.13 Intermediate representation for the code of Figure 7.12, just before dead
7.14 Boomerang output for the code of Figure 7.12. The parameter and return
7.15 Original source code for a Fibonacci function. From [Cif94]. . . . . . . 232
7.16 Disassembly of the modied Fibonacci program adapted from [Cif94]. . 233
7.17 Output from the dcc decompiler for the program of Figure 7.16. . . . . 234
7.18 Output from the REC and Boomerang decompilers for the program of
7.21 Debug output from Boomerang while nding that esi (register esi) is
7.22 Call graph for the Boomerang test program test/pentium/recursion2. 238
7.24 The code generated for procedure b for the program test/pentium/recursion2.
The Boomerang -X option was used to remove extraneous variables, as
discussed in Section 7.3. The code has been modied by hand (under-
7.25 Assembler source code for a modication of the Fibonacci program shown
7.26 IR part way through the decompilation of the program of Figure 7.25. . 242
7.27 Generated code for the program of Figures 7.25 and 7.26. Redundant
List of Tables
1.3 Limitations for the two most capable preexisting machine code decompilers. . 27
6.1 High level expressions for switch expressions in the Boomerang decompiler. . 199
7.1 Complexity metrics for the code in Figures 7.1 - 7.5. . . . . . . . . . . . . . . . . . . . . . . . . 228
List of Algorithms
Intel syntax)
AXP appellation for DEC Alpha architecture
BB Basic Block
xxv
xxvi LIST OF ABBREVIATIONS
MGM Metro-Goldwyn-Mayer
NAN Not A Number (an innity, the result of an overow or underow, etc)
OO Object Oriented
GLOSSARY xxvii
PC Personal Computer
SR Status Register
suif.stanford.edu
TA Type Analysis
VFT Virtual Function Table (also virtual table or VT or virtual method table)
VT Virtual Table (also virtual method table or VFT or virtual method table)
K. Bennett)
Glossary
As a result, certain commonly used terms such as source code carry a bias towards
the forward engineering direction. A few terms will be used here with slightly special
meanings.
For example, a loop index and a running array pointer are usually related this
way.
in call statements, and the term parameter is used for formal or dummy pa-
rameters in callees.
• An assembler translates assembly language into object code, suitable for linking.
Occasionally, the term refers to the assembler proper and linker, i.e. a translator
that take exactly two operands are also known as binary operators, forming binary
expressions.
• The borrow ag (also called a condition code bit or status register bit) is the
bit used to indicate a borrow from one subtract operation to the next. In most
processors, it is the same register bit as the carry ag; in some processors, the
case of the stack pointer, the eect of the procedure could be to have a constant
value added to it.) The use is modied to become dened by the denition of the
• Callers are procedures that call callees. Callees are sometimes known as called
procedures.
• The carry ag (also called a condition code bit or status register bit) is the bit
used to indicate a carry (in the ordinary arithmetic sense) from one addition,
shift, or rotate operation to the next. For example, after adding 250 to 10 in an
8-bit add instruction, the result is 4 with a carry (representing one unit of 256
that has to be taken into account by adding one to the high order byte of the
result, if there is one). Adding 240 and 10 results in 250 with the carry ag being
cleared.
• A call graph is the directed graph of procedures with edges directed from callers
to callees. Self recursion causes cycles of length 1 in the call graph; mutual
recursion causes cycles of longer length. Often, a call graph is connected, i.e. the
there will be multiple entry points, and the possibility exists that the call graph
• The canonical form of an expression is the most favoured form. For example,
m[esp0 -8] is preferred to m[ebp+8-12], where esp0 is the value of the stack pointer
register on entry to the procedure. With completely equivalent expressions such
as a+8 and 8+a, one is arbitrarily chosen as canonical. The process of converting
• A childless call is one where the callee does not yet have a data ow summary
available, either because it is a recursive call, or because the call is indirect and
not yet analysed. In the call graph, childless calls have no child nodes that have
call summaries.
xxx GLOSSARY
variables of the ranges do not overlap. Such variables might have dierent types,
compiler is object code, however some compilers emit assembly code which has to
be assembled to object code. Sometimes the term compiler includes the linking
stage, in which case the output would be machine code. The term can therefore
apply to the compiler proper, compiler and assembler, compiler and linker, or
• Condition codes (also called ags or status bits) are a set of single bit registers
that represent such conditions as the last arithmetic operation resulting in a
zero result, negative result, overow, and so on. These bits are usually collected
condition codes, though the big four of Zero, Sign, Overow and Carry are
things depending on the type. A constant could be any integer, oating point,
string, or other xed value, e.g. 99, 0x8048ca8, -5.3, or hello, world. Pointers
to procedures or global data objects are also constants with special types. There
are also type constants, e.g. in the constraint T (x) = α | int; the int is a type
constant.
ferent form in the caller context as opposed to the callee context. For example,
GLOSSARY xxxi
the rst parameter of a procedure might take the form m[esp0 +4], while in the
caller context it might take the form m[esp0 -20]. Both expressions refer to the
same physical memory word, but esp0 (the stack pointer value on entry to the
here. In CPS machine code, the address of the machine code to continue with after
to rather than called. At the end of the procedure, the usual return instruction is
replaced with an indirect jump to the address given by the continuation parameter.
of a graph could be contracted by replacing them with a single vertex labelled {x,
• A Cygwin program is one which calls Unix library functions, but using a special
library, is able to run on a Windows operating system. A version of the GCC
elimination of dead code (dead code elimination, or DCE) does not aect the
for Flash movies, database formats, and so on. Such usage is not considered here.
• The denes of a call statement are the locations it denes (modies). Not nec-
essarily all of these may end up being declared as results of the call.
xxxii GLOSSARY
output.
• A denition is said to dominate a use if for any path from the entry point to the
use, the path includes the dominating denition.
pointer thus moves down the class hierarchy. The reciprocal process is called
emitting machine code, for example adding a constant to the pointer, which should
• The endianness of an architecture is either little endian (little end rst, at lowest
memory address) or big endian (big end rst). Data in a le will be represented
• The address of a local variable is said to escape the current procedure, if the
address is passed as an argument to a procedure call, is the return expression of
• An executable le is a general class of les that could be decompiled. Here, the
term includes both machine code programs, and programs compiled to a virtual
machine form. The term native distinguishes machine code executables from
• The type oat is the C oating point type, equivalent to real or shortreal types in
some other languages. It is often assumed that the size of a oat and the size of
For example, three xor (exclusive or) instructions can be used to swap the values
of two registers without requiring a third register or memory location.
use of the memory location. Implicit references may be required to prevent the
• The input program is the machine code or assembly language program (etc.)
that the decompiler reads. Terms such as source program create confusion.
• itof is the i nteger to f loat operator. For example, itof(-5) yields the oating
point value-5.00. In practice, the operator may take a pair of precisions, e.g.
itof(-5, 32, 64) to convert the 32-bit integer 5 to the 64-bit oating point
value -5.00.
dictated by the current virtual program counter. Such a compiler achieves the
same result as an interpreter for the VM, but usually has better performance.
• The term lattice usually refers to a lattice of types; see Notation on page xl.
xxxiv GLOSSARY
• A linker is a program that combines one or more object les into an executable,
machine code le.
unconditional assignment to x .
• The live range of a variable is the set of program points from one or more
denitions to one or more uses, where the variable is live. If the program is
represented in Static Single Assignment form, the live ranges associated with
• A local variable is a variable located in the stack frame such that it is only visible
(in scope) in the current procedure. Its address is usually of the form sp±K, where
sp is the stack pointer register, and K is a constant. Stack parameters usually also
have the same form, and are sometimes considered to be local variables.
• A location is a register or memory word which can be assigned to (it can ap-
pear on the left hand side of an assignment). It can be used as a value (it can
appear on the right hand side of assignments, and in other kinds of statements).
Temporaries, array elements, and structure elements are special cases of regis-
ters or variables, and are therefore also locations. A location in machine code
• Machine code (also called machine language) refers to native executable pro-
grams, i.e. instructions that could be executed by a real processor. Java bytecodes
• When subtracting two numbers, the minuend is the number being subtracted
from, i.e. dierence = minuend - subtrahend.
• The modieds of a procedure are the locations that are modied by that proce-
dure. A subset of these become returns. A few special locations (e.g. the stack
pointer register and program counter) which are modieds are not considered
returns.
• The term name is sometimes used in the aliasing sense, e.g. *p and q can be
dierent names (or aliases) for the same location if p points to q. In the context
of the SSA form, the term is used in a dierent sense. Locations (including those
them or giving them dierent names, in order to create unique denitions for each
location.
object les (.o or .obj les on most machines). These les contain incomplete
able to link the object le with others to form an executable machine code le.
• An oset pointer is a pointer to other than the start of a data object, obtained by
adding an original pointer to an oset or displacement. Oset pointers frequently
arise as a result of one or more array indexes having a non zero lower bound.
• The original compiler is the one presumed to have created the input executable
program.
• An original pointer is a pointer to the start of a data object, e.g. the rst element
of an array (even if that represents a non zero index), or the rst element of a
structure.
• Original source code is the original, usually high level source code that the
program was written in.
• Pentium refers to the Intel IA32 processor architecture. The term x86 is more
appropriate, since Pentium is no longer used by Intel for their latest IA32 proces-
sors, and other manufacturers have compatible products with dierent names.
procedures called by the current procedure). Most commonly it is saved near the
start of the procedure and restored near the end. In rare cases, a location may be
considered preserved if it always has a constant added to its initial value through
xxxvi GLOSSARY
the procedure. For example, a CISC procedure where the arguments are removed
by the caller might always increase the stack pointer register by 4, representing the
return address popped by the return instruction (call instructions always subtract
• The term procedure is used interchangeably with the term function; whether
a procedure returns a value or values or nothing at all is not known until quite
x and y are simple variables, not expressions) are propagated. This is because
simple expressions that match better with machine code instructions. Compilers
contrast prefer more complex expressions, as these are generally more readable,
• Range analysis is an analysis that attempts to nd the possible runtime values
of locations, usually as a range or a set of ranges. The ranges may or may not be
strided, that is, the possible values may be a minimum value plus a multiple of a
stride, particularly for pointers. For function pointers or assigned goto pointers,
term recompiled will still be used to indicate compiling of the source code gen-
erated by the decompiler. The ability to compile the decompiler's output is called
recompilability .
• The term record is sometimes used synonymously with structure, i.e. an ag-
and c calls a.
• In the SSA form, locations are renamed to eectively create unique denitions
for each location. For example, two denitions of register r8 could be renamed
• The results of a call statement are those denes that are used before being
redened after the call. In a decompiler, calls are treated as special assignment
statements, where several locations can be the results of a call. For example, a is
• The returns from a procedure are the locations that are dened by that procedure
and in addition are used by at least one caller before being redened. Each return
has a corresponding result in at least one call. In a high level program, usually
only one value is returned, and the location (register or memory) where that value
• A return lter is a lter to remove locations, such as global variables, that can
not be used as a return location for a procedure.
by a compiler, and uses reverse compilation for the more general problem of
rewriting hand-written code as high level source. Whether the original program
was machine compiled or not should make little dierence to a general decompiler.
form of reverse engineering; there are also source to source forms such as archi-
tecture extraction. It could be argued that binary to binary translations are also
reverse engineering. Popular usage and some authors consider only the process of
processor's instruction cache does not produce unwanted eects. It also makes
variable would have been in the scope of the procedure. For the lifetime of the
which are used in decompilation. Usually, as used here, the term is a synonym
for prototype, in the sense of a declaration of a function with the names and
types for its parameters, and also its return type. The other meaning is as a bit
pattern used to recognise the binary form of the function as it appears statically
in detail here.
character type. For example, the C type char is usually but not always regarded
as signed.
• Source code is a term very rmly associated with a high level program rep-
resentation, and is used here despite the fact that it is usually the output of a
decompilation.
• The term sparse as applied to a data ow analysis implies a minimisation of the
stores type information for each variable at each program point (each basic block
or sometimes even each statement). By contrast, sparse type analysis stores type
information for each variable, and if the program is represented in SSA form, a
degree of ow sensitivity is achieved if the variable can be assumed to keep the
same type from each denition to the end of its live range.
• When subtracting two numbers, the subtrahend is the number being subtracted,
i.e. dierence = minuend - subtrahend.
• The type x is a subtype of type y (x < y ) if for every place that type y is valid,
• Locations are here labelled suitable if they meet the criteria of a parameter lter
or a return lter.
• The terms target and retargeting are rmly associated with machines, and
are best avoided. The decompiled output is often referred to as source code
• Unreachable code is code which can not be reached by any execution path of
the program.
• A value is a location or a constant, i.e. an elementary item that can appear on the
right hand side of an assignment, either by itself or combined with other values
and operators into an expression. The leaves of expression trees are values. The
term is sometimes also used to refer to one or more runtime values of a location.
• Value analysis is an analysis which attempts to nd the set of possible runtime
values of locations, e.g. function pointers. When run on ordinary variables, it
tends to be called range analysis, since the values of ordinary variables tends to
refer to a modern version of the x86 processor family capable of running in 64-bit
mode. Sometimes EM64T or AMD64 are used to imply the slightly dierent
Intel and AMD versions of this processor, respectively. Note that IA64 refers to
• x86 is the term referring to the family of Intel and compatible processors which
maintain binary instruction set compatibility to the original 8086 processor. The
name came from the original numbering system of 8086 and 80186 through 80486.
The term i686 is still used by some software to denote a modern member of this
series, even though there has never been a processor numbered 80586 or 80686.
Here, this term implies a 32-bit processor, which means at least an 80386. Also
Notation
• locn represents a version of the location loc. In the SSA form, all denitions of
• φ(a1 , a2 ) represents a phi-function. See section 4.1 for details. It should not be
τ , but commonly needed constraints of the type e1 has the same type as e2
cannot readily be expressed with this notation. No standard has emerged: Tip
et al. use [e] for the type of e [TKB03], Palsberg uses [[e]] [PS91], while Hendren
• Type constants such as int are printed in sans serif font. Type variables (i.e. vari-
ables whose possible values are types) are written as lower case Greek letters such
as α.
• > (top) represents the top element of a lattice of types or other data ow infor-
mation, representing (for types) no type information. Analogously, ⊥ (bottom)
are thought of as a set of possible types for a location, > represents the set of all
possible types and ⊥ the empty set. Type information progresses from the top of
the lattice downwards. Authors in semantics and abstract interpretation use the
opposite convention, i.e. types are only propagated up the lattice [App02].
• u represents the meet operator; a u b is the lattice node that is the greatest
lower bound of a and b. Some authors use ∧ for greater generality. A type
lattice contains elements that are not comparable, hence the u symbol is preferred
because it implies a partial ordering.
• t represents the join operator; a t b is the lattice node that is the least upper
comparable (where X and Y are the sets of values that variables of type x and y
vation analysis, type analysis, and the analysis of indirect jumps and
calls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1 Introduction 1
Machine code decompilers have the ability to provide the key to software evolution:
and ending where traditional reverse engineering starts (i.e. with source
1.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Decompilers have many useful applications, broadly divided into those based on
tool, and those requiring the ability to compile the decompiler's output. 8
When viewed as program browsers, decompilers are useful tools that focus
parison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
xliii
xliv SUMMARY
The user may choose to accept the default, automatically generated out-
If sucient manual eort is put into the decompilation process, the gen-
The state of the art as of 2002 needed improvement in many areas, including
the recovery of parameters and returns, type analysis, and the handling
Various reverse engineering tools are compared in terms of the basic prob-
lems that they need to solve; the machine code decompiler has the largest
traversal and the analysis of indirect jumps and calls, both of which are
analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Separating original pointers from those with osets added (oset pointers)
tion distance from source code to the input code increases; hence assem-
bly language decompilers face relatively few problems, and machine code
Disassemblers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
solved. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
SUMMARY xlv
Assembly Decompilers . . . . . . . . . . . . . . . . . . . . . . . 23
ers, however, they only face about half the problems of a machine code
decompiler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
lems that they face between assembly decompilers and machine code de-
ers are very successful because their matadata-rich executable le formats
ensure that they only face two problems, the solutions for which are well
known. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
a correct but less optimised program, the result for a decompiler ranges
There are sucient important and legal uses for decompilation to warrant
this research, and decompilation may facilitate the transfer of facts and
1.7 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
The main goals are better identication of parameters and returns; recon-
structing types, and correctly translating indirect jumps and calls; all are
and show how many of their limitations in the areas of data ow analysis,
type analysis, and the translation of indirect jumps and calls are solved
2 Decompiler Review 33
Existing decompilers have evaded many of the issues faced by machine code decom-
A surprisingly large number of machine code decompilers exist, but all suer
Object code decompilers have advantages over machine code decompilers, but
are less common, presumably because the availability of object code with-
such as names and types, making decompilers for these platforms much
Since Java decompilers are relatively easy to write, they rst started ap-
pearing less than a year after the release of the Java language. . . . . . 42
2.6.1 Disassembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
some of the same problems as decompilation, but the results are more
but declining use, and has several unique problems that are not considered
here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Static Binary Translators that emit C produce source code from binary
code, but since they do not understand the data, the output has very
low readability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
static interpretation of the input program inlined into a large source le,
diate representations that are similar to those that can be created for
ing with source code, while decompilation provides little high level com-
Several compiler infrastructures exist with mature tool sets which, while
LLVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
SUIF2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
COINS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
SCALE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Phoenix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Open64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Static Single Assignment form assists with most data ow components of decom-
ers, and there are two simple rules, yet dicult to check, for when it can
be applied. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
SUMMARY xlix
Dead code elimination is facilitated by storing all uses for each denition
(denition-use information). . . . . . . . . . . . . . . . . . . . . . . . 69
tails such as those revealed by older x86 oating point compare instruction
sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
The eects of calls are best summarised by the locations modied by the callee,
The semantics of call statements and their side eects necessitates ter-
Three propositions determine how registers and global variables that are
equations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
The stack pointer, and occasionally other special pointers, can appear to
Decompilers could treat the whole program as one large, global (whole-program)
data ow problem, but the problems with such an approach may outweigh
the benets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.7.1 Sub-elds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
The related work conrms that the combination of expression propagation and
4 SSA Form 97
Static Single Assignment form assists with most data ow components of decom-
ow information, and is strong enough to solve problems that most other
The SSA form makes propagation very easy; initial parameters are readily
The conversion from SSA form requires the insertion of copy statements;
Unused but not eliminated denitions with side eects can cause
problems with the translation out of SSA form, but there is a simple
solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
but one due to Sreedhar et al. appears to be most suitable for decompila-
tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
ing in a compiler have some similarities, but there are enough signicant
piler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
which are safe to propagate, and those which must not be propagated at
all. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Alias-induced problems are more common at the machine code level, aris-
causes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
cient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
the decompiled output easier to read, and also to prevent data ow anoma-
lies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
The rule that a location which is live at the start of a function is a param-
Collectors, a contribution of this thesis, extend the sparse data ow infor-
mation provided by the Static Single Assignment form in ways that are
calls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
With suitable modications to handle aggregates and aliases well, the SSA
form obviates the need for the complexity of techniques such as recency
abstraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
The VDG and other representations abstract away the control ow graph,
for the Static Single Assignment form in machine code decompilers. . . 145
The Static Single Assignment form has been found to be a good t for the
The SSA form enables a sparse data ow based type analysis system, which is well
The work of Mycroft and Reps et al. have some limitations, but they laid the
Type information encapsulates much that distinguishes low level machine code
tions, aggregate types must at times be discovered through stride analysis. 155
While running pointers and array indexing are equivalent in most cases,
Type information arises from machine instruction opcodes, from the signatures
Constants have types just as locations do, and since constants with the same
numeric value are not necessarily related, constants have to be typed in-
dependently. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Finding types for variables and constants in the decompiled output can be
rectly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Type analysis for decompilers where the output language is statically type
Since types are hierarchical and some type pairs are disjoint, the rela-
leaves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
tions require special type functions in a data ow based type analysis. . 177
A small set of high level patterns can be used to represent global variables, local
variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
A few special types are needed to cater for certain machine language details,
Most related work is oriented towards compilers, and hence does not address
While good progress has been made, much work remains before type analysis
to prepare memory expressions for high level pattern analysis, and the
While indirect jumps and calls have long been the most problematic of instruc-
Special processing is needed since the most powerful indirect jump and call
plete control ow graph (CFG), but the CFG is not complete until the
(case) statements and assigned goto statements, and tail-optimised calls. 198
facilitated by the SSA form, enables a very simple way to improve the
not emit the compare and branch that usually sets the size of the jump
table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
have long been the most dicult to translate, but value analysis combined
with the assigned goto switch statement variant can be used to represent
number of cases, and subtract instructions may replace the usual compare
values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Indirect calls implement calls through function pointers and virtual function
calls; the latter are a special case which should be handled specially for
readability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
pilers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Use of the SSA form helps considerably with virtual function an-
pointers that are osets from other pointers, necessitating some special
allows the comparison of VTs which may give clues about the original
Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
one of the few cases where correct output is not possible in general. . . 219
7 Results 223
Several techniques introduced in earlier chapters were veried with a real decom-
piler, and show that good results are possible with their use. . . . . . . 223
Common Subexpression Elimination does not solve the problem of excessive ex-
When the techniques of Section 4.1.3 are applied to the running example, the
Preserved locations appear to be parameters, when usually they are not, but
Most components of the preservation process are facilitated by the SSA form. 234
returns in a test program, improving the readability of the generated code. 240
8 Conclusion 243
The solutions to several problems with existing machine code decompilers are
This thesis advances the state of the art of machine code decompilation through
While the state of the art of decompilation has been extended by the techniques
decompilers that handles aggregates well, and supports alias and value
analyses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
lx SUMMARY
Chapter 1
Introduction
Machine code decompilers have the ability to provide the key to software evolution:
source code. Before their considerable potential can be realised, however, several prob-
Computers are ubiquitous in modern life, and there is a considerable investment in the
its life, to correct errors, improve security, and adapt to changing requirements. Source
code, usually written in a high level language, is the key to understanding and modifying
any program. Decompilers can generate source code where the original source code is
not available.
Since the rst compilers in the 1950s, the prospect of converting machine code into
high level language has been intriguing. Machine code decompilers were rst used in
the 1960s to assist porting programs from one computer model to another [Hal62]. In
In this gure, the compiler, assembler, and linker are lumped together for simplicity.
Compilers parse source code into an intermediate representation (IR), perform various
analyses, and generate machine code, as shown in the left half of Figure 1.1. Decompilers
decode binary instructions and data into an IR, perform various analyses, and generate
source code, as shown in the right half of the gure. Both compilers and decompilers
transform the program from one form into another. Both also use an IR to represent
the program, however the IR is likely to be more low level for a compiler and high level
for a decompiler, reecting the dierent target languages. The overall direction is from
source to machine code (high level to low level) in a compiler, but from machine to
source code (low level to high level) in a decompiler. Table 1.1 compares the features
Despite the complete reversal of direction, compilers and decompilers often employ
1
2 Introduction
similar techniques in the analysis phases, such as data ow analysis. As an example,
even though a compiler may eventually emit an instruction such as or r3,0,r5 to copy
register r3 to register r5 if the target instruction set lacks a move instruction. Similarly,
both compilers and decompilers need to know the denition(s) of a variable, whether a
The compiler process of parsing the source code to an intermediate representation (IR)
corresponds roughly with the decompiler process of decoding instructions into its inter-
Compilation adds machine details to a program, such as which registers to use and
loop, what address to use for a variable, and so on. Compilation removes comprehension
1.1 Source Code 3
aids such as comments and meaningful names for procedures and variables; these aids
the machine dependent detail, and recovers information such as the nesting of loops,
function parameters, and variable types from the instructions and data emitted by the
compiler.
Section 1.1 shows the importance of source code, while Section 1.2 relates decompilers
to other reverse engineering tools. Several of the applications for decompilation are
enumerated in Section 1.3, indicating that many benets would result from solving some
of the current problems. The goal of this thesis is to solve several of these problems,
in large part by using the Static Single Assignment form, as used in many optimising
compilers. The current state of the art in machine code decompilation is poor, as will
Section 1.5 discusses the problems faced by a variety of binary analysis tools: various
existing and general classes of decompiler, and the ideal disassembler. It is found that
a good decompiler has to solve about twice the number of fundamental problems of any
existing binary analysis tool. Two existing machine code decompilers are included in
the comparison, to show in greater detail the problems that need to be solved. Legal
issues are briey mentioned in Section 1.6, and Section 1.7 summarises the main goals
of the research.
A program's source code species to a computer the precise steps required to achieve
the functionality of the program. When that program is compiled, the result is an
executable le of verbose, machine specic, and minute detail with the precise steps
required to perform the steps of the program on the target machine. The executable
version has the same essential steps, except in much greater detail. For example, the
machine registers and/or memory addresses are given for each primitive step.
The source code and the machine code for the same program are equivalent, in the sense
that each is a specication of how to achieve the intended functionality of the program.
In other words, the program, in both source and machine code versions, conveys how to
perform the program's function. Comments and other design documents can optionally,
and to varying extents, convey what the program is doing, and why. The computer has
Computers with dierent instruction sets require dierent details about how to perform
the program; hence two executable les for the same program compiled for dierent in-
struction sets will in general be completely dierent. However, the essential components
of how to perform the program are in all versions of the program: the original source
code, innitely many variants of the original source code that also perform the same
functionality, and their innitely many compilations. The task of a decompiler is es-
sentially to nd one of the variants of the original source code that is semantically
equivalent to the machine code program, and hence to the original source code.
It could be argued that the output of a decompiler, typically C or C++ with generated
variable names and few if any comments, is not high level source code. Despite this,
• source code is much more compact than any machine code representation;
• irrelevant machine details are not present, so the reader does not have to be
• high level (more abstract) source code features such as loops and conditionals are
easier to grasp at a glance than code with compares, branches and other low level
instructions.
Since the output of a decompiler is much more readable and compact than the machine
code it started with, such output will be referred to as high level language.
Source code does not necessarily imply readability, even though it is a large step in
the right direction. Figure 1.2 shows the C source code for a program which is almost
any decompiler could produce readable source code for this program, since by design its
operation is hidden. Source code comprehension (even with meaningful variable names
What is desired is fully reverse engineered source code, such as shown in Figure 1.3. This
was achieved using traditional reverse engineering techniques, starting with the source
code of Figure 1.2. Techniques used include pretty printing (replacing white space),
transforming ternary operator expressions (?:) into if-then-else statements, path pro-
ling, and replacing most recursive function calls with conventional calls [Bal98]. From
this it can be seen that low readability source code has an intrinsic value: it can be
This example also illustrates the dierence between decompilation (which might pro-
duce source code similar to that of Figure 1.2) and traditional reverse engineering (which
main(t,_,a)
char *a;
{return!0<t?t<3?main(-79,-13,a+main(-87,1-_,
main(-86, 0, a+1 )+a)):1,t<_?main(t+1, _, a ):3,main ( -94, -27+t, a
)&&t == 2 ?_<13 ?main ( 2, _+1, "%s %d %d\n" ):9:16:t<0?t<-72?main(_,
t,"@n'+,#'/*{}w+/w#cdnr/+,{}r/*de}+,/*{*+,/w{%+,/w#q#n+,/#{l,+,/n{n+\
,/+#n+,/#;#q#n+,/+k#;*+,/'r :'d*'3,}{w+K w'K:'+}e#';dq#'l q#'+d'K#!/\
+k#;q#'r}eKK#}w'r}eKK{nl]'/#;#q#n'){)#}w'){){nl]'/+#n';d}rw' i;# ){n\
l]!/n{n#'; r{#w'r nc{nl]'/#{l,+'K {rw' iK{;[{nl]'/w#q#\
n'wk nw' iwk{KK{nl]!/w{%'l##w#' i; :{nl]'/*{q#'ld;r'}{nlwb!/*de}'c \
;;{nl'-{}rw]'/+,}##'*}#nc,',#nw]'/+kd'+e}+;\
#'rdq#w! nr'/ ') }+}{rl#'{n' ')# }'+}##(!!/")
:t<-50?_==*a ?putchar(a[31]):main(-65,_,a+1):main((*a == '/')+t,_,a\
+1 ):0<t?main ( 2, 2 , "%s"):*a=='/'||main(0,main(-61,*a, "!ek;dc \
i@bK'(q)-[w]*%n+r3#l,{}:\nuwloca-O;m .vpbks,fxntdCeghiry"),a+1);}
Figure 1.2: An obfuscated C program; it prints the lyrics for The Twelve Days of
Christmas (all 64 lines of text). From [Ioc88].
Figure 1.3: Part of a traditional reverse engineering (from source code) of the
Twelve Days of Christmas obfuscated program of Figure 1.2. From [Bal98].
Source code with low readability is also useful simply because it can be compiled. It
could be linked with other code to keep a legacy application running, compiled for a
dierent target machine, optimised for a particular machine, and so on. Where main-
tainability is required, meaningful names and comments can be added via an interactive
• Well written, documented source code, written by the best human programmers.
It contains well written comments and carefully chosen variable names for func-
• Human written source code that, while readable, is not as well structured as the
with the program. Functions and variables have names, but are not as well chosen
• Source code with few to no comments, mainly generic variable and function names,
but otherwise no strange constructs. This is probably the best source code that
• Source code with few to no comments, mainly generic variable and function names,
and occasional strange constructs such as calling an expression where the original
• Source code with no translation of the data section. Accesses to variables and
registers in the original program are represented by generic local variables such
as v3. This is the level of source code that might be emitted by a static binary
• As above, but even the program counter is visible, and the program is in essence a
large switch statement with one arm for the various original instructions. This is
the level of source code emitted by Instruction Set Simulation techniques [MAF91].
ending where traditional reverse engineering starts (i.e. with source code that does not
Figure 1.4 shows the relationship between various engineering and reverse engineering
tools and processes. Some compilers generate assembly language for an assembler, while
others generate object code directly. However, the latter can be thought of as gener-
ating assembly language internally, which is then assembled to object code. Compilers
for virtual machines (e.g. Java bytecode compilers) generate code that is roughly the
equivalent of object code. The main tools available for manipulating machine code are:
gram; the output may not be able to be assembled without modication. The
1.2 Forward and Reverse Engineering 7
Concepts
Source code
with comments* Post
decompilation
More abstract
editing
(less concrete)
Assisted
Source code machine code
Compiler w/out comments* decompiler
Assembly
Less abstract
decompiler
(more refined)
Object code
Assembly decompiler
language Machine code
decompiler
Assembler
Disassembler;
Object code debugger
Linker
* Comments,
meaningful
Machine code
names, etc.
guage level, but have the advantage that the program is running, so that values
for registers and memory locations can be examined to help understand the pro-
gram's operation. Source level debuggers would oer greater advantages, but
• Decompilers: these produce a high level source code version of the program. The
user does not have to understand assembly language, and the output is an order
general not be the same as the original source code, and may not even be in the
same language. Usually, there will be few, if any, comments or meaningful variable
8 Introduction
names, except for library function names. Some decompilers read machine code,
some read object (pre-linker) code, and some read assembly language.
Clearly, a good decompiler has signicant advantages over the other tools.
engineering. For example, Ward has transformed assembly language all the way to
formal specications [War00]. However, reverse engineering from source code entails the
to source code from lower forms needs only to recognise such basic patterns as loops
1.3 Applications
Decompilers have many useful applications, broadly divided into those based on browsing
parts of a program, providing the foundation for an automated tool, and those requiring
The rst uses for decompilation were to aid migration from one machine to another, or to
recover source code. As decompiler abilities have increased, software has become more
complex, and the costs of software maintenance have increased, a wide range of potential
applications emerge. Whether these are practical or not hinges on several key questions.
The rst of these is, Is the output to be browsed, is an automated tool required, or
should the output be complete enough to recompile the program? If the output is to
be recompiled, the next key question is Is the user prepared to put signicant eort
into assisting the decompilation process, or will the automatically generated code be
For these applications, a decompiler is used to browse part of a program. Not all of
the output code needs to be read or even generated. Browsing high level source code
automated
browse recompile
tool
Interoperability
or system, for the purposes of enabling some other program or system to operate with
it. These design principles are ideas, not implementations, and are therefore not able to
be protected by patent or copyright. Reverse engineering techniques are lawful for such
uses, as shown by Sega v. Accolade [Seg92] and again by Atari v. Nintendo [Ata92].
See also the Australian Copyright Law Review Committee report [Com95], and the
Decompilation is a useful reverse engineering tool for interoperability where binary code
is involved, since the user sees a high level description of that code.
Learning Algorithms
The copyright laws in many countries permit a person in lawful possession of a program
to observe, study, or analyse the program in order to understand the ideas and principles
underlying the program (adapted from [CAI01]). The core algorithm of a program
occupies typically a very small fraction of the whole program. The lower volume of code
to read, and the greater ease of understanding of high level code, confer an advantage
Code Checking
Browsing the output of a decompiler could assist with the manual implementation of
tool, if available. This possibility is suggested by the left-most dotted line of Figure 1.1.
10 Introduction
Tools that operate on native executable code are becoming more common, especially
Finding Bugs
Sometimes software users have a program which they have the rights to use, but it
is not supported. This can happen if the vendor goes out of business, for example.
Another scenario is where the vendor cannot reproduce a bug, and travelling to the
user's premises is not practical. If a bug becomes evident under these conditions,
decompilation may be the best tool to help an expert user to x that bug. The only
It will be dicult to nd the bug in either case, but with the considerably smaller
volume of high level code to work through, the expert with the decompiler should
in general take less time than the expert working with disassembly. A tool with a
decompiler integrated with a debugger would obviously be more useful than a stand
The bugs could be xed by patching the binary program, or source code fragments
illustrating the bug could be sent to the vendor for maintenance. With more user
eort, the bug could be xed in the saved source code and the program is rebuilt, as
An automated tool could also search for sets of commonly found bugs, such as derefer-
Finding Vulnerabilities
Again, the lower volume of output to check would in general make the decompiler a
far more eective tool. It should be noted that the actual vulnerability will sometimes
have to be checked at the disassembly level, since some vulnerabilities can be extremely
machine specic. Others, such as buer overow errors, may be eectively checked
using the high level language output of the decompiler. For the low level vulnerabilities,
decompilation may still save considerable time by allowing the expert user to navigate
quickly to those areas of the program that are most likely to harbour problems.
1.3 Applications 11
Finding Malware
Finding malware (behaviour not wanted by the user, e.g. viruses, keyboard logging,
etc.) is similar in approach to nding vulnerabilities. Hence, the same comments apply,
Verication
Verication checks that the software build process has produced correct code, and that
no changes have been made to the machine code le since it was created. This is
Comparison
sometimes necessary to compare the high level semantics of one piece of binary code
with another piece of binary code, or its source code. If both are in binary form, these
pieces of code could be expressed in machine code for dierent machines. Decompilation
provides a way of abstracting away details of the machine language (e.g. which compiler,
optimisation level, etc), so that more meaningful comparisons can be made. Some
idiomatic patterns used by compilers could result in equivalent code that is very dierent
at the machine code level, and the dierences may be visible in the decompiled output.
transformations.
code program, they should be able to produce source code which can be modied, and a
working program should result from the modied source code. This ideal can currently
only be approached for Java and CLI (.NET) bytecode programs, but applications can
output.
Automatically generated output will be low quality in the sense that there will be few
meaningful identier names, and few if any comments. This will not be easy code to
12 Introduction
maintain, however, recompilable source code of any quality is useful for the following
applications.
Many Personal Computer (PC) applications are compiled in such a way that they will
still run on an 80386 processor, even though only a few percent of programs would be
running on such old hardware. Widely distributed binary programs have to be built
for the expected worst case hardware. A similar situation exists on other platforms,
e.g. SPARC V7 versus V9. There is a special case with Single Instruction Multiple Data
decisive, and vendors nd that it is worth rewriting very small pieces of code several
times for several dierent variations of instruction set. Apart from this exception,
the advantages of using the latest machine instruction set are often not realised. A
decompiler can realise those advantages by producing source code which is then fed
into a suitable compiler, optimising for the exact processor the user has. The resulting
executable will typically perform noticeably better than the distributed program.
Cross Platform
Programs written for one platform are commonly ported to other platforms. Ob-
viously, source code is required, and decompilation can supply it. Two other issues
arise when porting to other platforms: the issues of libraries and system dependencies
in the programs being ported. The latter issue has to be handled by hand editing of
the source code. The libraries required by the source program may already exist on
the target platform, they may have to be rewritten, or a compatibility library such as
With the introduction of 64-bit operating systems, another opportunity for decompila-
tion arises. 32-bit drivers will typically not work with 64-bit operating systems. In these
cases, either all drivers will have to be rewritten, or some form of compatibility layer
than native drivers, and of course will not provide the benets of 64-bit addresses
(i.e. being able to address more than 4GB of memory). There will be some hardware
for which there is no vendor support, yet the hardware is still useful. In these cases,
a 64-bit driver could be rewritten from decompiled source code for the 32-bit driver.
This would allow maximum performance and provide full 64-bit addressing.
Machine code drivers are commonly the only place that low level hardware details
are visible. As a result, porting drivers across operating systems when the associated
1.3 Applications 13
hardware is not documented may even be feasible with the aid of decompilation to
expose the necessary interoperability information. Drivers also have the advantage
of being relatively small pieces of code, which typically call a relatively small set of
Automatically generated code may be suitable for xing bugs and adding features.
as discussed below. The dotted lines near the right end of Figure 1.5 indicate this
possibility.
In contrast with creating source code for the above cases, creating maintainable code
requires considerable user input. The process could be compared to using a disassembler
with commenting and variable renaming abilities. The user has to understand what the
program is doing, enter comments and change variable names in such a way that others
reading the resultant code can more easily understand the program. There is still a
signicant advantage over rewriting the program from scratch, however: the program
is already debugged.
The result will probably not be bug free, since no signicant program is. However the
eort of writing and debugging a new program to the same level of bugs as the input
program is often substantially larger than the eort of adding enough comments and
meaningful identiers via decompilation. (This assumes that the decompiler does not
It is easier to optimize correct code than to correct optimized code. Bill
Harlan [Har97].
In decompilation terms, optimisation is actually the creation of the most readable rep-
resentation of the program; the optimum program is the most readable one. The above
quote, obviously aimed at the forward engineering process, applies equally to decompi-
lation.
program understanding problem where the only program documents available are end-
Fix Bugs
Once maintainable source code is available for an application, the user can x bugs by
It could be argued that xing bugs is possible with automatically generated code, be-
cause the changes could be made in the absence of comments or meaningful names.
However, it will be easier to eect the change and to maintain it (bug xes need main-
tenance, like all code), if at least the code near the bug is well commented. This is the
reason for the dotted line between auto generated and x bugs in Figure 1.5.
Many programs ship without debugging information, and often link with third party
code. Sometimes when the program fails in the eld, a version of the failing program
with debugging support turned on will not exhibit the fault. In addition, developers
are reluctant to allow debug versions of their program into the eld. Existing tools
for this situation are inadequate. A high level debugger, essentially a debugger with a
decompiler in place of the usual disassembler, could ease the support burden, even for
companies that have the source code (except perhaps to third party code) [CWVE01].
Add Features
Similarly to the above, once maintainable source code is available, the user can add
new features to the program. Also similarly to the above, it could be argued that this
Legacy applications, where the source code (or at least one of the source les) is missing,
able source code. Variations include applications with source code but the compiler is
there is source code, it is missing some of the desirable features of modern languages.
An example of the latter is the systems language BCPL, which is essentially untyped.
Sometimes there is source code, but it is known to be out of date and missing some
features of a particular binary le. Finally, one or more source code les may be written
The state of the art as of 2002 needed improvement in many areas, including the recovery
of parameters and returns, type analysis, and the handling of indirect jumps and calls.
At the start of this research in early 2002, machine code decompilers could automatically
generate source code for simple machine code programs that could be recompiled with
some eort (e.g. adding appropriate #include statements). Statically compiled library
functions could be recognised to prevent decompiling them, give them correct names,
and (in principle) infer the types of parameters and the return value [VE98, Gui01]. The
problem of transforming unstructured code into high level language constructs such as
conditionals and loops is called structuring. This problem is largely solved; see [Cif94]
Unlike bytecode programs, machine code programs rarely contain the names of functions
or variables, with the exception of the names of library functions and user functions
able code involves manually changing the names of functions and variables, and adding
comments. At the time, few decompilers performed type recovery, so that the types
for variables usually have to be manually entered as well. Enumerated types (e.g. enum
colour {red, green, blue};) that do not interact with library functions can never
be recovered automatically from machine code, since they are indistinguishable from
integers.
Many pre-existing decompilers have weak analyses for indirect jump instructions, and
even weaker analysis of indirect call instructions. When an indirect jump or call is not
analysed, it becomes likely that the decompilation will not only fail to translate that
instruction, but the whole separation of code and data could be incomplete, possibly
resulting in large parts of the output being either incorrect or missing altogether.
there were often severe limitations on this identication (e.g. the decompiler might
Existing decompilers can be used for a variety of real-world applications, if the user is
prepared to put in signicant eort to manually overcome the deciencies of the decom-
piler. Sometimes circumstances would warrant such eort; see for example [VEW04].
However, such situations are rare. There would be far more applications for decom-
pilation if the amount of eort needed to correct deciencies in the output could be
ments.
• There was even poorer handling of indirect call instructions, to e.g. indirect func-
• The identication of parameters and returns made assumptions that were some-
times invalid.
• Some decompilers produced output that resembled high level source code, but
Finally, meaningful names for functions and variables are not generated. Certainly,
the original names are not recoverable, assuming that debug symbols are not present.
Similarly, enumerated types (not associated with library functions) are not generated
where a human programmer would use them. Unless some advanced articial intelli-
gence techniques become feasible, these aspects will never be satisfactorily generated
Various reverse engineering tools are compared in terms of the basic problems that they
need to solve; the machine code decompiler has the largest number of such problems.
Reverse Engineering Tools (RETs) that analyse binary programs face a number of
common diculties, which will be examined in this section. Most of the problems stem
1. Some information is not present in some executable le formats, e.g. variable and
procedure names, comments, parameters and returns, and types. These are losses
2. Some information is mixed together in some executable le formats, e.g. code and
data; integer and pointer calculations. Pointers and osets can be added together
to separate. These are losses of separation, and require analyses to make the
not correct.
By contrast, forward engineering tools have available all the information needed. Pro-
gramming languages, including assembly languages, are designed with forward engi-
neering in mind, e.g. most require the declaration of the names and types of variables
There are some advantages to reverse engineering over forward engineering. For ex-
ample, usually the reverse engineering tool will have the entire program available for
analysis, whereas compilers often only read one module (part of a program) in isolation.
The linker sees the whole program at once, but usually the linker will not perform any
signicant analysis. This global visibility for reverse engineering tools can potentially
make data ow and alias analysis more precise and eective for reverse engineering tools
than for the original compiler. This advantage has a cost: keeping the IR for the whole
Another advantage of reverse engineering from an executable le is that the reverse
engineering tool sees exactly what the processor will execute. At times, the compiler
may make decisions that surprise the programmer [BRMT05]. As a result, security
analysis tools increasingly work directly from executable les, and hopefully decompilers
Table 1.2 shows the problems to be solved by various reverse engineering tools. The
assembly language
• the ideal machine code decompiler: it transforms any machine code program to
• the problems that were solved by two specic machine code decompilers, dcc
Machine code
disassembler
Object code
decompiler
decompiler
decompiler
decompiler
decompiler
decompiler
decompiler
Assembly
bytecode
Ideal
Java
REC
dcc
CLI
Problem
Separate
yes some some some no yes no no
code from data
Separate
yes no no easy no yes no no
pointers from constants
Separate original from
yes no no easy no yes no no
offset pointers
Declare data yes no no yes easy yes easy easy
Recover
yes yes most yes yes no no no
parameters and returns
Analyse
yes no no yes yes yes no no
indirect jumps and calls
most local
Type analysis yes no no yes some no no
variables
Merge instructions yes yes yes yes yes no yes yes
Structure
yes yes yes yes yes no yes yes
loops and conditionals
Total 9 3½ 3¼ 6½ 4½ 5 2½ 2
analysis of indirect jumps and calls, both of which are addressed in this thesis.
Recompilability requires the solution of several major problems. The rst of these is
the separation of code from data. For assembly, Java bytecode, and CLI decompilers,
this problem does not exist. For the other decompilers, the problem stems from the
fact that in most machines, data and code can be stored in the same memory space.
The general solution to this separation has been proved to be equivalent to the halting
problem [HM79]. The fact that many native executable le formats have sections with
names such as .text and .data does not alter the fact that compilers and programmers
often put data constants such as text strings and switch jump tables into the same
section as code, and occasionally put executable code into data segments. As a result,
There are a number of ways to attack this separation problem; a good survey can be
+
found in [VWK 03]. The most powerful technique available to a static decompiler is
the data ow guided recursive traversal. With this technique, one or more entry points
are followed to discover all possible paths through the code. This technique relies on all
paths being valid, which is unlikely to be true for obfuscated code. It also relies on the
1.5 Reverse Engineering Tool Problems 19
ability to nd suitable entry points, which can be a problem in itself. Finally, it relies
on the ability to analyse indirect jump and call instructions using data ow analysis.
Part of the problem of separating code from data is identifying the boundaries of pro-
cedures. For programs that use standard call and return instructions this is usually
straightforward. Tail call optimisations replace call and return instruction pairs with
jump instructions. These can be detected easily where the jump is to the start of a
function that is also the destination of a call instruction. However, some compilers
such as MLton [MLt02] do not use conventional call and return instructions for the
vast majority of the program. MLton uses Continuation Passing Style (CPS), which
from functional programming languages is left for future work, but it does not appear
at present that nding procedure boundaries poses any particularly dicult additional
problems.
The second major problem is the separation of pointers from constants. For each
the choice of representing the immediate value as a constant (integer, character, or other
type), or as a pointer to some data in memory (it could point to any type of data). Since
addresses are always changed between the original and the recompiled program, only
those immediate values identied as pointers should change value between the original
analysis.
Reverse engineering tools that read native executable les have to separate original
pointers (i.e. pointers to the start of a data object) from oset pointers (e.g. pointers
to the middle of arrays, or outside the array altogether). Note that if the symbol table
was present, it would not help with this problem. Figure 1.6 shows a simple program
illustrating the problem, in C source code and x86 machine code disassembled with IDA
Pro [Dat98]. There is no need to understand much about the machine code, except to
20 Introduction
note that the machine code uses the same constant (identied in this disassembly as str)
to access the three arrays. Interested readers may refer to the x86 assembly language
overview in Table 3.1 on page 63, but note that this example is in Intel assembly
Figure 1.6: A program illustrating the problem of separating original and oset pointers.
In this program, there are two arrays which use negative indexes. Similar problems
result from programs written in languages such as Pascal which support arbitrary array
bounds. In this program, the symbol table was not stripped, allowing the disassembler
to look up the value of the constants involved in the array fetch instructions. However,
the disassembler assumes that the pointers are original, i.e. that they point to the start
of a data object, in this case the string pointer str. The only way to nd out that the
rst two constants are actually oset pointers is to analyse the range of possible values
for the index variable i (register edx in the machine code contains i*4).
Note that decompiling the rst two array elements as
((int*)(str-16))[i] and
((float*)(str-8))[i] respectively
may under some circumstances even compile and run correctly, but it would be very
poor code quality. It is dicult to read, but most importantly the compiler of the
decompiled output is free to position the data objects of the program in any order and
1.5 Reverse Engineering Tool Problems 21
with any alignment that may be required by the target architecture, probably resulting
in incorrect behaviour.
which data object is accessed in memory expressions, so that not all denitions are
known. In addition, when the address of a data object is taken, it is not known which
object's address is taken, which could also lead to a denition not being identied. As
The problems of separating pointers from constants and original pointers from oset
pointers could be combined. The combined problem would be the problem of separating
constants, original pointers, and oset pointers. These problems are not combined here
because although they stem from the same cause (the linker adding together quantities
that it is able to resolve into constants), the problems are somewhat distinct. Also, sep-
arating original from oset pointers requires an extra analysis that separating constants
from source code to the input code increases; hence assembly language decompilers face
relatively few problems, and machine code decompilers face the most.
The information and separation losses discussed above occur at dierent stages of the
Figure 1.7 shows that no separations are lost in the actual compilation process (here ex-
cluding assembly and linking). In principle, procedure level comments could be present
at the assembly language level (and even at the level of major loops, before and after
procedure calls, etc.) Hence, only statement level detailed comments are lost at this
stage. The entry structured statements indicates that several structured statements
are lost, such as conditionals, loops, and multiway branch (switch or case) statements.
However, the switch labels persist at the assembly language level. The original expres-
sions from the source code are not directly present in the assembly language, however
the elementary components (e.g. individual add and divide operations) are present in
individual instructions. Some higher level types may still be present, e.g. structure
oating point numbers starts with a label and has assembly language statements that
reserve space for the array. The elementary type of the array elements is absent, how-
ever its name and size (in machine words or bytes, if not in elements) is given. This
22 Introduction
Machine code 1
Some compilers 2
If not already lost
Figure 1.7: Information and separations lost at various stages in the com-
pilation of a machine code program.
shows that decompiling from assembly language to source code is not as dicult, in
After assembly, all comments, types, and data declarations are lost, as are switch labels.
At this stage, the rst separation is lost: code and data are mixed in the executable
sections of the object code le. Even at the object code level, however, there are
clues provided by the relocation information. For example, a table of pointers will have
pointer relocations and no gaps between the relocations; this can not be executable code.
However, a data structure with occasional pointers is still not readily distinguished
from code. For this reason, the entry for separate code from data in the object code
After linking, all variable and procedure names are lost, assuming the usual case that
symbols and debug information is stripped. The very useful relocation information is
also lost. It would be convenient for decompilation if processors had separate integer and
pointer processing units, just as they now have separate integer and oating point units.
1.5 Reverse Engineering Tool Problems 23
Since they do not, an analysis is required to separate integer constants from pointer
constants. Object code provides ready clues via relocation information to make this
analysis easy. From machine code, the analysis has to determine whether the result of a
the pointer is original or oset. These are more dicult problems that cannot be solved
1.5.4.1 Disassemblers
Machine code decompilers face four additional problems to those of an ideal disassem-
are:
by Java and CLI decompilers, and is facilitated by the data ow techniques of
Chapter 3; and
• structuring loops and conditionals, which are already solved by Java and CLI
decompilers.
In other words, if an ideal disassembler could be written, the techniques now exist to
Good results have been achieved for assembly language decompilers, however, they only
Some decompilers take as input assembly language programs rather than machine code;
such tools are called assembly decompilers. Assembly decompilers have an easier job in
several senses. An obvious advantage is the likely presence of comments and meaningful
identier names in the assembly language input. Other advantages include the nonex-
istence of the three separation problems of disassemblers (separating code from data,
pointers from constants, and original from oset pointers), type analysis is much less
24 Introduction
work, and declaring data is easy. Assembly language decompilers have the names and
sizes of data objects, hence there is no problem separating original and oset pointers.
Figure 1.8 shows the assembly language code for the underlined code of Figure 1.6(a),
Assembly decompilers still have some of the major problems of decompilers as shown
in Table 1.2: identifying function parameters and return values, merging instruction
semantics into expressions, and structuring into high level language features such as
loops. The symbols in assembly language typically make compound types such as
arrays and structures explicit, but the types of the array or structure elements typically
still requires type analysis, hence the entry some under type analysis for assembly
decompilers in Table 1.2. Overall, assembly decompilers have slightly fewer problems
The above considerations apply to normal assembly language, with symbols for each
data element. Some generated assembly language, where the addresses of data items
are forced to agree with the addresses of some other program, would be much more
dicult to generate recompilable high level source code for. The assembly language
Applications for the decompilation of such programs would presumably be quite rare,
Object code decompilers are intermediate in the number of problems that they face be-
tween assembly decompilers and machine code decompilers, but the existence of reloca-
A few decompilers take as input linkable object code ( .o or .obj les). Object les are
interesting in that they contain all the information contained in machine code les,
plus more symbols and relocation information designed for communication with the
linker. The extra information makes the separation of pointers from constants easy,
1.5 Reverse Engineering Tool Problems 25
and original pointers from osets pointers as well. Figure 1.9 shows the disassembly for
the underlined code of Figure 1.6(a), starting from the object le, again with explicit
references to the symbols a, b, and str. However, there are few circumstances under
which a user would have access to object les but not the source les.
Most virtual machine decompilers such as Java bytecode decompilers are very success-
ful because their matadata-rich executable le formats ensure that they only face two
Some decompilers take as input virtual machine executable les, e.g. Java bytecodes.
Despite the fact that the Java bytecode le format was designed for programs written
in Java, compilers exist which compile several languages other than Java to bytecodes
(e.g. Component Pascal [QUT99] and Ada [Sof96]; for a list see [Tol96]). Like assem-
bly decompilers, most virtual machine decompilers have an easier job than machine
code decompilers, because of the extensive metadata present in the executable program
les. For example, Java bytecode decompilers have the following advantages over their
• Separating pointers (references) from constants is easy, since there are bytecode
opcodes that deal exclusively with object references (e.g. geteld, new ).
• There is no need to decode global data, since all data are contained in classes.
Class member variables can be read directly from the bytecode le.
• Type analysis is not needed for member variables and parameters, since the types
are explicit in the bytecode le. However, type analysis is needed for most local
26 Introduction
variables. In the Java Virtual machine, local variables are divided by opcode
group into the broad types integer, reference, oat, long, and double. There is no
possible for a local variable slot to take on more than one type, even though this
is not allowed at the Java source code level [GHM00]. This requires the dierent
type usages of the local variable to be split into dierent local variables.
• Structuring the code (eliminate gotos, generate conditionals, loops, break state-
ments, etc).
Bytecode decompilers have one minor problem that most others do not: they need to
atten the stack oriented instructions of the Java (or CIL) Virtual Machine into the
conventional instructions that real processors have. While this process is straightfor-
top of stack (sometimes called stack variables). A few bytecode decompilers assume
that the stack height is the same before and after any Java statement (i.e. values are
never left on the stack for later use). These fail with optimised bytecode.
The advantages of virtual machine decompilers are due exclusively to the metadata
present in their executable le formats. A decompiler for a hypothetical virtual machine
whose executable les is not metadata rich would be faced with the same problems as
problems that they have to solve is small, so the barrier to writing them is comparatively
small, many are available, and some perform quite well. A few of them perform type
analysis of local variables. For this reason, several Java decompilers are reviewed in
The limitations of existing machine code decompilers include the size of the input pro-
gram, identication of parameters and returns, handling of indirect jumps and calls,
Of the handful of machine code decompilers in existence at the start of this research in
early 2002, only two had achieved any real success. These are the dcc decompiler [Cif96],
and the Reverse Engineering Compiler (REC) [Cap98], both summarised in Section 2.1
on page 35. Dcc is a research compiler, written to validate the theory given in a PhD
thesis [Cif94]. It decompiles only 80286 DOS programs, and emits C. REC is a non
commercial decompiler recognising several machine code architectures and Binary File
Formats, and produces C-like output. Considerable hand editing is required to convert
REC output to compilable C code. Table 1.2 showed the problems solved by these
decompilers, compared with the problems faced by various other tools, while Table 1.3
Table 1.3: Limitations for the two most capable preexisting machine code decompilers.
ioms
tive approximations, but while the result for a compiler is a correct but less optimised
program, the result for a decompiler ranges from a correct but less readable program to
Many operations fundamental to decompilation (such as separating code from data) are
equivalent to the halting problem [HM79], and are therefore undecidable. By Rice's the-
orem [Ric53], all non-trivial properties of computer programs are undecidable [Cou99],
hence compilers and other program-related tools are aected by theoretical limits as
well.
A compiler can always avoid the worst outcome of its theoretical limitations (incorrect
it is, there is a simple conservative option: the constant propagation is not applied
in that instance. The result is a program that is correct; the cost of the theoretical
limit is that the program may run more slowly or consume more memory than if the the
optimisation had been applied. Note that any particular compiler theoretical limitation
can be overcome with a suciently powerful analysis; the theoretical limitation implies
that no compiler can ever produce optimal output for all possible programs.
This contrasts with a decompiler which due to similar theoretical limitations cannot
prove that an immediate value is the address of a procedure. The conservative behaviour
in this case would be to treat the value as an integer constant, yet somehow ensure that
all procedures in the decompiled program start at the same address as in the input
program, or that there are jump instructions at the original addresses that redirect
control ow to the decompiled procedures. If in fact the immediate value is used as a
procedure pointer, correct behaviour will result, and obviously if the value was actually
harm is done. Note that such a solution is quite drastic, compared to the small loss
of performance suered by the compiler in the previous example. To avoid this drastic
measure, the decompiler will have to choose between an integer and procedure pointer
type for the constant (or leave the choice to the user); if the incorrect choice is made,
A similar situation will be discussed in Section 3.6 with respect to calls where the
parameters cannot be fully analysed. A correct but less readable program can be
generated by passing all live locations as parameters. In the worst case, a decompiler
could fall back to binary translation techniques, whose output (if expressed in a language
like C) is correct but very dicult to read. Since the original program is (presumed
to be) correct, the decompiled program can always mimic the operation of the original
program.
about decisions that have been made which may prove to be incorrect.
1.6 Legal Issues 29
There are sucient important and legal uses for decompilation to warrant this research,
and decompilation may facilitate the transfer of facts and functional concepts to the
public domain.
A decompiler is a powerful software analysis tool. Section 1.3 detailed many important
and legal uses for decompilation, but there are obviously illegal uses as well. It is
acknowledged that the legal implications of decompilation, and reverse engineering tools
in general, are important. However, these issues are not considered here. For an
`But just as gun owners who defend the legal use of guns are not endorsing
cop killers, or free speech activists who attack overly broad restrictions on
[Les05]
The last part could be replaced with so too is the defence of decompilation research
It is possible that decompilation may eventually help to attain the original goals of
... the fact that computer programs are distributed for public use in object
code form often precludes public access to the ideas and functional concepts
de facto monopoly over those ideas and functional concepts. That result
works while leaving the ideas, facts, and functional concepts in the public
infringement is unclear. Prior to the MGM v Grokster, the landmark Sony Betamax
case stated in eect that manufacturers of potentially infringing devices (in this case,
Sony's Video Cassette Recorder) can not be held responsible for infringing users, as
long as there are substantial non-infringing uses [Son84]. However, the ruling in MGM
vs Grokster was:
30 Introduction
We hold that one who distributes a device with the object of promoting
mative steps taken to foster infringement, is liable for the resulting acts of
1.7 Goals
The main goals are better identication of parameters and returns; reconstructing types,
and correctly translating indirect jumps and calls; all are facilitated by the Static Single
Assignment form.
true that not all machine code programs can successfully be decompiled. The inter-
code decompiler for real-world programs. The objective is to overcome the following
• Inferring types for variables, parameters, and function returns. If possible, the
more complex types such as arrays, structures, and unions should be correctly
inferred.
• Correctly analysing indirect jumps and calls, using the power of expression prop-
agation and type analysis. If possible, calls to object oriented virtual functions
should be recognised as such, and the output should make use of classes.
This thesis shows that the above items are all facilitated by one technique: the use of
the Static Single Assignment (SSA) form. SSA is a representation commonly used in
optimising compilers, but until now not in a machine code decompiler. SSA is discussed
in detail in Chapter 4.
1.8 Thesis Structure 31
It is expected that the use of SSA in a machine code decompiler would enable at least
simple programs with the following problems, not handled well by current decompilers,
to be handled correctly:
• register and stack-based parameters that do not comply with the Application
• switch statements should be recognised despite code motion and other optimisa-
tions, and
• at least the simpler indirect call instructions should be converted to high level
equivalents.
Following chapters review the limitations of existing machine code decompilers, and
show how many of their limitations in the areas of data ow analysis, type analysis, and
the translation of indirect jumps and calls are solved with the Static Single Assignment
form.
This section gives an overview of the whole thesis. Throughout the thesis, one sentence
summaries precede most sections. These can be used to obtain an overview of a part of
the thesis. The index of summaries on page xliii is therefore an overview of the whole
thesis.
Chapter 2 reviews the state of the art of various decompilers of machine code through
to virtual machines. It reviews what has been done to date, with an emphasis on their
shortcomings.
Data ow analysis is one of the most important components of a decompiler, as shown in
Chapter 3. Data ow establishes the relationship between denitions and uses, which
is important for expression propagation and eliminating condition codes from machine
code. The data ow eects of call statements are particularly important; equations are
dened to summarise the various elements. While traditional data ow techniques can
Chapter 4 introduces the Static Single Assignment (SSA) representation; the use of
SSA for decompilation forms the core of this thesis. SSA is a well known technique in
compilation, used to perform optimisations that require data ow analysis. The advan-
tages of SSA over other data ow implementations for the very important technique of
returns is given. Also shown are the solutions to two other problems which are enabled
by the use of SSA: determining whether a procedure preserves a location, and handling
overlapped registers. Some problems caused by recursion are discussed. Methods for
safely dealing with the inevitable imprecision of data ow information are also given.
problem, it can also be cast as a data ow problem. The use of a representation based on
SSA is useful, since SSA provides a convenient pointer from every use of a location to its
denition; the denition becomes a convenient place to sparsely store type information.
After a brief comparison of the constraint based and data ow based implementations,
details are presented for handling arrays and structures. Such compound types have
Chapter 6 describes techniques for analysing indirect jumps and calls. These instruc-
tions are the most dicult to resolve, and are not well handled by current decompilers.
The solution is based on using the power of propagation and high level pattern match-
ing. However, there are problems caused by the fact that before these instructions
are analysed, the control ow graph is incomplete. When correctly analysed, indirect
jumps can be converted to switch-like statements. Fortran style assigned gotos can also
be converted to such statements. With the help of type analysis, indirect calls can be
the theory on actual programs. Chapter 7 demonstrates results for various topics
discussed throughout the other chapters.
Chapter 8 contains the conclusion, where the main contributions and future work are
summarised.
Chapter 2
Decompiler Review
Existing decompilers have evaded many of the issues faced by machine code decompilers,
or have been decient in the areas detailed in Chapter 1. Related work has also not
The history of decompilers stretches back more than 45 years. In the sections that
follow, salient examples of existing decompilers and related tools and services are re-
viewed. Peripherally related work, such as binary translation and obfuscation, are also
considered.
A surprisingly large number of machine code decompilers exist, but all suer from the
Machine code decompilation has a surprisingly long history. Halstead [Hal62] reports
that the Donnelly-Neliac (D-Neliac) decompiler was producing Neliac (an Algol-like
language) code from machine code in 1960. This is only a decade after the rst com-
pilers. Cifuentes [Cif94] gives a very comprehensive history of decompilers from 1960
to 1994. The decompilation Wiki page [Dec01] reproduces this history, and includes
history to the present. A brief summary of the more relevant decompilers follows; most
There was a dierence of mindset in the early days of decompilation. For example,
Stockton Gaines worries about word length (48 vs 36 was common), the representation
of negative integers and oating point numbers, self modifying code, and side eects
such as setting the console lights. Most of these problems have vanished with the
exception of self modifying code, and even this is much less common now. Gaines also
33
34 Decompiler Review
has a good explanation of idioms [Gai65]. Early languages such as Fortran and Algol do
not have pointers, which also marks a change between early and modern decompilation.
• D-Neliac decompiler, 1960 [Hal62], and Lockheed Neliac decompiler 1963-7. These
produced Neliac code (an Algol-like language) from machine code programs. They
could even convert non-Neliac machine code to Neliac. Decompilers at this time
were pattern matching, and left more dicult cases to the programmer to perform
manually [Hal70].
while novel for the time, is essentially still pattern matching. Hollander may have
been the rst to use a combination of data ow and control ow techniques to
• The Piler System 1974. Barbe's Piler system was a rst attempt to build a general
decompiler. The system was able to read the machine code of several dierent
machines, and generate code for several dierent high level languages. Only one
input phase was written (for the GE/Honeywell 600 machine) and only two output
language into an articial language called MOL620, which features machine regis-
ters. This choice of target language made the decompiler easier to write, however
the result is not really high level code. He also chose one instruction per node
of his control ow graph, instead of the now standard basic block, so his decom-
piler wasted memory (relatively much more precious at that time). He was able
to successfully translate one large program with some manual intervention; the
resultant program was reportedly better documented than the original [Hop78].
• Exec-2-C, 1990. This was an experimental project by the company Austin Code
Works, which was not completed. Intel 80286/DOS executables were disassem-
such as registers and condition codes were visible in the output. Some recovery
of high level C (e.g. if-then, loops) was performed. Data ow analysis would be
needed to improve this decompiler. The output was approximately three times
the size of the assembly language le, when it should be more like three times
smaller. It can be downloaded from [Dec01], which also contains some tests of
the decompiler.
2.1 Machine Code Decompilers 35
specic) to reduce unwanted output from the decompiler. It also has rule-based
recognition of data types such as arrays and pointers to structures, though the
papers give little detail on how this is done [FZL93, FZ91, HZY91].
returns were identied using data ow analysis, and control ow analysis was
her work with a research decompiler called dcc [Cif96], which reads small Intel
The only types dcc could identify were 3 sizes of integers (8, 16, and 32 bit),
and string constants (only for arguments to recognised library functions). Arrays
were emitted as memory expressions (e.g. *(arg2 + (loc3 < < 1)). dcc can be
considered a landmark work, in that it was the rst machine code decompiler
to have solved the basic decompiler problems excluding the recovery of complex
types.
The dcc compiler was modied to read 32-bit Windows programs in 2002; see
work in several small ways, but the output is less readable since it generates a
C-like output with registers (but not condition codes) still present. It is able to
decompile executable les for several processors (e.g. Intel 386, Motorola 68K),
and handles multiple executable le formats (e.g. ELF, Windows PE, etc). It is
able to use debugging information in the input le, if present, to name functions
and variables as per the original source code. Variable arguments to library func-
tions such as printf are handled well. Complex types such as array references
REC is not open source software, however several binary distributions are avail-
able. The decompiler engine does not appear to have been updated after 2001,
translator uses a standard C compiler as the back end; in other words, it emits
C source code. The output is not intended to be readable, and in practice is very
dicult to read. However, the output is compilable, so UQBT could be used for
36 Decompiler Review
1.3.3). Work on UQBT was not completed, however it was capable of producing
low level source code for moderate sized programs, such as the smaller SPEC
+
[SPE95] benchmarks [CVE00, CVEU 99, UQB01].
powerful tool for security work. The main idea is that the security analyst is only
interested in one small piece of code at one time, and so high level code could
be generated on the y. One problem with traditional (static) decompilation
contrast, a dynamic decompiler can provide at least one value (the current value)
• Type Propagation in IDA Pro Disassembler, 2001. Guilfanov describes the type
propagation system in the popular disassembler IDA Pro [Dat98]. The types of
parameters to library calls are captured from system header les. The parameter
types for commonly used libraries are saved in les called type libraries. Assign-
ments to parameter locations are annotated with comments with the name and
type of the parameter. This type information is propagated to other parts of the
the types for other variables not associated with the parameters of any library
calls [Gui01].
• DisC, by Satish Kumar, 2001. This decompiler is designed to read only programs
decompiler. With only one compiler generating the input programs, simpler
pattern matching is more eective. DisC does not handle oating point instruc-
for loops are translated to while loops. It is an interesting observation that since
most aspects of decompilation are ultimately pattern matching in some sense, the
• ndcc decompiler, 2002. André Janz modied the dcc decompiler to read 32-bit
Windows Portable Executable (PE) les. The intent was to use the modied
decompiler to analyse malware. The author states that a rewrite would be needed
to fully implement the 80386 instruction set. Even so, reasonable results were
switch statements are handled well. When run on a Cygwin program, it failed to
nd arguments for a call to printf, possibly because the compiler translated it to
a call to Red_puts(). Anatomizer does not appear to handle any oating point
ping registers (e.g. BL and EBX) are treated as unrelated. No attempt was made
procedures where registers are used before denition, they are not identied as
parameters. It is dicult to evaluate this tool much further, since no source code
activity on the web site (which is in Japanese) appears to have ceased in 2005.
and Cifuentes show a method of analysing indirect call instructions. If such a call
aspects of the call are extracted. The technique as presented is limited to one basic
block; as a result, it fails for some less common cases. The intended application
ends (two are well developed) and a C back end. Boomerang has been used to
demonstrate many of the techniques from this thesis. At the time of writing, it
was just becoming able to handle larger binary programs. Its many limitations
• Desquirr, 2002. This is an IDA Pro plug-in, written by David Eriksson as part
of his Masters thesis. It decompiles one function at a time to the IDA output
be done with the help of a powerful disassembler and about 5000 lines of C++
code. Because a disassembler does not carry semantics for machine instructions,
addressing modes. The x86 and ARM processors are supported. Conditionals
and loops are emitted as gotos, there is some simple switch analysis, and some
• Yadec decompiler, 2004. Raimar Falke submitted his Diploma thesis Entwicklung
38 Decompiler Review
conicts [Fal04].
never publicly released, but a GUI program for interacting with the IR generated
by the decompiler proper is available from the website [And04]. The author claims
that the decompiler is designed to be universal, although only an x86 front end
and C/C++ back end are written at present. The GUI program comes with one
is not possible to analyse arbitrary programs with the available GUI program, so
the correct types might for example be the result of extensive manual editing. As
of March 2007, the web page has been inactive since May 2005.
• Hex-Rays decompiler for 32-bit Windows, 2007. At the time of writing, author Il-
fak Guilfanov had just released a decompiler plugin for the IDA Pro Disassembler.
The decompiler view shows one function at a time in a format very close to the C
to jump to the view for that function. Small functions are translated in a fraction
cates with operators such as ||) and loops (for and while loops, including break
statements). Parameters and returns were present in all functions. There is an
The author stressed that the results are for visualisation, not for recompilation
[Gui07a, Gui07b].
Object code decompilers have advantages over machine code decompilers, but are less
common, presumably because the availability of object code without source code is low.
As noted earlier, the term object code is here used strictly to mean linkable, not
executable, code. Object code contains relocation information and symbol names for
objects that can be referenced from other object les, including functions and global
variables.
2.3 Assembly Decompilers 39
• Schneider and Winger 1974. Here the contrived grammar for a compiler was
inverted to produce a matching decompiler [SW74]. This works only for a partic-
ular compiler, and only under certain circumstances; it was shown in the paper
• Decomp 1988. Reuter wrote a quick decompiler for a specic purpose (to port a
game from one platform to another, without the original source code). Decomp
was an object code decompiler, and produced les that needed signicant hand
editing before they could be recompiled. No data ow analysis was performed, so
Some of the early decompilers (e.g. those by W. Sassaman [Sas66] and Ultrasystems
[Hop78]) read assembly language, because there was a pressing need to convert assembly
language programs (second generation languages) to high level languages (third gener-
ation languages). This is a somewhat easier task than machine code decompilation, as
1
evidenced by the number of problems listed in Table 1.2 on page 18 (4 versus 9).
2
and translated from a toy assembly language (MIXAL) to PL/1 code. Six small
programs were tested, all from Knuth's Art of Computer Programming Vol. 1
[Knu69]. Of these six, two were correct with no user intervention, and of all PL/1
another. It highlighted the problems that such low level code can cause. Up to
intervention on average. The nal program had almost three times the number
of instructions [Fri74].
40 Decompiler Review
• Zebra 1981. Zebra was a prototype decompiler developed at the Naval Underwater
Systems Center [Bri81]. It was another assembly decompiler, this time emitting
the semantics of a program was not economically practical, but that it was useful
turing assembly language programs. He also found ways to improve the perfor-
mance of interval based algorithms as used in the dcc decompiler. His conclusion
was that with these improvements, there was no decisive advantage of one algo-
rithm over the other; intervals produced slightly better quality, while parentheses
• Glasscock's diploma project An 80x86 to C reverse compiler, 1998. This project
instructions. Test programs were very simple, consisting of only integer variables,
string constants used only for the printf library function, and no arrays or struc-
tures. The goal was only to decompile 5 programs, so even if-then-else conditionals
were not needed. Data ow analysis was used to merge instruction semantics and
machine instructions (in the form of Register Transfer Language, RTL) could be
decompiled to C. RTL is at a slightly higher level than machine code; the author
states that it is assumed that data and code are separated, and that procedure
boundaries are known. It is not stated how these requirements are fullled; per-
haps there is a manual component to preparation of the RTLs. He uses the SSA
One of the drivers for Mycroft's work was a large quantity of BCPL source code
RTL instructions, and unication to distill type information. Only registers, not
memory locations, are typed. He considers code with pointers, structures, and
arrays. Sometimes there is more than one solution to the constraint equations,
and he suggests user intervention to choose the best solution. Mycroft describes
• Ward's FermaT transformation system, 1999. Ward has been working on program
transformations for well over a decade. His FermaT [War01] system is capable of
transforming from assembly language all the way up to specications with some
human intervention [War00]. He uses the Wide Spectrum Language WSL as his
transformations are applied to the WSL program, and nally (if needed) the
is released under the GPL license and can be downloaded from the author's web
page [War01]. He also founded the company Software Migrations Ltd. In one
project undertaken by this company, half a million lines of 80186 assembler code
guage decompiler for Digital Signal Processing (DSP) code was written in a
compiler-compiler called rdp [JSW00]. The authors note that DSP is one of
the last areas where assembly language is still commonly used. This decompiler
faces problems unique to DSP processors, as noted in section 2.6.2, however de-
code. The authors doubt the usefulness of decompiling from binary les. See
also [JS04].
The input is Jasmin, essentially Java assembly language. The output is an ML-
like simply typed functional language. Their example shows an iterative imple-
mentation of the factorial function transformed into two functions (an equivalent
to a general decompiler, their work may have application where proof of correct-
In [Myc01], Mycroft compares his type-based decompilation with this work. Struc-
that the two systems produce very similar results in the areas where they overlap,
Virtual machine specications (like Java bytecodes) are rich in information such as
names and types, making decompilers for these platforms much easier, however good
As a group, the only truly successful decompilers to date have been those which are
specic to a particular virtual machine standard (e.g. Visual Basic or Java Bytecodes).
Executables designed to run on virtual machines typically are rich in information such
Because of the large number of Virtual Machine decompilers, only the more interesting
ones will be described here. For more details, and a comparison of some decompilers,
see [Dec01].
As indicated in Table 1.2 on page 18, the types of most local variables can only be
found with type analysis. Some of the simpler Java decompilers, including some com-
mercial decompilers, do not perform type analysis, rendering the output unsuitable for
• McGill's Dava Decompiler. The Sable group at McGill University, Canada, have
been developing a framework for manipulating Java bytecodes called Soot. The
main purpose of Soot is optimisation of bytecodes, but they have also built a
decompiler called Dava [Dec01] on top of Soot. With Dava, they have been con-
Miecznikowski and Hendren found that four commonly used Java decompilers
were confused by peephole optimisation of the bytecode. The goal of their Dava
pilable Java source code. Three problems have been overcome to achieve that
Finding types for all local variables is more dicult than might be imagined. The
found that a three stage algorithm was needed. The rst stage solves constraints;
the second is needed when the types of objects created diers depending on the
run-time control ow path. The third stage, not needed so far in their extensive
2.4 Decompilers for Virtual Machines 43
tests with actual bytecode, essentially types references as type Object, and intro-
instead of saving such variables to locals, they are left on the stack to be used as
needed. The decompiler has to create local variables in the generated source code
for these stack variables, otherwise the code would not function. Optimisation
of bytecodes can also result in reusing a bytecode local variable for two or more
source code variables, and these may have dierent types. It is important for
the decompiler to separate these distinct uses of one bytecode local variable into
codes into readable, correct high level source code (Java in this case). For exam-
Miecznikowski and Hendren give examples where four other decompilers fail at
all three of the above problems, and their own Dava decompiler succeeds [MH02].
Van Emmerik [Dec01] shows that one of the more recent Java decompilers (JODE,
While Dava usually produced correct Java source code, it was often dicult to
read. Some high level patterns and ow analysis were used in later work to im-
prove the readability, even from obfuscated code [NH06]. Compound predicates
(using the && and || operators) , rarely handled by decompilers, are included in
the patterns.
to the Java runtime verier, that attempts to nd type information from other
class les. JODE is able to correctly infer types of local variables, and is able to
transform code into a more readable format, closer to the way Java is naturally
but source code is not available. Since it is written in C++, it is relatively fast
[Kou99]. It is in the top three of nine decompilers tested in [Dec01], but is confused
It seems immature compared with JODE, which is similarly open source. It fails
44 Decompiler Review
a number of tests that most of the other Java decompilers pass. However, it does
attempt to type local variables. For example, in the Sable test from [Dec01], it
correctly infers that the local variable should be declared with type Drawable,
when most other decompilers use the most generic reference type, Object (and
emit casts as needed) [Kum01a].
The situation with Microsoft's CLI (.NET) is slightly dierent to that of Java. In CLI
bytecode les, the types of all local variables are stored, as well as all the information
that bytecode les contain. This implies that local variables can only be reused by
other variables of the same type, so that no type problems arise from such sharing. No
types analysis is needed. It is therefore possible to write a very good decompiler for
MSIL.
license [Ana01]. As of this writing, the decompiler does not decompile classes,
only methods, necessitating some hand editing for recompilation. It exited with
• Reector for .NET. Reector is a class browser for CLI/.NET components and
A few commercial companies oer decompilation services instead of, or in addition to,
Despite the lack of capable, general purpose decompilers, a few companies specialise
that they are not selling to the public. It is more likely, however, that they have an
imperfect decompiler that needs signicant expert intervention, and they see more value
to their company in selling that expertise than selling the decompiler itself and providing
support for others to use it. This is similar to some other reverse engineering tools,
whose results cannot be guaranteed. Only a few commercial decompilation services are
• Software Migrations Ltd [SML01]. Their services are based on ground breaking
research work rst undertaken during the 1980's at the Universities of Oxford
and Durham in England (from their web page). They specialise in mainframe
assembler comprehension and migration. See also the entry on the company's
• The Source Recovery Company and ESTC. The Source Recovery Company [SRC96,
FC99], and their marketing partner ESTC [EST01] oer a service of recovering
COBOL and Assembler source code from MVS load modules. They use a low
level pattern matching technique, but this seems to be suitable for the mainframe
COBOL market. The service has been available for many years, and appears
to be successful. It is possible that this and related services are relatively more
successful than others in part because COBOL does not manipulate pointers.
• JuggerSoft [Jug05], also known as SST Global [SST03] and Source Recovery
[SR02a] (not related to The Source Recovery Company above). This company
lators. They have a good collection of decompilers for legacy platforms. Their
this company guarantees success (provided the customer has enough money, pre-
sumably), and will write a custom decompiler if necessary. They claim that even
• Dot4.com. This company oers the service of translating assembly language code
house tools. Comments and comment blocks are maintained from the assembly
services for the translation of one assembly language to another. Several machines
are supported, including 68K, DSP, 80x86, Z8K, PowerPC, and ColdFire cores
[Mic97].
• Decompiler Technologies, formerly Visual Basic Right Back, oered a Visual Basic
service was announced in mid 2005. Recompilability was not guaranteed; output
was generated automatically using proprietary, not for sale decompilers and other
This related work faces a subset of the problems of decompilation, or feature techniques
2.6.1 Disassembly
Disassembly achieves similar results to decompilation, and encounters some of the same
problems as decompilation, but the results are more verbose and machine specic. No
Disassemblers can be used to solve some of the same problems that decompilers solve.
They have three major drawbacks compared to decompilers: their output is machine
specic, their output is large compared to high level source code, and users require
blers, if available at the same level of functionality, for the same reasons that high level
languages are preferred over assembly language. As an example, Java disassemblers are
rarely used (except perhaps to debug Java compilers), because good Java decompilers
are available.
The most popular disassembler is probably IDA Pro [Dat98]. It performs some auto-
matic analysis of the input program, but also oers interactive commands to override the
auto analysis results if necessary, to add comments, declare structures, etc. Separation
of pointers from constants and original from oset pointers is completely manual. Ida
Pro has the ability to generate assembly language in various dialects, however routine
level visualiser plugins are available, and a pseudocode view was being demonstrated
use, and has several unique problems that are not considered here.
Current Digital Signal Processors (DSPs) are so complex that they rely on advanced
for their performance. As a result, there is a need for conversion of existing assembly
2.6 Related Work 47
language code to high level code, which can be achieved by assembly decompilation.
Decompilation of DSP assembly language has extra challenges over and above the usual
problems, including
• saturated arithmetic,
• sticky overow,
Link-time optimisers share several problems with decompilers. Muth et al. [MDW01]
found the following problems with the alto optimiser for the Compaq Alpha:
tion) compliance, since the compiler or linker could perform optimisations that
• Constant propagation is very useful for optimisations and for analysing indirect
control transfer instructions. The authors found that an average of about 18% of
Analysis is easier at the object code (link-time) level, because of the presence of re-
location information. Optimisers have a very simple fallback for when a proposed
able to force conservative assumptions for indirect control transfers by using the spe-
cial control ow graph nodes Bunknown (for indirect branches) and Funknown (for indirect
calls), which have worst case behaviour (all registers are used and dened). This is
48 Decompiler Review
For example, there are control ow edges from Bunknown to only those basic blocks in
the procedure whose addresses have relocation entries associated with them (implying
that there is at least one pointer in the program to that code). Without relocation
information, Bunknown would have edges to the beginning of all basic blocks, or possibly
to all instructions.
in hardware such as eld programmable gate arrays (FPGAs). Hardware compilers can
read source code such as C, or they can read executable programs. Better performance
can be obtained if basic control and data ow information is provided, allowing the
use of advanced memory structures such as smart buers [SGVN05]. As noted in this
or to provide source code that, even if not suitable for software maintenance, is adequate
The ability to synthesise to hardware is becoming more important with the availability
of chips that incorporate a microprocessor and congurable logic on one chip. These
they do not understand the data, the output has very low readability.
translator that uses a C compiler as its back end. It is able to translate moderately
sized programs (e.g. the SPEC CPU95 [SPE95] benchmark go) into very low level
+
C code [UQB01, CVE00, CVEU 99, BT01]. The semantics of each instruction are
visible, and all control ow is via goto statements and labels. Parameters and returns
are identied, although actual parameters are always in the form of simple variables,
not expressions.
There is no concept of the data item that is being manipulated, apart from its size.
To ensure that the address computed by this instruction will read the intended data,
2.6 Related Work 49
binary translators have to force the data section of the input binary program to be
copied exactly to the data section of the output program and at the same virtual
address. In essence, the target program is replicating the bit manipulations of the
source program. The data section is regarded as a group of addressable memory bytes,
This is a subtle but important point. The compiler (if there was one) that created
the input program inserted instructions to manipulate high level data objects dened
in the source program code. The compiler has an intermediate representation of the
data declared in the source program, as well as the imperative statements of the pro-
representation of the data. It blindly replicates the bit manipulations from the source
program, without the benet of this data representation. It ends up producing a pro-
gram that works, but only because the input program worked. It makes no attempt
to analyse what the various bit manipulations do. In eect, the binary translator is
relying on the symbol table of the original compiler to ensure that data items do not
overlap. The source code emitted by such a translator is at the second or third lowest
level, no translation of the data section, as described in Section 1.1 on page 6.
This contrasts with the code quality that a decompiler should produce. A decompiler
most of the program's behaviour, just as they would if they were to read the original
source code with the comments blanked out. In other words, decompilers should strive
for source code that is two levels higher up the list of Section 1.1.
The above discussion concerns static binary translation, where no attempt is made to
execute the input program. Dynamic binary translation does run the program, and
achieves two of the goals of decompilation. The rst is the ability to run a legacy ap-
plication with no source code on a new platform. The second is the ability to optimise
(e.g. Dynamo [BDB00]). However, dynamic binary translation in the usual congu-
ration does not allow maintenance of the program except by patching the executable
le, and most programs require modication over their lifetimes. However, a modied
dynamic binary translator might be able to nd values for important pointers or other
tation of the input program inlined into a large source le, relying heavily on compiler
Instruction Set Simulation is a technique where each instruction of the original machine
a static translation (however there are also dynamic versions, e.g. Shade [CK94]), and
the result is source code, like the output of a binary translator, except that even the
original program's program counter is visible. This is the lowest level of source code
listed in Section 1.1. When macros are used, the source code resembles assembly
Most analyses in a decompiler compute a result that is incomplete in some sense, e.g. a
subset of the values that a register could take. In other words, these analyses are ab-
stract evaluations of a program. Cousot and Cousot [CC76] show that provided certain
conditions are met, correctness and termination can be guaranteed. This result gives
condence that the various problems facing decompilers can eventually be overcome.
with machine-checkable proofs [AF00]. The proof-carrying code can be at the machine
code level [SA01] or at the assembly language level [MWCG98]. While these systems
work with types for machine or assembly language, they do not derive the types from
the low level code. Rather, extra information in the form of checkable proofs and types
interesting to observe that most attempts to increase program safety seem to increase
tions that are similar to those that can be created for programs written in a high-level
language.
Some security checking tools read machine code directly, without relying on or in some
ing, which encompasses the types and also the states (e.g. readable, executable) of
machine code operands [Xu01]. He uses symbol table information (not usually avail-
claims to be able to deal with indirect calls using a simple intraprocedural constant
propagation algorithm.
Also from the University of Wisconsin-Madison, Christodorescu and Jha discuss a Static
Analyser for Executables (SAFE) [CJ03]. The binary is rst disassembled with the IDA
Pro disassembler. A plug-in for IDA Pro called the Connector, provided by Gramma-
Tech Inc [Gra97], interfaces to the rest of the tool. Value-Set Analysis (VSA) is used
tool called CodeSurfer (sold by GrammaTech Inc). CodeSurfer is a tool for program
understanding and code inspection, including slicing and chopping. It appears that
the executable loader used in this work was an early version that relied on IDA Pro
to (possibly manually) analyse indirect jump and call instructions. They were able to
detect obfuscated versions of four viruses with no false positives or negatives, where
A later paper from the same university names the composite tool for analysis and in-
representations that are similar to those that can be created for a program written in
a high-level language. Limitations imposed by IDA Pro (e.g. not analysing all indirect
branches and calls) are mitigated by value-set analysis and also Ane Relations An-
alysis (ARA). This work appears to make the assumption that absolute addresses and
osets indicate the starting address of program variables. In other words, it does not
perform separation of original and oset pointers (see Section 1.5.3). Various safety
via an API. Memory consumption is high, e.g. 737MB for analysing winhlp32.exe,
which is only 265KB in size, and this is while a temporary expedient for calls to
Starting in 2005, papers from the same university have been published, featuring a third
52 Decompiler Review
major analysis [BR05, RBL06]. Aggregate Structure Identication (ASI) provides the
ability to recover record and array structure. A technique the authors call recency-
abstraction avoids problems of low precision [BR06]; see also Section 4.6 on page 139.
Decompilation is mentioned as a potential application for their tool, but does not appear
code, while decompilation provides little high level comprehension, starting with machine
code.
therefore has much in common with traditional reverse engineering, but this is not the
case. Traditional reverse engineering is performed on source code, usually with the
Decompilation provides the basis for comprehension, maintenance and new develop-
ment, i.e. source code, but any high-level comprehension is provided by the reader.
mented by one at this point in the program, and its type is unsigned integer. Tra-
ditional reverse engineering could add high-level comprehension such as because the
the store a One (knowing a value is already zero) abstraction, the Com-
A decompiler can readily deduce that a value is set to one by the combination of clearing
and incrementing it; this is an example where decompilation provides a limited kind of
2.6 Related Work 53
comprehension. It is a mechanical deduction, but quite useful even so. Such mechanical
deductions can be surprising at times (see for example the triple xor idiom in Section
4.3). If sucient identities are preprogrammed into a decompiler, it seems plausible that
(e.g. the LSB is isolated with an appropriate AND instruction, and the result is tested
emitting mem++ or the like seems appropriate, leaving the determination of the higher
Compilers are becoming so complex, and the need for performance is so great, that
several compiler infrastructures have appeared in the last decade, oering the ability to
research a small part of the compiler chain, and not having to write the whole compiler,
visualisation tools, testing frameworks, and so on. None of these are designed with
decompilation in mind, but they are all exible systems. It is worthwhile therefore
to become the basis for a machine code decompiler. Presumably, much less work would
be required to get to the point where some code could be generated, compared to
starting from scratch. Parts that could be used with little modication include the
IR design, the C generators, simplication passes, translating into and out of SSA
form, constant and copy propagation passes (which could be extended to expression
propagation), dead code elimination, and probably many data ow analyses such as
reaching denitions. Even the front ends could be used as a quick way to generate IR
for experimentation.
2.6.11.1 LLVM
The Low Level Virtual Machine is an infrastructure for binary manipulation tools such
as compilers and optimisers. It has also been used for binary to binary tools, such as
Single Assignment form (SSA form). Memory objects are handled only through load
instructions, with an encoding scheme that allows an innite number of virtual registers
54 Decompiler Review
while retaining good code density. There are a few concessions for type and exception
handling information, but fundamentally this is not suitable for representing the com-
plex expressions needed for decompilation. It is possible that complex expressions could
be generated only in the back ends, but this would duplicate the propagation code in
all back ends. The propagation could be applied immediately before language specic
back ends are invoked, but this would mean that the back ends require a dierent IR
than the rest of the decompiler. Also, earlier parts of a decompiler (e.g. analysing in-
direct jump and call instructions) rely on the propagation having generated complex
Figure 2.1: LLVM can be used to compile and optimise a program at various stages of
its life. From [Lat02].
2.6.11.2 SUIF2
gram concepts such as expressions and statements, and it is extensible so that it should
be able to handle e.g. SSA phi nodes. However, authors of a competing infrastructure
+
state that SUIF [17] has very little support of the SSA form [SNK 03]. There are
tools to convert the IR to C, front ends for various languages, and various optimisation
passes. At any stage, the IR can be serialised to disk, or several passes can operate in
2.6.11.3 COINS
a low level IR, and C code can be generated from either. The low level IR (LIR) appears
LIR is not SSA based, but there is support to translate into and out of SSA form.
Operands are typed, but currently the supported types are basically only a size and a
2.6 Related Work 55
,%"-./01"-23+%4
3*,)RUWUDQ ('*& ('*& -DYD
268,)
Interprocedural
*
68,)
Analysis
Parallelization
/0,1#*2 Locality Opt
!"#"$%&'(
"+!,-80-76 & 0DFK68,) )*+(,% !-./0#1*2
3.21+(.$%4##&!"(1&*
$OSKD [
* C++ OSUIF to SUIF is incomplete
ag indicating whether the operand is an integer (sign is not specied) or oating point.
)46)!%!+3"7/8 #$%&$'%(")*"-./0"5)46)!%!+3"7//8
There is some extensibility in the IR design, so COINS may be able to be expanded be
D)!8A%$404)#'9%),9#"*,)%!004$!*#4' %%%(+"#4'E%(#""#4'
High-level Intermediate%%%"*!*,;,'*%),#'2,.#'9%!'2%"$!0#'9
Representation (HIR)
T08A!%!'2%.UV%/!$N,'2"
04$N#'9%(4)%'4'8,)(,$*0<%',"*,2%0448"
Symbol
table Basic optimizer Parallelization Advanced optimizer
HIR
to Basic data flow analyzer Loop analyzer Alias analyzer
LIR Common subexp. elim. Coarse grain parallelizer Loop optimizer
Dead code elim. Loop parallelizer Partial redundancy elim.
+
Figure 2.3: Overview of the COINS compiler infrastructure. From Fig. 1 of [SFF 05].
56 Decompiler Review
2.6.11.4 SCALE
Scale is a exible, high performance research compiler for C and Fortran, and is written
+
in Java [CDC 04, Sca01]. The data ow diagram of Figure 2.4 gives an overview.
Again, there is a high and low level IR; the low level IR is in SSA form. Expressions in
the SSA CFG are tree based, hence it may be possible to adapt this infrastructure for
research in decompilation.
CFG
Graph (file or direct) C source
Alias Analysis
CFG
Static Single
Assignent
SSA CFG
Optimizations
SSA CFG
As of GCC version 4.0, the GNU compiler collection includes the GENERIC and
GIMPLE IRs, in addition to the RTL IR that has always been a feature of GCC
[Nov03, Nov04]. RTL is a very low level IR, not suitable for decompilation. GENERIC
is a tree IR where arbitrarily complex expressions are allowed. Compiler front ends can
translator (the gimplier) could convert this to GIMPLE, a lower version of GENERIC
with three address representations for expressions. Alternatively, front ends can con-
vert directly to GIMPLE. All optimisation is performed at the GIMPLE level, which is
probably too low level for decompilation. GENERIC shows some promise, but few of
the tools support it. Figure 2.5 shows the relationship of the various IRs.
2.6 Related Work 57
Tree SSA
C C
trees genericize
Java Java
trees genericize
Figure 2.5: The various IRs used in the GCC compiler. From a presentation by D.
Novillo.
There is a new SSA representation for aggregates called Memory SSA [Nov06] which is
planned for GCC version 4.3 . While primarily aimed at saving memory in the GCC
2.6.11.6 Phoenix
Variable Variable
Figure 2.6: Overview of the Phoenix compiler infrastructure and IR. From [MRU07].
the basis of future compiler products. Figure 2.6 gives an overview, and an example
of IR showing one line of source code split into two IR instructions. Phoenix has the
ability to read native executable programs, but does not (as of this writing) come with
a C generator. IR exists at three levels (high, medium, and low), plus an even lower
level for representing data. Unfortunately, even at the highest level (HIR), expression
operands can not be other expressions, only variables, memory locations, constants,
and so on. The three main IR levels are kept as similar as possible to make it easier for
phases to operate at any level. As a result, Phoenix would probably not be a suitable
2.6.11.7 Open64
Open64 is a compiler infrastructure originally developed for the IA64 (Itanium) archi-
tecture. It has since been broadened to target x86, x64, MIPS, and other architectures.
The IR is called WHIRL, and exists at ve levels, as shown in Figure 2.7.
Front-ends
VHO Very High
standalone inliner WHIRL
Lower aggregates
Un-nest calls
IPA Lower COMMAs, RCOMMAs
PREOPT High WHIRL
LNO Lower ARRAYs
Lower Complex Numbers
Lower high level control flow
Lower IO
Lower bit-fields
Spawn nested procedures for
parallel regions
WOPT Mid WHIRL
RVI1
Lower intrinsics to calls
Generate simulation code for quads
All data mapped to segments
Lower loads/stores to final form
Expose code sequences for
constants and addresses
Expose $gp for -shared
Expose static link for nested
procedures
RVI2 Low WHIRL
Code generation
CG Machine
CG Instruction
Representation
Figure 2.7: The various levels of the WHIRL IR used by the Open64 compiler
infrastructure. From [SGI02].
At all IR levels, the IR is a tree, which is suitable for expressing high level language
constructs. The higher two levels can be translated directly to C or Fortran by provided
tools. Many of the compiler optimisations use the SSA form, hence there are facilities
for transforming into and out of SSA form. This infrastructure would therefore appear
Open64 is free software licensed under the GPL, and can be downloaded from Source-
One of the transformations needed by compilers, decompilers, and other tools is the
canonical or simpler forms. Dolzmann and Sturm [DS97] give ideas on what constitutes
simplest: few atomic formulae, small satisfaction sets, etc. They point out that some
goals are contradictory, so at times a user may have to select from a set of options, or
select general goals such as minimum generated output size or maximum readability.
The eld of decompilation probably has much to learn from such related work, to ne
dicult, including decompilation; in most cases, such protection prevents eective de-
compilation.
in machine code form in such a way as to make understanding of the program more
dicult. (There is also source code obfuscation [Ioc88], but at present this appears to
A typical technique is to add code not associated with the original program, confusing
the reverse engineering tool with invalid instructions, self modifying code, and the like.
Branches controlling entry to the extra code are often controlled by opaque predicates,
which are always or never taken, without it being obvious that this is the case. Another
technique is to modify the control ow in various ways, e.g. turning the program into a
designed to hide the executable code of a program. Techniques include encrypting the
instructions and/or data, using obscure instructions, self modifying code, and branching
to the middle of instructions. Attempts to thwart dynamic debuggers are also common,
but these do not aect static decompilation except to add unwanted code to the output.
60 Decompiler Review
The result of decompiling an obfuscated or protected program will depend on the nature
of the modications:
• If compiler specic patterns of machine code are merely replaced with dierent
but equivalent instructions, the result will probably decompile perfectly well with
a general decompiler.
• Where extraneous code is inserted without special care, the result could well
• Where extraneous code is inserted with special care to disguise its lack of reacha-
bility, the result will likely be a program that is correct and usable, but dicult to
understand and costly to maintain unless most of the obfuscations are manually
removed. Even though such removal could be performed at the source code level,
any removal is likely to be both tedious and costly. Some obfuscations (e.g. cre-
• Where the majority of the program is no longer in normal machine language, but
is instead encrypted, the result is likely to be only valid for the decrypting routine,
and the rest will be invalid. Where several levels of encryption and/or protection
are involved, the result is likely to be valid only for the rst level of decryption.
• Where the majority of the program has been translated into instructions suitable
Time (JIT) compiler is part of the executable program, the result will likely be
As a research area, obfuscation has only received serious attention since the mid 1990s,
Collberg et al. have published the seminal paper in this area, A Taxonomy of Obfus-
The diculties presented by most forms of executable program protection indicate that
protected program is required (e.g. source code is lost, and the only available form of the
unrelated to decompilation.
Chapter 3
Static Single Assignment form assists with most data ow components of decompilers,
and return values, deciding if locations are preserved, and eliminating dead code.
Data ow analysis and control ow analysis are two of the main classes of analyses for
machine code decompilers, as shown in Figure 3.1. Control ow analysis, where the
conditionals and loops, is a largely solved problem. However, data ow analysis, where
instruction semantics are transformed into more complex expressions, parameters and
return values are identied, and types are recovered, still has signicant potential for
improvement.
61
62 Data Flow Analysis
#include <stdio.h>
/* n n! n(n-1)...
* Calculate C = -------- = --------------- (n-r terms top and bottom)
* r r!(n-r)! (n-r)(n-r-1)... */
int comb(int n, int r) {
if (r >= n)
r = n; /* Make sure r <= n */
double res = 1.0;
int num = n;
int denom = n-r;
int c = n-r;
while (c-- > 0) {
res *= num--;
res /= denom--;
}
int i = (int)res; /* Integer result; truncates */
if (res - i > 0.5)
i++; /* Round up */
return i;
}
int main() {
int n, r;
printf("Number in set, n: "); scanf("%d", &n);
printf("Number to choose, r: "); scanf("%d", &r);
printf("Choose %d from %d: %d\n", r, n, comb(n, r));
return 0;
}
Figure 3.2: Original source code for the combinations program.
%eax, ... %edx, %esi, Registers. %esp is the stack pointer; %ebp is the base
%edi, %ebp, %esp pointer, usually pointing to the top of the stack frame.
%st, %st(1) Top of the oating point register stack (often implied),
and next of stack
(r ) Memory pointed to by register r.
disp (r ) Memory whose address is disp + the value of register r.
push r Push the register r to the stack.
Afterwards, the stack pointer esp points to r.
pop r Pop the value currently pointed to by esp to register r.
Afterwards, esp points to the new top of stack.
mov sr,d est Move (copy) the register or immediate value sr to register
or memory dest.
cmp sr,dr Compare the register dr with the register or immediate
value sr. Condition codes (e.g. carry, zero) are aected.
jle dest Jump if less or equal to address dest. Uses condition
codes set by an earlier instruction.
lea mem ,rd Load the eective address of mem to register rd.
jmp dest Jump to address dest. Equivalent to loading the program
counter with dest.
call dest Call the subroutine at address dest. Equivalent to
pushing the program counter and jumping to dest.
test rs ,rd And register or memory rd to register or immediate value
rs ; result is not stored. Condition codes are aected.
sub rs ,rd rd := rd - rs. Condition codes are aected.
sbb rs ,rd rd := rd - rs - carry/borrow. Condition codes are aected.
fidiv mem The value at the top of the oating point stack (FPTOS)
is divided by the integer in memory location mem.
leave Equivalent to mov %ebp,%esp; pop %ebp.
ret Subroutine Return. Equivalent to pop temp; jmp temp
dec rd Decrement register rd , i.e. rd := rd -1.
nop No operation (do nothing).
statement with the expression m[esp-8] > 0 representing the condition for which the
branch will be taken, and another expression (often a constant) which represents the
destination of the branch in the original program's address space. esp is the name of a
register (the x86 stack pointer), and m[esp-8] represents the memory location whose
address is the result of subtracting 8 from the value of the esp register. Register and
memory locations are generally the only entities that can appear on the left hand side
of assignments. When locations appear in the decompiled output, they are converted
form and how they are used. Expressions consist of locations, constants, or operators
64 Data Flow Analysis
high level language operators such as addition and bitwise or. There are also a few low
level operators such as those which convert various sizes of integers to and from oating
point representations.
Data ow analysis concerns the denitions and uses of locations. A denition of a
location is where is it assigned to, usually on the left hand side of an assignment.
One or more locations could be assigned to as a side eect of a call statement. A use
of a location is where the value of the location aects the execution of a statement,
branch. Just as compilers require data ow analysis to perform optimisation and good
code generation, decompilers require data ow analysis for the various transformations
To demonstrate the various uses of data ow analysis in decompilers, the running ex-
ample of Figure 3.2 will be used for most of this and the next chapters. It is a simple
program for calculating the number of combinations of r objects from a set of n objects.
(n−r) terms top and bottom
if n > r
z }| {
!
n(n − 1)...
n
n n!
(n − r)(n − r − 1)...
Cr = = =
r r!(n − r)!
1 if n = r
undef ined if n < r
Figure 3.3 shows the rst part of a possible compilation of the original program for the
x86 architecture. For readers not familiar with the x86 assembly language, an overview
is given in Table 3.1. The disassembly is shown in AT&T syntax, where the last operand
Data ow analysis is involved in many aspects of decompilation, and is a well known
[ASU86, App02]. Other tools such as link time optimisers also use data ow analysis
[Fer95, SW93]. Early decompilers made use of data ow analysis (e.g. [Hol73, Fri74,
Bar74]).
propagation, however, can cause problems as Section 3.2 shows. Dead code elimina-
tion, introduced in Section 3.3, reduces the bulk of the decompiled output, one of the
advantages of high level source code. One set of the machine code details that are elim-
inated includes the setting and use of condition codes; Section 3.3.1 shows how these
3.1 Expression Propagation 65
are combined. The x86 architecture has an interesting special case with oating point
compares, discussed in Section 3.3.2. One of the most important aspects of a program's
representation is the way that calls are summarised. Section 3.4 covers the special
terminology related to calls, the necessity of changing from caller to callee contexts,
and discusses how parameters, returns, and global variables interact when locations are
dened along only some paths. The various elements of call summaries are enumerated
in Section 3.4.4 in the form of equations. Section 3.5 considers whether data ow could
or should be performed over the whole program at once, while Section 3.6 discusses
safety issues related to the summary of data ow information. Architectures that have
overlapped registers present problems that can be overcome with data ow analysis, as
there are two simple rules, yet dicult to check, for when it can be applied.
The IR for the rst seven instructions of Figure 3.3 are shown in Figure 3.4(a). Note
how statement 2 uses (in m[esp]) the value of esp computed in statement 1.
Let esp0 be a special register for the purposes of the next several examples. An extra
statement, statement 0, has been inserted at the start of the procedure to capture the
initial value of the stack pointer register esp. This will turn out to be a useful thing
to do, and the next chapter will introduce an intermediate representation that will do
this automatically.
Statement 0 can be propagated into every other statement of the example; this will
not always be the case. For example, statement 1 becomes esp := esp0 - 4. Similarly,
The rules for when propagation is allowed are well known. A propagation of x from a
0 esp0 := esp
80483b0 1 esp := esp0 - 4
2 m[esp0-4] := ebp
80483b1 3 ebp := esp0-4
80483b3 4 esp := esp0 - 8
5 m[esp0-8] := esi
80483b4 6 esp := esp0 - 12
7 m[esp0-12] := ebx
80483b5 8 esp := esp0 - 16
9 m[esp0-16] := ecx
80483b6 10 tmp1 := esp0 - 16
11 esp := esp0 - 24
80483b9 13 edx := m[esp0+4] ; Load n to edx
(b) After expression propagation
Figure 3.4: IR for the rst seven instructions of the combinations example.
analyses are required. The rst analysis pass uses use-denition chains (ud-chains),
based on reaching denitions, a forward-ow, any-path data ow analysis. The second
one requires a special purpose analysis called rhs-clear , a forward-ow, all-paths an-
alysis [Cif94]. The next chapter will introduce an intermediate representation which
Statement 3 also uses the value calculated in statement 1, so statement 1 is also prop-
agated to statement 3. Statement 13 uses the value calculated in statement 3, and the
3.1 Expression Propagation 67
conditions are still met, so statement 3 can be propagated to statement 13. The result
Note that the quantity being propagated in these cases is of the form esp0-K, where K
is a constant; the quantity being propagated is an expression. This is expression prop-
agation , and diers from the constant propagation and copy propagation performed in
and y is either a constant or another variable, but not a more complex expression.
Performing expression propagation tends to make the result more complex, copy prop-
agation leaves the complexity the same, and constant propagation usually makes the
machine instructions. In other words, the overall progression is from the complex to the
simple. Decompilers do the opposite; the progression is from the simple to the complex.
variables (locations). Copy and constant propagation are also useful for decompilers,
since propagating them tends to reduce the need for temporary variables.
Expression propagation, when applied repeatedly, tends to result in expressions that are
in terms of the input values of the procedure. This is already evident in Figure 3.4(b),
where the uses of seven dierent values for esp are replaced by constant osets from
one (esp0). This tendency is very useful in a decompiler, because memory locations
Figure 3.5: Two machine instructions referring to the same memory location using
dierent registers. /f is the oating point division operator, and (double) is the
integer to oating point conversion operator.
Figure 3.5 shows two instructions from inside the while loop, which implement the
division. In Figure 3.5(a), m[esp] is an alias for m[ebp-24], however this fact is not
68 Data Flow Analysis
the same memory location, as they should. Treating them as separate locations could
result in errors in the data ow information. This form of aliasing is more prevalent at
Figure 3.6 shows decompiler output where too much propagation has been performed.
The result is needlessly dicult to read, because a large expression has been duplicated.
The reader has to try to nd a meaning for the complex expression twice, and it is not
obvious from the code whether the condition is the same as the right hand side of the
assignment on line 25. Figure 3.7 shows how the problem arises in general.
Figure 3.7: The circumstance where limiting expression propagation results in more
readable decompiled code.
pear to solve this problem. However, implementing CSE everywhere possible eectively
it may be desirable to undo certain propagations, e.g. where specied manually. For
those cases, CSE could be used to undo specic propagations. While compilers perform
CSE for more compact and faster machine code, decompilers limit or undo expression
propagation to reduce the bulk of the decompiled code and improve its readability.
The problem with Figure 3.6 is that a large expression has been propagated to more
than one use. Where this expression is small, e.g. max-1, this is no problem, and is
Similarly, when a complex expression has only one use, it is also not a problem. However,
code, it may be worth preventing the propagation. The threshold for such propagation
The main problematic propagations are therefore those which propagate expressions
complexity will probably suce, e.g. the number of operators is easy to calculate and
is eective.
Dead code elimination is facilitated by storing all uses for each denition (denition-use
information).
Dead code consists of assignments for which the denition is no longer used; its elimi-
nation is called Dead Code Elimination (DCE). Dead code contrasts with unreachable
code, which can be any kind of statement, to which there is no valid path from the
Propagation often leads to dead code. In Figure 3.4(b), half the statements are cal-
culations of intermediate results that will not be needed in the decompiled output,
and are therefore eliminated as shown in Figure 3.4(c). In fact, all the statements in
this short example will ultimately be eliminated as dead code. DCE is also performed
considerable bulk and greatly increases the readability of the generated output. Hence,
As the name implies, dead code elimination depends on the concept of liveness infor-
mation. If a location is live at a program point, changing its value (e.g. by overwriting
it or eliminating its denition) will aect the execution of the program. Hence, a live
variable can not be eliminated. In traditional data ow analysis, every path from a
denition to the end of the procedure has to be checked to ensure that on that path the
location is dened before it is used. The next chapter will introduce an intermediate
representation with the property that any use anywhere in the program implies that
Some instructions typically generate eects that are not used. For example, a divide
instruction usually computes both the quotient and the remainder in two machine
registers. Most high level languages have operators for division and remainder, but
typically only one or the other operation is specied. As a result, the divide and similar
Another machine specic detail which must usually be eliminated is the condition code
register, also called the ags register or status register. Individual bits of this register
are called condition codes, ags, or status bits. Condition codes are a feature of machine
code but not high level source code, so the actual assignments and uses of the condition
codes should be eliminated. In some cases, it may be necessary for a ag, e.g. the
carry ag as a local Boolean variable CF, to be visible. In general, such cases should be
avoided if at all possible, since they represent very low level details that programmers
normally don't think about, and hence do not appear in normal source code.
In the example IR of Figure 3.4, the instruction at address 80483b6, a subtract imme-
diate instruction, has the side eect of setting the condition codes. There is a third
statement for this instruction (statement 12, omitted earlier for brevity) representing
As shown in Figure 3.8(b), the semantics implied by the SUBFLAGS macro are complex.
Decompilers do not usually need the detail of the macro expansion. Other macro calls
are used to represent the eect on condition codes after other classes of instructions, e.g.
oating point instructions, or logical instructions such as a bitwise and. For example,
add and subtract instructions aect the carry and overow condition codes dierently
on most architectures.
In this case, none of the four assignments to condition codes are used. In a typical
x86 program, there are many modications of the condition codes that are not used;
the semantics of these modications are removed by dead code elimination. Naturally,
some assignments to condition codes are not dead code, and are used by subsequent
instructions.
Separate instructions set condition codes (e.g. compare instructions), and use them (e.g.
conditional branch instructions). Specic instructions which use condition codes use
dierent combinations of the condition code bits, e.g. one might branch if lower (carry
3.3 Dead Code Elimination 71
(a) Complete semantics for the instruction using the SUBFLAGS macro call.
Figure 3.8: The subtract immediate from stack pointer instruction from regis-
ter esp from Figure 3.4(a) including the side eect on the condition codes.
set) or equal (zero set), while another might set a register to one if the sign bit is set.
Hence the overall eect is the result of combining the semantics of the instruction which
sets the condition codes, and the instruction which uses them. Often, these instructions
are adjacent or near each other, but it is quite possible that they are separated by many
instructions.
Data ow analysis makes it easy to nd the condition code denition associated with
code use, and hence the semantics of that use. This combination can be used to dene
the semantics of the instruction using the condition codes (e.g. the branch condition is
a ≤ b ). After this is done, the condition code assignments are no longer necessary, and
signments to the %flags abstract location, as shown earlier in Figure 3.8(a). Figure
In general, after an arithmetic instruction (e.g add, decrement, compare), there will be
instructions use the ADDFLAGS macro, which has slightly dierent semantics to the
72 Data Flow Analysis
location to pass as the last operand of the SUBFLAGS macro. After a logical instruction
%flags := LOGICALFLAGS(res)
tially the expression that would be emitted within parentheses of an if statement when
decompiling to C. For integer branches, this expression is initially set to %flags, so that
expression propagation will automatically propagate the correct setting of the integer
condition codes to the branch statement. Floating point branches initially set the con-
dition to %fflags (representing the oating point ags), so that interleaved integer and
oating point condition codes producers (e.g. compare instructions) and consumers
such as x + 0 = x are applied. Combining condition code denitions and uses can be
The general method of using propagation to combine the setting and use of condition
codes to form conditional expressions is due to Trent Waddington, who rst imple-
mented the method in the Boomerang decompiler [Boo02]. It is able to handle various
control ow congurations, such as condition code producers and consumers appearing
There is a special case where the carry ag can be used explicitly, as part of a subtract
with borrow instruction. Figure 3.10 from the running example shows a subtract with
The essence of the sequence is to perform a compare or subtract which aects the
condition codes, followed by a subtract with borrow of a register from itself. The carry
ag (which also stores borrows from subtract instructions) is set if the rst operand of
the compare or subtract is unsigned less than the second operand. The result is therefore
either zero (if no carry/borrow) or -1 (if carry/borrow was set). This value can be anded
with an expression to compute conditional values without branch instructions, or by
In the example of Figure 3.10, ebx is set to r-n, and the carry ag is set if r<n and
cleared otherwise. Register eax is subtracted from itself with borrow, so eax becomes
-1 if r<n, or 0 otherwise. After eax is anded with ebx, the result is r-n if r<n, or 0
otherwise. Finally, n is added to the result, giving r if r<n, or n otherwise.
The carry ag can be treated as a use of the %flags abstract location, so that appro-
priate condition code operations will propagate their operands into the %CF location.
During propagation, a special test is made for the combination of propagating %flags
= SUBFLAGS(op1 , op2 , res ) into %CF. The branch condition is set to op1 <u op2 .
The result is correct, compilable, but dicult to read code as shown in Figure 3.10(e).
(p ? x : y ) + c ⇒ p ? x +c : y +c .
The above technique is not suitable for multi-word arithmetic, which also uses the carry
The reader may question why the semantics for ags are not substituted directly into
expressions for the branch conditions directly to derive the high level branch condi-
tion from rst principles each time. For example, the instruction commonly known as
74 Data Flow Analysis
if (r > n)
r = n; /* Make sure r <= n */
(a) Source code
Figure 3.10: Code from the running example where the carry ag is used explicitly.
3.3 Dead Code Elimination 75
Figure 3.10: (continued). Code from the running example where the carry ag
is used explicitly.
branch if greater or equals actually branches if the sign ag is equal to the overow
ag. It is possible to substitute the expressions for %NF and %OF from Figure 3.8(b) to
that the combination of op1 - op2 (assuming ags are set) and JGE implies that the
%flags is unused, so that standard dead code elimination will remove it. In sum-
mary, expression propagation makes it easy to combine condition code setting and use.
Special propagation or simplication rules extract the semantics of the condition code
Modern processors in the x86 series are instruction set compatible back to the 80386
and even the 80286 and 8086, where a separate, optional numeric coprocessor was used
(80387, 80287 or 8087 respectively). Because the CPU and coprocessor were in separate
physical packages, there was limited access from one chip to the other. In particular,
oating point branches were performed with the aid of the fnstsw (oating point store
status word) instruction, which copied the oating point status word to the ax register
of the CPU. Machine code for the oating point compare of Figure 3.2 is shown in
Figure 3.11.
The upper half of the 16-bit register ax is accessible as the single byte register, ah.
The oating point status bits are masked o with the test $0x45,%ah instruction.
76 Data Flow Analysis
Figure 3.11: 80386 code for the oating point compare in the running example.
The expression propagated into the branch is (SETFFLAGS(op1 , op2 , res) & 0x45)
!= 0. Similar logic to the integer propagation is used to modify this to op1 >=f op2,
where >=f is the oating point greater-or-equals operator. The result is a quite general
solution that eliminates most of the machine dependent details of the above oating
point compare code. The test $0x45,%ah is often replaced with sequences such as and
$0x45,%ah; xor $0x40,%ah or and $1,ah. Some compilers emit a jp instruction (jump
if parity of the result is even) after a test or and instruction. This can save a compare
instruction. For example, after test $0x41,%ah and jp dest , the branch is not taken
if the oating point result is less (result of the test instruction is 0x01) or equal (0x40),
but not uncomparable (0x41). (Two oating point numbers are uncomparable if one
of them is a special value called a NAN (Not A Number).) The exact sequence of
instructions found depends on the compiler and the oating point comparison operator
The eects of calls are best summarised by the locations modied by the callee, and the
If data ow analysis is performed on the whole program at once, the eects of called
procedures are automatically incorporated into their callers; Section 3.5 will consider
this possibility. Assuming that the whole program is not analysed at once, a summary
Information has to be transmitted from a caller to and from a callee (e.g. the number
3.4 Summarising Calls 77
and types of parameters), and this information may change as the decompilation pro-
ceeds. This raises some issues with respect to the context of callers and callees, which
Finally, the various call-related quantities of interest can be concisely dened by a set
Call instructions, and the call statements that result from them in a decompiler's in-
By comparing the source and machine code in Figures 3.2 and 3.3, the instructions at
addresses 80483b9 and 80483bc are seen to load parameters from the stack and from
a register respectively. To prevent confusion, a strict distinction is made between the
Since both edx and eax are modied by the procedure, they are called modieds.
The value or values returned by a procedure are called returns (used as a noun). It
is possible at the machine language level that more than one value is returned by a
procedure, hence the plural form. Unlike the return values of high level languages, at
the machine code level, returns are assignments. Thus, there are return locations
(usually registers) and return expressions (which compute the return values). In this
program, only the value in eax is used, so only eax is called a return of the procedure
comb and a result of the call to it. If and only if a call has a result will there be
either an assignment to the result in the decompiled output, or the call will be used as
Figure 3.12 shows IR for part of the running example, illustrating several of the terms
dened above.
entirely nonexistent at the machine code level. The two loads from the parameters are
similar to a load from a local variable and a register to register copy. Assignments to
return locations are indistinguishable from ordinary assignments. The fact that register
eax is used by some callers, and register edx is not used by any callers, is not visible
at all by considering only the IR of the called procedure. The locations used to store
78 Data Flow Analysis
parameters and return locations are a matter of convention, called the Application
Binary Interface (ABI); there are many possible conventions. While most programs
will follow one of several ABI conventions, some will not follow any, so a decompiler
Most of the time, a location that is used before it is dened implies that it is a parameter.
However, in the example of Figure 3.3, registers esi and ebx are used by the push
instructions, yet they are not parameters of the procedure, since the behaviour of the
program does not depend on the value of those registers at the procedure's entry point.
The registers esi and ebx are said to be preserved by the procedure.
Register ecx is also preserved, however the procedure does depend on its value; in
other words, ecx is a preserved parameter. Care must be taken to separate those
locations used before being dened which are preserved, parameters, or both. This can
For example, consider statements 5, 7 and 9 of Figure 3.4. These save the preserved
registers, and are ultimately dead code. They are needed to record the fact that comb
preserves esi, ebx and ecx, but are not wanted in the decompiled output. When
the preservation code is eliminated, the only remaining locations that are used before
Reference parameters are an unusual case. Consider for example a procedure that takes
only r as a reference parameter. At the machine code level, its address is passed in
a location, call it ar. In a sense, the machine code procedure has two parameters, ar
(since the address is used before denition) and r itself, or *ar (since the referenced
object and its elements are used before being dened). However, the object is by
denition in memory. There is no need to consider the data ow properties of memory
objects; the original compiler (or programmer) has done that. Certainly it is important
to know whether *ar is an array or structure, so that its elements can be referenced
appropriately, but this can be done by dening ar with the appropriate type (e.g. as
being of type Employee* rather than void*). It is quite valid for the IR to keep the
3.4 Summarising Calls 79
parameter as a pointer until very late in the decompilation. It could be left to the
back end or even to a post decompilation step to convert the parameter to an actual
the dierence in context requires some substitutions to obtain expressions which are valid
When the callee for a given caller is known, the parameters are initially the set of
variables live at the start of the callee. These locations are relative to the callee, and
not all can be used at the caller without translation to the context there. Figure 3.13
in register ecx. In the callee, parameter n is read from m[esp0+4], while r is read
from register ecx. (The memory location m[esp0] holds the return address in an x86
procedure). Note that these memory expressions are in terms of esp0, the stack pointer
on entry to the procedure, but the two procedures (main and comb ) have dierent
initial values for the stack pointer. Without expression propagation to canonicalise the
memory expressions in the callee, the situation is even worse, since parameters could be
accessed at varying osets from an ever changing stack pointer register, or sometimes
at osets from the stack pointer register and at other times at osets from the frame
Since it will be necessary to frequently exchange location expressions between callers and
callees , care needs to be taken that this is always possible. For example, if register ecx
in main is to be converted to a local variable, say local2, it should still be accessible as the
80 Data Flow Analysis
original register ecx, so that it can be matched with a parameter in callees such as comb.
Similarly, if ecx in comb is to be renamed to local0, it should still be accessible as ecx for
the purposes of context conversion. Similar remarks apply to memory parameters such
as m[esp0-64] in main and m[esp0+4] in comb. One way to accomplish this is to not
maintain a mapping between the register or memory location and the local variable
name to be used at code generation time (very late in the decompilation). Most of
the time the IR is used directly, but at code generation time and possibly in some IR
Note that while arguments and the values returned by procedures can be arbitrary
expressions, the locations that they are passed in are usually quite standardised. For
example, parameters are usually passed in registers or on the stack; return values are
usually located in registers. Hence, as long as registers are not renamed, the main
problem is with the stack parameters, which are usually expressions involving the stack
pointer register.
Consider argument n in main in Figure 3.13. It has the memory expression m[esp0-64].
Suppose we know the value of the stack pointer at the call to comb ; here it is espcall
= esp0-68 (in the context of main ; in other words, the value of the stack pointer at
the call is 68 bytes less than the initial value at the start of main ). This is reaching
denitions information; it turns out to be useful to store this information at calls (for
this and several other purposes). It is assumed that if the caller pushes the return
address, this is part of the semantics of the caller, and has already happened by the
time of the call. The call is then purely a transfer of control, and control is expected
Hence, esp0 (at the start of comb ) = espcall +68. Substituting this equation into
the memory expression for n in main (m[esp0-64]) yields m[espcall + 68 - 64] =
m[espcall + 4]. Since the value for the stack pointer at the call is the same as the value
at the top of comb, this is the memory expression for n in terms of esp0 in comb, in
other words, in the context of comb this expression is m[esp0+4], just as the IR in
Figure 3.13(b) indicates. Put another way, it is possible to generate the expression
(here m[esp0+4]) that is equivalent to the expression valid in main (here m[esp0-64])
for a stack location, so that it is possible to search through the IR for a callee to nd
important information. A similar substitution can be performed to map from the callee
context to the caller context. When the number of parameters in a procedure changes,
machine code decompilers need to perform this context translation to adjust the number
Figure 3.14 shows parts of the running example extended with some assignments and
uses of global variables. It shows a potential problem that arises when locations are
In this example, g and edi are global variables, with edi being a register reserved for this
global variable. Both g and edi are found to be modied by comb. If the assumption
is made that locations dened by a call kill reaching denitions and livenesses, the
denitions of g and edx in main no longer reach the use in the printf call. As a result,
the denitions are dead and would be removed as dead code. This is clearly wrong,
since if n6=r and c6=0, then the program should print g = 2, edi = 3 rather than
undened values.
local variable. It could also be represented by a global variable, but at the machine code
level, there are few clues that the register behaves like a global variable. The two ways
that the modication of a local variable can be communicated to its callers are through
equivalent to the location being both a parameter and a return of the procedure. In
arguments of procedure calls are pure uses, and only returns dene locations at a call
statement.
82 Data Flow Analysis
Location edi therefore becomes a return of procedure comb. Since comb already returns
i, this will be the second return for it. Local variable edi is not dened before use on
the path where n=r, in other words the value returned in edi by comb depends on the
value before the call, hence edi also becomes a parameter of comb.
Conveniently, making edi an actual argument of comb means that the denition of edi
is no longer unused (it is used by the one of the arguments of the call to comb).
Unfortunately, this solution is not suitable for the global variable, g. In the example,
there would be code similar to g = comb(g, ...), and the parameter would shadow
the global inside the procedure comb. This has the following problems:
global to the parameter, from the parameter to the return location, and from the
mers do not write code this way, so the result is less readable.
• The global may be aliased with some other expression (e.g. another reference
The problem can be solved by adopting a policy of never allowing globals to become
parameters or returns, and never eliminating denitions to global variables. The latter is
probably required for alias safety in any case. The above discussion can be summarised
Proposition 3.1: Locations assigned to on only some paths become parameters and
returns.
Using these extra constraints causes the denitions for g and edi of Figure 3.14 to
remain, although for dierent reasons. The program decompiles correctly, as shown in
Figure 3.15.
The decompiled program does not retain edi as a global register. It could be argued
that a decompiler should recognise that such variables are global variables. The only
clue at the machine code level that a register is global would be if it is not dened
before being passed to every procedure that takes it as a parameter, perhaps initialised
3.4 Summarising Calls 83
in a compiler generated function not reachable from main. In the above example, this
does not apply since the parameter is dened before the only call to comb.
Decompilation can sometimes expose errors in the original program. For example, a
program could use an argument passed to a procedure call after the call, assuming that
the argument is unchanged by the call. This could arise through a compiler error, or
a portion of the program written in assembly language. The decompiled output will
make the parameter a reference parameter, or add an extra return, which is likely to
make the error more visible than in the original source code. In a sense, there is no such
thing as an error in the original program as far as a decompiler is concerned; its job is
to reproduce the original program, including all its faults, in a high level language.
Proposition 3.2 states that global memory variables are not suitable as parameters and
returns. This proposition motivates the denition of lters for parameters and returns:
These lters can be thought of as sets of expressions representing locations, which can
that satisfy the lter criteria (i.e. are not ltered out) are referred to as suitable .
Local variables are not ltered from potential parameters for the following reason.
Some parameters are passed on the stack; architectures that pass several parameters
in registers pass subsequent parameters, if any, on the stack. Stack locations used
for parameters are equivalent to local variables in the caller. They are both located
at osets (usually negative) from the stack pointer value on entry to the procedure
(esp0 in the running example). A machine code decompiler will typically treat local
variables and stack locations used to pass procedure arguments the same, and eliminate
assignments to the latter as dead code. For simplicity, no distinction will be made
The main data ow sets of interest concerning procedures are as follows.
They are potential parameters, because preserved locations often appear to be pa-
rameters when they are not. Such locations will be removed from the parameters
• Initial argument locations of a call c are the potential parameters of the callee,
Arguments are best represented as a kind of assignment, with a left hand side
(the argument location) and a right hand side (the actual argument expression).
The argument location has to be retained for translating to and from the callee
one location, the register or memory location that they are passed in.
• Modieds of a procedure are the denitions reaching the exit, except for preserved
superset of returns: not all dened locations will be used before denition in any
caller.
• Denes of a call are the denitions reaching the exit of the callee:
• Results of a call are the suitable denes that are also live at the call:
• Returns of a procedure are the suitable locations modied by the procedure, which
S
return-locations(p) = modif ieds(p) ∩ ret-lter ∩ live(c) (3.11)
c calls p
Returns, like arguments, are best represented as assignments, with a left hand side
containing the return location, and the right hand side being the actual return
expression. Results, like parameters, are represented by only one expression, the
In all but the rst case, context translation from the callee to the caller or vice versa is
required.
Potential return locations are based on reaching denitions at the exit of the procedure.
It might be thought that available denitions (statements whose left hand side is dened
on all paths) should be used instead of reaching denitions, since it seems reasonable
that a return value should be dened along all paths in a procedure. However, as shown
in the example of Figures 3.14 and 3.15, suitable locations that are dened along only
The above equations assume that a callee procedure can be found for each call. In some
cases, this is not true, at least temporarily. For example, an indirect call instruction may
not yet be resolved, or recursion may prevent knowledge of every procedure's callees.
In these cases, the conservative approximation is made that the callee uses and denes
where live-at-call(cc) is the set of locations live at the childless call statement cc.
86 Data Flow Analysis
The stack pointer register is used before it is dened in all but the most trivial of
procedures. For all other locations, use before denition of the location (after dead
code is eliminated) implies that the location is a parameter of the procedure. In the
decompiled output, the stack pointer must not appear as a parameter, since high level
languages do not mention the stack pointer. It is constructive to consider why the stack
pointer is an exception to the general parameter rule, and whether there may be other
exceptions.
The value of the stack pointer on entry to a procedure aects only one aspect of the
program: which set of memory addresses are used for the stack frame (and the stack
frames for all callees). By design, stack memory is not initialised, and after the proce-
dure exits, the values left on the stack are never used again. The stack is therefore a
variable length array of temporary values. The value of the stack pointer at any instant
is not important; only the relative addresses (osets) within the stack array are im-
portant, to keep the various temporary values separated from one another, preventing
unintended overwriting.
The program will run correctly with any initial stack pointer value, as long as it points
to an appropriately sized block of available memory. In this sense, the program does
not depend on the value of the stack pointer. This is the essence of the exception to
Some programs reference initialised global memory through osets from a register. One
example of this are Motorola 68000 series programs which use register A5 as the pointer
to this data. Another example are x86 programs using the ELF binary le format.
These programs use the ebx register to point to a data section known as the Global
Oset Table (GOT). These special registers would also appear to be parameters of all
procedures, yet the specic value of these registers is unimportant to the running of the
program, and should not appear in the decompiled output. In these cases, however, it is
probably better for the front end of the decompiler to assign an arbitrary value to these
special registers. Doing this ensures that all accesses to scalar global variables are of
the form m[K] where K is a constant. If this is done, the question of whether the special
will transform all the scalar memory addresses to the m[K] form.
It might be considered that giving the stack pointer an initial arbitrary value would
avoid the need for any exception. However, when a procedure is called from dierent
3.5 Global Data Flow Analysis 87
points in the program, it can be passed dierent values for that procedure's initial stack
pointer. Worse, in a program with recursion, the value of the stack pointer depends on
the dynamic behaviour of the program. Rather than causing the stack pointer to take
on simple constant values, such a scheme would, for many programs, generate extremely
In summary, the stack pointer is the only likely exception to the general rule that a
location used before denition after dead code elimination in a procedure implies that
Decompilers could treat the whole program as one large, global (whole-program) data
ow problem, but the problems with such an approach may outweigh the benets.
Some analyses are approximate until global analyses can be performed over the IR for
all procedures. In the running example, the procedure comb modies registers eax and
edx. It is not possible to know whether these modied registers are returning a value
until all callers are examined. Imagine that comb is used in a much larger program,
and there are a hundred calls to it. No matter what order the callers are processed, it
could be the case that the last caller processed will turn out to be the rst one to use
the result computed in (e.g.) register edx. Other analyses also require the IR for all
procedures. For example, type analysis is global in the sense that a change to the type
Since the IR for all procedures has to be available for a complete decompilation, one
approach would be to treat the whole program as a single whole-program data ow
problem, with interprocedural edges from callers to callees, and from the exit of callees
back to the basic block following the caller. The solution of data ow problems is usually
iterative, so a global data ow analysis would automatically incorporate denitions and
As noted by Srivastava and Wall [SW93], a two phase algorithm is required, to prevent
impossible data ow (a kind of leakage of data ow information from one call to
another). So-called context insensitive interprocedural data ow ignores this problem,
and suers a lack of precision as a result [YRL99]. Figure 3.16 shows a program with
two procedures similar to main in the running example (main1 and main2 ) calling
comb, which is the same as in the running example. This program illustrates context
sensitivity.
Two callers are passing dierent arguments, say comb(6, 4) and comb(52, 5). The
88 Data Flow Analysis
second argument, passed in register ecx, is 4 for the rst caller, and 5 for the second.
Without the two stage approach, all interprocedural edges are active at once, and it is
possible for data ow information to traverse from basic block b1 (which assigns 4 to
ecx), through procedure comb, to basic block b4, which prints among other information
the value of ecx to caller 2, which should see only the denition in b2 reaching that use
(with the value 5). Similarly, data ow information from b2 can imprecisely ow to b3.
The solution to this lack of precision requires that the data ow be split into two phases.
Before the start of the rst phase, some of the interprocedural edges are disabled.
Between the rst and second phases, these edges are enabled and others are disabled.
In [SW93], only backward-ow problems were considered, but the technique can be
• Resources: the interprocedural edges consume memory, and their activation and
deactivation consumes time. In addition, there are many more denitions for call
parameters, since they are dened at every call. Every parameter will typically
have from two to several or even hundreds of calls. Special locations such as the
• There is no need for the two phase algorithm in the non-global analysis, with its
• Since some programs to be compiled require a large machine to execute them, and
programs so that parts of it can be run in parallel with cluster or grid computing.
• Procedures are inherently designed to be separate from their callers; the non-
• The number of denitions for a parameter can become very large if a procedure
is called from many points in the program. So there can be problems scaling the
Factors that do not signicantly favour one approach over the other include:
that have no denition. However, with the global approach, parameters could be
identied as locations that have some or all denitions in procedures other than
• The saving and restoration of preserved locations become dead code when con-
locations becoming extra parameters and returns, although recursion causes extra
problems (Sections 4.3 and 4.4). With the global approach, denitions only used
outside the current procedure (returns are used by the return statement in the
• Simplicity: the data ow problems are largely solved by one, albeit large, analysis.
• There is no need to summarise the eects of calls, and any imprecision that may
• There are problems that result from recursion, which do not arise with the global
approach. However, these are solvable for the non-global approach, as will be
• Information has to be transmitted from the callee context to the caller context,
and vice versa, in the non-global approach. Section 3.4.2 showed that this is easily
performed.
90 Data Flow Analysis
application; for decompilers, it is safe to overestimate both denitions and uses, with
Some aspects of static program analysis are undecidable or dicult to analyse, e.g.
whether one branch of an if statement will ever be taken, or whether one expression
aliases with another. As a result, there will always be a certain level of imprecision
in any data ow information. Where no information is available about a callee, the
lack of precision of data ow information near the associated call instructions could
temporarily be extensive, e.g. every location could be assumed to have been assigned
to by the call.
of ensuring that approximations lead at worst to lower readability, and not to incor-
rect output (programs whose external behaviour diers from that of the input binary
program).
Less readable output is obviously to be preferred over incorrect output. In many cases,
if the analyses are sophisticated enough, low quality output will be avoided, as tem-
can be removed by a later analysis, and so on. Where information is incomplete, ap-
proximations are necessary to prevent errors. Errors usually cannot be corrected, but
elementary data ow operations (e.g. denitions, parameters, etc.) is safe. For example,
denitions over which there is some uncertainty could be counted as actual denitions
Using the three categories is more precise, but consumes more memory and complicates
the analyses.
statement if there are no denitions to locations that are components of the source
statement (Section 3.1). If denitions are underestimated (the algorithm is not aware
of some denitions), then a propagation could be performed which does not satisfy the
or at least be delayed until later changes are made to the IR of the program. This will
Denitions that reach the exit of a procedure are potential return locations. Underes-
timating denitions would cause some such denitions to not become returns, resulting
in some semantics of the original program that are not contained in the decompiled
Locations that are live (used before being dened) are generally parameters of a pro-
cedure (an exception will be discussed later). Underestimating uses would result in
too few parameters for procedures, hence underestimating uses is unsafe. Overesti-
mating uses may result in overestimating livenesses, which may result in unnecessary
Dead code elimination depends on an accurate list of uses of a denition. If uses are
overestimated, then some statements that could be deleted would be retained, resulting
in lower quality but correct decompilations. If uses are underestimated, denitions could
be eliminated when they are used, resulting in incorrect output. Hence, underestimation
of another. In particular, denitions kill liveness (the state of being live), leading to
potentially underestimated uses. Figure 3.17(a) shows a denition and use with a call
between them; the call has an imprecise analysis which nds that x may be dened
denitions are overestimated). The arrows represent the liveness of x owing from
the use to the denition. This situation arises commonly when it is not known whether
a location is preserved or not, either because the preservation analysis has failed, or
because the analysis can not be performed yet (e.g. as a result of recursion).
definition of x definition of x
x appears to be unused:
underestimating uses of x x is used
With the data ow as shown in Figure 3.17(a), the denition of x appears to be unused,
hence it appears to be safe to eliminate the denition as dead code. In reality, x is used
and the removal would be incorrect. The solution to this problem is as follows: inside
92 Data Flow Analysis
calls, for every non-global location whose denition is overestimated (such as x in the
the call as well as a return, hence the original denition of x is still used, and it will
not be eliminated.
This is similar to the problem of Figure 3.14; in that example, the location was known
to be dened but only along some paths. If x is a global variable, the above would not
apply, and the uses of x would in fact be underestimated above the call. However, by
Proposition 3.3, denitions of globals are never eliminated, hence the underestimation is
safe. Global variables must not be propagated past calls to procedures that may dene
them, but this is guaranteed by the fact that such globals appear in the modieds list
for the call. However, these modieds never become returns because of the return lter
of Equation 3.5.
Parameters act as denitions when there is no explicit denition before a use in a pro-
cedure. As shown in Figure 3.18, a similar situation exists with procedure parameters,
and the solution is the same: overestimate the uses as well as the denitions in the call
tion, but representing explicit side eects produces a correct program, and dead code
CISC architectures (Complex Instruction Set Computers) often provide the ability to
operate on operands of various sizes, e.g. 8-/16-/32- and even 64-bit values. Figure
3.19 shows the x86 register eax, whose lower half can be accessed as the register ax,
and both halves of ax can be accessed as al (low half ) and ah (high half ). In the
3.7 Overlapped Registers 93
x64 architecture, all these registers overlap with the 64-bit rax register. Three other
registers (ebx, ecx , and edx) have similar overlaps, while other registers have overlaps
only between the 32-bit register and its lower half (e.g. ebp and bp).
31 0
0x12345678 eax
15 0
0x5678 ax
15 8 7 0
ah 0x56 0x78 al
The Motorola 68000 family of processors has a similar situation, although only the
lower parts (lower 16 bits, or lower 8 bits) can be accessed. In the 6809 and 68HC11
series, both halves of the D (double byte) register can be accessed as the A (upper) and
B (lower) accumulators.
Some RISC architectures have a similar problem. For example, SPARC double precision
registers use pairs of single precision registers, and typical 32-bit SPARC programs use
pairs of 32-bit loads and stores when loading or storing a 64-bit oating point register.
However, this problem is dierent, in that the smaller register is the machine word size,
and bit manipulation such as shifting and masking cannot be performed on oating
point registers.
Since there are distinct names for the overlapped registers at the assembly language
level, it seems natural to represent these registers with separate names in the interme-
diate representation (IR). However, this leads to a kind of aliasing at the register level;
treating the registers as totally independent leads to errors in the program semantics.
The alternative of representing all subregisters with the one register in the IR also
has problems. For example, al and ah may be used to implement two distinct 8-bit
variables in the original source program. When accessing the variable assigned to ah,
awkward syntax will result, e.g. local3 > > 8 or local3 & 0xFF. Even worse, the data
ow for the two variables will become conated, so that assignment to one variable will
source code constructs such as C unions or Fortran common statements. However, such
output is quite unnatural, and represents low level information that does not belong
in the decompiled output. Such a scheme is more suitable for a binary translator that
The best solution seems to be to treat the overlapped registers as separate, but to
make the side eects explicit. This was rst suggested for the Boomerang decompiler
by Trent Waddington [Boo02]. Most such side eects can be added at the instruction
decoding level, which is source machine dependent already. (An exception are results
from call statements, which are not explicit at the assembly language level, and which
may be added later in the analysis.) After any assignment to an overlapped register, the
decoder can emit assignments representing the side eects. Considering the example
ax := truncate(eax)
al := truncate(eax)
ah := eax@[8:15] /* Select bits 8-15 from eax */
Many of these have the potential of leading to complex generated source code, but
dead code elimination will remove almost all such code. For example, where al and
ah are used as independent variables, the larger registers ax and eax will not be used,
so dead code elimination will simply eliminate the complex assignments to ax and
eax. When operands are constants, the simplication process reduces something like
0x12345678@8:15 to 0x56 or plain decimal 86. Where the complex manipulations are
not eliminated, the original source program must in fact have been performing complex
3.7.1 Sub-elds
Sub-elds present similar problems to that of overlapped registers.
Some high level languages, such as C, allow the packing of several structure elds into
one machine word. Despite the fact that the names of such elds are related, in data
ow terms, the variables are completely independent. As mentioned above, current data
ow analyses will not treat them as independent, leading to less than ideal generated
code. There is no reason that compilers for languages such as Pascal which support
range variables (e.g. 0..255 or 100..107) could not pack two or more such variables
into a machine word. Such variables will also not be handled properly by a standard
The PowerPC architecture has a 32-bit register containing 8 elds of four condition
codes (8 condition code registers, named cr0 - cr7 ). Instructions using the condition
codes register specify which condition codes set they will refer to. This poses a similar
problem, except that the way that the condition code register is represented in a de-
The related work conrms that the combination of expression propagation and dead code
Johnstone and Scott describe a DSP assembly language to C reverse compiler called
asm21toc [JS04]. They nd that the combination of propagation and dead code elimi-
nation is also useful for their somewhat unusual application (DSP processors have some
unusual features compared to ordinary processors). They do not appear to use any
special IR; they parse the assembly language to produce low-level C, perform data
ow analyses using standard techniques (presumably similar to those of [ASU86]), and
use reductions to structure the some high level constructs. Their output typically re-
quires editing before it can be compiled. They found that some 85% of status (ags)
SSA Form
Static Single Assignment form assists with most data ow components of decompilers,
return values, deciding if locations are preserved, and eliminating dead code.
Chapter 3 listed many applications for data ow analysis, several of which were some-
what dicult to apply. This chapter introduces the Static Single Assignment form (SSA
form), which dramatically reduces these diculties. The SSA form also forms the basis
for type analysis in Chapter 5 and the analysis of indirect jumps and calls in Chapter
6.
The SSA form is introduced in Section 4.1, showing that the implementation of many of
the data ow operations in a machine code decompiler is simpler. Some problems that
arise with the propagation of memory expressions are discussed in Section 4.2, while
Section 4.3 introduces the important concept of preserved locations. The analysis of
these is complicated by recursion, hence Section 4.4 discusses the problems caused by
convenient to store a snapshot of denitions or uses, and these are provided by collectors,
discussed in Section 4.5. Sections 4.6 and 4.7discuss related work and representations,
SSA form vastly simplies expression propagation, provides economical data ow in-
formation, and is strong enough to solve problems that most other analysis techniques
cannot solve.
which maintains the property that each variable or location is dened only once in
97
98 SSA Form
the program. Maintaining the program in this form has several advantages: analyses
are simpler in this form, particularly expression propagation. For most programs, the
size of the SSA representation is roughly linear in the size of the program, whereas
In order to make each denition of a location unique, the variables are renamed. For
example, three denitions of register edx could be renamed to edx1 , edx2 , and edx3 .
The subscripts are often assigned sequentially as shown, but any renaming scheme
that makes the names unique would work. In particular, the statement number or
memory address of the statement could be used. SSA form assumes that all locations
are initialised at the start of the program (or program fragment or procedure if analysing
in isolation). The initial values of locations are usually given the subscript 0, e.g. edx0
for the initial value of edx.
Since each use in a program has only one denition, a single pointer can be used from
each use to the denition, eectively providing reaching denitions information (the one
denition that reaches the use is available by following the pointer). Much of the time,
the original denition does not have to be literally renamed; it can be assumed that
each denition is renamed with the address of the statement as the subscript. (For
convenience, display of the location could use a statement number as the subscript).
Sassa et al. report that renaming the variables accounts for 60-70% of the total time
+
to translate into SSA form [SNK 03]. Implicit assignments are required for parameters
and other locations that have uses but no explicit assignments; these are comparatively
The algorithms for transforming ordinary code into and out of SSA form are somewhat
complex, but well known and ecient [BCHS98, App02, Mor98, Wol96]. Figure 4.1
shows the main loop of the running example and its equivalent in SSA form. Statements
irrelevant to the example have been removed for clarity (e.g. assignments to unused
temporaries, unused ag macros, and unused assignments to the stack pointer).
Note how the special variable esp0 instroduced earlier is now automatically replaced
by the initial value of the stack pointer, esp0 , and there is no need to explicitly add
an assignment from esp0 to esp. The initial value for other locations is also handled
ow merges, i.e. where a basic block has more than one in-edge. A φ-function is needed
for location a only if more than one denition of a reaches the start of the basic block.
The top of the loop in Figure 4.1 is an example; locations m[esp0 -28], st, edx, esi,
and ebx are all modied in the loop, and also have denitions prior to the loop, but
esp0 itself requires no φ-function. st1 := φ(st0 , st3 ) means that the value of st1 ,
4.1 Applying SSA to Registers 99
loop1:
80483d8 46 m[esp] := edx
80483d9 47 st := st *f (double)m[esp]
80483dc 49 edx := edx - 1 // Decrement numerator
80483dd 51 m[esp] := esi
80483e0 52 st := st /f (double)m[ebp-24]) // m[ebp-24] = m[esp]
80483e6 57 esi := esi - 1 // Decrement denominator
80483e7 60 ebx := ebx - 1 // Decrement counter
80483e8 62 goto loop1 if ebx >= 0
80483f5 65 m[esp] := (int)st
80483fb 67 edx := m[esp]
(a) Original program. *f is the oating point multiply operator, and /f
denotes oating point division.
loop1:
m[esp0 -28]1 := φ(m[esp0 -28]0 , m[esp0 -28]3 )
st1 := φ(st0 , st3 )
edx1 := φ(edx0 , edx2 )
esi1 := φ(esi0 , esi2 )
ebx1 := φ(ebx0 , ebx2 )
46 m[esp0 -28]2 := edx1
47 st2 := st1 *f (double)m[esp0 -28]2
49 edx2 := edx1 - 1 ; Decrement numerator
51 m[esp0 -28]3 := esi1
52 st3 := st2 /f (double)m[esp0 -28]3
57 esi2 := esi1 - 1 ; Decrement denominator
60 ebx2 := ebx1 - 1 ; Decrement counter
62 goto loop1 if ebx2 >= 0
65 m[esp0 -20]1 := (int)st3
67 edx3 := m[esp0 -20]1
(b) SSA form.
Figure 4.1: The main loop of the running example and its equivalent SSA form.
treated as a separate variable from st0 and st3 , takes the value st0 if control ow
enters via the rst basic block in-edge (falling through from before the loop), and the
value st3 if control ow enters through the second in-edge (from the branch at the end
of the loop). In general, a φ-function at the top of a basic block with n in-edges will
have n parameters.
A relation called the dominance frontier eciently determines where φ-functions are
necessary: for every node n dening location a, every basic block in the dominance
frontier of n requires a φ-function for a. The dominance frontier for a program can
(made executable) by inserting copy statements. For example, st1 := φ(st0 , st3 )
could be implemented by inserting st1 := st0 just before the loop (i.e. at the end of
100 SSA Form
the basic block which is the rst predecessor to the current basic block), and inserting
st1 := st3 at the end of the loop (which is the end of the basic block which is the
second predecessor to the current basic block). In many instances the copy statements
The main property of a program in SSA form is that each use has eectively only one
denition. If in the pre-SSA program a use had multiple denitions, the denition
will be a φ-function that has as operands the original denitions, or other φ-functions
leading them. A secondary property is that denitions dominate all uses. That is, any
path that reaches a use of a location is guaranteed to pass through the single denition
for that location. For example st is dened at statement 52 in Figure 4.1(a), after a use
in statement 47. However, in the equivalent program in SSA form in Figure 4.1(b), there
is a φ-function at the top of the loop which dominates the use in statement 47. One of
the operands of that φ-function references the denition in statement 52 (i.e. sp3 ).
4.1.1 Benets
The SSA form makes propagation very easy; initial parameters are readily identied,
maintenance.
Both requirements for expression propagation (Section 3.1 on page 65) are automatically
fullled by programs in SSA form. The rst condition (that s must be the only denition
of x to reach u ) is satised because denitions are unique, and the second condition
pass through the unique denitions for st1 , esp0 , and m[esp0 -28], the three subscripted
locations on the right hand side. Because all denitions are unique, any path from s to
array expression, there is a similar requirement for subscripted locations in the address
there can be no assignment to esp0 or esi1 between s2 and u, and s2 must be the only
denition of m[esp0 -28] to reach u. These are again automatically satised, except that
the SSA form has no special way of knowing whether expressions other than precisely
m[esp0 -28] alias to the same memory location. For example, if m[ebp1 -4] aliases with
m[esp0 -28] and there is an assignment to m[ebp1 -4], the assumption of denitions
being unique is broken. This issue will be discussed in greater detail later.
After statement 46 is propagated into statement 47, the IR is as shown in Figure 4.2.
Note that the propagation from statement 47 to statement 52 is one that would not
normally be possible, since there is a denition of edx between them. In SSA terms,
however, the denition is of edx2 , whereas edx1 , treated as a dierent variable, is used
by statement 47. The cost of this exibility is that when the program is translated back
out of SSA form, an extra variable may be needed to store the old value of edx (edx1 ).
performed very easily. Most uses can be replaced by the right hand side of the expression
indicated by the subscript. In practice, some statements are not propagated e.g. φ-
functions and call statements. φ-functions must remain in a special position in the
control ow graph. Call statements may have side eects, but could be propagated
under some circumstances. If the resultant statement after a propagation still has
subscripted components, and those denitions are suitable for propagation, the process
Note that this propagation has removed the machine code detail that the oating point
instructions used require the operands to be pushed to the stack; these instructions can
not reference integer registers such as edx directly. All three uses in this expression
Since expression propagation is simple in SSA form, analyses relying on them (e.g. con-
While SSA form inherently provides only use-denition information (the one denition
for this use, and other denitions via the φ-functions), denition-use information (all
uses for a given denition) can be calculated from this in a single pass. The IR can
information upon translation into SSA form (so the arcs of the SSA graph point from
The algorithm for transforming programs into SSA form assumes that every variable is
implicitly dened at the start of the program, at an imaginary statement zero. This
means that when a procedure is transformed into SSA form, all its potential parameters
(locations used before being dened) appear with a subscript of zero, and are therefore
easily identied.
The implicit use-denition information provided by the SSA form requires no mainte-
nance as statements are modied and moved. There is never an extra denition created
for any use, and a denition should never be deleted while there is any use of that
are added, others are removed), and when dead code is eliminated (uses are removed).
procedure is also facilitated by the way that the SSA form separates dierent versions
of the same location. Firstly, it is easy to reason with equations such as esi99 =
esi0 , rather than having to distinguish the denition of esi that reaches the exit, the
original value of esi, and all its intermediate denitions. The equation can be simplied
the denition reaching the exit, preceding denitions can be accessed directly without
With the aid of some simple identities, SSA form is able to convert some sequences of
complex code into a form which greatly assists its understanding. One example is the
Finally, the SSA form is essential to the operation and use of collectors, discussed in
Section 4.5.
From the above, it is clear that the SSA form is very useful for decompilers.
The Static Single Assignment form is not directly suitable for converting to high level
source code, since φ-functions are not directly executable. Φ -functions could eectively
be executed by inserting one copy statement for each operand of the φ-function. Every
denition of every original variable would then eectively be the denition of a separate
variable. This would clutter the decompiled output with variable declarations and copy
statements, making it very dicult to read. Hence, the intermediate representation has
Several mapping policies from SSA variables to generated code variables are possible.
For example, the original names could be ignored, and new variables could be allocated
as needed. However, this discards some information that was present in the input
binary - the allocation of original program variables to stack locations and to registers.
While optimising compilers do merge several variables into the one machine location,
and sometimes allocate dierent registers to the same variable at dierent points in the
program, they may not do this, especially if register pressure is low. Attempting to
minimise the number of generated variables could lead to the sharing of more than one
unrelated original variable in one output variable. While the output program would be
The simplest mapping policy is to keep the original allocation of program variables to
this policy is to simply remove the subscripts and φ-functions. However, there are two
The rst is where the binary program uses the location in such a way that dierent
types will have to be declared in the output program. In the running example, register
location edx is used to hold three original program variables: num, n, and i. All of these
were declared to be the same type, int, but if one was of a dierent fundamental type
(e.g. char*), then the location edx would have to be split into two output variables,
each with a dierent type.
The second circumstance where subscripts cannot simply be removed is that there may
be an overlap of live ranges between two or more versions of a location in the current
(transformed) IR. When two names for the same original variable are live at the same
point in the program, one of them needs to be copied to a new local variable. Figure 4.3
shows part of the main loop of the running example, after the propagations of Section
4.1.1. Note how the multiply operation, previously before the decrement of edx, now
follows it, and the version of edx used (edx1 ) is the version before the decrement.
The seven columns at the left of the diagram represent the live ranges of the SSA
variables. None of the various versions of the variables overlap, except for edx1 and
edx2 , indicated by the shaded area. When transforming out of SSA form, a temporary
104 SSA Form
edx2
edx1
edx0
esi2
esi1
st3
st1
loop1:
edx1 := O(edx0, edx2)
st1 := O(st0, st3)
esi1 := O(esi0, esi2)
edx2 := edx1 - 1
st3 := st1 *f (double)edx1 /f (double)esi1
esi2 := esi1 - 1
goto loop1 if ...
edx3 := (int)st3
Figure 4.3: Part of the main loop of the running example, after the propagation
of the previous section and dead code elimination.
variable is required to hold either edx1 or edx2 . Figure 4.4(a) shows the result of
(a) Replacing edx1 with local1 (b) Replacing edx2 with local2
Figure 4.4: The IR of Figure 4.3 after transformation out of SSA form.
the denition for edx1 became φ(edx0 , local2), which is implemented by inserting
Note that in this second case, it is now possible to propagate the assignment to local2
into its only use, thereby removing the need for a local variable at all. In addition,
the combination of the SSA form and expression propagation has achieved code motion
(i.e. the decrement of edi has moved past the multiply (as in the original source code)
and past the divide (which it was not in the original source code). Unfortunately, this is
not always possible, as shown in the next example. Figure 4.5 shows the code of Figure
Expression propagation has again been unhelpful, changing the loop condition from the
original edx2 > r to edx1 -1 > r. Now it is no longer possible to move the denition of
4.1 Applying SSA to Registers 105
edx2
edx1
edx0
esi2
esi1
st3
st1
loop1:
edx1 := O(edx0, edx2)
st1 := O(st0, st3)
esi1 := O(esi0, esi2)
edx2 := edx1 - 1
st3 := st1 *f (double)edx1 /f (double)esi1
esi2 := esi1 - 1
goto loop1 if edx1-1 > ecx0
edx3 := (int)st3
Figure 4.5: The code of Figure 4.3, with the loop condition optimised to num>r.
edx2 to below the last use of edx1 , since edx1 is used in the loop termination condition.
Figure 4.6 shows two possible transformations out of SSA form for the code of Figure
4.5.
loop1: loop1:
local1 := edx local2 := edx-1
edx := local1-1
st := st *f (double)local1 st := st *f (double)edx
/f (double)esi /f (double)esi
esi := esi - 1 esi := esi - 1
local1 := edx
edx := local2
goto loop1 if local1-1 > ecx goto loop1 if local1-1 > ecx
edx := (int) st edx := (int) st
(a) Replacing edx1 with local1 (b) Replacing edx2 with local2,
then of necessity edx1 with local1
Figure 4.6: Two possible transformations out of SSA form for the code of
Figure 4.5.
Note that the second case could again be improved if expression propagation is per-
formed on local2. Unfortunately, at this point, the program is no longer in SSA form,
mation out of SSA form are only needed across short distances (within a basic block).
Hence it may be practical to test the second condition for propagation by testing each
statement between the source and destination. This condition (see Section 3.1) states
that components of the expression being propagated should not be redened on the
In this case, not propagating into the loop expression would have avoided the problem.
Techniques for minimising the number of extraneous variables and copy statements will
106 SSA Form
of a variable to replace with a new variable, and eectively performing code motion, is
complete [Upt03].
The above examples show that the number of copy statements and extra local variables
is aected by many factors. In the worst case, it is possible for a φ-function with
n operands to require n copy statements to implement it, if every operand has been
assigned a dierent output variable. Section 4.1.3 will discuss how to minimise the
copies.
Unused but not eliminated denitions with side eects can cause problems with the
Figure 4.7 shows a version of the main loop of the running example where an assignment
to st2 , even though unused, is not eliminated before transformation out of SSA form.
This can occur with assignments to global variables, which should never be eliminated.
loop1:
st1 := φ(st0 , st3 )
edx1 := φ(edx0 , edx2 )
esi1 := φ(esi0 , esi2 )
st2 := st1 *f (double)edx1 ; Unused definition
edx2 := edx1 - 1 ; Decrement numerator
st3 := st1 *f (double)edx1 /f (double)esi1
esi2 := esi1 - 1 ; Decrement denominator
goto loop1 if ...
edx3 := (int) st3
Figure 4.7: A version of the running example where an unused denition has
not been eliminated.
The denition of st2 no longer has any uses, so it has no live range, and hence does
not interfere with the live ranges of other versions of st. As a result, considering only
This program is clearly wrong; in terms of the original program variables, res *= num
is being performed twice for each iteration of the loop, where the original program
performs this step only once per iteration. The solution to this problem is to treat
denitions of versions of variables as interfering with any existing liveness of the same
variable, even if they cannot themselves contribute new livenesses. The result is to
loop1:
st := st *f (double)edx ; Unused code with side effects
local1 := edx
edx := edx - 1
st := st *f (double)local1 /f (double)esi
esi := esi - 1
goto loop1 if ...
edx := (int) st
Figure 4.8: Incorrect results from translating the code of Figure 4.7 out of SSA
form, considering only the liveness of variables.
Several algorithms for SSA back translation exist in the literature, but one due to Sreed-
Because of the usefulness of the SSA form, and the cost of transforming out of SSA
form, decompilers benet from keeping the intermediate representation in SSA form for
as long as possible.
The process of translating out of SSA form is also known as SSA back translation in
the literature. The original algorithm for translation out of SSA form by Cryton et
+
al. [CFR 91] has been shown to have several critical problems, including the so-called
lost copy problem. Solutions to the problem have been given by several authors; the
salient papers are by Briggs et al. [BCHS98] and Sreedhar et al. [SDGS99]. Sassa et al.
attempted to improve on these solutions, but concluded that the algorithm by Sreedhar
While the application of these algorithms to decompilation remains the subject of future
work, it would appear that the algorithm by Sreedhar et al. oers the best opportunity
for decompilation. Sreedhar uses both liveness information and the interference graph
coalescing algorithm to coalesce pairs of variables (use the same name). Unique to
Shreedhar's algorithm is the ability to safely coalesce the variables of copy statements
such as x=y where the locations interfere with each other, provided that the coalesced
Figure 4.9 illustrates the coalescing phase. Figure 4.9(a) shows a program in SSA form.
Since none of the operands of the φ-functions interfere with each other, the subscripts
can be dropped as shown in Figure 4.9(b). The standard coalescing algorithm, due to
Chaitin, can not eliminate the copy x=y since they interfere with each other [Cha82].
108 SSA Form
y1=30 y=30
y2=10 x2=20 y=10 x=20 xy=30 xy=10 xy=20
x1=y1 x=y
y3=phi x3=phi
(y1,y2) (x1,x2)
foo(y3) foo(x3) foo(y) foo(x) foo(xy) foo(xy)
(a) Translated SSA form (b) Chaitin coalescing (c) Sreedhar coalescing
Although the variables x1 and y1 interfere with each other, Sreedhar's algorithm is able
to identify that if x1 and y1 are coalesced, the resulting variable does not cause any
new interferences. Figure 4.9(c) shows the example code after x1, x2, y1, and y2 are
coalesced into one variable xy, and the copy statement x1=y1 is removed.
Translation out of SSA form in a decompiler and register colouring in a compiler have
some similarities, but there are enough signicant dierences that register colouring
The role of the register allocator in a compiler, usually implemented by a register colour-
ing algorithm, is to allocate registers to locations in the IR. The number of registers
is xed, and depends on the target machine. It is important that where live ranges
One of the main tasks when translating out of SSA form in a decompiler is to allocate
local variables to locations in the SSA-based IR. The number of local variables has
no xed limit, however for readability, it is important to use as few local variables as
possible. It is important that where live ranges overlap, distinct local variables are
allocated.
From the above two paragraphs, it can be seen that there are similarities and dierences
between register colouring and the translation out of SSA form. The similarities include:
ference graph.
4.1 Applying SSA to Registers 109
• For both compilers and decompilers, φ-functions indicate sets of locations that
it would be benecial to allocate the same resource to. These relations could be
• The number of registers is xed, but the number of local variables has no xed
limit.
• For a compiler, a small set of interference graph nodes are pre-coloured, because
some registers must be used for parameter passing or the return value, or because
tion appears in edx:eax). For a decompiler, assuming that the original mapping
• For a compiler, the interference graph is always consistent, but until nished it
is not fully coloured. For a decompiler, the graph is always fully coloured but
possibly inconsistent; the task is not to colour the graph, but to remove the
and attempt to minimise the copy statements with the coalescing phase of the
register colouring algorithm. The φ-functions or copy statements guide the allo-
times split a group of nodes that currently uses the same local variable, so that
some of the members of the group no longer use the same local variable as the
others. The aim of coalescing is to reduce copy statements; the cost of splitting
The dierences are signicant enough that only interference graphs and the broad con-
cept of colouring can be adapted from register colouring to allocating local variables in
a decompiler.
of code induce one or more extraneous local variables, which reduces readability.
Consider the IR fragment of Figure 4.1 from the running example. After exhaustive
loop1:
st1 := O(st0, st3)
edx1 := O(edx0, edx2)
esi1 := O(esi0, esi2)
ebx1 := O(ebx0, ebx2)
edx2 := edx1 - 1
st3 := st1 *f (double)edx1 /f (double)esi1
esi2 := esi1 - 1
ebx2 := ebx1 - 1
goto loop1 if ebx1-1 0
edx3 := (int)(st1 *f (double)edx1 /f (double)esi1)
Figure 4.10: The code from Figure 4.1 after exhaustive expression propagation, showing
the overlapping live ranges.
Note the multiple shaded areas indicating an overlap of live ranges of dierent ver-
ever, now the φ-function for edx1 requires a copy, since edx1 can now become edx0 or
local8 depending on the path taken. This necessitates a copy of edx2 at the end of
the loop into local12, and back to edx1 (allocated to local11) at the start of the
loop. Similarly, the live range overlaps of ebx, esi, and st (st is the register rep-
resenting the top of the oating point stack) necessitate copies to variables local10,
in Figure 4.11.
As well as the extra local variables, the loop condition is changed by one (local10 >= 1
compared with the original c > 0). In other programs, this can be more obvious,
4.1 Applying SSA to Registers 111
do {
local11 = local12;
local10 = local15;
local17 = local13;
local7 = local18;
local8 = local11 - 1;
local18 = local7 * (float)local11 / (float)local17;
local9 = local17 - 1;
local15 = local10 - 1;
local12 = local8;
local13 = local9;
} while (local10 >= 1)
local14 = (int)(local7 * (float)local11 / (float)local17);
Figure 4.11: Generated code from a real decompiler with extraneous vari-
ables for the IR of Figure 4.10. Copy statements inserted before the loop
are not shown.
e.g. i < 10 becomes old_i < 9. The oating point multiply and divide operations
are repeated after the loop, adding more unnecessary volume to the generated code.
Finally, the last iteration of the loop, computing the result in local18, is unused. As
a result of all these eects, the output program is potentially less ecient than the
original. A very good optimising compiler used on the decompiled output might be
The code where these extraneous local variables are being generated contains statements
x2 x3 x2 x3
loop: loop:
x2 := φ(x1 ,x3 ) x2 := φ(x1 ,x3 )
... ...
x3 := af(x2 ) x3 := af(x2 )
... ...
print x3 print af(x2 )
(a) Before propagation. (b) After propagation of x3 .
Figure 4.12: Live ranges for x2 and x3 when x3 := af(x2 ) is propagated inside
a loop.
the loop, e.g. to the print statement, x2 will be live after the assignment to x3 , as shown
in Figure 4.12(b). The assignment to x3 cannot be eliminated, since it has side eects;
112 SSA Form
in other words, x3 is used by the φ-function, hence it is never unused. Overlapping live
ranges for the dierent versions of the same base variable will result in extra variables in
the decompiled output, as noted above. Only assignments of this sort, where the same
location appears on the left and right hand sides, have this property. These assignments
Note that the opportunity for propagation in this situation is unique to renaming
schemes such as the SSA form. In normal code, overwriting statements cannot be
propagated, since a component of the right hand side (here x2 ) is modied along all
paths from the source to the destination. Propagation of this sort in SSA form is pos-
sible only as a result of the assumption that dierent versions of the same variable are
distinct.
Propagating overwriting statements where the denition is not in a loop is usually not
a problem. The reason is that the propagated expression usually becomes eliminated
as dead code, so that the new version of x does not interfere with other versions of x.
When inside a loop, the new version of x is always used by the φ-function at the top of
the loop.
The problem also occurs when propagating expressions that carry a component of the
overwriting statements across the overwriting statement. Figure 4.13 shows an example
is a component of the right hand side that is dened by a φ-function one of whose
operands is the location dened by the statement. Φ -functions reference earlier values
overwriting statement between the denition of y1 and its use, and the denition of the
Algorithm 1 gives the detail of how to prevent extraneous local variables as discussed
in this section. It has been implemented in the Boomerang decompiler; for results see
As a result of alias issues, memory expressions must be divided into those which are
Memory expressions are subject to aliasing, in other words, there can be more than
one way to refer to the location, eectively creating more than one name for the same
location. The meaning of the term name here is dierent to the one used in the
context of SSA renaming or subscripting. Aliasing can occur both at the source code
At the source code level, the names for a location might include g (a global variable),
At the machine code level, all these aliases persist. Reference parameters appear as
call-by-value pointers at the machine code level, but this merely adds another * (deref-
erence) to the aliasing name. In addition, a global variable might have the names
m[10 000] (the memory at address 10 000), m[r1] where r1 is a machine register and r1
could take the value 10 000, m[r2+r3] where r2+r3 could sum to 10 000, m[m[10 008]]
where m[10 008] has the value 10 000, and so on. There will also be several equivalents
result of these additional aliasing opportunities, aliasing is more common at the machine
The main causes of machine code aliasing are the manipulation of the stack and frame
pointers, and the propensity for compilers to use registers as internal pointers to objects
(pointers inside multi-word data items, not pointing to the rst word of the item).
Internal pointers are particularly common with object oriented code, since objects are
The following sections will detail the problems and oer some solutions.
Figure 4.14 illustrates the problem posed by internal pointers with a version of the
running example.
Statement 3, which causes one register to become a linear function of another register
assignments of statements 14 and 15 are to the same memory location, but the memory
expressions appear to be dierent in Figure 4.14(b). When the memory variables are
placed into SSA form, as shown in Figure 4.14(c), they are treated as independent
variables. As a result, the wrong value (from statement 3) is propagated, and the
decompilation is incorrect.
In this example, statements 2 and 3 have not been propagated into statements 13-
15 for some reason; perhaps they were propagated but too late to treat the memory
expressions of statements 14 and 15 as the same loation. If they had been propagated,
the last two memory expressions would have been m[eax1 + 4], and the problem would
not exist. This is an example where too little propagation causes problems.
4.2 Applying SSA to Memory 115
int main() {
pair<int>* n_and_r = new pair<int>;
...
printf("Choose %d from %d: %d\n", n_and_r->first,
n_and_r->second, comb(n_and_r));
(a) Original source code
1 eax1 := malloc(8)
2 esi2 := eax1
3 edi3 := eax1 + 4
...
13 m[esi2 ]13 := exp1
14 m[esi2 +4]14 := exp2
15 m[edi3 ]15 := exp3
26 printf ..., m[esi2 ]13 →exp1 , m[esi2 +4]14 →exp2, ...
Alias-induced problems can also arise from a lack of preservation information. Preser-
vation analysis will be discussed more fully in Section 4.3. Figure 4.15(a) shows code
where a recursive call (statement 30) prevents knowledge of whether the stack frame
Preservation analysis requires complete data ow analysis, and complete data ow
analysis requires preservation analysis for the callee (which, because of the recursion,
is the current procedure, and has not yet had its data ow summarised). As a result of
the lack of preservation knowledge, the frame pointer is conservatively treated as being
dened by the call. Later analysis reveals that ebp is not aected by the call, i.e. ebp30
= ebp3 = esp0 -4, and the memory expression in statement 79 is in fact the parameter
expression is in some sense safe to use. Another issue is that the data ow analysis
of the whole procedure is not complete until it is known that the memory locations of
116 SSA Form
Figure 4.15: A recursive version comb from the running example, where
the frame pointer (ebp) has not yet been shown to be preserved because
of a recursive call.
The problems with correctly renaming memory expressions leads to the question of
whether memory expressions should be renamed at all. For example, the Low Level
Virtual Machine (LLVM, [LA04]) compiler optimisation framework uses SSA as the
basis for its intermediate representation, however memory locations in LLVM are not
in SSA form (not subscripted). To address this question, consider Figure 4.16.
Figure 4.16(a) shows the main loop of the running example with one line added; this
time variables c and denom are allocated to local memory. Figure 4.16(b) shows the
program in SSA form, ignoring the alias implications of statement 82. The program is
correctly represented, except for one issue. If c and *p could be aliases, the program is
incorrect. Hence, care must be taken if the address of c is taken, and could be assigned
to p.
4.2 Applying SSA to Memory 117
c = n-r;
*p = 7; /* Debug code */
while (c-- > 0) {
res *= num--;
res /= denom--;
}
(a) Source code.
80c80 := n0 - r0 80c := n - r;
81esi81 := p0 81esi81 := p
82m[esi81 ] := 7 // *p = 7 82m[esi81 ] := 7 // *p = 7
84dl84 := (n0 -r0 > 0) 84dl84 := (c > 0)
85c85 := c80 - 1 85c := c - 1
86goto after_loop if dl84 = 0 86goto after_loop if dl84 = 0
→ n0 -r0 <= 0 → c <= 0
loop: loop:
(b) Subscripting and propagating memory (c) Error: No propagation of memory
expressions. expressions.
Figure 4.16(c) shows the result of propagating only non-memory expressions; the while
loop now tests c after the decrement, which is incorrect.
Note that although no denition of a memory location was propagated, dl (an 8-bit
register, the low byte of register edx) was propagated, and carried the old value of c
(from statement 84) across a denition of c (statement 85). LLVM appears to avoid this
problem by making memory locations accessible only through special load and store
bytecode instructions; all expressions are in terms of virtual registers, never memory
locations directly. Load and store instructions are carefully ordered so that the original
virtual register and compare that against 0). Decompilers cannot avoid converting some
memory expressions (e.g. m[sp-K]) to variables, but the fact that a variable originated
from memory expressions can be recorded.
One solution would be to not allow propagation of any statements that contain (or
contained in the past) any memory expression components on the right hand side.
However, this has the drawback of making programs dicult to read. For example,
The IR and the generated code, if not propagating non-local memory expressions, would
be similar to
118 SSA Form
r1 := m[G] r1 = g;
r2 := m[B+r1] r2 = b[r1];
r3 := m[C+rj] # rj holds j r3 = c[j];
r4 := r1 + r3
m[A+ri] := r4 # ri holds i a[i] = r2+r3;
that these lines can be combined. Naturally, this must be done with alias safety.
4.2.3 Solution
A solution to the memory expression problem is given, which involves propagating only
suitable local variables, and delaying their subscripting until after propagation of non-
memory locations.
The solution to the problem of conservatively propagating memory locations has several
facets. Firstly, only stack local variables whose addresses do not escape the current
procedure are propagated at all. Such local variables are where virtual registers in a
compiler's IR are allocated, apart from those that are allocated to registers. All other
memory expressions, including global variables, array element accesses, and dereferences
Naturally, with a suitable analysis of where the address of a local variable is used, the
restriction on propagating local variables whose address escapes the local procedure
could be relaxed. Also, where the address is used inside the current procedure, care
must be taken not to propagate a local variable across an aliased denition. Escape
take the address of local variable foo at the source code level, the expression &foo
or equivalent is always involved. If the address of foo at the machine code level is
be involved, or it may actually be used to nd the address of the variable at sp-4,
or the expression sp-24 might be used to forge the address of foo by adding 12 to
structure whose address is sp-24. Escape analysis is the subject of future work.
For those memory locations that are identied as able to be propagated, their sub-
scripting and hence propagation is delayed until the rst round of propagations and
preservation analysis is complete. This requires a Boolean variable, one for each pro-
cedure, which is set after these propagations and analyses. Before this variable is set,
no attempt is made to subscript or create φ-functions for memory locations. After the
4.3 Preserved Locations 119
Boolean is set, the subscripting and φ-function inserting routines are called again. Only
For propagation over short distances to avoid temporary variables as noted at the end
of the last section, a crude propagation could be attempted one statement at a time,
checking each component propagated for alias hazards associated with the locations
dened in the statement being propagated across. This would allow propagation of
array elements, as per the example at the end of the last section.
Figure 4.17 shows the result of applying the solution of this section to the problem
1 eax1 := malloc(8)
2 esi2 := eax1
3 edi3 := eax1 + 4
...
13 m[eax1 ]13 := exp1
14 m[eax1 +4]14 := exp2
15 m[eax1 +4]15 := exp3
26 printf ..., m[eax1 ]13 →exp1 , m[eax1 +4]15 →exp3
Figure 4.17: The example of Figure 4.14 with expression propagation before
renaming memory expressions.
Now, expression propagation has replaced esi2 +4 and edi3 in statements 13-15 with
the more fundamental expressions involving eax1 . As a result, when SSA subscripting is
re-run, the two assignments in statements 14 and 15 are treated correctly as assignments
because preserved locations are an exception to the usual rule that denitions kill other
denitions.
same location. There is a change of the expression currently assigned to the location de-
ned in the rst denition, which normally means that this rst value is lost. However,
this is not always the case. A common example is in a procedure, where callee-save
register locations are saved near the start of the procedure, and restored near the end.
The register is said to be preserved. Between the save and the restore, the procedure
uses the register for various purposes, usually unrelated to the use of that register in
its callers.
120 SSA Form
application binary interface (ABI). However, there are many possible ABIs, and the one
in use can depend on the compiler and the calling convention of a particular procedure.
In addition, a compiler may change the calling convention for calls that are not visible
outside a compilation module, e.g. to use more registers to pass paremeters. Assembler
code or binary patches may also have violated the convention. This is particularly
important if the reason for decompiling is security related. Hence, the safest policy is
A pop instruction, or load from stack memory in a RISC-like architecture, which re-
stores the register is certainly a denition of the register. The unrelated denitions of
the register between the save and the restore are certainly killed by the restore. How-
ever, any denitions reaching the start of the procedure in reality reach the end of the
procedure. Since the save and restore are on all possible paths through the procedure,
denitions available at the start of the procedure are also available at the end.
a := 5 a := 5;
x := a x := 5 x := 5; x := 5;
call proc1 proc1(); a := proc1(5); proc1(a); // ref param
y := a y := 5; y := a; y := a;
(a) Input (b) Ideal (c) Decompilations resulting from ignoring
program decompilation the restoration of variable a in proc1
Figure 4.18: The eect of ignoring a restored location. The last example uses a call-
by-reference parameter.
Either the decompiled output has extra returns, or extra assignments. More impor-
tantly, the propagation of constants and other transformations which improve the out-
It may be tempting to reason that the saving and restoring of registers at the beginning
However, compilers may spill (save) registers at any time, and assembler code could
save and restore arbitrary locations (such as local or global variables). Treating push
and pop instructions as special cases in the intermediate representation is adequate for
architectures featuring such instructions (e.g. the dcc [Cif94] compiler does this), but
Saving of locations could occur without writing and reading memory, as shown in Figure
4.19. In this contrived example, an idiomatic code sequence for swapping two registers
Figure 4.19: Pseudo code for a procedure. It uses three xor instructions to
swap registers a and b at the beginning and end of the procedure. Eectively,
register a is saved in register b during the execution of the procedure.
By itself, expression propagation cannot show that the procedure of Figure 4.19 pre-
serves variable a. The rst statement could be substituted into the next two, yielding
a=b and b=a except that a is redened on the path from the rst to the third statement.
A temporary location is required to represent a swap as a sequence of assignments. The
combination of expression propagation and SSA form can be used to analyse this triplet
b1 = a0 ^ b0 ; b 1 = a0 ^ b0
a1 = a0 ^ b1 ; a1 = a0 ^ (a0 ^ b0 ) = b0
b2 = a1 ^ b1 ; b2 = b0 ^ (a0 ^ b0 ) = a0
a2 = 5 ; dene a (new live range)
print(a2 ) ; use a print(5)
b3 = a2 ^ b2 ; b3 = 5 ^ a0
a3 = a2 ^ b3 ; a3 = 5 ^ (5 ^ a0 ) = a0 : preserved
b4 = a3 ^ b3 ; b4 = a0 ^ (5 ^ a0 ) = 5 : overwritten b4 = 5
ret ; return to caller return b4
(a) Intermediate representation (b) After prop. & DCE
Figure 4.20: Pseudo code for the procedure of Figure 4.19 in SSA form. Here it is
obvious (after expression propagation) that a is preserved, but b is overwritten.
In this example, it is clear that a is preserved (the nal value for a3 is a0 ), but b is
overwritten (the nal value for b4 is 5, independent of b0 ). Eectively, b1 is used to
return, despite the fact that the assignment is to the variable a. If no callers use b, it
can be removed from the set of returns, but not from the set of parameters. Figure 4.21
With propagation and dead code elimination, it is obvious that a and b are now pa-
rameters of the procedure (since they are variables with zero subscripts).
122 SSA Form
b = xor a, b b1 = a0 ^ b0
c = b c1 = a0 ^ b0
a = xor a, b a1 = b0
b = xor a, b b2 = a0
a = 5 a2 = 5
print(a) print(5) print(5)
print(c) print(a0 ^ b0 ) print(a0 ^ b0 )
b = xor a, b b3 = 5 ^ a0
a = xor a, b a3 = a0
b = xor a, b b4 = 5 b4 = 5
ret ret return b4
(a) Original code (c) After propagation and DCE
(b) SSA form
Figure 4.21: The procedure of gure 4.19 with two extra statements.
The above example illustrates the power of the SSA form, and how the combination of
SSA with propagation and dead code elimination resolves many issues.
All procedures can trivially be transformed to only have one exit, and a denition
collector (see Section 4.5) can be used to nd the version of all variables reaching the
exit. This makes it easy to nd the preservation equation for variable x: it is simply
x0 =xe , where e is the denition reaching the exit, provided by the denition collector
in the return statement. With some manipulation and a lot of simplication rules, this
equation can be resolved to true or false. This algorithm for determining the preservation
of locations using the SSA form was rst implemented in the Boomerang decompiler
In the absence of control ow joins, expression propagation will usually make the deci-
sion trivial, as in Figure 4.21, where the preservation equation for a is a3 =a0 , and the
will be φ-functions; the equation is checked for each operand of the φ-function.
for example, if the denition for ebx1 happened to be ebx0 +1, ebx would indeed be
preserved by procedure comb. However, the denition in the example is the φ-function
φ(ebx0 , ebx2 ). The current equation ebx0 =ebx1 -1 is therefore split into two equa-
tions, one for each operand of the φ-function, substituting into ebx1 of the current
equation. The resultant equations are ebx0 =ebx0 -1 and ebx0 =ebx2 -1, both of which
have to simplify to true for ebx to be analysed as preserved by comb. In this case, the
rst equation evaluates to false, so there is no need to evaluate the second. Note that
4.3 Preserved Locations 123
comb: ...
loop:
ebx1 := φ(ebx0 , ebx2 )
...
ebx2 := ebx1 - 1
goto loop1 if ebx2 >= 0
...
return {Reaching definitions: ebx2 , ...}
Figure 4.22: A version of the running example with the push and pop of ebx
removed, illustrating how preservation analysis handles φ-functions.
if the second equation had to be evaluated, there is a risk of innite recursion that
has to be carefully avoided, since ebx2 is dened in terms of ebx1 and ebx1 is partly
It is possible to encounter φ loops that involve several φ-functions, and which are not
Section 7.5 on page 234 describes the implementation of an equation solver in the
Boomerang decompiler which uses the above techniques. The presence of recursion
Just as preserved locations appear to be used when for all practical purposes they are
not, there are several algebraic identities which appear to use their parameters, but
since the result is a constant, do not in reality use them. These include:
x - x = 0 x ^ x = 0 (^ = xor) x | ~x = -1
x & 0 = 0 x & ~x = 0 x | -1 = -1
each of these, a naive data ow analysis will erroneously conclude that x is used by
the left hand side of each of these identities. The exclusive or and subtract versions
aware of such identities to prevent needless overestimating of uses. The constant result
is shorter than the original expression, thereby making the decompiled output easier
to read. Often the constant result can combine with other expressions to trigger more
constant values before data ow analysis is performed. These changes can be made as
part of the typical set of simplications that can be performed on expressions, such
as x + 0 = x.
Equation 3.6 on page 84 states that the initial parameters are the locations live on entry
at the start of the procedure after intersecting with a parameter lter. Two exceptions
have been seen. Firstly, preserved locations are often not parameters, although they
could be. Figure 4.23 shows how propagation and dead code elimination combine to
Figure 4.23: Analysing preserved parameters using propagation and dead code
elimination.
Secondly, locations involved in one of the above identities are never parameters. For
example, if the rst use of register eax in a procedure appears to be used in a register
which performs an exclusive-or operation on eax with itself, e.g. eax := eax ^ eax, then
eax is not a parameter. Application of the identity x ^ x = 0 solves this problem.
Hence, the equation for nal parameters is in fact the same as that for initial parameters
(Equation 3.6), but parameters are only nal after propagation, dead code analysis, and
Figure 4.24 shows the IR for a fragment of the recursive version of the running example
3 ebp3 := esp0 - 4
11 goto to L2 if eax7 > m[r280 + 8]
12 m[esp0 - 32] := 1
13 GOTO L2
L1:
14 eax14 := m[esp0 +4]0 ; n
23 m[esp0 -48]23 := m[esp0 +8] ; argument r
25 m[esp0 -52]25 := eax14 -1 ; argument n-1
30 eax30 , ebp30 , m[ebp30 +8]30 := CALL rcomb ; Recurse
Reaching denitions: ... esp= esp0 -56, ebp := esp0 -4, ...
79 st79 := st76 *f (double) m[ebp30 +8]30 ; res *= n
L2:
191 ebp191 := φ(ebp3 , ebp30 )
184 esp184 := ebp191 + 8
185 return
(b) Before bypassing. Many statements are removed for simplicity, including
those which preserve and restore ebp.
...
30 eax30 := CALL rcomb ; Recurse
79 st79 := st76 *f (double) m[esp0 +4]0 ; res *= n
L2:
191 ebp191 := esp0 - 4 ; φ-function now an assignment
184 esp184 := esp191 + 8 → esp0 + 4 ; esp is preserved
185 return
(c) After bypassing.
Note that the preservation of the stack pointer esp depends on the preservation of
another register, the frame pointer ebp. The fact that the call is recursive is ignored
for the present; the eect of recursion on preservation will be addressed soon. For
the purposes of this example, it will be assumed that after preservation analysis, it is
comes about from the fact that the stack is balanced throughout the procedure except
126 SSA Form
for the return statement at the end, which pops the 32-bit (4-byte) return address from
the stack. X86 call statements have a corresponding decrement of the stack pointer
by 4, where this return address is pushed. Hence, incrementing esp by 4 is the x86
equivalent of preservation.)
The φ-function now has two parameters, ebp3 and ebp30 . Ebp30 is the value of ebp after
the call; since ebp was found to be preserved by the call, all references to ebp30 can be
replaced by the value that reaches the call, which is esp0 -4 (reaching denitions are
stored in calls by a collector; see Section 4.5). Hence, both operands of the φ-function
evaluate to esp0 -4. There is no longer any need for the φ-function, so it is replaced
by an assignment as shown. Eectively, the value of ebp has bypassed the call (it is as
if the call did not exist) and also bypassed the φ-function caused by the control ow
merge at L2.
Following subsections discuss how to determine the best order to process procedures in
when they are involved with recursion, and how to perform the preservation analysis.
A nal subsection discusses the problem of removing unused parameters and returns,
Decompilers have the whole input binary program available at once, unlike compilers,
which typically see source code for only one of potentially many source modules. A
decompiler therefore potentially has access to the IR of the whole program at once.
Denitions and uses in a procedure depend on those in all child procedures, if any, with
the exception that a procedure's returns depend on the liveness of all callers. As has
been mentioned earlier, this must be done after all procedures are analysed. It makes
sense therefore to process procedures in a depth rst ordering (of the call graph, the
ordering, child nodes are processed before parent nodes, so the callers have the data
ow summary of callees (as per Section 3.4) available to them. The main summary
information stored in callees that is of interest to callers are the modieds set, possibly
stored in the return statement of the callee, and the set of live locations at the callee
entry point.
4.4 Recursion Problems 127
However, it is only possible to access the summary information of every callee if the
call graph is a tree, i.e. there are no cycles in the call graph introduced by recursive
procedures. Many programs have at least a few recursive procedures, and sometimes
there is mutual recursion, which causes larger cycles in the call graph. Figure 4.25
shows an example of a program call graph with many cycles; there is self recursion and
mutual recursion.
Figure 4.25: A small part of the call graph for the 253.perlbmk SPEC CPU2000
benchmark program.
2 → 3 → 4 →6 → 2.
These cycles imply that during analysis of these procedures, an approximation has to be
made of the denitions and uses made by the callees in recursive calls. Since unanalysed
indirect calls also have no callee summary available, they are treated similarly. Both
types of calls will be referred to as childless calls. Section 3.6 concluded that it is
safe to overestimate the denitions and uses of a procedure when there is incomplete
information about its actual denitions and uses. Therefore, it can be assumed that all
denitions reaching a recursive call are used by the callee (all locations whose denitions
reach the call are considered live), and all live locations at the call are denes (dened
by the callee), according to Equations 3.12 and 3.13 respectively on page 85. In eect,
all childless calls (calls for which the callee is not yet analysed, including recursive calls
where <all> is a special location representing all live locations (the leftmost <all>)
and all reaching denitions (the rightmost <all>).
This is a quite coarse approximation, but it is safe. Preservation analysis is the main
analysis to be performed considering together all the procedures involved in mutual re-
cursion. In eect, preservation and bypassing rene the initially coarse assumption that
everything is dened in the callee. For example, in Figure 4.24, the initial assumption
that register ebp is dened at the recursive call is removed, once it is determined by
preservation analysis that ebp is preserved by the call, and is therefore eectively not
Similarly, before preservation analysis, it is assumed that register ebx is used by the
procedure rcomb, i.e. it is live at the call to rcomb. In reality, ebx is pushed at the
start of rcomb and popped at the end, and not used as a parameter. In other words,
it is not actually used by the call, so that denitions of ebx before the call not used
except by the call are in reality dead code. After preservation analysis and dead code
elimination, the real parameters of rcomb can be found (recall that preserved locations
appear to be parameters until after dead code elimination). Similarly to the situation
with denitions, the coarse assumption of all locations being used by the recursive call
All the procedures involved in a recursion cycle can have the usual data ow analyses
(expression propagation, preservation analysis, call bypassing, etc) applied with the
approximation of childless calls using and dening everything, which will result in a
summary for the procedures (locations modied and used by the procedures).
Figure 4.26 shows another call graph. In this example, i should be processed before
processed independently of the other cycles. Both f and g are processed together using
the approximations mentioned above, and the data ow analyses are repeated until
there is no change.
However, procedures such as j and k, while part of only one cycle, must be processed
with the larger group of nodes, including b, c, d, e, j, k, and l. The dierence arises
because f and g depend only on each other, and so once processed together, all infor-
mation about f and g is known before c is processed. Nodes j and k, however, are not
4.4 Recursion Problems 129
d f h j l
e g i k
Figure 4.26: A call graph illustrating the algorithm for nding the correct
ordering for processing procedures.
also l, which depends on b as well. The order of processing should be as follows: visit
The procedures that have to be processed as a group are those involved in cycles in
the call graph, or sets of cycles with shared nodes. For these call graph nodes, there
is a path from one procedure in the group to all the others, and from all others in the
group to the rst. In other words, each recursion group forms a strongly connected
component of the call graph. In the example graph, there is a path from c to f, but
none from f to c (the call graph is directed), so f is not part of the strongly connected
component associated with c. Nodes l and e are part of the same strongly connected
component, because there is a path from each to the other (via their cycles and the
shared node c ).
a graph. It has been shown that this algorithm is linear in the size of the graph
[Gab00]. Decompilers typically do not store call graphs directly, but the algorithm can
be adapted for decompilation by only notionally contracting vertices in the graph that
Algorithm 3 shows an algorithm for nding the recursion groups and processing them
When the child procedure does not cause a new cycle, the recursive call to decompile
130 SSA Form
performs a depth rst search of the call graph. In the example call graph of Figure
4.26, procedure i is decompiled rst, followed by h. When the algorithm recurses back
to c, cycles are detected. For example, when child f of c is examined, no new cycles are
found, so f is merely visited. When child g of f is examined, the rst and only child
f is found to have been already visited. At this point, path contains a-b-c-f-g. The
fact that f is in path indicates a new cycle. The set child, initially empty, is unioned
with the set of all procedures from f to the end of path (i.e. f and g ). As a result, g
is not decompiled yet, and decompile returns with the set {f, g }. At the point where
the current procedure is f, the condition at Note 1 is true, so that f and g are analysed
as a group. (The group analysis will be described more fully in Section 4.4.2 below.)
Note that the cycle involving c and six other procedures is still being built. Assuming
that c 's children are visited in left to right order, a, b, c, d, and e have been visited but
When j 's child e is considered, it has already been visited, but is not part of path
(which then contains a-b-c-j-k ). However, e has already been identied as part of a
cycle c-d-e-c, so c>cycleGrp contains the set {c, d, e }. The rst element of path that
is in this set is c, so all procedures after c to the end of path (i.e. {j, k }), are added to
child, which becomes {c, d, e, j, k }. Finally, when l 's child b is found to be in path,
this cycle have their cycleGrp members point to the same set.
described.
Consider rst a procedure with only self recursion, i.e. a procedure which calls itself
directly. When deciding whether a location is preserved in some self recursive proce-
dure p, there is the problem that at the recursive calls, preservation for location l will
usually depend on whether l is preserved by the call. There may be copy or swap
instructions such that l is preserved if and only if some other location m is preserved.
4.4 Recursion Problems 131
This will be considered below. The location l will be preserved by the call if the whole
procedure preserves l, but the whole procedure's preservation depends on many parts
of the procedure, including preservation of l at the call. This is a chicken and egg prob-
lem; the innite recursion has to be broken by making some valid assumptions about
only if it is preserved along all possible paths in the procedure. When the preservation
have to succeed for the overall preservation to succeed. A failure along any path will
result in the failure of the overall preservation. Figure 4.27 shows a simplied control
ow graph for the recursive version of the running example program, which illustrates
the situation.
proc rcomb:
n <= r?
call rcomb ?
Return
Figure 4.27: A simplied control ow graph for the program of Figure 4.15.
The ticked basic blocks are those which are on a path for which a particular location is
known to be preserved. In some cases, several of the blocks have to combine, e.g. with
a save of the location in the entry block and a restoration in the return block. However,
the result is the same. The only block with a question mark is the recursive call. If
that recursive call preserves the location, then the location is preserved for the overall
procedure. However, the recursive call does not add any new control ow paths; it will
only lead to ticked blocks or the recursive call itself. The call itself does not prevent
Hence, for the purpose of preservation analysis, the original premise can safely be
assumed to succeed until shown otherwise. In eect, the problem is l preserved in
by calls to p in p, is the location l preserved in p ?. This assumption breaks the innite
Consider now the situation where the preservation of l in p depends on the preservation
of some other location m in p. This happens frequently, for example, when a stack
pointer register is saved in a frame pointer register, so that preservation of the stack
The analysis recurses with the new premise, i.e. that both l and m are preserved. The
dierence now is that a failure of either part of the premise (i.e. if either l or m are
found not to be preserved) causes the outer premise to fail (i.e. l is not preserved).
p ; if so, this can safely be assumed, as described above. There may also be places where
to innite recursion again. However, given that the current goal is to prove that m is
In order to keep track of the assumptions that may safely be made, the analysis main-
tains a set of required premises. This set could be thought of as a stack, with one
premise pushed at the beginning of every analysis, and popped at the end. This stack
not valid until the outer preservation analysis is complete. In other words, each premise
is a necessary but not sucient condition for the preservation to succeed. Since the
number of procedures and locations involved in mutual recursion is nite, this analysis
Consider now mutual recursion, where p calls q and possibly other procedures, one or
more of which eventually calls p. When the preservation analysis requires that m is
preserved by q, this premise is added to the stack of required premises, as above. The
dierence now is that the elements of the stack have two components: the location that
is assumed preserved, and the procedure to which this assumption applies. In all other
Note how the preservation analysis requires the intermediate representation for each
of Figure 4.26, when checking for a location in procedure f, there will at some point
recursion, the analyses are arranged as follows. When Algorithm 3 nds a complete
recursion group, early analyses are performed separately on each of the procedures
in the recursion group. This decodes each procedure, inserting denitions and uses at
childless calls as described earlier. Once each procedure has been analysed to this stage,
middle analyses are performed on each procedure with a repeat until no change for the
whole group of middle analyses. Preservation analysis the main analysis applied during
middle analyses. When this process is complete, each procedure in the recursion group
is nished o with late analyses, involving dead code elimination and the removal of
134 SSA Form
The repeat until no change aspect of middle analyses suggests that perhaps preservation
analysis could be viewed as a xedpoint dataow algorithm. The t is not a good one,
but two possible lattices for this process are shown in Figure 4.28.
Figure 4.28(b) is remiscent of the textbook lattices for constant propagation as a xed-
point data ow transformation, or Fig. 1 of [Moh02]. Pres. +0 indicates that the
location of interest has been found to be preserved along some paths with nothing added
to it; Pres. +4 indicates that it is preserved but has 4 added to it, and so on. If along
dierent constant than the current state indicates, the location's state moves to the
Section 7.5.1 on page 235 demonstrates the overall preservation analysis algorithm (not
After dead code elimination, the combination of propagation and dead code elimination
will have removed uses and denitions of preserved locations that are not parameters.
As a result, no locations live at the entry are falsely identied as parameters due to saves
of preserved locations. However, there may still be some parameters whose only use
in the whole program is to supply arguments to recursive calls. Consider the example
Procedures a and b are mutually recursive, and the recursion is controlled entirely in b.
The question is what algorithm to use to decide whether any of the parameters of a or b
of both a and b, but q is not. Parameter p is used only to pass an argument to the
the parameter p from the denitions of a and b and the calls to them, the program is
and only locations whose only uses in the whole program are as arguments to calls to
procedures in the current recursion group can be considered redundant. For example,
consider procedure a in Figure 4.29. It currently has two parameters, p and q. When
a recursive call is encountered, the analysis needs to consider the callee (here b ) and if
necessary all its callees, for calls to a. During this analysis, parameter p is only used as
so it is not redundant.
However, while s is used only by return statements in recursive procedures (the return
redundant parameter but r is not. For each procedure involved in recursion, each return
136 SSA Form
c and d separately), recursive calls have to be followed, and only return components
dened by calls to other procedures in the current cycle group that are otherwise unused
Once the intermediate representation for all procedures is available, (this is a late
analysis, in the sense of Section 4.4.2), unused returns can be removed by searching
through all callers for a liveness (use before denition) for each potential return location.
Livenesses from arguments to recursive calls or return statements in recursive calls are
ignored for this search. When the intersection of all considered livenesses is empty, the
return can be removed from the callee. Removing the returns removes some uses in
the program, so that some statements and some parameters may become unused as a
result.
For example, when a newly removed return is the only use of a dene in a call statement,
it is possible that the last use of that dene has now been removed; this will happen
if no other call to that callee uses that location. Hence the unused return analysis
has to be repeated for that callee. In this way, changes resulting from unused returns
propagate down the call graph. When parameters are removed, the arguments of all
callers are reduced. When arguments are reduced, uses are removed from the caller's
data ow. As a result of this, the changes initiated from removing unused parameters
propagate up the call graph. Hence, there is probably no best ordering to perform the
Section 7.6 on page 240 demonstrates this algorithm in practice, using the Boomerang
decompiler.
4.5 Collectors
Collectors, a contribution of this thesis, extend the sparse data ow information provided
by the Static Single Assignment form in ways that are useful for decompilers, by taking
The SSA form provides use-def information (answering the question what is the deni-
tion for this use?) and if desired also def-use information (answering the question what
are the uses of this denition?). Sometimes reaching denitions are required, which
answers what locations reach this program point, and what are their denitions?. The
SSA form by itself can answer the second part of the question, but not the rst part.
During the conversion of a program to SSA form, however, this information is available,
Decompilers only need reaching denitions at two sets of program points: the end of
procedures, and at call sites. Live variables are also needed at calls, and this information
is also available during conversion. Despite this, SSA is generally considered in the
literature to be unsuitable for backwards data ow information such as live variables
[CCF91, JP93].
These observations are the motivation for denition collectors , which capture reaching
denitions, and use collectors , which capture live variables. The collectors store in-
formation available during the standard variable renaming algorithm, when it iterates
through all uses in a statement. In eect, collectors are exceptions to the usually sparse
storage of data ow information in SSA form. The additional storage requirements are
modest because they are only needed at specic program points: the end of procedures,
Figure 4.30: Use of collectors for call bypassing, caller and callee contexts,
arguments (only for childless calls), results, denes (also only for childless calls),
and modieds.
Algorithm 4 shows the algorithm for renaming variables, as per [App02], modied for
updating collectors, and for subscripting with pointers to dening statements rather
Where the algorithm reads if can rename a , it is in the sense given in Section 4.2.3,
i.e. a is not a memory location, or it is a suitable memory location and memory locations
Recall that at childless calls (call statements whose destination is unknown, or has not
had complete data ow analysis performed to summarise the denitions and uses), all
locations are assumed to be dened and used. To implement this successfully, Appel's
pushed to all elements of Stacks (note that Stacks is an array of stacks, one stack for
each location seen so far). This slightly complicates removal of these elements, adding
Collectors complicate the handling of statements containing them slightly. For example,
when performing a visitor pattern on a call statement, which contains two collectors,
should the expressions in the collectors visited? At times, this is wanted, but not at
right hand side of denition collectors, so in this case the right hand sides of denition
collectors are treated as uses. However, when deciding if a denition is dead, these uses
should be ignored.
As covered in Section 4.3.3, the denitions of locations reaching calls are needed for call
Recall from Section 3.4.2, an expression in the caller context such as m[esp0 -64] is
converted to the callee context (as m[esp0 +4]) using the value espcall of the stack pointer
at the call statement. The denition collector in the call collects reaching denitions
are provided by the denition collector at the call. Similarly, denes at a childless call
are initially all locations live at the call. These are provided by the use collector at the
call.
Call results are computed late in the decompilation process. For each call, the results
are the intersection of the live variables just after the call (in the use collector of the
call) and the denes of the call (which are the modieds of the callee translated to the
context of the caller). The nal returns of the callee depend on the union of livenesses
(obtained from the use collectors in the callers) over all calls. Thus, if a procedure
denes a location that is never used by any caller, it is removed from the returns of the
callee.
With suitable modications to handle aggregates and aliases well, the SSA form obviates
(a) (b)
Figure 4.31: The weak update problem for malloc blocks. From Fig. 1 of [BR06].
The problem is that for representations without the single assignment property, infor-
mation about assignments such as *pp= in the example are summary nodes; they must
summarise all assignments to *pp in the program. To ensure soundness, value analysis
must always over-approximate the set of possible values, so the summary nodes are
initialised for all variables to the special value >, representing all possible values. All
updates to summary nodes must be weak updates, so the values of the nodes never
change. If the information for *pp was initialised to ∅ (the empty set), information
such as *pp points to a can be established, but in the case of Figure 4.31(a) the fact
that *pp is not initialised along some paths has been lost, and this is the very kind of
The authors use the elaborate recency-abstraction to overcome this problem while main-
taining soundness. However, ignoring alias issues for a moment, single assignment rep-
resentations such as the SSA form only have one assignment to each variable (or virtual
variable, such as *pp in the example). In such a system, all updates can be strong,
since the one and only assignment (statically) to each variable is guaranteed to change
it to the value or expression in the assignment. For example, let pp1 be the variable
assigned to at the malloc call. pp1 always points to the last heap object allocated at
this site. Following (*pp1 )1 = &a and (*pp1 )2 = &b, the two versions of (*pp1 ) always
point to a and b respectively.
Information about each variable can be initialised to > as required for soundness. Fol-
lowing a heap allocation, all elements of the structure (in this case there is only one,
*pp), are assigned > (e.g. (*pp1 )0 = >). In Figure 4.31(a), the value for *pp that is
4.7 Other Representations 141
assigned the value 10 will be the result of a φ-function with (*pp1 )1 and (*pp1 )0 as
operands; the value sets for these are &a and > respectively. The resultant variable will
have a value set of > (i.e. possibly undened). In Figure 4.31(b), the equivalent value
for *pp will be the result of a φ-function with (*pp1 )1 and (*pp1 )2 , which have values
sets &a and &b respectively. The result will have the value set {&a, &b} (i.e. *pp could
point to a or to b). Figure 4.32 shows the example of Figure 4.31(a) in SSA form.
void foo() {
int **pp, a;
while(...) {
pp1 = (int*)malloc(sizeof(int*));
*(pp1 )0 = >; // Value set = {>}
if(...)
(*pp1 )1 = &a; // Value set = {&a}
else {
// No initialization of *pp
}
(*pp1 )2 = φ((*pp1 )0 , (*pp1 )1 ); // Value set = {>}
*(*pp1 )2 = 10;
}
}
Figure 4.32: The code of Figure 4.31(a) in SSA form.
This considerable advantage comes at a cost; information is stored about every version
of every variable. While the SSA form solves the simple example above, as usually
applied in compilers the SSA form needs to be extended to handle aggregates well.
More importantly, variables such as *pp1 are only uniquely assigned to if no other
expression both aliases to *pp1 and is assigned to. If the SSA form can be extended to
accommodate the complications of aliasing and aggregates, it shows great promise for
Many intermediate representations (IRs) for program transformations have been sug-
form (SSA form) is one such IR, and there have been several extensions to the basic
SSA form.
The Gated Single Assignment form (GSA form) [TP95] is an extension of the SSA form
designed to make possible the interpretation of programs in that form. This is not
142 SSA Form
required in a decompiler, since copy statements can be added to make the program
executable. However, it is possible that GSA or similar forms might reduce the number
There is a series of intermediate representations that are based on the idea of abstracting
away the Control Flow Graph (CFG). A notable example is the Value Dependence
Graph (VDG) [WCES94]. The key dierence between VDG and CFG-based IR is that
VDG is a parallel representation that species a partial order on the operations in the
computation, whereas the CFG imposes an arbitrary total order. The authors claim
that the CFG, for example by naming all values, gets in the way of analysing the
underlying computation.
The running example in their paper shows seven dierent types of optimisations, all
such as loop invariant code motion do not apply to decompilation; the best position
for an assignment in source code is the position that maximises readability. One opti-
based on the IR described in this chapter are already inherently name insensitive, in
the sense that variables such as acopy (a copy of variable a ) in their example are auto-
matically replaced with the original denition by the process of expression propagation.
by expression propagation.
VDG does reveal opportunities for parallelisation that might be useful for future de-
compilers that emit source code in some language that expresses parallelism directly.
However, it could be argued that parallelisation opportunities are the job of the com-
piler, not the programmer, and therefore do not belong in source code.
intermediate representation for decompilation, but the benets do not outweigh the costs.
Using the SSI form, as opposed to SSA form, improves a decompiler's type analysis in
very rare circumstances. One such case involves a pair of load instructions which are
4.7 Other Representations 143
hoisted ahead of a compare and branch. Figure 4.33 shows the source and machine
code of an example.
#define TAG_INT 1
#define TAG_PTR 2 int f(tagged_union *tup) {
typedef struct { int x;
int tag; if (tup->tag == TAG_INT)
union { x = tup->u.a;
int a; else
int *p; x = *(tup->u.p);
} u; return x;
} tagged_union; }
(a) Source code
The problem is that the hoisted load instruction loads either a pointer or an integer,
depending on the path taken in the if statement. In the non-hoisted version, while r0
is sometimes used as a pointer and as an integer, there are separate denitions for the
two dierent types. With its concept of having only one denition of a variable, SSA
is very good at separating the dierent uses of the register, and sensible output can be
emitted, with no type conicts. The decompiled program will require casts or unions,
In the compilation with the hoisted load, the one (underlined) load instruction (and
therefore one denition) produces either an integer or a pointer, and there is a path in
the program where that value is returned. In essence, the machine code conates the
two loads, and uses extra information (the tag eld) to perform the right operations
on the result. When the value loaded is a pointer, control always proceeds to the last
144 SSA Form
load instruction, which dereferences the pointer and loads an integer. It appears that
the program could take the other path, thereby returning the pointer, but this never
By itself, the SSA form combined with any type analysis is not powerful enough to infer
that the type of the result is always an integer. This is indicated by the type equation
for the return location r0d of Figure 4.34(a), which is T (r0d ) = α ∨ α∗. This equation
could be read as the type of variable r0d is either an alpha, or a pointer to an alpha.
(α is a type variable, i.e. a variable whose values are types like oat or int*. It appears
here because until the return location is actually used, there is nothing to say what type
is stored at oset 4 in the structure; as soon as the return value is used as an integer,
Figure 4.34: IR of the optimised machine code output from Figure 4.33.
As with all three representations, int and r0a has the type
T (r1a ) =
struct{int, union{α, α*}}*, in other words, r0a points to a structure with an int followed
by a union of an α and a pointer to an α.
When the program is converted to SSI form, as shown in Figure 4.34(b), the decompiler
spends extra time and space splitting the live range of every live variable at every control
ow split (branch or switch instruction). For example, r0b is split into into r0x and
r0y , gambling that something dierent will happen to the split variables to justify this
extra eort. In this case the gamble succeeds, because r0y is used as a pointer while
r0x remains used only as an integer. This allows each renamed variable to be typed
separately, and the nal type for r0d is α, as it should be. Hence this program is one
case where the SSI form has an advantage over the SSA form.
While SSI splits all live variables at every control ow split, propagation removes the
use of r0b as a temporary pointer value, placing the memory indirection directly where
it is used. In other words, the single memory of operator from the single load instruc-
tion is copied into two separate expressions, neatly undoing the hoisting, and neatly
4.7 Other Representations 145
avoiding the consequent typing problem. The variables r0b , r0c , and most importantly
r0d all have a consistent type, α, as expected.
Once again, the combination of the SSA form and expression propagation is found to
be quite useful for a decompiler. The extra overhead of the SSI form (the time and
space to create the σ -functions) has not been found to be useful for decompilation in
this case.
One area where the extra information could potentially be useful for a decompiler is in
the analysis of array sizes. For example, if an array index takes the values 0 through 9
in a loop, then the version of the index variable in the loop can be shown to take the
range 0-9, while the same variable outside the loop could take other values (typically
only one value, 10, after the loop exits). This information could potentially be used to
correctly emit the initialised values for an initialised array. However, decompilers need
to be able to do the same thing even if the index variable has been optimised away,
eectively deducing the existence of an induction variable. The SSA form appears to
The Dependence Flow Graph (DFG) can be thought of as an extension of the Static
Single Assignment form [JP93]. In this form, the emphasis is on recording dependencies,
both from uses to denitions and vice versa. Figure 4.35 shows the main loop of the
Note that the arrows are reversed for the SSA form, since in that form, the most
compact encoding of the dependencies is for each use to point to the unique denition.
+
This encoding is used in [KCL 99]. However, the dependencies could be recorded as a
+
list of uses for each denition, as per the earlier SSA papers (e.g. [CFR 91, WZ91]),
the DFG form, uses could point to denitions (sometimes via merge operators), and/or
denitions could point to their uses (sometimes via switch operators or multiedges).
Switch operators dier from multiedges in the DFG form in that there is a second
input (not shown in Figure 4.35) that carries information about the uses at the target
of the switch operator. For example, at the true output of the switch operator for ebx,
it is known that ebx>0. For the other switch operators, the switch predicate does not
give directly useful information, although it may be possible for example to deduce that
146 SSA Form
control flow
dependence (def-use) phi-function merge operator switch operator
dependence (use-def)
multiedge
(a) CFG with def-use chains (b) SSA form (c) DFG form
Figure 4.35: A comparison of IRs for the program of Figure 4.1. Only a few def-use
chains are labelled, for simplicity. After Figure 1 of [JP93].
esi is always greater than zero, so that a divide by zero error or exception will never
The presence of switch operators and multiedges makes the DFG more suitable for
The original control ow graph still exists in the DFG form, so no extra step is necessary
to construct a CFG for high level code generation, which would normally require one.
The DFG form can also readily be converted to a dataow graph for execution on a
The availability of uses for each denition, albeit at some cost for maintaining this
4.8 Summary
The Static Single Assignment form has been found to be a good t for the intermediate
It has been demonstrated that data ow analysis is very important for decompilation,
and that the Static Single Assignment form makes most of these considerably easier.
The propagation of memory expressions has been shown to be dicult, but adequate
Preservation analysis was found to be quite important, and in the presence of recursion,
surprisingly dicult. An algorithm has been given which visits the procedures involved
Denition and use collectors have been introduced to take a snapshot of already com-
Overall, the SSA form has been found to be a good t for the intermediate representa-
tion of a machine code decompiler. The main reasons are the extreme ease of propaga-
analysis, simplicity of dead code elimination, and a general suitability for identifying
The SSA form is conventionally used only on scalar variables. Although numerous
attempts have been made to extend it to handle arrays and structures, an extension
that works well for for machine code programs has yet to be found.
148 SSA Form
Chapter 5
The SSA form enables a sparse data ow based type analysis system, which is well suited
to decompilation.
Compilers perform type analysis to reject invalid programs early in the software de-
velopment process, avoiding more costly correction later (e.g. in debugging, or after
deployment). Some source languages such as C and Java require type denitions for
all variables. For these languages, the compiler checks the mutual consistency of state-
ments which have type implications at compile time in a process called type checking.
Since much type information is required by the compiler of the decompiled source code,
and no explicit type information exists in a typical machine code program, a machine
code decompiler has considerable work to do to recover the types of variables. This
Other languages such as Self, C/C++ and Smalltalk require few type denitions, vari-
ables can hold dierent types throughout the program, and most type checking is per-
formed at runtime. Such languages are dynamically type checked. Static (compile
time) type analysis is often still performed for these languages, to discover what types
a variable could take at runtime for optimisation. This process, called type inferencing
or type reconstruction, is harder for a compiler than type checking [KDM03]. Since C-
like languages are more common, this more dicult case will not be considered further
here.
The type analysis problem for decompilers is to associate each piece of data with a high-
level type. The program can reasonably be expected to be free of type errors, even
though some languages such as C allow casting from one type to almost any other.
Various dierent pieces of data require typing: initialised and uninitialised global data,
Type analysis is conceptually performed after data ow analysis and before control ow
149
150 Type Analysis for Decompilers
Control Flow
Data Flow Type Code
Loader Decoder Analysis
Analysis Analysis generation
(structuring)
Front end Back end
Input
binary file
Type analysis, even more so than data ow analysis, is vastly more heavily discussed
in the literature from the point of view of compilers. Detailed consideration is given
here on the nature of types in machine code programs, and the requirements for types
The type information present in machine code programs can be regarded as a set of
data ow analysis. This chapter presents a sparse data ow based type analysis based
on the Static Single Assignment form (SSA form). Since type information can be stored
with each denition (and constant), there is ready access to the current type from each
use of a location. The memory savings from this sparse representation are considerable.
Addition and subtraction instructions, which generate three constraints each, require
a dierent approach in a data based type analysis system. Similarly, arbitrary expres-
sions deserve special attention in such a system, as there is nowhere to store types for
In any type analysis system for machine code decompilers, there is a surprisingly large
variety of memory expressions that represent data objects from simple variables to array
and structure members. These will be enumerated in detail for the rst time.
Section 5.1 lists previous work that forms the basis for this chapter. Section 5.2 in-
troduces the nature of types from a machine code point of view, and why they are so
important. The sources of type information are listed in Section 5.3. Constants require
typing as well as locations, as shown in Section 5.4. Section 5.5 discusses the limitations
of addition and subtraction instructions are considered in Section 5.6, and revisited in
Section 5.7.4. An iterative data ow based solution is proposed in Section 5.7, and the
5.1 Previous Work 151
benets of an SSA version are given in Section 5.7.2. The large number of memory
well as code, and this process goes hand in hand with type analysis, as discussed in
Section 5.9. Section 5.10 mentions some special types useful in type analysis, Section
5.11 discusses related work, and Section 5.12 lists work that remains for the future.
Finally, Section 5.13 summarises the contributions of the SSA form to type analysis.
The work of Mycroft and Reps et al. have some limitations, but they laid the foundation
Mycroft's paper [Myc99] was the rst to seriously consider type analysis for decompila-
tion. He recognised the usefulness of the SSA form to begin to undo register-colouring
array indexing where the compiler generates more than one instruction for the array
Mycroft's work has a number of other limitations, e.g. global arrays do not appear
to have been considered. However, the work in this chapter was largely inspired by
Mycroft's paper.
Reps et al. describe an analysis framework for x86 executables [RBL06, BR05]. Their
goal is to produce an IR that is similar to that which could be produced from source
code, but with low-level elements important to security analysis. They ultimately use
tools designed to read source code for browsing and safety queries. Figure 5.2 gives an
overview.
One of the stated goals is type information, but the papers do not spell out where this
information comes from, apart from a brief mention of propagating information from
Three main analyses are used: value set analysis (VSA), a form of value analysis; ane
relation analysis (ARA), and aggregate structure identication (ASI). VSA nds an
overapproximation of the values that a location could take at a given program point.
ARA is a source code analysis modied by the authors for use with executable programs.
It nds relationships between locations, such as index variables and running pointers.
ASI recovers the structure of aggregates such as structures and arrays, including arrays
of structures.
152 Type Analysis for Decompilers
CodeSurfer/x86
!
Figure 5.2: Organisation of the CodeSurfer/x86 and companion tools. From
"
[RBL06].
The authors report results that are reasonable, particularly compared to the total failure
of most tools to analyse complex data structures in executable programs. However, only
55% of virtual functions were analysed successfully [BR06], and 60% of the programs
tested have one or more virtual functions that could not be analysed [BR06, BR05].
72% of heap allocated structures were analysed correctly (i.e. the calculated structure
agreed with the debugging information from the compiler), but for 20% of the programs,
12% or less were correct [BR05]. As will be shown later, the form of memory expressions
complex. In particular, it is not clear that their analyses can separate original from
oset pointers (Section 1.5.3 on page 19). It is hoped that an analysis that takes
into account all the various possible memory expression forms will be able to correctly
analyse a larger proportion of data accesses, while also nding types for each data item.
Type information encapsulates much that distinguishes low level machine code from high
The nature and uses of types will be considered in the next sections from a machine
statically type checked language, assertions are made about what values the variables
can take, and what operations can usefully be performed on them. Part of the domain
of program semantics applies to variables of type integer (e.g. a shift left by 2 bits),
and others (e.g. a call to method getName()) apply to objects of type class Customer.
When a program is found to be applying the wrong semantics to a variable, it can be
shown that the program violates the typing rules of the language, so an error message
Types include information about the size of a data object. Types partition the data
sections from blocks of bytes to sets of distinct objects with known properties.
In a machine code program, type declarations usually have been removed by the com-
which may still be present in some cases.) When decompiling the program, therefore,
the fact that a variable is at some point shifted left by two is a strong indication that
from other data objects in the data sections. It is unlikely that these four bytes are
part of another data object, a string for example, that happens to precede it in the
data section. In this way, the decompiler can eectively reconstruct the mapping from
The view of types as assertions leads to the idea of generating constraints for each
variable whenever it is used in a context that implies something about its type. Not
all uses imply full type information. For example, a 32-bit copy (register to register
move) instruction only constrains the size of the type, not whether it is integer, oat,
or pointer. The above assumes that pointers are 32 bits in size. Programs generally
use one size for pointers, which can be found by examining the binary le format.
The examples in this chapter will assume 32-bit pointers; obviously other sizes can be
accommodated.
Some operations imply a basic type, such as integer, but no detail of the type, such
as its signedness (signed versus unsigned). The shift left operator is an example of
this. Operators such as shift right arithmetic imply the basic type ( integer), and the
signedness (in this case, signed). The partial nature of some type information leads to
the notion of a hierarchy or lattice of types. The type unsigned short is more precise than
154 Type Analysis for Decompilers
short integer (where the signedness is not known), which is more precise than size16,
which in turn is more precise than > (no type information at all).
In object oriented programs, an important group of types are the class types. It is
desirable to know, for example, whether some pointer is of type Customer* or type Em-
ployee*. Such type information does not come directly from the semantics of individual
A program expressed in a high level language must contain some type declarations to
satisfy the rules of the language. However, it is possible to not use the type system ex-
cept at the most trivial level. Consider for example a binary translator which makes no
attempt to comprehend the program it is translating, and copies the data section from
the original program as one monolithic block of unstructured bytes. Suppose also that
it uses the C language as a target machine independent assembler; the UQBT binary
+
translator operates this way [CVE00, CVEU 99, UQB01]. Such low level C emitted
by the translator will contain type-related constructs, since the C language requires
that each variable denition is declared with a type, but all variables are declared
as integer, and casts are inserted as needed to reproduce the semantics of the origi-
One of the main features absent from this kind of code is real type information for the
data.
readability, encapsulate knowledge, separate pointers from numeric constants, and en-
From the above discussion, it is clear that types are an essential feature of code written
in high level languages, and without them, readability will be extremely poor.
The separation of pointers and numeric constants is implicit in deciding types for vari-
is clearly a pointer or constant once assigned a type. Recall that in Section 1.5, sepa-
rating pointers from numeric constants is one of the fundamental problems of reverse
One of the dierences between machine code programs and their high level language
equivalents is that features such as null pointer checks and array bounds checks are
5.2 Type Analysis for Machine Code 155
usually implicitly applied by the compiler. In other words, they are present at the
machine code level, but a good decompiler should remove them. Type analysis is
needed for this removal; for a null pointer check to be removed, the analysis must nd
that a variable is a pointer. Similarly, arrays and array indexes must be recognised
before array bounds checks can be removed. It could be argued that if the null pointer
or array bounds checks are removed by a decompiler but were present in the original
Recursive types are types where part of a type expression refers to itself. For example,
a linked list may have an element named Next which is a pointer to another element of
the same linked list. Care must be taken when dealing with such types. For example,
pointer, emit a star and recurse with the child of the pointer type. This works for
non-recursive types, but fails for the linked list example (it will attempt to emit an
Most machine instructions deal with elementary (simple) types. Such types include in-
tegers and oating point numbers. Enumerations and function pointers are represented
at the machine code level as integers. Aggregate types are combinations of elementary
The other main class of types (other than elementary and aggregate types) are the
pointers. These will be considered in more detail in the sections that follow.
Aggregate types are usually handled at the machine code level one element at a time.
The few machine instructions which deal with aggregate types, such as block move or
block set instructions, can be broken down into more elementary instructions in a loop.
While machine instruction semantics will often determine the type of an elementary data
item, aggregate types can generally only be discovered as emerging from the context
Figure 5.3 shows source and machine code that uses elementary and aggregate types.
Note how on a complex instruction set such as the Intel x86, objects of elementary types
are accessed with simple addressing modes such as m[r1], while the aggregate types
use more complex addressing modes such as m[r1+r2*S] (addressing mode (r1,r2,S))
and m[r1+K] (r1 and r2 are registers, and S and K are constants).
156 Type Analysis for Decompilers
Figure 5.3: Elementary and aggregate types at the machine code level.
C programmers are aware that it is possible to access an array using either indexing
or by incrementing a pointer through the array. Generally, pointers are more ecient
optimising compiler to equivalent code that manipulates pointers. This implies that a
decompiler is free to represent array handling code either way. Users may prefer one
representation over the other. It could be argued that the representation can safely be
could always be produced, and a post decompilation transformation phase used if the
The use of running pointers on arrays has an important implication for the recovery of
pre-initialised arrays. Simplistic type analysis expecting to see indexing semantics for
arrays (e.g. [Myc99]) may correctly decompile the code section of a program, but fail
to recover the initial values for the array. Figure 5.4 illustrates the problem.
In Figure 5.4(a), the type analysis discovered an array, and in order to declare the array
of the binary le, its size and type can be used to declare initial values for the array
(not shown in the example). In Figure 5.4(b), the type of p is char*, and the type of
the constants is also char*. The fact that address 10 000 is used as a char may prompt
one char to be declared as a character, and if in a read-only section of the binary le,
it may be given an initial value. Note that there is no prompting to declare the other
nine values as type char. Worse, the constant 10 010 is used as type char*, which may
5.2 Type Analysis for Machine Code 157
prompt the type analysis to declare the object at address 10 010 to be of type char,
when in fact this is just the next data item in the data section after array a. That
object could have any type, or be unused. This demonstrates that whenever constant
K is used with type Y*, it does not always follow that location K* (i.e. m[K]) is used
with type Y.
The size of the array in Figure 5.4(a) is relatively easy to determine. In other cases,
it may be dicult or impossible to determine the length of the array. For example, a
special value may be used to terminate the loop. The loop termination condition could
be arbitrarily complex, necessitating in eect that the program be executed to nd the
terminating condition. Even executing the program may not determine the full size of
the array, since for each execution, only part of the array may be accessed. In these
cases, type analysis may determine that there is an initialised array, but fail to compute
typedef struct {
int i;
float f;
} s;
s as[10];
s* ps = as; void* p = (void*) 10000;
while (ps < as+10) { while (p < (void*) 10080) {
processInt(ps->i); processInt(*(int*)p); p += 4;
processFloat(ps->f); processFloat(*(float*)p);
++ps; p += 4;
} }
Figure 5.5: A program referencing two dierent types from the same pointer.
158 Type Analysis for Decompilers
shown in Figure 5.5(a). Here the structure contains one integer and one oat; inside
the loop, the pointer references alternately an integer and a oat. The original compiler
has rearranged the code to increment the pointer by the size of the integer and the size
of the oat after each reference, to make the machine code more compact and ecient.
As shown in Figure 5.5(b), the type of the pointer is now void*, since it is used as two
Programs such as shown in Figure 5.4(b) and 5.5(b) are less readable but correct (as-
suming suitable casts and allocating memory at the appropriate addresses) if the array
is not initialised. However, the only way to make them correct if the array is initialised
is to use binary translation techniques: force the data to the original addresses, and
reverse the data sections if the endianness of source and target machines is dierent.
Needless to say, the results are far from readable, and translation to machines with
dierent pointer sizes (among other important characteristics) is simply not feasible.
To avoid this extreme unreadability, it is necessary to analyse the loops containing the
Pointer range analysis for decompilers is not considered here; analyses such as the value
Type information arises from machine instruction opcodes, from the signatures of library
functions, to a limited extent from the values of some constants, and occasionally from
debugging information
One of the best sources of type information in a machine code program is the set of
calls to library functions. Generally, the signature of a library function is known, since
signature consists of the function name, the number and type of parameters, and the
return type (if any) of the function. This information can be stored in a database of
signatures derived from parsing header les, indexed by the function name (if dynam-
[FLI00, VE98].
A limited form of type information comes from the value of some constants. In many
architectures, certain values can be ruled out as pointers (e.g. values less than about
0x100). One of the major decisions of type analysis is whether a constant is a pointer
All machine instructions will imply a size for each non-immediate operand. Registers
have a denite size, and all memory operations have a size encoded into the opcode
of the instruction. For instructions such as some load, store and move instructions,
there is no information about the type other than the size, and that the type of the
source and destination are related by T (dest ) ≥ T (src ). (T (x) denotes the type of x.
The inequality only arises with class pointers; the destination could point to at least
as many things as the source). The same pointer-sized move instruction could be used
to move a pointer, integer, oating point value, etc. As a result, there needs to be a
representation for a type whose only information is the size, e.g. size16 for any 16-bit
quantity.
In some machine code programs, there is runtime type information (sometimes called
RTTI or runtime type identication). This is typically available where the original
program used the C++ dynamic_cast operation or similar. Even if the original source
code did not use such operations explicitly, the use of certain libraries such as the
Microsoft Foundation Classes (MFC) may introduce RTTI implicitly [VEW04]. When
RTTI is present in the input program, the names and possibly also the hierarchy of
classes is available. Class hierarchy provides type information; for example if class A is
Finally, it is possible that the input program contains debug information or symbols.
is available in the input program, the names of all procedures are usually available,
along with the names and types of parameters. The types of function returns and
local variables may also be available, depending on the details of the debug information
present.
Most type information travels from callee to caller, which is convenient for decompi-
lation because most of the other information (e.g. information about parameters and
returns) also travels from callee to caller. However, some type information travels
turns/results, with values sent from argument to parameter and return to result. The
direction that type information from the sources discussed above travels depends on
whether the library function, constant, machine instruction, etc. resides in the caller
or callee.
Consider the simple example of a function that returns the sum of its two 32-bit (pointer-
sized) parameters. The only type information known about the parameters of this
160 Type Analysis for Decompilers
function, taken in isolation, is both parameters and the return are of size 32 bits, and
not of any oating point type. In other words, several combinations of pointer and
integer are possible. If types are found for either of the callers' actual arguments or
This example also highlights a problem with type analysis in general: there may be
several possible solutions, all valid. The programs for adding a pointer and an integer,
an integer and a pointer, and two integers, could all compile to identical machine
code programs. This problem is most acute with small procedures, however, since the
chance of encountering a use that implies a specic type increases as larger procedures
are considered.
mented as a data ow problem, neither forward or reverse traversal of the ow graph
will result in dramatically improved performance. The bidirectionality has its roots in
the fact that most type dening operations aect a destination (a denition) and some
operands (uses). The type eects of denitions ow forward to uses of that denition;
the type eects of uses ow backwards to denition(s) of those uses. Library function
calls similarly have bidirectional eects. Call by value actual arguments to the library
calls are uses of locations dened earlier (type information ows backwards). Return
value types and call by reference arguments are denitions whose type eects ow for-
wards. As a result, type analysis is one of few problems that can be expressed in data
Comparison operations result in type constraints that may not be intuitive. The equal-
ity relational operators (= and 6=) imply only that the types of the variables being
compared are comparable. (If two types are comparable, it means that they are com-
patible in the C sense; int and char* are not compatible so they can't be compared.)
The type of either operand could be greater or equal to the type of the other operand,
uniformly typed objects, since high level languages do not allow assumptions about
the relative addresses of other objects. Figure 5.7(b) shows an example where the
constant a+10 should be typed as int*, even though the same numeric constant may be
5.4 Typing Constants 161
used elsewhere as the address of the float f. These operators therefore imply that the
Some relational operators imply a signedness of the operands (e.g. signed less than or
hence signed integers, unsigned integers, and integers of unknown sign should not be
the nal declared signedness of a variable can be decided by a heuristic such as the
Usually, a dierent comparison operator is used when comparing integers and oating
point numbers, so the basic type is implied by these operators. Fixed point numbers
will often be manipulated by integer operators (e.g. comparison operators, add and
subtract), with only a few operations (e.g. xed point multiply) identifying the operands
as xed point. It may therefore be possible to promote integers to xed point numbers,
Constants have types just as locations do, and since constants with the same numeric
In high level languages, constants have implied types. For example, the type of 2.0 is
double; 'a' is char, 0x61 is integer, and so on. However, in decompilation, this does
not apply. Constants require typing just as locations do. Depending on the type of the
constant, the same immediate value (machine language constant) might end up emitted
independently. For example, in the following statement, each of the constants with
where each α represents an arbitrary type, but each appearance of α implies the same
arbitrary type as the other αs. In the third example, 10003 is a pointer to an α, which
been used in the original program for the third solution to be valid, and will be needed
Note also that the value in a register can be an intermediate value that has no type,
The value 0x10 000 in register %o0 has no standard type, although it appears to be used
as part of two dierent types, char* and integer. It must not appear in the decompiled
output. The intermediate value is a result of a common feature of RISC machines: since
RISC instructions are usually one word in length, two instructions are needed to produce
most word-length constants. Constant propagation and dead code elimination, both
facilitated by the SSA form, readily solve this problem by removing the intermediate
constants.
Finding types for variables and constants in the decompiled output can be treated as
tion Problem (CSP) [VH89, Bar98]. The domain of type variables is relatively small:
5.5 Type Constraint Satisfaction 163
• a structure or class
• an enumerated type
Figure 5.8 shows a program fragment in source code, machine code (with SSA trans-
In the second instruction, the register r1a (rst SSA version of register r1) is set to 0.
Since zero could be used as an integer and but also as a NULL pointer, the constraints
are that r1a could be an integer (t1a = int ) or a pointer to something, call it α1 (t1a =
ptr(α1 )). The constraints generated by the add instruction are more complex, reecting
the fact that there are three possibilities: pointer + integer = pointer, integer + pointer
= pointer, and integer + integer = integer respectively. The constraints are normally
solved with a standard constraint solver algorithm, but parts of a simple problem such
as that of Figure 5.8 can be solved by eye. For example, the constraint for the load
instruction has only one possibility, t0b = ptr(mem(0: t2a) (meaning that r0b points
to a structure in memory with type t2a at oset 0 from where r0b is pointing). This
can be substituted into the constraints for the instruction with the label 3F2:, so the
Continuing the constraint resolution, two solutions are found. In most cases, there
Because constants are typed independently (previous section), and expressions (includ-
constants in some way. For example, constants could be subscripted much as SSA vari-
ables are renamed, e.g. 10003 as has already been seen. The type of each version of the
Type constants (values for type variables which are constant) are common. For exam-
ple, an xor instruction implies integer operands and result; a sqrt instruction implies
oating point operands and result; an itof instruction implies integer operand and
oating point result. A library function call results in type constants for each parame-
Choosing a value for a type variable (in CSP terminology) is often implicit in the
constraint. For example, the add instruction of Figure 5.8, add r2a,r1b,r1c, results
in the constraints
164 Type Analysis for Decompilers
f: tf = t 0 → t 99
mov r0,r0a t 0 = t 0a
mov #0,r1a t 1a = int ∨ t 1a = ptr (α1 )
cmp #0,r0a t 0a = int ∨ t 0a = ptr (α2 )
beq L4F2
L3F2: mov φ(r0a,r0c),r0b t 0b = t 0a, t 0b = t 0c
mov φ(r1a,r1c),r1b t 1b = t 1a, t 1b = t 1c
ld.w 0[r0b],r2a t 0b = ptr (mem (0 : t 2a )
add r2a,r1b,r1c t 2a = ptr (α3 ), t 1b = int, t1c = ptr(α3 ) ∨
t 2a = int, t 1a = ptr (α4 ), t 1c = ptr (α4 ) ∨
t 2a = int, t 1b = int, t 1c = int
ld.w 4[r0b],r0c t 0b = ptr (mem (4 : t 0c ))
cmp #0,r0c t 0c = int ∨ t 0c = ptr (α5 )
bne L3F2
L4F2: mov φ(r1a,r1c),r1d t 1d = t 1a, t 1d = t 1c
mov r1d,r0d t 0d = t 1d
ret t 99 = t 0d
Parasitic solutions are unlikely with larger programs, and Mycroft suggests ways to
avoid them in his paper.
Figure 5.8: A simple program fragment typed using constraints. From [Myc99].
5.5 Type Constraint Satisfaction 165
Here, the constraints are expressed as a disjunction ( or-ing) of conjuncts (some terms
and-ed together; the commas here imply and). Finding values for t 2a, t 1b, and t 1c
happen simultaneously, by choosing one of the three conjuncts. The choice is usually
made by rejecting conjuncts that conict with other constraints. Hopefully, at the end
of the process, there is one set of constraints that represents the solution to the type
analysis problem.
Some constraint solvers are based on constraint propagation (e.g. the forward check-
ing technique). Others rely more on checking for conicts (e.g. simple backtracking
or generate and test). Taken together, the above factors (small domain size, constants
and equates being common, and so on) indicate that constraint propagation will quickly
prune branches of the search tree that will lead to failure. Therefore, it would appear
that constraint propagation techniques would suit the problem of type constraint sat-
are incomplete, i.e. a given algorithm may nd one or more solutions, or prove that the
constraints cannot be solved (prove they are inconsistent), or they may not be able to
do either.
Mycroft proposed a constraint based type analysis system for decompilation of machine
code programs (in Register Transfer Language form or RTL) [Myc99]. He generates
constraints for individual instructions, and solves the constraints to type the variables
of the program being decompiled. He assumes the availability and use of double register
addressing mode instructions to signal the use of arrays. For example, from Section 4
of his paper,
However, some machine architectures do not support two-register indexing, and even if
they did, a compiler may for various reasons decide to perform the addition separately
166 Type Analysis for Decompilers
Figure 5.9: Constraints for the two instruction version of the above. Example
from [Myc99].
to the load or store instruction. Hence, the above instruction may be emitted as two
The last (underlined) conjunct for the rst instruction is immediately removed since r1
is used as a pointer in the second instruction. The nal constraints are now in terms
of ptr (t 3), rather than ptr (array (t 3)). These are equivalent in the C sense, but the
fact that an array is involved is not apparent. In other words, considering individual
instructions by themselves is not enough (in at least some cases) to analyse aggregate
types. Either some auxiliary rule has to be added outside of the constraint system,
or expression propagation can be used in conjunction with a high level type pattern.
(In Section 4.3 of his paper, Mycroft seems to suggest considering pairs of instructions
to work around this problem.) This is another area where expression propagation and
Compilers implicitly use pointer-sized addition instructions for structure member access,
leading to an exception to the general rule that adding an integer to a pointer of type
since they can be used on pointers or integers. Mycroft [Myc99] states the type con-
straints for the instruction add a,b,c (where c is the destination, i.e. c := a + b) to
be:
where again T (x) represents the type of x (Mycroft uses tx ) and ptr(α) represents
a pointer to any type, with the type variable α representing that type. Mycroft uses
5.6 Addition and Subtraction 167
the C pointer arithmetic rule, where adding an integer to a variable of type α* will
always result in a pointer of type α*. However, the C denition does not always
apply at the machine code level. Compilers emit add instructions to implement the
addition operator in source programs, but also for two other purposes: array indexing
and structure member access. For array indexing, the C pointer rules apply. The base
of an array of elements of type α is of type α*, the (possibly scaled) index is of type
integer, and the result is a pointer to the indexed element, which is of type α*.
However, structure member access does not follow the C rule. Consider a structure of
type Σ with an element of type at oset K. The address of the structure could be
type Σ* (a pointer to the structure) or type 0 * (a pointer to the rst element 0 of the
structure), depending on how it is used. At the machine code level, Σ and Σ.0 have
the same address and are not distinguishable except by the way that they are used. K is
added to this pointer to yield a pointer to the element , which is of type *, in general
a dierent type to Σ* or 0 *.
Unfortunately, until type analysis is complete, it is not known whether any particular
pointer will turn out to be a structure pointer or not. Figure 5.10 gives an example.
void foo(void* p) {
m[p] := ftoi(3.00); // Use p as int*
...
m[p+4] := itof(-5); // Use p as pointer to struct with oat at oset 4
Figure 5.10: A program fragment illustrating how a pointer can initially appear
not be be a structure pointer, but is later used as a structure pointer.
Initially, parameter p is known only to be a pointer. After processing the rst statement
of the procedure, p is used with type int*, and so T (p) is set to int*. In the second
statement (in general, any arbitrary time later), it is found that p points to a struct
with an int at oset 0 and a oat at oset 4. If the C rule was used that the sum of p
(then of type int*) and 4 results in a pointer of the same type as p, then in the second
statement p is eectively used as both type int* and type oat*. This could lead to p
structure pointer and a constant integer; compilers know the oset of structure members
and keep this information in the symbol table for the structure. Hence if the integer is
not a constant, the types of the result and the input pointer can be constrained to be
Here, void*, represents a variable known to be a pointer, but the type pointed to is
unknown (output) or is ignored (input). Where there is a conict between void* and an
α* , α* would be used by preference. Note that this is already suggesting a hierarchy
of types, with void* being preferred to no type at all, but any α* preferred to void*.
Subtraction from or by pointers is not performed implicitly by compilers, so Mycroft-like
Type analysis for decompilers where the output language is statically type checked can
be performed with a sparse data ow algorithm, enabled by the SSA form.
Types can be thought of as sets of possible values for variables. The smaller the set of
possible values, the more precise and useful the type information. These sets form a
natural hierarchy: the set of all possible values, the set of all possible 32-bit values, the
set of signed or unsigned 32-bit integers, the set of signed 32-bit integers, a subrange
of the signed integers, and so on. The eect of machine instructions is often to restrict
the range of possible values, e.g. from all 32-bit values to 32-bit integers, or from 32-bit
This eect suggests that a natural way to reconcile the various restrictions and con-
straints on the types of locations is to use an iterative data ow framework [MR90,
KU76]. The data that are owing are the type restrictions and constraints, and the re-
sults are the most precise types for the locations in the program given these restrictions
and constraints.
Data ow based type analysis has some advantages over the constraint-based type
analysis outlined in Section 5.5. Since there are no explicit constraints to be solved
distinguish constants that happen to have coinciding values. Constraints are dicult to
5.7 Data Flow Based Type Analysis 169
solve, sometimes there is more than one solution, and at other times there is no solution
at all. By contrast, the solution of data ow equations is generally quite simple. Two
exceptions to this simplicity are the integer addition and subtraction instructions; as
will be shown, these are more complex than other instructions, but the complexity is
moderate.
Types can be thought of as sets of possible values, e.g. the type integer could be thought
of as the innite set of all possible integers. In this chapter, the two views will be
used interchangeably. The set of integers is a superset of the set of unsigned integers
(counting numbers). There are more elements in the set of integers than the set of
unsigned integers, or integer ⊃ unsigned integer. The set of integers is therefore in a
meaningful way greater than the set of unsigned integers; hence there is an ordering
of types.
The ordering is not complete, however, because some type pairs are disjoint, i.e. they
cannot be compared sensibly. For example, the oating point numbers, integers, and
pointers are not compatible with each other at the machine code level, and so are
types implies more information than the others. To indicate that the ordering is partial,
the square versions of the usual set operator symbols are used: ⊂, ⊃, ∩, and ∪ become
<, =, u, and t respectively. Hence int = unsigned int or int is a supertype of unsigned
int.
These types are incompatible at the machine code level despite the fact that the math-
ematical number 1027 is an integer, it is also a real number, and it could be used as the
address of some object in memory. Floating point numbers have a dierent bit repre-
sentation to integers, and are therefore used dierently. Pointer variables should only
used by integer instructions (integer add, shift, bitwise or), and oating point variables
should only be used by oating point instructions (oating point add, square root, etc).
Two exceptions to this neat segregation of types with classes of instructions are the
integer add and subtract instructions. In most machines, these instructions can be used
on integer variables and also pointer variables. Section 5.6 discussed the implications
of this exception in more detail. It is the fact that objects of dierent types usually
170 Type Analysis for Decompilers
must be used with dierent classes of instructions that makes it so important that type
such as size32 (a 32-bit quantity, whose basic type is not yet known, yet the size is
known), or the type pointer-or-integer. These are temporary types that should be elimi-
nated before the end of the decompilation. As a contrived example, consider a location
referenced by three instructions: a 32-bit test of the sign bit, an integer add instruc-
tion, and an arithmetic shift right. Before the test instruction, nothing is known about
the location at all. The type system can assign it a special value, called > (top). >
represents all possible types, or the universal set U, or equivalently, no type information
(since everything is a member of U). After the test instruction, all that is known is
that it is a 32-bit signed quantity, so the type analysis can assign it the temporary type
The type hierarchy so far can be considered an ordered list, with > at one end, and int
at the other end. It is desirable for the type of a location to move in one direction (with
one exception, described below), towards the most constrained types (those with fewer
possible values, away from >). It is also possible that later information will conict
with the currently known type. For example, suppose that a fourth instruction used
the location as a pointer. Logically, this could be represented by another special value
values. This could occur as a result of a genuine inconsistency in the original program
(e.g. due to a union of types, or a typecast), or from some limitation of the type analysis
system.
In practice, for decompilation, it is preferable never to assign the value ⊥ to the types
decompilers would either retain the current type ( int) or assign the new type (the
pointer), and conicting uses of the location would be marked in such a way that a
cast or union would be emitted into the decompiled output. (If a cast or union is not
allowed in the output language, a warning comment may be the best that can be done.
Some languages are more suitable as the output language of a decompiler than others.)
Continuing the example of the location referenced by three instructions, the list of types
can be represented vertically, with > and types size32, pointer-or-int, and int in that
Since another instruction (e.g. a square root instruction) in place of the add instruction
would have determined the location to be of type oat instead, there is a path from
5.7 Data Flow Based Type Analysis 171
No type information = T
size32 a b c
pointer-or-int float-or-int
d
T e
type conflict =
(a) (b)
size32 to oat, and no path from int to oat, since these are incompatible.
So far, when a location is used as two types a and b, the lower of the two types in the
lattice becomes the new type for the location. However, consider if a location had been
used with types pointer-or-int and oat-or-int. (The latter could come about through
being assigned a value known not to be valid for a pointer). The resultant type cannot
should in fact be the new type int, which is the greatest type less than both pointer-or-int
and oat-or-int. In other words, the general result of using a location with types a and
b is the greatest lower bound of a and b, also known as the meet of a and b, written
a u b.
Note the similarity with the set intersection symbol ∩. The result of meeting types a
and b is basically a ∩ b where a, b, and the result are thought of as sets of possible
values. For example, if the current type is ?signed-int (integer of unknown sign), it could
be thought of as the set {sint, uint}. If this type is met with unsigned-int ({uint}), the
Figure 5.11(b) shows why it is the greatest lower bound that is required. When meeting
types a and c, the result should be d or e, which are lower bounds of a and c. In fact,
it should be d, the greatest lower bound, since there is no justication (considering only
the meet operation of a and c ) for selecting the more specialised type e. For example,
if the result is later met with b, the result has to be d, since types b and e are not
comparable in the lattice (meaning that they are not compatible types in high level
languages).
Figure 5.12 shows a more practical type lattice, showing the relationship of the numeric
172 Type Analysis for Decompilers
types.
pointer or
int (pi
long long dou- pointer pointer signed unsigned signed unsigned signed unsigned
float
double long ble to class to other int int short short char char
Earlier it was mentioned that when considering the types of locations in a program, the
types do not always move from top to bottom in the lattice. The exception concerns
T = void*
Sender* Receiver*
Sender Receiver GoldCustomer GoldCustomer*
Transceiver*
Transceiver T
Consider the example class hierarchy and lattice shown in Figure 5.13. The similarity
between the class hierarchy and the hierarchy of pointers to those classes is evident. A
pointer could be assigned the type Sender* in one part of the program, and the type
Receiver* in another part. If there is a control ow merge from these two parts, the
pointer will have been used as both Sender* and Receiver*. The rules so far would result
in the type Transceiver*, which is a pointer to a type that has multiple inheritance
from classes Sender and Receiver. However, this may be overkill, if the program is
only referring to those methods and/or members which are common to Senders and
5.7 Data Flow Based Type Analysis 173
Receivers, i.e. methods and members of the common ancestor Communicator. Also,
inherits from more than one class, and to generate one in the decompiled output would
be incorrect.
This example illustrates that sometimes the result of relating two types is higher up the
lattice than the types involved. In these cases, relating types α* and β* results in the
type (αtβ )*, where t is the join operator, and αtβ results in the least upper bound
This behaviour occurs only with pointers and references to class or structure objects.
For pointers to other types of objects, the types of the pointer or reference and the
There is an additional dierence brought about by class and structure pointers and
references, then the type of a and b are both aected by each other; the type of both
is the type of a met with the type of b. However, with p := q where p and q are both
class or structure pointers or references, p may point to a larger set of objects than q,
and this broadening of the type of p does not aect q. Hence after such an assignment,
the type of p becomes a pointer to the join of p 's base type and q 's base type, but the
type of q is not aected. Only assignments of this form have this exception, and only
For the purposes of type analysis, procedure parameters are eectively assigned to by
all corresponding actual argument expressions at every call to the procedure. The
parameter could take on values from any of the actual argument expressions. For
parameters whose types are not class or structure pointers or references, the types of
all the arguments and that of the parameter have to be the same. However, if the type
of the parameter is such a pointer or reference, the type of the parameter has to be the
In the beginning of Section 5.7, it was stated that the decompiler type recovery problem
can be solved as a data ow problem, just as compilers can implement type checking for
statically checked languages that way. Traditional data ow analysis, often implemented
174 Type Analysis for Decompilers
using bit vectors, can be used [ASU86]. However, this involves storing information
about all live variables for each basic block of the program (or even each statement,
depending on the implementation). Assuming that the decompiler will generate code for
a statically typed language, each SSA variable will retain the same type, so that a more
sparse representation is possible. Each use of an SSA location is linked to its statically
unique denition, so the logical place to store the type information for variables is in
the assignment statement associated with that denition. For parameters and other
This framework is sparse in the sense that type information is located largely only where
needed (one type variable per denition and constant, instead of one per variable and
The SSA form of data ow based type analysis is not ow sensitive, in the sense that
the computed type is summarised for the whole program. However, since the type for
the location is assumed to be the same throughout the program, this is not a limitation.
Put another way, using the SSA form allows the same precision of result with a less
Early in the decompilation process, assignments are often in a simple form such as c :=
stants. However, some instructions are more complex than this, and after propagation,
expressions can become arbitrarily complex. Figure 5.14 shows a simple example.
a[2000] + s.b (except for the destination of an assignment, which is always a lo-
In the example, the only type information known from other statements is that a is
an array of character pointers (i.e. a currently has type char*[]). Type analysis for the
5.7 Data Flow Based Type Analysis 175
expression starts at the bottom of the expression tree in a process which will be called
ascend-type. In this example the algorithm starts with subexpression a[2000]. The
array-of operator is one of only a few operators where the type of one operand (here
a, not 2000 ), if known, aects the result, and the type of the result, if known, aects
that operand. Since the type of a is known, the type of a[2000] can be calculated; in
this case it is char*. This subexpression type is not stored; it is calculated on demand.
The integer addition operator is a special case, where if one operand is known to be a
pointer, the result is a pointer type, because adding a pointer and an integer results in
a pointer. Hence the type of the overall expression is calculated to be void*. (Adding
an integer to a char* does not always result in another char*, hence the result has type
void*). Again, this type is not stored. No type information is gained from s, b, or s.b,
Next, a second phase begins, which will be called descend-type. Now type information
ows down the expression tree, from the root to the leaves. In order to nd the type
that is pushed down the expression tree to the right of the addition operator, ascend-
type is called on its left operand. This will result in the type char*, as before. This
type, in conjunction with the type for the result of the addition, is used to nd the type
for the right subexpression of the add. Since pointer plus integer equals pointer, the
type found is integer. The structure membership operator, like the array-of operator,
can transmit type information up or down the expression tree. In this case, it causes
the type for b to be set to integer, and the type for s to be set to a structure with an
integer member.
When the process is repeated for the left subexpression of the add node, the result is
type void*, which implies a type void*[ ] for a. However when this is met with the more
precise existing type char*[ ], the type of a remains as char*[ ]. The type for the constant
In the example above, the initial type for location a came from type information else-
parts of the program, as shown in Section 5.4. As a result, constants are typed only by
In general, type information has to be propagated up the expression tree, then down
Table 5.1 shows the type relationships between operands(s) and results for various
operators and constants. Most operators are in the rst group, where operand and
result types are xed. For the other, less common operators and constants, the full
In the example of Figure 5.14, the type for a[2000] was calculated twice; once during
176 Type Analysis for Decompilers
ascend-type, and again for descend-type. For more complex expressions, descend-type
may call ascend-type on a signicant fraction of the expression tree many times. Figure
5.15 (a) shows a worst-case example for an expression with four levels of binary oper-
ators. At the leaves, checking the type of the location (if present) could be considered
one operation. This expression tree has 16 leaf nodes, for a total of 16 operations. One
level up the tree, the type of the parent nodes is checked, using information from the
two child nodes, for a total of three operations. There are eight such parent nodes, for
a total of 24 operations at this level. Similarly, at the top level, 31 operations (almost
2h+1 where h is the height of the expression tree) are performed by descend-type.
1x31
1x40
2x15
3x13
h=4 4x7 h=3
9x4
8x3
27x1
16x1
(a) (b)
Figure 5.15(b) shows a tree with all ternary operators (e.g. the C ?: operator). Such
a tree would never be seen in a real-world example, but it illustrates the worst-case
complexity of the descend-type algorithm. Here, of the order of 3h+1 operations are
performed. This potential cost osets the space savings of storing type information
Section 5.6 showed a modication of Mycroft's constraints that take into account the
croft's constraints in a data ow based analysis, some extra types and operators are
required. Let π pointer-or-integer or higher in the lattice. It is hoped that all occur-
rences of π will eventually be replaced with a lower (more precise) type such as integer
or a specic pointer type. The integer is understood to be the same size as a pointer
in the original program's architecture. In the lattice of Figure 5.12, π could have the
values pointer-or-integer, size32, or >. The data ow equations associated with add a,
b, c , (i.e. c=a+b), with the structure pointer exception, can be restated as:
where Σa and Σs (a stands for addend or augend, s for sum) are special functions
dened as follows:
T (c) =
const-π π π var-π -
T (a) =
β* ⊥ β* void* β* void*
For brevity, ptr(α) is written as α*. The type variables T (a), T (b), and T (c) are
As an example, consider p = q+r with q known to have type char*, and the type of r
is wanted. Since the type of p is not known yet, the type of p remains at its initial
value of var-π . p, q, and r are substituted for c, a, and b respectively of equation 5.2.
This equation uses function Σa , dened above in the left table. Since T (c) = var-π ,
the third column is used, and since T (other) = α* with α = char, the rst row is used.
The intersection of the third column and rst row contains var-int, so T (b) = T (r) =
int.
Similarly, the data ow equations associated with sub a, b, c (i.e. c=a-b) can be
restated as:
where ∆m , ∆s and ∆d (m stands for m inuend (item being subtracted from), s for
s ubtrahend (item being subtracted), and d for d ierence (result)) are special functions
dened as follows:
T (c) = T (c) =
∆m β* int π ∆s β* int π
α* ⊥ α* α* α* int α* π
T (b) = int β* int π T (a) = int ⊥ int int
π β* int π π int π π
T (a) =
∆d α* int π
β* int ⊥ int
so this time there is no need to distinguish between var-int and const-int, or var-π and
const-π .
A small set of high level patterns can be used to represent global variables, local variables,
A sequence of machine instructions for accessing a global, local, or heap variable, array
element, or structure element will in general result in a memory expression which can
n
h i
Sj ∗ iej + K]
P
m[ sp0 pl + (5.7)
j=1
where
• sp0 represents the value of the stack pointer register at the start of the procedure.
• pl is a nonconstant location used as a pointer (although nonconstant, it could
• The iej are nonconstant integer expressions that do not include the stack pointer
register, and are known not to be of the form x+C where x is a location and C is a
constant. Constants could appear elsewhere in an iej , e.g. it could be 4*r1*r2.
to m[r1*4 + 4000] before attempting to match to the above equation. Expression prop-
The sum of the terms inside the m[...] must be a pointer expression by denition.
Pointers cannot be added, adding two or more integers does not result in a pointer, and
the result of adding a pointer and an integer is a pointer. It follows that exactly one of
the terms is a pointer, and the rest must be integers. Since sp0 is always a pointer, sp0
and pl cannot appear together, and if sp0 or pl are present, K cannot be a pointer.
It could be argued that since pl or K could be negative, all three of sp0 , pl, and K
could be present, with two pointers being subtracted from each other, resulting in a
180 Type Analysis for Decompilers
constant. However, such combinations would require the negation of a pointer, which
Initially, it may not be possible to distinguish pl from an iej with Sj =1, so temporary
expressions such as m[l1 + K] or m[l1 + l2 + K] may be needed, until it becomes clear
which of l1 , l2 and K is the pointer.
When present, sp0 indicates that the memory expression represents a stack allocated
Table 5.2 shows a range of patterns that may be found in the intermediate representation
after propagation and dead code elimination, and the propositions which discuss them
over the next several pages. It is evident that there are a lot of possible patterns, and
distinguishing among them is not a trivial task. Most of this distinguishing is left as
future work.
indexing is not possible. Arrays can therefore be indexed with a constant or variable
index expression; index expressions are of integer type. Structure elements can only be
The only place where a non-constant integer expression can appear is therefore as an
array index. Hence, when present, the iej indicate array indexing, and the overall
memory expression references an array element. For an array element access with m
dimensions n of which are non-constant, there will be at least n such terms. (One or
more index expressions could be of the form j+k where j and k are locations, hence n
is a minimum.)
5.8 Type Patterns 181
For single dimension arrays whose elements occupy more than one byte, there will be
arrays of subarrays, so the higher order index expressions are scaled by the size of the
subarray. The sizes of arrays and subarrays are constant, so the Sj will be constant for
any particular array. Variations in Sj between two array accesses indicate either that
dierent arrays are being accessed, or that what appears to be scaling is in at least one
Proposition 5.9: : (K present) When present, the constant K of Equation 5.7 could
(a) The address of a global array, global structure, or global variable (sp0 and pl
not present). This component of K may be bounded: the front end of the decom-
piler may be able to provide limits on addresses that fall within the read-only or
(b) The (possibly zero) oset from the initial stack pointer value to a local variable,
(d) Constants arising from the constant term(s) in array index expressions. For
example, if in a 10×10 array of 4-byte integers, the index expressions are a*b+4
and c*d+8, the expression for the oset to the array element is (a*b+4)*40 +
(c*d+8)*4, which will canonicalise to 40*a*b + 4*c*d + 192. In the absence of
other constants, K will then be 192, which comes in part from the constants 4 and
8 in the index expressions, as well as the size of the elements (S2 =4) and the size
(e) Osets arising from the lower array bound not being zero (for example, a: array
[-20 .. -11] of real). Where more than one dimension of a multidimensional
array has a lower array bound that is non zero, several such osets will be lumped
together. In the C language, arrays always start at index zero, but it is possible
to construct pointers into the middle of arrays or outside the extent of arrays, to
achieve a similar eect, as shown in Figure 5.16. Note that in Figure 5.16(b), it is
possible to pass a0+100 to another procedure, which accesses the array a0 using
Where nested structures exist, several structure member osets could be lumped
182 Type Analysis for Decompilers
together to produce the oset from the start of the provided object to the start of
the structure member involved. For example, in s.t.u[i], K would include the
osets from the start of s to the start of u, or equivalently the sum of the osets
from the start of s to the start of t and the start of t to the start of u.
Figure 5.16: Source code for accessing the rst element of an array with a
nonzero lower index bound.
Many combinations are possible; e.g. options (d) and (f ) could be combined if an array
is accessed inside a structure with at least one constant index, e.g. s.a[5] or s.b[5, y]
where s is a structure, a is an array, and b is a two-dimensional array.
The iej terms represent the variable part of the index expressions, with the constant
part of index expressions split o as part of K, as shown above at option 5.9(d). To save
space, statements such as ie represents the variable part of the index expression will
be shortened to ie represents the index expression, with the understanding that the
constant parts of the index expressions are actually lumped in with other terms into K.
It is apparent that many patterns could be encountered in the IR of a program to be
typed, and these make dierent assertions about the types of the various subexpressions
involved. The following propositions summarise the patterns that may be encountered.
a global array element with constant index(es), possibly inside a global structure.
K is the sum of options 5.9(a) and 5.9(c)-(f ). As noted in section 3.4.4 on page 83,
it is assumed that architectures where global variables are accessed as osets from a
register reserved for this purpose, that register is initialised with a constant value by
the decompiler front end to ensure that all global variable accesses (following constant
propagation) are of this form. Since this pattern can represent either a global variable,
structure element, or an array element with xed oset(s), elementary types can be
promoted to structure or array elements. To t this into the lattice of types concept, this
type α. This relation could be read as the type `element of an array of int' is a subtype
of or equal to type `variable of int' (the former, occuring less frequently, is in a sense
5.8 Type Patterns 183
more constrained than the latter). The same applies for structure elements; an element
of a structure σ whose type is β is written ξ (σ )(β ), and where the structure type is
following proposition:
The above proposition will be improved below. It should be noted that the types α
and ξ (array(α)) are strictly speaking the same types, so the v relation is really =
(equality), but expressing the various forms of α in this way allows these forms to
become part of the lattice of types. When various triggers are found (e.g. K is used as a
pointer to an array elsewhere in the program), the type associated with a location can
be moved down the lattice (e.g. from α to ξ (array(α))). This makes the handling of
array and structure members more uniform with the overall process of type analysis.
member, or local array element with constant index(es), possibly inside a local structure.
cess.
this expression can be rened to one of the following patterns. Since l+K is used as a
pointer, and adding two terms implies that one of the two terms is an integer and the
other a pointer, then if the value of K is such that it cannot be a pointer, K must be
an integer and l a pointer. As pointed out in Section 5.4, constants are independent,
so other uses of the same constant elsewhere do not aect whether l or K is a pointer.
Hence, the only factors aecting whether l or K is the pointer in m[l + K] are whether
l is used elsewhere as a pointer or integer, and whether K has a value that precludes it
and 5.9(c)-(f ), and ie is the possibly scaled index expression. There could be
Here pl is a pointer to the array or structure (global, local, or heap allocated), and
K represents the sum of options 5.9(c)-(f ). For example, m[pl + K] could represent
s.m where s is a variable of type structure (represented by pl), and m is a member
of s (K represents the oset from the start of s to m). It could also represent a[C],
struct {
COLOUR c1;
COLOUR c2;
COLOUR c3;
} colours; COLOUR colours[3];
colours.c1 = Red; colours[1] = Red;
colours.c2 = Green; colours[2] = Green;
colours.c3 = Blue; colours[3] = Blue;
process(colours.c2); process(colours[2]);
(a) Structure. (b) Array.
Figure 5.17: Equivalent programs which use the representation m[pl + K].
The two representations for m[pl + K] are equivalent, as shown in Figure 5.17. Hence,
hence this pattern could be considered a structure reference unless and until an array
reference with a nonconstant index is found. In either case, since an array reference
with a nonconstant index could be found elsewhere at any time, the array element is in
a sense more constrained than the structure element. Again abusing terminology, this
Proposition 5.16:
ξ (structure-containing-ξ (array(α))) v ξ (array(α)) v ξ (structure-containing-α) v α.
The above example illustrates the point that the lattice of types for decompilation is
based at least in part not on the source language or how programs are written, but on
!(structure-containing- )
!(array of )
!(structure-containing-!(array of ))
Figure 5.18: A type lattice fragment relating structures containing array ele-
ments, array elements, structure members, and plain variables.
There is no pattern m[sp0 + pl + K] since sp0 and pl are both pointers, and pointers are
assumed never to be added.
The pattern m[l1 + l2 + K] where l1 and l2 are locations is interesting because both
locations could match either ie or pl. Until other uses of l1 or l2 are found that
provide type information about the locations, it is not known which of the locations or
K is a pointer.
l2 is used elsewhere as a pointer, or K has a value that cannot be a pointer and either
following pattern:
If both l1 and l2 are used elsewhere as integers, then Proposition 5.18 can be rened
The non-constant index expression is ie1 + ie2 , and K represents the sum of op-
Another special case of the pattern in Proposition 5.18 is when K=0. One of l1 and
l2 must be an integer, and the other a pointer. From proposition 5.8, this implies that
the integer expression represents array indexing. If one of l1 or l2 are used elsewhere
al global array element access. Here the iej are the index expressions, S 1 ..Sn are
scaling constants, and K lumps together options 5.9(a) and 5.9(c)-(f ). If there are
index expressions with more than one location, then m could be less than n. If there
are constant index expressions, then m could exceed n. These two factors could cancel
Here pl points to the array or structure containing the array, the iej are the index
expressions, S1 .. Sn are scaling constants, and K lumps together options 5.9(a) and
5.9(c)-(f ).
Here the li are the index expressions, S1 ..Sn are scaling constants, and K lumps together
options 5.9(b)-(f ).
Examples in the C language include (*p)[i] and *(ptrs[j]), where the above expres-
sions have been applied to part of the overall expression to yield the array indexes.
Decompilers need a data structure comparable to the compiler's symbol table (which
maps symbols to addresses and types) to map addresses to symbols and types.
Section 5.2.3 on page 155 noted that aggregate types are usually manipulated with a
series of machine code instructions, rather than individual instructions. For example, to
sum the contents of an array, a loop is usually employed. If a structure has ve elements
of elementary types, there will usually be at least ve separate pieces of code to access
all the elements. In other words, aspects of aggregate types such as the number of
elements and the total size emerge as the result of many instructions, not individual
instructions. This contrasts with the type and size of elementary types, which are
Hence a data structure is required to build up the picture of how the data section is
composed. This process could be thought of as partitioning the data section into the
various variables of various types. This data structure is in a sense the equivalent of the
symbol table in the compiler or assembler which allocated addresses to data originally.
In a compiler or assembler, the symbol table is essentially a map from a symbolic name
to a type and a data address. In a decompiler, the appropriate data structure, which
could be called a data map, is a map from data address to a symbolic name and type.
At least two address spaces need to be considered: the global address space, containing
global variables, and the stack local address space, containing local variables. There
is also the heap address space, where variables, usually aggregates, are created with a
language keyword such as new or a call to a heap allocating library function such as
malloc. However, allocations for heap objects are usually for one object at a time, and
the addresses by design do not overlap. A separate data map would be used for each
address map.
As a space saving optimisation, compilers may allocate more than one variable to the
same address. Once this has been determined to be safe, such colocation has little cost
for a compiler; it merely has some entries in the symbol table that have the same or
overlapping values for the data addresses. For a decompiler, however, the issue is more
complex. The data address does not uniquely identify a data map entry, as shown in
Figure 5.19.
D D D D D
U U D o
U D = definition
D o
U
D U = use
U D
D
o o = phi-function
o U
U U
U
U U U
(a) (b) (c) (d)
In Figure 5.19(a) and (b), although there are multiple denitions for the variable, the
188 Type Analysis for Decompilers
denitions and uses of the variable are united by φ-functions. In cases (c) and (d), there
is more than one live range for the variable, made obvious by the breaks in data ow
from one live range to the next. Where there is more than one live range for a variable
and the types of the variable are dierent in the live ranges, two or more variables must
be emitted, as the compiler has clearly colocated unrelated variables, and each named
Although more than one name must be generated, there is still the option of uniting the
addresses of those names (e.g. with a C union), or separating them out as independent
variables, which the compiler of the generated code could assign to dierent addresses.
In cases where the types for the various live ranges agree, however, it is not possible to
decide in general whether the compiler has colocated unrelated variables that happen
to have the same type, or if the original source code used one variable in several live
ranges. An example of where the latter might happen is where an array index is re-
used in a second loop in the program. Always separating the live ranges into multiple
variables could clutter the generated source code with excess variable declarations.
Never separating the live variables into separate variables could result in confusion
when the variable is given a meaningful name. The name that is meaningful for one
live range could be incorrect for other live ranges. This is a case where an expert user
may protably override the default decompiler behaviour. Despite the potential for
variable or aggregate element. At the high level, this may be explicit, as with the &
unary operator in C, or implicitly, as in Java when an object is referenced (references
are pointers at the bytecode level). This taking of the address may be far away from the
eventual use of that variable or aggregate element. The type patterns for the address
of a variable or aggregate element are as per Section 5.8, but with the outer m[. . .]
operation removed. In some cases, these will be very common expressions such as K or
ie + K. Clearly, not all such patterns represent the address of a global variable, or the
These patterns are therefore employed after type analysis, and are only used when type
analysis shows the pattern (excluding the m[. . .]) is a pointer type. Obviously, the
patterns of Section 5.8 with the m[. . .] guarantee that the inner expression is used as a
pointer.
When the address of a variable is taken, the variable is referenced, but it is not directly
dened or used. Data ow information ultimately comes from denitions or uses, but
when the reference is passed to a library function, the usage information is usually con-
reference implies that the location being referenced is used but not dened. A non
constant reference could imply various denition and use scenarios, so the conservative
summary is may dene and may use. The purpose of the library function may carry
more data ow information than its prototype. For example, the this parameter of
CString::CString() (the constructor procedure for a string class) does not use the
referenced location (*this) before dening it. This extra information could help sepa-
rating the live ranges of variables whose storage is colocated with other objects. While
it may be tempting to use types to help separate colocated variables, the possibility
that the original program used casts makes types less reliable for this purpose. In the
case of the CString constructor, a new live range is always being started, but this fact
int i;
void (*p)(void);
char c, *edi;
mov $5,-16(%esp) ; Define i=5;
print -16(%esp) ; Use as int print(i);
lea -16(%esp),%edi ; Take the address, → edi edi = &c;
call proc1 proc1();
... ...
mov $proc2,-16(%esp); Define p = proc2;
call -16(%esp) ; Use as proc* (*p)();
... ...
mov $'a',-16(%esp) ; Define c='a';
putchar -16(%esp) ; Use as char putchar(c);
... ...
process((%edi)) ; Use saved address (as char) process(*edi);
Figure 5.20: A program with colocated variables and taking the address.
Taking the address of a variable that has other variables colocated can cause an ambi-
guity as to which object's address is being taken. For example, consider the machine
code program of Figure 5.20(a). The location m[esp-16] has three separate variables
sharing the location. Suppose that it is known that proc1 and proc2 preserve and do
not use edi. The denition of m[esp-16] with the address of proc2 completely kills
the live range of the integer variable, yet the address taken early in the procedure turns
out to be used later in the program, at which point only the char denition is live.
Denitions of m[esp-16] do not kill the reach of the separate variableedi, which
esp-16. It is the interpretation of what
continues to hold the value esp-16 represents
that changes with denitions of m[esp-16], from &i to &proc2 to the address of a
character. Type information about edi is linked to the data ow information about
190 Type Analysis for Decompilers
In this case, if there were no other uses of edi and no other instructions took the
address of m[esp-16], the three variables could be separated in the decompiled output,
as shown in Figure 5.20(b). However, if the address of m[esi-16] escaped the procedure
(e.g. edi is passed to proc1), this separation would not in general be safe. For example,
it may not be known whether the procedure the address escapes to (here proc1) denes
the original variable or not. If it only used the original variable, then it would be used as
an integer, and the reference is to i in Figure 5.20(b). However, it could dene and use
the location with any type. Finally, it could copy the address to a global variable used
by some procedure called before the variable goes out of scope. In this case, the type
passed to proc1 depends on which of the colocated variables is live when the location
is actually used. (The compiler would have to be very smart to arrange this safely, but
the three variables as a union, just as the machine code in eect does. Such escape
Ifproc1 in the example of Figure 5.20 was a call to a library function to which
m[esp-16] was a non-constant reference parameter, the same ambiguity arises. The
type of the parameter will be a big clue, but because of the possibility of casts, using
type information may lead to an incorrect decompilation. Hence, while library functions
are an excellent source of type information, they are not good sources of the data ow
information that can help separate colocated variables. This could be a case where it
Colocated variables are a situation where one original program address represents more
than one original variable. Figure 1.6 on page 20 showed a program where three arrays
are accessed at the machine code level using the same immediate values (one original
pointer and two oset pointers). This is a dierent situation, where although the three
variables are located at dierent addresses, the same immediate constant is used to refer
to these with indexes of dierent ranges. It illustrates the problem of how the constant
K of Equation 5.7 could have many components, in particular that of Proposition 5.9(e).
Figure 5.21 shows the declation of three nested structures. The address of the rst ele-
struct {
struct {
struct {
int a;
...
} small;
...
} medium;
...
} large;
Figure 5.21: Nested structures.
A few special types are needed to cater for certain machine language details, e.g. up-
per(oat64).
In addition to the elementary types, aggregates, and pointers, a few special types are
useful in decompilation. There is an obvious need for no type information (> in a type
lattice, or the C type void), and possibly overconstraint (⊥ in a type lattice). Many
machines implement double word types with pairs of locations (two registers, or two
word sized memory locations). It can therefore be useful to deneupper(τ ) and lower(τ )
where τ is a type variable, to refer to the upper or lower half respectively of τ . For
larger types, these could be combined, e.g. lower(upper(oat128)) for bits 64 through
95 of a 128-bit oating point type. For example, size32 would be compatible with
upper(oat64). Pairs of upper and lower types can be coalesced into the larger type
in appropriate circumstances, e.g. where a double word value is passed to a function
at the machine language level in two word sized locations, this can be replaced by one
Most related work is oriented towards compilers, and hence does not address some of
Using iterative data ow equations to solve compiler problems has been discussed by
many authors, starting with Allen and Cocke [All70, AC72] and also Kildall [Kil73].
The theoretical properties of these systems were proved by Kam and Ullman [KU76].
Khedker, Dhamdhere and Mycroft argued for a more complex data ow analysis frame-
work, but they attempted to solve the more dicult problem of type inferencing for
Guilfanov [Gui01] discussed the problem of propagating types from library functions
through the IR for an executable program. However, he did not attempt to infer types
intrinsic in instructions.
Data ow problems can be solved in an even more sparse manner than that enabled
by the SSA form, by constructing a unique evaluation graph for each variable [CCF91].
However, this approach suers from the space and time cost of generating the evaluation
graph for each variable, and some other tables required by this framework.
Guo et al. reported on a pointer analysis for assembly language, which they suggested
+
could be extended for use at run-time [GBT 05]. In their treatment of addition in-
structions, they assumed that the address expressions for array and structure element
l is the size of the array elements, and c is a constant. This contrasts with the more
complex expression of Equation 5.7. The latter is more complex mainly to express
a[x*10+y], where 10 is the number of elements in the row. It seems possible that ma-
chine code arrays could be analysed like this, and converted back to a[x][y] or a[x,
While good progress has been made, much work remains before type analysis for machine
Most of the ideas presented in this chapter have been at least partially implemented in
the Boomerang decompiler [Boo02]. The basic principles such as the iterative data ow
based solution to the type analysis problem work well enough to type simple programs.
However, experimental validation is required for several of the more advanced aspects
• the more unusual cases involving pointer-sized add and subtract instructions (Sec-
• splitting the value of K into its various possible origins (Proposition 5.9 on
page 181), which will probably require range analysis for pointers and index vari-
ables;
• separating colocated variables and escape analysis (Section 5.9.1 on page 187);
and
5.13 SSA Enablements 193
Range analysis of pointers is also important for initialised aggregate data, as noted in
Section 5.2.4.
Object oriented languages such as C++ introduce more elements to be considered, such
as member pointers, class hierarchies, and so on. Some of these features are discussed
Expression propagation, enabled by the SSA form, combines with simplication to pre-
pare memory expressions for high level pattern analysis, and the SSA form allows a
The high level patterns of Section 5.8 on page 178 require the distinction of integer
constants from other integer expressions. This is readily achieved by the combination
of expression propagation and simplication that are enabled by the Static Single As-
signment form. These also eliminate partial constants generated by RISC compilers,
which would have been awkward to deal with in any type analysis system.
The SSA form also allows a sparse representation of type information at the denitions
of locations. One type storage per SSA denition contrasts with the requirement of one
type storage per live variable per basic block, as would be required by a traditional bit
While indirect jumps and calls have long been the most problematic of instructions for
reverse engineering of executable les, their analysis, facilitated by SSA, yields high level
When decoding an input executable program, indirect jump and call instructions are
the most problematic. If not for these instructions, a recursive traversal of the program
from all known entry points would in most cases visit every instruction of the program,
+
thereby separating code from data [VWK 03]. This assumes that all branch instruction
targets are valid, there is no self modifying code, and the program is well behaved in
the sense of call and return instructions doing what they are designed to do.
Indirect jumps and calls are decoded after loading the program le, and before data
Output
source
file
Intermediate Representation (IR)
Front end
Control Flow
Data Flow Type Code
Loader Decoder Analysis
Analysis Analysis generation
(structuring)
Back end
If indirect jump or call
Input
binary file
It is interesting to note that it is indirect jump and call instructions that are most
problematic when reconstructing the control ow graph, and it is indirect memory
195
196 Indirect Jumps and Calls
operations (e.g. m[m[x ]]) which include the most problems (in the form of aliases), in
+
data ow analysis [CCL 96].
The following sections describe various analyses, most facilitated by the static single
assignment form (SSA form), which convert indirect jumps and calls to the appropriate
Special processing is needed since the most powerful indirect jump and call analyses rely
on expression propagation, which in turn relies on a complete control ow graph (CFG),
but the CFG is not complete until the indirect transfers are analysed.
The analysis of both indirect jumps and indirect calls share a common problem. It
is necessary to nd possible targets and other relevant information for the location(s)
involved in the jump or call, and the more powerful techniques such as expression
propagation, and value and/or range analysis have the best chance of computing this
information. These techniques rely heavily on data ow analyses, which in turn rely
on having a complete CFG. Until the analysis of indirect jumps and calls is completed,
however, the CFG is not complete. This type of problem is often termed a phase
ordering problem. While initially it would appear that this chicken and egg problem
cannot be resolved, consider that each indirect jump or call can only depend on locations
before the jump or call is taken, and consequently only instructions from the start of
Figure 6.2 illustrates the problem. It shows a simple program containing a switch
statement, with one print statement in each arm of the switch statement (including the
default arm, which is executed if none of the switch cases is selected). Part (b) of the
gure shows the control ow graph. Before the n-way branch is analysed, the greyed
basic blocks are not part of the graph (they are code that is not yet discovered). As a
result, the data ow logic is able to deduce that the denition of the print argument
in block 8 (the string Other!) can be propagated into the print statement in block
9. In fact, basic blocks 8 and 9 are not yet separate at this stage. However, one of
the rules for safe propagation is that there are no other denitions of the components
of the right hand side of the assignment to be propagated which reach the destination
(Section 3.1 on page 65). Once the n-way branch is analysed, however, it is obvious
One way to correct this problem would be to force conservative behaviour (in this case,
0
eax >u 5
false true
1 nway
8 fall
9 call
printf
10 ret
(b) Control Flow Graph for the above program. Shaded basic blocks are not
discovered until the n-way (switch) block is analysed.
noted that restricting the propagation too much will result in the failure of the n-way
branch analysis.
Another way would be to store undo information for propagations, so that propagations
found to be invalid after an indirect branch could be restored. At minimum, the original
expression before propagation would need to be stored, possibly in the form of the
original SSA reference to its denition. Dead code elimination is not run until much
198 Indirect Jumps and Calls
The simplest method, costly in space and time, is to make a copy of the procedure's IR
just after decoding, and the data ow for the whole procedure is restarted after every
indirect jump (of course, making use of the new jump targets).Where nested switch
statements occur, several iterations of this may be needed; in practice, more than one
or two such iterations would be required. This is the approach taken by the Boomerang
decompiler [Boo02]. It causes a loop in the top level of the decompiler component graph
Simpler techniques not requiring propagation could be used for the simpler cases, to
Indirect calls that are not yet resolved have to assume the worst case preservation
information. It is best to recalculate the data ow information for these calls after
nding a new target, so that less pessimistic propagation information can be used.
Whether the analysis of the whole procedure needs to be restarted after only indirect
Indirect jump instructions are used in executable programs to implement switch (case)
case statements, unless the number of cases is very small, or the case values are sparse.
They can also be used to implement assigned goto statements, and possibly exception
+
handling [SBB 00]. Finally, any call at the end of a procedure can be tail-call optimised
to a jump. It is assumed that jumps whose targets are the beginning of functions have
pression propagation.
The details of the implementations of switch statements are surprisingly varied. Most
implementations store direct code pointers in a jump table associated with the indirect
jump instruction. Some store osets in the table, usually relative to the start of the
table, rather than the target addresses themselves. This is presumably done to minimise
6.2 Indirect Jump Instructions 199
the number of entries in the relocation tables in the object les; relocation table entries
from the indirect jump until one of a few normal forms is found [CVE01].
Table 6.1 shows the high level expressions for various forms of switch statement for
32-bit architecture, with a table size of 4 bytes. %pc represents the program counter
(Boomerang is not very precise about what point in the program %pc represents, but it
does not need to be).
These expressions are very similar to those from Figure 7 of [CVE01], which builds up
the expression for the target of the indirect branch by slicing backwards through the
program at decode time. This approach was used initially in the Boomerang decom-
piler, however, the variations seen in real programs caused the code to become very
complex and dicult to maintain. Since a decompiler needs to perform constant and
other propagation, it seems natural to use this powerful technique instead of slicing, de-
spite the necessity of restarting analyses. By delaying the analysis of indirect branches
until after the IR is in SSA form and expression propagation has been performed, the
expression for the destination of the call appears in the IR for the indirect branch with
This delayed approach has the advantage that the analysis can bypass calls if necessary,
which is not practical when decoding. There is no need to follow control ow edges, and
the analysis automatically spans multiple basic blocks if necessary. Figure 6.3 shows
a sense, but it is simpler because the branch expression has been propagated into the
condition of the branch instruction at the end of the 2-way basic block. In this case, the
200 Indirect Jumps and Calls
1 m[esp0 - 4] := ebp0
3 ebp3 := esp0 - 4
4 eax4 := m[esp0 + 4]0
6 eax6 := eax4 - 2
10 BRANCH to 0x804897c if (eax4 - 2) >u 5
11 CASE [m[((eax4 - 2) * 4) + 0x8048934]]
Figure 6.3: IR for the program of Figure 6.2.
branch expression is (eax4 - 2) >u 5 where (eax4 - 2) matches expr . The number
of cases is readily established as six (branch to the default case if expr is greater than
5).
Where there is a minimum case value, expression propagation, facilitated by the SSA
form, enables a very simple way to improve the readability of the generated switch
statement.
subtract 2 from the switch variable before comparing the result against the number of
case values, including any gaps. Hence the comparison in block 0 is against 5, not 7.
Without checking the case expression, this would result in the program of Figure 6.4.
Figure 6.4: Output for the program of Figure 6.2 when the switch expression is not
checked for subtract-like expressions.
This program, while correct, is less readable than the original. There is a simple check
that can be performed to increase readability: if the switch expression is of the form
`-K where ` is a location and K is an integer constant, simply add K to all the switch
case values, and emit the switch expression as ` instead of `-K. This simple expedient
are facilitated by the SSA form. It saves a lot of checking of special cases, following
possibly multiple in-edges to nd a subtract statement, and so on. This improvement
In the example of Figure 6.3, ` is eax4 and K is 2. The generated output is essentially
There are three special cases where an optimising compiler does not emit the compare
and branch that usually sets the size of the jump table.
Occasionally, the maximum value of a switch value is evident from the switch expression.
Examples include ` % K, ` &(N-1), and ` |(-N), where K is any nonzero integer constant
(e.g. 5), and N is an integral power of 2 (e.g. 8). For example, ` |(-8) will always
yield a value in the range -1 to -8. In these cases, an optimising compiler may omit the
comparison against the maximum switch case value, if it is redundant (i.e. the highest
known to be met by all callers. In these cases, range analysis will be needed to nd the
tran.
Figure 6.5 shows a simple program containing an assigned goto statement. Many mod-
ern languages do not have a direct equivalent for this relatively unstructured form of
statement, including very expressive languages such as ANSI C. The similarity to the
switch program in Figure 6.2 suggests that the switch statement could be used to express
this program. For n assignments to the goto variable, there are n possible destinations
of the indirect jump instruction, which can be expressed with a switch statement con-
taining n cases. The case items are somewhat articial, being native target addresses
SSA form enables the task of nding of these targets eciently. Figure 6.6 shows the
detail. In eect, the φ-statements form a tree, with assignments to constants (rep-
resented here by the letter L and their Fortran labels) at the leaves. It is therefore
straightforward to nd the list of targets for the indirect jump instruction.
202 Indirect Jumps and Calls
program asgngoto
integer num, dest
print*, 'Input num:' 70 local2 := 134514528
read*, num 73 if (local1 6= 2) goto L1
assign 10 to dest 74 local2 := 134514579
if (num .eq. 2) assign 20 to dest L1:
if (num .eq. 3) assign 30 to dest 85 local2 :=
if (num .eq. 4) assign 40 to dest φ{local270 , local274 )
* The computed goto: 77 if (local1 6= 3) goto L2
goto dest, (20, 30, 40) 78 local2 := 134514627
10 print*, 'Input out of range' 86 local2 :=
return φ{local285 , local278 )
20 print*, 'Two!' 81 if (local1 6= 4) goto L3
return 82 local2 := 134514675
30 print*, 'Three!' L3:
return 87 local2 :=
40 print*, 'Four!' φ{local286 , local282 )
return 84 goto local287
end
goto p
ret
87
Statement 84 of Figure 6.5(b) shows the case statement
Figure 6.6: Tree of φ-statements and assignments to the goto variable from
Figure 6.5.
Figure 6.7 shows the resultant decompiled program. The output is not ideal, in that
aspects of the original binary program (the values of the labels) are visible in the
decompiled output. However, the computed goto is represented in the output language,
and is correct.
Figure 6.7: Decompiled output for the program of Figure 6.5. Output has been
edited for clarity.
Indirect jump instructions that do not match any known pattern have long been the
most dicult to translate, but value analysis combined with the assigned goto switch
The above assigned goto analysis has generated code which would be suitable for an
indirect jump instruction which does not match any of the high level patterns; all that
204 Indirect Jumps and Calls
is needed is the set of possible jump instruction targets. There may be cases where the
jump instruction targets are not available via a tree of φ-statements as is the case in the
program of Figure 6.5. An analysis that provides an approximation of the possible values
for a location (e.g. the value set analysis (VSA) of Reps et al. [BR04, RBL06]) would
output. The imprecision of the value analysis may cause some emitting of irrelevant
output, or omission of required output, but this is still a considerable advance over not
Branch trees or chains are found in switch statements with a small number of cases,
and subtract instructions may replace the usual compare instructions, necessitating some
Branch trees are also found in other situations, such as a switch statement with a small
number of cases. Figure 6.8 shows a program fragment with such an example. In this
case, with only three cases, there isn't a tree as such, but the implementation has an
Although the three switch values are 2, 15, and 273, the values compared to are 2, 13,
and 258. This is because instead of three compare and branch instruction pairs, the
compiler has chosen to emit three subtract and branch pairs. (Perhaps the reasoning
is that the dierences between cases are usually less than the case values themselves,
and the x86 target has more compact instruction forms for smaller immediate values).
param5 == 2 for the rst comparison. Care needs to be taken to ensure that similar
6.2 Indirect Jump Instructions 205
if (param5 == 2) { // 2 = WM_DESTROY
PostQuitMessage(0);
} else {
if (param5 - 2 == 13) { // 13 = WM_PAINT - WM_DESTROY
BeginPaint(param4, ¶m2); ...
} else {
if (param5 - 15 == 258) { // 258 = WM_COMMAND - WM_PAINT
if ((param6 & 0xffff) == 104) {...
}
}
} else {
DefWindowProcA(param4, param5, param6, param7);
}
}
Figure 6.9: Direct decompiled output for the program of Figure 6.8.
transformations are applied to the other two relational expressions, so that the true
switch values are obtained. Yet again, these manipulations are easier after expression
int main() {
int n; printf("Input a number, please: "); scanf("%d", &n);
switch(n) {
case 2: printf("Two!\n"); break;
case 20: printf("Twenty!\n"); break;
case 200: printf("Two Hundred!\n"); break;
case 2000: printf("Two thousand!\n"); break;
case 20000: printf("Twenty thousand!\n"); break;
case 200000: printf("Two hundred thousand!\n"); break;
case 2000000: printf("Two million!\n"); break;
case 20000000: printf("Twenty million!\n"); break;
case 200000000: printf("Two hundred million!\n"); break;
case 2000000000: printf("Two billion!\n"); break;
default: printf("Other!\n");
}
return 0;
}
Figure 6.10: Source code using a sparse switch statement.
206
0 call 46 call 54
printf scanf local0 = 20K
false true
57
local0 > 20K
155
true false call
puts
77 60
local0 = 20M local0 = 20
80 63
local0 > 20M local0 > 20
194
116 call
call false true true
puts
puts
Figure 6.11: Control Flow Graph for the program of Figure 6.10 (part 1).
83 90 70
false
local0 = 200K local0 = 200M local0 = 200
Indirect Jumps and Calls
true false true false false true 120 oneway
86 207 call 93 66 73
local0 = 2M puts local0 = 2B local0 = 2 local0 = 2K
103 call
211 oneway 133 oneway
puts
233 call
224 oneway
puts
243 ret
Figure 6.11: Control Flow Graph for the program of Figure 6.10 (part 2).
207
208 Indirect Jumps and Calls
While Figure 6.6 shows a tree of values leading to an indirect jump instruction, trees
also feature in the control ow graph of sparse switch statements. Compilers usually
emit a tree of branch instructions to implement sparse switch statements. Figure 6.10
shows C source code for such a program, and Figure 6.11 shows its control ow graph.
Note that no indirect jump instruction is generated; a jump table would be highly
inecient. In this case, almost two billion entries would be needed, with only ten of
these actually used. The compiler emits a series of branches such that a binary search
of the sparse switch case space is performed. Without special transformations, the
would be correct, but less readable than the original code with a switch statement.
It should be possible to recognise this high level pattern in the CFG. There are no control
ow edges from other parts of the program to the conditional tree, and no branches
outside what will become the switch statement. The follow node (the rst node that
will follow the generated switch statement) is readily found as the post dominator of
the nodes in the conditional tree. The switch cases are readily found by searching
the conditional tree for instructions comparing with a constant value. Range analysis
would be better, since it would allow for ranges of switch cases (e.g. case 20: case
21: ... case 29: print(Twenty something)).
The lcc compiler can generate a combination of branch trees and possibly short jump
tables [FH91, FH95]. Again, it should be possible to recognise this high level pattern
An older version of the Sun C compiler, for sparse switch statements, compiled in a
simple hash function on the switch variable, and the jump table consisted of a pointer
and the hash key (switch value). A loop was required to account for collisions. Several
dierent simple hash functions have been observed; the compiler seemed to try several
hash functions and to pick the one with the best performance for the given set of switch
values. Suitable high level patterns can transform even highly unusual code like this
Indirect calls implement calls through function pointers and virtual function calls; the
latter are a special case which should be handled specially for readability.
Compilers emit indirect call instructions to implement virtual function calls (VFCs),
and calls through function pointers (e.g (*fp)(x), where fp is a function pointer, and
jump tables, virtual function calls are usually implemented using a few xed patterns,
with some variation amongst compilers. There is an extra level of indirection with
VFCs, because all objects of the same class share the same virtual function table (VFT
or VT). The VT is often stored in read-only memory, since it never changes; the objects
Indirect function calls that do not match the patterns for VFCs can be handled as
indirect function pointer calls. In fact, VFT calls could be handled as indirect function
pointer calls, but there would be needless and complex detail in the generated code. For
where again x is the argument to the call, compared to the expected obj->method3(x).
Virtual
function
Object Function
Table (VT)
code
VT ptr offset 2 save
p member 1 offset 1 add
move
member 2 function 1 ...
function 2 =
draw()
function 3
Figure 6.12: Typical data layout of an object ready to make a virtual call such
as p->draw().
It is common in object oriented programs to nd function calls whose destination de-
pends on the run-time class of the object making the call. In languages like C++, these
are called virtual function calls. The usual implementation includes a table of function
pointers, called the virtual function table, VFT, VT, or virtual method table, as a
hidden member variable. Figure 6.12 shows a typical object layout. The VT pointer is
not necessarily the rst member variable. Figure 6.13 shows a typical implementation
Where multiple inheritance is involved, the VT may include information about how to
cast a base class pointer to a derived class pointer (a process known as downcasting).
Such casting usually involves adding an oset to the pointer, but the required constant
210 Indirect Jumps and Calls
often depends on the current (runtime) type of the original pointer. Sometimes this
constant is stored at negative osets from the VT pointer, as shown dotted in gure
Tröger and Cifuentes [TC02] report that static analysis can identify virtual function calls
using high level patterns similar to those used for switch statements. It can determine
the location the object pointer is read from, the oset to the VT, and the oset to the
method pointer. To nd the actual method being called, however, requires a run-time
value of an object pointer. As an example, the analysis might reveal that the object
the object, and the function pointer is at oset 8 in the VT. To nd one of the methods
actually being called, the analysis needs to know that the object pointer could take on
the value allocated at the call to malloc at address 0x80487ab. The authors imply that
this analysis is possible only in a dynamic tool (such as a dynamic binary translator),
since only in a dynamic tool would such run-time information be available in general.
However, once the object pointer and oset of the VT pointer in the object is found from
the above analysis, nding the VT associated with the object is essentially equivalent
to performing value analysis (like range analysis but expecting sets of singleton values
rather than ranges of values) on the VT pointer member. In the compiler world, this is
called type determination. Pande and Ryder [PR94, PR96] prove that this is NP-hard,
such object pointer values can be found. Finding all such objects allows examination of
all possible targets, eliminating arguments that are not used by any callees, and allowing
precise preservation analysis. The latter prevents all live locations from having to be
passed as reference parameters (or as parameters and returns) of the call, as discussed
in Section 3.4.3 on page 81. Value analysis may be able to take advantage of trees of
Once the arguments and object pointer are determined, the indirect call instruction can
be replaced by a suitable high level IR construct. The rest of the code implementing
the indirect call will then be eliminated as dead code, leaving only the high level call
If the above could be achieved for most object values, then it would be possible to iden-
tify most potential targets of virtual calls. This would enable decoding of instructions
reachable only via virtual functions, achieving more of the goal of separating code from
data.
Use of the SSA form helps considerably with virtual function analysis, which is more
complex than switch analysis, by mitigating alias problems, and because SSA relations
apply everywhere.
While the techniques of Tröger and Cifuentes [TC02] can be used to nd the object
pointer, VT pointer oset, and method oset, it does not nd the actual VT associated
with a given class object. Finding the VT associated with the object is equivalent to
nding a class type for the object. In some cases, the VT will even have a pointer to
the original class name in the executable [VEW04]. This section considers two analyses
for nding the targets of virtual function calls, the rst without using the SSA form,
Figure 6.15 shows a C++ program using shared multiple inheritance. The
opposed to being replicated inside both B and C). In the source code, the
A
underlined virtual keywords make this choice. Machine code for this
v v program is shown in Figure 6.16. Unfortunately, simpler examples do not
B C
contain aliases, and hence do not illustrate the point that SSA form has
Consider the call at address 804884c, implementing c->foo() in the source code. Sup-
pose rst this call is being analysed instruction by instruction without the SSA form or
212 Indirect Jumps and Calls
#include <iostream>
class X {
public:
int x1, x2;
X(void) { x1=100; x2=101; }
virtual void foo(void) {
cout < < "X::foo(" < < hex < < this < < ")" < < endl; }
};
class A: public X {
public:
int a1, a2;
A(void) { a1=1; a2=2; }
virtual void foo(void) {
cout < < "A::foo(" < < hex < < this < < ")" < < endl; }
};
class B: public virtual A {
public:
int b1, b2;
B(void) { b1=3; b2=4; }
virtual void bar(void) {
cout < < "B::bar(" < < hex < < this < < ")" < < endl; }
};
class C: public virtual A {
public:
int c1, c2;
C(void) { c1=5; c2=6; }
virtual void foo(void) {
cout < < "C::foo(" < < hex < < this < < ")" < < endl; }
};
class D: public B, public C {
public:
int d1, d2;
D(void) { d1=7; d2=8; }
virtual void foo(void) {
cout < < "D::foo(" < < hex < < this < < ")" < < endl; }
virtual void bar(void) {
cout < < "D::bar(" < < hex < < this < < ")" < < endl; }
};
int main(int argc, char *argv[]) {
D* d = new D(); d->foo();
B* b = (B*) d; b->bar();
C* c = (C*) d; c->foo();
A* a = (A*) b; a->foo();
....
Figure 6.15: Source code for a simple program using shared multiple inheritance.
6.3 Indirect Calls 213
the benet of any data ow analysis. By examining the preceeding three instructions,
it is readily determined that the object pointer is eax at instruction 8048846, the VT
A simplied algorithm for determining the value of the VT pointer associated with the
call is as follows. Throughout the analysis, one or more expression(s) of interest is (are)
maintained. Analysis begins with the expression of interest set to the expression for
the VT pointer, in this case m[eax+8] (with eax taking the value it has at instruction
8048846). First, analysis proceeds backwards through the program. For each assign-
of interest, since it is overwritten at this instruction, and cannot aect the indirect
call from that point back. The new expression of interest is found by replacing the
left hand side of the assignment (here eax) with the right hand side of the assignment
null-preserved pointer, discussed below. This phase of the algorithm terminates when
the expression of interest would contain the result of a call to the operator new library
Now a second phase begins where analysis proceeds forwards from this point. In this
phase, when an assignment is reached that produces an ane relation for one of the
components, the analysis continues with an extra expression of interest involving new
locations. For example, if the expression of interest is ...m[eax+16]... and the assign-
ment esi := eax is encountered, the new expressions of interest are ...m[eax+16]...
and ...m[esi+16]... . If the assignment was instead ebx := eax+8, an expression for
eax is derived (here eax := ebx-8), and this is substituted into the existing expres-
sion. In this case, the resulting expressions of interest would be ...m[eax+16]... and
cause it marries the locations eax and ebx to each other. Such ane related locations,
if involved in memory expressions, produce aliases. This phase of the analysis termi-
Table 6.2 shows this process in detail for the program of Figures 6.15 and 6.16. When
the rst phase of the analysis winds back to the call to operator new, this is equivalent
to m[m[base +16]+8], were base is the start of the memory allocated for the object.
the time it was not known that m[ebx] and m[esi+16] were aliases. This is the reason
6.3 Indirect Calls 215
for the second phase of the algorithm. Further forward progress reaches instruction
8048818: movl $0x8049c10,8(%eax), which gives the address of the VT for this in-
direct call. However, the instructions at addresses 80487bf and 80487e9 also match
the nal expression of interest, and these can only be reached by changing analysis
direction yet again. (In this case, they are not needed, as they are overwritten by the
assignment at 8048818, but this may not always be the case.) Virtual calls with the
same VT pointer call methods of the same class, thereby giving information about the
classes that methods belong to. It is now a simple matter of looking up the VT at oset
8 (the method pointer oset, here coincidentally the same as the VT pointer oset) to
Note that in this example, there is only one possible target for the call, so the call
could have been optimised to a direct call if the compiler had optimised better. The
more common case is that there will be several possible objects pointed to, and each of
these could have dierent types, and hence dierent VT pointer values. The simplied
algorithm needs to be extended to take this into account, and since object pointers are
as well. Since global objects are created in special initialisation functions that are
specic to the executable le format, this code needs to be examined as well.
Note also that for this example, the standard preservations apply (e.g. esi and ebx
are preserved by the calls at 8048829 and 8048835). Alternatively, each call could be
processed in sequence, strictly checking each preservation as targets are found. However,
this will only strictly be correct if for every call, all possible targets are discovered, since
216 Indirect Jumps and Calls
This example provides a taste of the complexity caused by aliases, which are common
Consider now the alternative of waiting until data ow analysis has propagated, canoni-
calised, and simplied expressions in the IR, and further that the IR is based on the SSA
6.3 Indirect Calls 217
form, as shown in Figure 6.17. For reasons of alias safety, propagation of memory ex-
pressions have not been performed at this stage. (If expression propagation could have
been performed with complete alias safety, which may be possible some day, the call at
Table 6.3 shows the equivalent analysis with these assumptions. In the SSA form,
assignments such as esi19 := eax18 are valid everywhere that they could appear. Un-
fortunately, although aliasing is mitigated (witness all the memory expressions that are
expressed in terms ofeax18 in Figure 6.16), aliasing can still exist because of statements
such as eax59 := m[edi57 ]? , where the data ow analysis has not been able to determine
where m[edi57 ] has been assigned to. Hence, with SSA form, there is no need for strict
direction reversals as with the rst algorithm, but sometimes more than one statement
has to be considered. Even so, the advantage of the SSA form version is apparent by
Compilers sometimes emit code that preserves the nullness of pointers that are osets
into a memory allocation that may have failed will be NULL if that allocation failed.
For example, if the return value from operator new at address 80487ab returns NULL,
then the value of esi at address 804883a will be zero. If so, the sete (set if equal)
instruction will set register al (and hence eax) to 1, which will set edi to 0 at 804883f.
When this is anded with ebx at 8048842, the result will be 0 (NULL). Any nonzero
value returned from operator new will cause the sete instruction to assign 0 to eax,
-1 to edi, and after the and instruction, edx will have a copy of ebx (which has the
edi57 := ((0 | (esi49 =0)) - 1) & ebx49 . This could be overcome with a simpli-
cation such as ((x = 0) − 1) & y → y . This construct could occur in the original
program's source code, so this simplication rule should be restricted to this sort of
analysis.
The code sequence at 8048852-8048856 of Figure 6.16 implements a similar construct for
memory references inside the potentially invalid memory block. (The idea seems to be
to make sure that any fault should happen in user code, not in compiler generated code).
this analysis. Note that the branch in this sequence causes extra basic blocks that would
not otherwise be present, and this simplication avoids the complexity of following more
than one in-edge for the basic block currently being analysed.
Value analysis on the VT pointer member, discussed in earlier sections, allows the
comparison of VTs which may give clues about the original class hierarchy.
closely as possible the class hierarchy of the original source code in the decompiler
then A and B must be related in the class hierarchy of the original program. This
is valuable information that comes from performing value analysis on the VT pointer
If class B is derived from class A, it generally follows that the VT for B has at least
the same number of methods as the VT for A. In other words, if the VT for B is a
of the methods of B could override the methods of A, and in addition B may have no
new virtual methods that A does not. In fact, B could override all the methods of A, in
which case it is not possible to determine from the VTs alone whether B is derived from
A or A from B. In this case, the size of the object itself (which indicates the number
of data members) could be used to determine which is the superclass. In the case that
the objects are of equal size, then in reality it does not matter which is declared as the
set of possible targets, avoiding readability reductions that these pointers would otherwise
cause.
Indirect call instructions that do not implement virtual function calls are most likely
Value analysis applied to a function pointer would yield the set of possible targets. As
mentioned above in relation to virtual function calls, the precise solution to this is an
NP-hard problem. Hence, approximate solutions (a subset of all possible targets) are
As with virtual function pointers, if target information is not available, the possible
consequences are missing functions, excess arguments for the pointer indirect calls, and
6.3.3.1 Correctness
Imprecision in the list of possible targets for indirect calls leads to one of the few cases
Since missing functions obviously imply an incorrect output, the inability to nd the
complete set of targets for indirect calls is one of the very few reasons why decompiled
output cannot be correct in the general case. Recall that for most other limitations of
While the missing functions could be found manually or by using dynamic techniques,
there must still be at least one error in the code for each missing function, since the
original source code must have assigned the address of the function to some variable
or data structure. If the pointer analysis was able to identify this value as a function
pointer, then the function would not be missing. So somewhere in the decompiled code
is the use of a constant (function pointers are usually constants) in an expression, and
this constant will be expressed as an integer when it should be the name of a function.
When the decompiled output is recompiled, the integer will no longer point to a valid
function, and certainly not to the missing function, so the program will not have the
In order for this incorrect output to occur, the necessary conditions are a function that
is reachable only through one or more function pointers, and the value analysis has to
pointing to the middle of an existing function; this situation can be handled by splitting
the function.
When either virtual function analysis or function pointer analysis is successful, a set
tail call optimised call to another function, so it is also possible for an indirect branch
to lead to the discovery of a new function. The tail call optimisation replaces a call
and return sequence with a branch; this saves time, uses fewer instructions, and most
importantly saves stack space. In highly recursive programs, the tail call optimisation
Occasionally, the newly discovered function target could be in the middle of an existing
function. Balakrishnan and Reps treat this as an error and merely report a message
[BR04]. The function may however be split into two, with a tail call at the end of the
rst part to preserve the semantics. This will obviously not be practical if there are
1200: call B
return
Block 2
return
return
Target at 1200
(a) (b)
Figure 6.18 shows an example of a function being split. In part (a), the indirect call
instruction has not yet been analysed, and procedure A consists of block1 and block2.
In part (b), a new call target of 1200 has been found, which is in the middle of the
6.4 Related Work 221
existing procedure A, and only one control ow edge crosses the address 1200. If the
address 1200 was in the middle of a basic block, it could be split, as routinely happens
The instructions from address 1200 to the end of the function become a new procedure
inserted at the end of block1, and a return statement is inserted after the call. Thus,
after executing block 1 in A, block 2 is executed, control returns to the return statement
in A, and A returns as normal to its caller. This is eectively the reverse of the tail call
optimisation.
De Sutter et al. observe a similar problem to the phase ordering problem of Section
6.1 when constructing the ICFG (Interprocedural Control Flow Graph) from sets of
+
object les [SBB 00]. They solve it by using a so-called hell node in the ICFG, which
unknown control ows from indirect jumps or calls lead to. They construct control
ow edges also from the hell node to all possible successors of indirect jumps or calls.
For their domain, where relocation information is available, all possible successors
Tröger and Cifuentes implemented a way of analysing virtual function calls to nd the
object pointer, virtual table pointer oset, and method oset [TC02]. They relied on
dynamic techniques to nd actual values for the object pointer. They used a simple
algorithm that was limited to the basic block the the virtual call is in, and did not have
the advantages of expression propagation. Expression simplication was used, but only
Vinciguerra et al. surveyed the various techniques available to disassemblers for nding
+
the code in an executable program [VWK 03]. The same techniques are used in the
and calls is cited as the problem which has dominated much of the work on C/C++
disassembly tools. Slicing and data ow guided techniques are mentioned, but not
with the advantages of expression propagation. The authors also mentioned that the
eect of limitations in this process produce, at best, a less readable but correct program.
Reps et al. perform security analysis on x86 executables [BR04]. To cope with indirect
jumps and calls, they use a combination of the heuristics in IDA Pro and their Value
Set Analysis (VSA). VSA has already been mentioned as being needed for the solution
222 Indirect Jumps and Calls
of several problems in type analysis and the analysis of indirect jumps and calls. They
do not split functions when a new target is in the middle of an existing function, but
Harris and Miller describe a set of analyses that split a machine code executable into
functions [HM05]. They claim that compiler independent techniques are not necessary,
because in practice simple machine and compiler dependent methods have proved ef-
fective at recovering jump table values. They do not appear to analyse indirect calls,
relying instead on various heuristics to nd procedures in the gaps between regions
Results
Several techniques introduced in earlier chapters were veried with a real decompiler,
and show that good results are possible with their use.
Many of the analyses described in earlier chapters have been tested on the Boomerang
decompiler [Boo02]. Simpler machine code decompilers, such as REC, can be relied on
to produce passable output on programs of almost any size, but Boomerang specialises
on correct, recompilable output for a wide range of small programs. Boomerang's test
suite consists of some 75 small programs, of which about a dozen fail for various reasons.
The results in this chapter relate to material in earlier chapters. Section 7.1 describes
an industry case study which conrms the limitations of current decompilers stated in
but how unlimited propagation can in some circumstances lead to poor readability.
Section 7.2 shows that common subexpression elimination is not the answer, but a
simple heuristic is quite eective. The usefulness of the SSA form is demonstrated
loop, extraneous variables can be emitted. Section 7.3 gives the results of applying
Section 4.3.2 showed how a preserved location could be a parameter; Section 7.4 presents
an example and results. Preservation analysis is shown in detail in Section 7.5. In the
Section 4.4.2. Section 7.5.1 gives detailed results of applying the theory of Section 4.4.2
and returns, as Section 4.4.3 indicated. Section 7.6 presents an example which shows
good results.
223
224 Results
When a Windows program was decompiled for a client, the deciencies of existing de-
compilers were conrmed, and the importance of recovering structures was highlighted.
Chapter 1 discussed the problems faced by machine code decompilers, and Chapter 2
reviewed the limitations of existing decompilers. This section conrms the limitations
The author of this thesis and a colleague were asked to attempt a partial decompilation
of a 670KB Windows program [VEW04]. Source code was available, but it was for a
considerably earlier version of the program. The clients were aware of the limitations
While there were a few surprising results, in general the case study conrmed the
ndings of Chapters 1 and 2. At the beginning of the study, Boomerang was broadly
comparable in capabilities with other existing machine code decompilers. Type analysis
was ad hoc, and structures were not supported. The parameter recovery logic assumed
that all parameters were passed on the stack, which was often not the case in the C++
As a result of this study, the salient problems facing machine code decompilers were
conrmed to be
One speculation from the paper ([VEW04]) has since been overturned. Section 5.2 of
the paper, referring to the problem of excessive propagation, states that Some form of
common subexpression elimination could solve this problem. As shown below, this is
Common Subexpression Elimination does not solve the problem of excessive expression
7.2 Limiting Expression Propagation 225
propagation; the solution lies with limiting propagation of complex expressions to more
Section 3.2 on page 68 states that Common Subexpression Elimination (CSE) does
not solve the problem of excessive propagation, but preventing propagation of complex
expressions to more than one destination does. Results follow that verify this statement.
Figure 7.1 shows part of the sample output of Figure 8 in [VEW04]. It shows two cases
where excessive propagation has made the output less readable than it could be: the
Value numbering was used to generate a table of subexpressions whose values were
available on the right hand sides of assignments. Because of the simple nature of ma-
chine instructions, almost all subexpressions are available at some point in the program.
CSE was performed before any dead code was eliminated, so all propagated expressions
still had their originating assignments available. A C++ map was used in place of the
usual hash table, to avoid issues arising from collisions. Statements were processed in
order from top to bottom of the procedure. Figure 7.2 shows the results of applying
Line 10 is now simplied as desired (it corresponds to lines 18-21 of Figure 7.1). Sim-
Figure 7.1). However, there are now far too many local variables, and the semantics of
individual instructions is again evident. What is desired for ease of reading is propaga-
tion where possible, except in cases where an already complex expression (as measured
e.g. by the number of subexpressions) would be propagated to more than one destina-
tion. In other words, it is better to prevent excessive propagation, rather than trying
The propagation algorithm of Boomerang has been updated as follows. Before each
propagation pass, a map is now created of expressions that could be propagated to. The
map records a count of each unique expression that could be propagated to (an SSA
subscripted location). In the main expression propagation pass, before each expression
is propagated from the right hand side of an assignment to a use, this map is consulted.
If the number recorded in the map is greater than 1, and if the complexity of the
expression to be propagated is more than a given threshold (set with a command line
Figure 7.3: Propagation limited to below complexity 2, applied to the same code as
Figure 7.1.
7.2 Limiting Expression Propagation 227
Figure 7.4: Propagation limited to below complexity 3, applied to the same code as
Figure 7.1.
number of operators in the expression (binary and ternary operators, and the memory
of operator).
Performing this limited expression propagation on the same code from Figure 7.1 re-
sults in the output shown in Figures 7.3 - 7.5. The dierence between these is that
the rst prevents propagation of expressions of complexity 2 or more, the second with
more. The if statements are again simplied, but complex expressions are retained for
the other statements, for maximum readability. Although the number of lines of code
only reduces from 14 to 11 or 12, the readability of the code is considerably improved.
Readability was measured with three metrics: the character count (excluding multi-
ple spaces, newlines, and comments), the Halstead diculty metric, and the Halstead
program length metric [Hal77]. The Halstead program length metric is the sum of the
Metrics based on control ow complexity (e.g. McCabe Cyclomatic Complexity [McC76])
will not nd any dierence due to expression propagation or the lack thereof, since the
228 Results
control ow graph is not aected by propagation. Table 7.1 shows the results for the
selected metrics. The Halstead metrics were calculated with a public domain program
[Uti92].
Table 7.1: Complexity metrics for the code in Figures 7.1 - 7.5.
Varying the expression depth limit causes minor variations in the output, as can be
seen by inspecting the decompiled output of Figures 7.3 - 7.5. The metrics vary in an
unpredictable way with this limit, because of a kind of interference eect. For example,
complexity, but this statement may now be able to be propagated to a later statement.
If the early propagation succeeded, the later propagation may have failed. Whether the
overall metrics improve or not ends up depending on chance factors. While the eect of
the propagation limit is not very large (the Halstead Diculty varies from 22.7 to 28.3,
a 25% dierence), it does suggest that propagation could be one area where user input
will be required for the most readable result (according to the tastes of the reader).
The low value for the Halstead length metric in the CSE case is because of the extreme
re-use of each subexpression; nothing is calculated more than once. The fact that
there are many local variables and assignments is not reected in this metric; the left
hand sides of each assignment are ignored. This metric is therefore not measuring the
distraction and diculty of dealing with a needlessly large number of local variables,
and hence this metric does not show the dramatic dierence that either the character
ure 7.6. As shown in Figure 7.7, the compiler uses an idiomatic sequence involving
the subtract with carry instruction. Output from the Boomerang decompiler without
expression limiting is shown in Figure 7.8. Boomerang has special code to recognise
the use of the carry ag in this idiom, but not to emit the conditional assignments that
the idiom represents. The result is completely incomprehensible code, but it is at least
correct. Figure 7.9 shows the same code with expression propagation limiting in eect.
While still dicult to comprehend, the result is much better laid out.
7.2 Limiting Expression Propagation 229
int test(int i) {
if (i < -2) i = -2;
if (i > 3) i = 3;
printf("MinMax result %d\n", i);
}
Figure 7.6: Original C source code for function test in the Boomerang SPARC
minmax2 test program. The code was compiled with Sun's compiler using -xO2
optimisation.
When the techniques of Section 4.1.3 are applied to the running example, the generated
Figure 4.11 on page 111 showed output from an early version of Boomerang which had
no logic for limiting extraneous variables. It is reproduced in Figure 7.10 with the local
variables renamed as far as possible the same names as the original registers. Variable
st represents the stack top register of the x86 processor. This is done to facilitate
comparison with later output, which does this automatically. There are 10 statements
do {
edx_1 = edx;
ebx_1 = ebx;
esi_1 = esi;
st_1 = st;
edx_2 = edx_1 - 1;
st = st_1 * (float)edx_1 / (float)esi_1;
esi_2 = esi_1 - 1;
ebx = ebx_1 - 1;
edx = edx_2;
esi = esi_2;
} while (ebx_1 >= 1)
st_2 = (int)(st_1 * (float)edx_1 / (float)esi_1);
Figure 7.10: A copy of the output of Figure 4.11 with local variables named
after the registers they originated from.
Figure 7.11(a) shows the output from a more recent version of Boomerang, which in-
corporates the expression limiting heuristic (with expression depth limited to 3), and
makes better choices about which version of variable to allocate to a new variable.
Already, it is a vast improvement. In this version, there are 5 statements inside the
loop.
Figure 7.11(b) shows output from Boomerang using the -X (experimental) ag. This ag
overwriting statements in a loop. A side eect is that the assignment to st is split onto
two lines, so there are still ve statements in the loop, but no extraneous variables.
With support for the C post decrement operator (e.g. edx--), output similar to the
As of April 2007, the Boomerang source code implementing the algorithm of Section
4.1.3 is awed, using the invalid concept of dominance numbers. It will fail in some
7.4 Preserved Parameters 231
do { do {
edx_1 = edx; st = st * (float)edx;
edx = edx_1 - 1; edx = edx - 1;
st = st * (float)edx_1 / (float)esi; st = st / (float)esi;
esi = esi - 1; esi = esi - 1;
ebx = ebx - 1; ebx = ebx - 1;
} while (ebx >= 0); } while (ebx >= 0);
edx = (int)st; edx = (int)st;
Preserved locations appear to be parameters, when usually they are not, but sometimes
Figure 7.12 shows the x86 assembly language for a simple function that takes one
parameter and returns a result, when the parameter is a register that is saved, used as
Figure 7.13 shows the intermediate representation for the program, after expression
propagation, but before dead code elimination. Note how the save instruction becomes
dead code, the restore instruction becomes a null statement, and the use of the param-
After dead code elimination, the only statement remaining is the return statement. The
Figure 7.13: Intermediate representation for the code of Figure 7.12, just before
dead code elimination.
Figure 7.14: Boomerang output for the code of Figure 7.12. The
parameter and return are identied correctly.
Figure 7.15 shows the source code for a Fibonacci function, and Figure 7.16 shows
program for the dcc decompiler [Cif94]. Where the original program saved the SI
register with a PUSH instruction and then loaded the parameter from the stack, this
version passes the parameter in SI yet still PUSHes and POPs the register. The register
SI is therefore both preserved and a parameter. The resultant program still runs
on a 32-bit personal computer, despite the age of the instruction set (from 1978), and
performs the same operation as the original program in fewer instructions (not counting
NOPs). It is therefore a valid program that machine code decompilers should be able to
Figure 7.15: Original source code for a Fibonacci function. From [Cif94].
7.4 Preserved Parameters 233
035B 55 PUSH BP
035C 8BEC MOV BP,SP
035E 56 PUSH SI
035F 909090 NOP ; Was MOV SI,[BP+4]
0362 83FE02 CMP SI,+02
0365 7E1C JLE 0383
0367 4E DEC SI ; Was MOV AX,SI
0368 9090 NOP ; Was DEC AX
036A 90 NOP ; Was PUSH AX
036B E8EDFF CALL 035B
036E 5E POP SI ; Was POP CX; get original parameter
036F 56 PUSH SI
0370 50 PUSH AX
0371 83C6FE ADD SI,-02 ; Was MOV AX,SI
0374 90 NOP ; Was ADD AX,-2
0375 90 NOP ; Was PUSH AX
0376 E8E2FF CALL 035B
0379 90 NOP ; Was POP CX
037A 8BD0 MOV DX,AX
037C 58 POP AX
037D 03C2 ADD AX,DX
037F EB07 JMP 0388
0381 EB05 JMP 0388
0383 B80100 MOV AX,0001
0386 EB00 JMP 0388
0388 5E POP SI
0389 5D POP BP
038A C3 RET
Figure 7.16: Disassembly of the modied Fibonacci program adapted from [Cif94].
As shown in Figures 7.17 and 7.18(a), the dcc and REC decompilers do not produce
valid source code. Both decompilers do not identify any parameters, although REC
passes two arguments to one call and none to the other. Both emit invalid C code for
POP instructions.
The Boomerang decompiler uses propagation and dead code elimination to identify the
1
preserved parameter. The result of decompiling the 32-bit equivalent of the program
of Figure 7.16 is shown in Figure 7.18(b). This program also demonstrates the removal
1 The 32-bit version was not completely equivalent. The 32-bit version originated from the source
code of Figure 7.19. The two minor dierences cause b(0) to return 0, as is usually considered correct.
234 Results
int proc_1 ()
/* Takes no parameters.
* High-level language prologue code.
*/
{
int loc1;
int loc2; /* ax */
if (loc1 > 2) {
loc1 = (loc1 - 1);
POP loc1
loc1 = (loc1 + 0xFFFE); /* 0xFFFE = -2 */
loc2 = (proc_1 () + proc_1 ());
} else { ...
Figure 7.17: Output from the dcc decompiler for the program of Figure 7.16.
L0000025b()
{
/* unknown */ void si; int fib(int param1) {
if(si <= 2) { __size32 eax;
ax = 1; __size32 eax_1; // eax30
} else { int local2; // m[esp - 12]
si = si - 1; if (param1 <= 1) {
(restore)si; local2 = param1;
si = si - 2; } else {
dx = L0000025B( eax = fib(param1 - 1);
L0000025B(), si); eax_1 = fib(param1 - 2);
(restore)ax; local2 = eax + eax_1;
ax = ax + dx; }
} return local2;
} }
Figure 7.18: Output from the REC and Boomerang decompilers for the pro-
gram of Figure 7.16 and its 32-bit equivalent respectively.
7.5 Preservation
Most components of the preservation process are facilitated by the SSA form.
The Boomerang machine code decompiler has an equation solver. It was added when it
was found necessary to determine whether registers are preserved or not in the presence
of recursion. Figure 7.21 shows the output of Boomerang's solver, nding that esi (an
x86 register) is preserved in the recursive Fibonacci function of Figures 7.19 (source
It begins with the premise esi35 = esi0 (i.e. esi as last dened is the same as esi
on entry; the return statement contains information about which locations reach the
exit). Various rules are applied to the current equation: propagation into the left hand
side, adding constants to both sides, using the commutation property of equality (swap
left and right sides), and so on. For example on the second line, the LHS is esi35
(esi as dened at line 35). In Boomerang, the subscript on esi is actually a pointer
to the statement at line 35, so again the propagation is very easy. The RHS has esi0 ,
and the proof engine actually searches for statements (such as statement 5) which save
both sides (i.e. to prove that m[a] = m[b], it is sucient to show that a = b). When
the left hand side could be propagated to by a φ-function, then the current equation
has to be proved for each operand. In the example, the current equation becomes
and ebp3 = (esp4 +4). (Control ow could be such that ebp could be dened at state-
prove the premise, i.e. that esi is preserved through the whole function.) Each of these
Without the SSA form, it would be necessary to separate various versions of each
location, such as esi on entry, esi at the procedure exit, and esi after statement
26.
shows.
A test program was written to exercise the conditional preservation analysis of Sec-
tion 4.4.2 on page 130. Figure 7.22 shows the call graph for the program, deliberately
designed to copy Figure 4.26 on page 129. An outline of the test program's source code
The global variable res is used to ensure that each procedure is called the correct number
of times. Each procedure increments res by a prime number, so that if the program
outputs the correct total, the control ow is very likely correct. Global variables with
7.5 Preservation 237
main
b Legend:
d f h j l
e g i k
edx
names of the form x_y are used to control the recursion from procedure x to procedure
y. All such globals are initialised to 3; the program outputs res is 533. To make
this test more dicult for the conditional preservation analysis, there are instructions
to exchange the contents of machine registers ecx and edx at the points indicated on
7.5 Preservation 239
Figure 7.22, and a single instruction to decrement register edx is placed in procedure
k after the call to e. This results in edx changing value, but also register ecx, since
there is a path in the call graph where there is an odd number of exchange instructions
(c-d-e-c ).
The rst cycle detected is f-g-f, when g 's child f is found in path. The rst preservation
to succeed is esp = esp+4. As with most x86 procedures, the stack pointer for g is
preserved in the base pointer register (ebp). Because of the control ow join at the
end of g, there is a φ-statement dening the nal value of ebp, so the proof has to
succeed for both paths. One path is through the call to f, so the preservation of ebp in
g depends on the preservation of ebp in f. f has a similar structure to g, so again there
is a φ-statement, the proof has to succeed for both paths, and now the preservation for
ebp in f depends on the preservation of ebp in g. Note that the algorithm has now
come almost full circle: esp in g depends on ebp in g, ebp in g depends on ebp in f, ebp
in f depends onebp in g. Since preservation that depends only on recursive calls can be
assumed to succeed, ebp is indeed assumed to succeed in g (note: this is not yet proven,
it is a conditional result). No other problems are found with this preservation, so nally
esp = esp+4 is proven. Note that the intermediate conditional results (ebp=ebp in g
and f ) are not stored, since at the time when the proof function exits, it was not known
if the outer proof would succeed or fail. This is illustrated in the next example.
The nal example is the preservation ecx=ecx in b. Recall that there are instructions to
exchange the value of these registers before and after calls in the c-j-k-e-c, b-c-l, and c-
d-e-c cycles. Due to these, preservation of ecx in b depends on edx in c, which depends
on ecx in d, which depends on edx in e, which depends on ecx in c. Note that while
time, so the process continues. ecx in c depends on edx in d, which depends on ecx in
e, which depends on edx in c. Finally, there is a required premise that is already being
be shown soon, this does not turn out to be true. In similar vein, the preservations ecx
in e and edx in d are assumed to be conditionally proven. However, c has other calls,
so ecx in c, which depended onedx in d and has conditionally passed, now depends
on edx in j, which depends on ecx in k, which depends on edx in e. This is another
premise, which is conditionally assumed to succeed, and similarly edx for j. There is
still one more call in c, to l, so ecx in c also depends on edx in l. This depends on
ecx in b, the original premise, and so conditionally succeeds. Finally, ecx in c succeeds
conditionally, leading to conditional successes for edx in e and ecx in d. The algorithm
is now considering edx in c, which depends on ecx in j, and in turn edx in k. This
depends on ecx in e, which has been calculated and conditionally proven before, but
240 Results
is calculated anew. After the call to k is the decrement of edx, so one critical path of
the whole proof nally fails, sinceedx0 = edx0 - 1 cannot be proven. As a result, ecx
is assumed to be assigned to in the call to c, ecx remains a parameter and return of b,
c, and all the other procedures involved in the cycle. Similarly, edx is found not to be
Figure 7.24: The code generated for procedure b for the program
test/pentium/recursion2. The Boomerang -X option was used to remove
extraneous variables, as discussed in Section 7.3. The code has been modied by
hand (underlined code) to return more than one location.
Figure 7.24 shows an outline of the code generated for procedure b. Since Boomerang
did not at the time handle more than one return from a procedure, some hand editing
of the output was necessary. The nal program compiled and ran identically to the
original.
The techniques of Section 4.4.3 successfully remove redundant parameters and returns
Figure 7.25 shows assembly language code for the program of Figure 7.19. A few
instructions were modied so that registers ecx and edx were used and dened in the
function. Without the techniques of Section 4.4.3 on page 134, and assuming that
7.6 Redundant Parameters and Returns 241
fib:
pushl %ebp
movl %esp, %ebp
pushl %ebx
subl $4, %esp
cmpl $1, 8(%ebp)
jle .L2
movl 8(%ebp), %eax
decl %eax
subl $12, %esp
pushl %eax
call fib
addl $16, %esp
push %edx # Align stack, and make a ...
push %edx # ... use of %edx (but it is dead code)
push %eax # Save intermediate result
movl 8(%ebp), %eax
subl $2, %eax
pushl %eax
call fib
pop %edx # Remove argument, assign to edx
pop %ecx # Get intermediate result, assign to ecx
addl $8, %esp
addl %eax, %ecx # Add the two intermediate results
movl %ecx, -8(%ebp)
jmp .L4
.L2:
movl 8(%ebp), %eax
movl %eax, -8(%ebp)
.L4:
movl -8(%ebp), %eax
movl -4(%ebp), %ebx
leave
ret
Figure 7.25: Assembler source code for a modication of the Fibonacci program
shown in Figure 7.19. The underlined instructions assign values to registers ecx
and edx.
push and pop instructions are not simply ignored, these two registers become extra
parameters and returns of the fib function, as shown by the IR in Figure 7.26.
The edx return from b is used to pass an argument to the call at statement 43, and
Similar comments apply to the assignment to ecx at statement 52, although in this case,
that assignment is gainfully used. Along the path where param10 > 1, ecx is dened
before use, and along the path where param10 ≤ 1, it is only used in the φ-function in
statement 67, which is only ultimately used to pass a redundant parameter to fib.
The techniques of Section 4.4.3 were able to determine that ecx and edx were redundant
parameters and returns, resulting in the generated code of Figure 7.27.
Conclusion
8.1 Conclusion
The solutions to several problems with existing machine code decompilers are facilitated
The main problems for existing machine code decompilers were found to be the iden-
tication of parameters and returns, type analysis, and the handling of indirect jumps
and calls.
The IR used by a decompiler has a profound inuence on the ease with which certain
itated by the SSA form, and propagation is fundamental to transforming the semantics
of many individual instructions into the complex expressions typical of source code. In
fact, unlimited propagation leads to expressions that are too complex, but appropriate
Propagating assignments of condition codes into their uses (e.g. conditional branches) is
a special case, where the combination of condition code and branch (or other) condition
Once expressions are propagated, the original denitions become dead code, and can
to high level language: reducing the size of the generated output. The established
techniques of control ow structuring introduce high level constructs such as conditional
The next major decompilation step is the accurate recovery of call statements: argu-
ments and parameters, returns and results. One of the main inputs to this process
243
244 Conclusion
modied by the target of a call, or is preserved. In the main, this is solved by standard
data ow analysis, but the combination of indirect calls and recursion causes problems,
since the results of some calls will not be known when needed. An algorithm for solving
While the SSA form makes the propagation of registers very easy, it is not safe to
can sidestep this issue by never moving load and store operations, although some will
attempt to move them when it can be proved to be safe. The cost of not moving
lost. For decompilers, the cost of not being able to propagate memory expressions is a
considerable loss of readability. In particular, all stack local variables, including arrays,
should be converted to variables in the decompiled output. Without this conversion, the
stack pointer will be visible in the decompiled output, and the output will be complex
and may not be portable. A solution to the problem is given, but the solution ignores
Few if any existing decompilers attempt type recovery (type analysis) to any serious
degree. This thesis outlines a type analysis implemented with an iterative, data ow
based framework. The SSA form provides convenient, sparse type information stor-
age with each denition and constant. A problem strongly related to type analysis is
the partitioning of the data sections (global, local, and heap allocated) into objects of
distinct types. This problem is made more complex by the widespread practice of colo-
cating variables in the same register or memory location. Compilers use a surprisingly
types, and the elements of aggregate types. These patterns are explored in detail, but
the resolution of some of the more complex patterns remains for future work. In par-
ticular, the interpretation of the constant K associated with many memory expressions
is complicated by factors such as oset pointers and arrays with nonzero lower bounds.
The usual way to handle indirect jump and call instructions is to perform a pattern-
based local search near the instruction being analysed, usually within the basic block
that the indirect instruction terminates. This search is performed at decode time,
before any more powerful techniques are available. The logic is that without resolving
the indirect jump or call, the control ow graph is incomplete, and therefore the later
analyses will be incorrect. While this is true, it is possible to delay analysis of these
indirect instructions until the more powerful analyses are available. The analysis of
the whole procedure has to be restarted after the analysis of any indirect branches,
incorporating the new indirect branch targets. This step may be repeated several times
if the indirect branches are nested. The same may be required after indirect calls. The
8.2 Summary of Contributions 245
power of expression propagation, facilitated by the SSA form, can be used to improve
the analysis, e.g. it may well succeed even when several basic blocks have to be analysed.
The resolution of VFT pointers is shown to be simpler using this technique also.
Many of the techniques introduced in this thesis have been veried by a working machine
code decompiler.
This thesis advances the state of the art of machine code decompilation through several
key contributions.
The rst two chapters established the limitations of current machine code decompilers.
Table 1.2 compared the fundamental problems of various tools related to machine code
decompilers. Figure 1.7 identied where the losses and separations that cause these
problems arise.
Chapter 3 demonstrated signicant advantages for the SSA form as a decompiler IR,
of indirect branches and calls, and enabling a sparse data ow based type analysis.
The SSA implementation of a pointer from each use to its denition saves having to
rename the original denition in many cases. Compilers as well as decompilers could
The power of expression propagation and importance to the overall decompilation pro-
cess was shown, along with the necessity of limiting some propagations.
3.4 dened appropriate terminology and equations for deriving the required summary
information.
While compilers can combine translation out of SSA form with register allocation, in
Although the SSA form makes it easy to propagate registers, the propagation of memory
of the generated code. Section 4.2 identied the issues and gave a solution.
Unless it is assumed that all procedures follow the ABI specication, preservation an-
alysis is important for correct data ow information. Section 4.3 dened the issues and
showed how the SSA form simplies the problem. Section 4.4 showed how recursion
246 Conclusion
causes extra problems, and gave a solution for preservation analysis in the presence of
recursion.
Chapter 5 gave the rst application of an iterative, data ow based type analysis for
machine code decompilers. In such a system, addition and subtraction are not readily
The memory expression patterns representing accesses to aggregate elements are sur-
Chapter 6 showed how the analysis of indirect jumps and calls can be delayed until more
powerful analyses are available, and the implications of working with an incomplete
Section 6.2.2 showed how Fortran style assigned gotos can be analysed and represented
in a language such as C.
In Section 6.3.1, it was shown that less work is involved in the identication of VFT
While the state of the art of decompilation has been extended by the techniques described
in this thesis, work remains to optimise an SSA-based IR for decompilers that handles
At the end of Chapter 3 it was stated that an IR similar to the SSA form, but which
stores and factors def-use as well as use-def information, could be useful for decompila-
tion; the dependence ow graph (DFG) may be one such IR.
In several places, the need for value (range) analysis has been mentioned. This would
help with function pointers and their associated indirect call instructions, interpreting
the meaning of K in complex type patterns, and would allow the declaration of initial
Alias analysis is relatively common in compilers, and would be very benecial for de-
compilers. However, working from source code (including from assembly language) has
one huge advantage: when a memory location has its address taken, it is immediately
obvious from the source code. This makes escape analysis practical; it is possible to
say which variables have their addresses escape the current procedure, and it is safe to
assume that no other variables have their addresses escape. At the machine code level,
and even if this is deduced by analysis, it is not always clear which object's address is
8.3 Future Work 247
taken. Hence, for alias and/or escape analysis to be practical for machine code decom-
pilation, it probably needs to be combined with value analysis. These analyses would
One area that emerged from the research that remains for future work is the possibility
of partitioning the decompilation problem so that parts can be run in parallel threads
that following this stage, interprocedural analyses may be more dicult to partition
eectively.
For convenience, a standard stack pointer register is usually assumed. In many architec-
tures it is possible to use almost any general purpose register as a stack pointer, and a
few architectures do not specialise one register at all as the stack pointer. In these cases,
the ABI will specify a stack pointer, but it will be a convention not mandated by the
architecture. Programs compiled with a nonstandard stack pointer register are likely
most machine code. The ability to be stack pointer register agile could therefore be
With the beginnings of the ability to decompile C++ programs comes the question of
their calls. Presumably, these functions could be recognised from the sorts of things
that they do, and their calls suppressed. The fallback is simply to declare them as
ordinary functions, and to not remove the calls. The user or a post decompilation
program could remove them if desired. Other compiler generated code such as null
pointer checking, debug code such as initialising stack variables to 0xCCCCCCCC and
the like, are potentially more dicult to remove automatically, since they could have
There is also the question of what to do with exception handling code. Most of the time,
very little exception specic code is visible in the compiled code by following normal
control ow; often there are only assignments to a special stack variable representing
the current exception zone of the procedure. These assignments appear to be dead
code. Finding the actual exception handling code is compiler specic, and hence will
possible that some unusual cases, such as compound conditionals (e.g. p && q) in loop
predicates, may require some research to handle eectively.
very signicant; almost everything seems to be needed before everything else. There
248 Conclusion
piling small machine code programs into readable, recompilable source code. More than
one source architecture is supported. The goal of a general machine code decompiler
[AC72] F. Allen and J. Cocke. Graph theoretic constructs for program control ow
Center, 1972.
+
[ADH 01] G. Aigner, A. Diwan, D. Heine, M. Lam, D. Moore, B. Murphy, and C. Sa-
[AF00] A. Appel and A. Felty. A semantic model of types and machine instructions
[All70] F. Allen. Control ow analysis. SIGPLAN Notices, 5(7):119, July 1970.
[Ana01] Anakrino web page, 2001. Retrieved Mar 2007 from http://www.saurik.
com/net/exemplar.
[And04] Andromeda decompiler web page, 2004. Retrieved Mar 2007 from http:
//shulgaaa.at.tut.by.
[Ata92] Atari Games Corp. v. Nintendo, 1992. 975 F.2d 832 (Fd. Cir. 1992).
[Bal98] T. Ball. Reverse engineering the twelve days of christmas, 23rd Decem-
http://research.microsoft.com/
ber 1998. Retrieved June 2006 from
~tball/papers/XmasGift/final.html.
249
250 BIBLIOGRAPHY
id=1117527263.
[Boo02] Boomerang web page. BSD licensed software, 2002. Retrieved May 2005
from http://boomerang.sourceforge.net.
What You See Is Not What You eXecute. In Proceedings of Veried Soft-
1993.
www.program-transformation.org/Transform/BinaryTranslation.
[Cap98] G. Caprino. REC - Reverse Engineering Compiler. Binaries free for any
+
[CCL 96] F. Chow, S. Chan, S. Liu, R. Lo, and M. Streich. Eective representation
+
[CDC 04] R. Chowdhury, P. Djeu, B. Cahoon, J. Burrill, and K. McKinley. The
+
[CFR 91] R. Cryton, J. Ferrante, B. Rosen, M. Wegman, and F. Zadek. Eciently
computing static single assignment form and the control dependence graph.
252 BIBLIOGRAPHY
[Cha82] G. Chaitin. Register allocation and spilling via graph coloring. SIGPLAN
http://www.itee.uq.edu.au/~cristina/dcc/
available Apr 2007 from
decompilation_thesis.ps.gz.
[Cif96] C. Cifuentes. The dcc decompiler. GPL licensed software, 1996. Retrieved
gust 2003.
and Modelling of Computer Systems, volume 22. ACM Press, May 1994.
[CVE01] C. Cifuentes and M. Van Emmerik. Recovery of jump table case statements
+
[CVEU 99] C. Cifuentes, M. Van Emmerik, D. Ung, D. Simon, and T. Wadding-
ton. Preliminary experiences with the use of the UQBT binary translation
Port Beach 16th Oct 1999, pages 1222. Technical Committee on Computer
[Dat98] DataRescue. IDA Pro, 1998. Retrieved Jan 2003 from http://www.
datarescue.com/idabase.
[Dec01] DeCompilation wiki page, 2001. Retrieved May 2005 from http://www.
program-transformation.org/Transform/DeCompilation.
[Dot04] Dot4, Inc. Assembler to "C" Software Migration Tool, 2004. Retrieved
[DT04] Decompiler Technologies web page, 2004. Retrieved Mar 2007 from http:
//www.decompiler.org.
[Ega98] G. Egan. Short story The Planck Dive, February 1998. Retrieved
[Eri02] D. Eriksson. Desquirr web page, 2002. Retrieved Jul 2005 from http:
//desquirr.sourceforge.net/desquirr.
[EST01] Essential systems technical consulting, 2001. Retrieved Feb 2003 from
http://www.essential-systems.com/resource/index.htm.
254 BIBLIOGRAPHY
IEEE-CS Press, 1999. Only the abstract is printed form; full text published
at http://doi.ieeecomputersociety.org/10.1109/ICSM.1999.10009.
pages 3770. Prospect Media Pty, Sydney, Australia, 2nd edition, February
[FLI00] IDA Pro FLIRT web page, 2000. Retrieved Mar 2007 from http://www.
datarescue.com/idabase/flirt.htm.
1974.
[FSF01] Free Software Foundation, Boston, USA. GNU Binutils, 2001. Retrieved
[FZ91] C. Fuan and L. Zongtian. C function recognition technique and its imple-
[Gab00] H. Gabow. Path-based depth-rst search for strong and biconnected com-
+
[GBT 05] B. Guo, M. Bridges, S. Triantafyllis, G. Ottoni, E. Raman, and D. August.
2005. 545 U.S. 125 S.Ct. 2764, 2770 (2005), retrieved Nov 2005
fromhttp://news.bbc.co.uk/1/shared/bsp/hi/pdfs/supreme_court_
mgm_grokster_27_06_05.pdf.
[Gui07a] I. Guilfanov. Blog: Decompilation gets real, April 2007. Retrieved Apr
[Gui07b] I. Guilfanov. Hex-rays home page, 2007. Retrieved Aug 2007 from http:
//www.hex-rays.com.
[GY04] J. Gross and J. Yellen, editors. Handbook of Graph Theory, chapter 10.
program-transformation.org/Transform/NeliacDecompiler.
http://billharlan.com/pub/papers/A_Tirade_
trieved Aug 2007 from
Against_the_Cult_of_Performance.html.
[HM05] L. Harris and B. Miller. Practical analysis of stripped binary code. ACM
[Ioc88] The International Obfuscated C Code Contest web page, 1988. Retrieved
Transactions Act (UCITA) by the states, 2000. Retrieved Jan 2003 from
http://www.ieeeusa.org/forum/POSITIONS/ucita.html.
BIBLIOGRAPHY 257
[Jan02] André Janz. Experimente mit einem Decompiler im Hinblick auf die
http://agn-www.
forensische Informatik, 2002. Retrieved Aug 2005 from
informatik.uni-hamburg.de/papers/doc/diparb_andre_janz.pdf.
[Jar04] J. Jarrett, 2004. tagline, Electric Vehicle Discussion List, retrieved June
[Jug05] JuggerSoft, 2005. (Also known as SST GLobal and Source Recovery.)
+
[KCL 99] R. Kennedy, S. Chan, S. Liu, R. Lo, P. Tu, and F. Chow. Partial re-
29(1-2):1544, 2003.
1969.
[Kou99] P. Kouznetsov. JAD - the fast JAva Decompiler, 1999. Retrieved Jan 2003
from http://kpdus.tripod.com/jad.html.
[KU76] J. Kam and J. Ullman. Global data ow analysis and iterative algorithms.
[Kum01a] K. Kumar. JReversePro - Java decompiler, 2001. Retrieved Jan 2003 from
http://jrevpro.sourceforge.net.
[Kum01b] S. Kumar. DisC - decompiler for TurboC, 2001. Retrieved Feb 2003 from
http://www.debugmode.com/dcompile/disc.htm.
[Lam00] M. Lam. Overview of the SUIF system, 2000. Presentation from PLDI
http://suif.stanford.edu/suif/suif2/
2000, retrieved Apr 2007 from
doc-2.2.0-4/tutorial/suif-intro.ps.
[MDW01] R. Muth, S. Debray, and S. Watterson. alto: A link-time optimizer for the
LNCS 2304.
BIBLIOGRAPHY 259
[Mic97] MicroAPL. MicroAPL: Porting tools and services, 1997. Retrieved Feb
[MLt02] MLton web page. BSD-style licensed software, 2002. Retrieved Jun 2006
from http://www.mlton.org.
155558179X.
https:
Workshop Presentation, August 2007. Retrieved Aug 2007 from
//connect.microsoft.com/Downloads/Downloads.aspx?SiteID=214.
[MS04] Microsoft Corporation. Phoenix Home Page, 2004. Retrieved Aug 2007
from https://connect.microsoft.com/Phoenix.
publications/techreports/#report2006-2.
260 BIBLIOGRAPHY
http://people.redhat.com/dnovillo/pub/mem-ssa.pdf.
[PR94] H. Pande and B. Ryder. Static type determination and aliasing for C++.
(OOPSLA).
http://www.citi.qut.edu.au/research/
1999. Retrieved Feb 2003 from
plas/projects/cp_files/cpjvm.html.
[Reu88] J. Reuter, 1988. Public domain software; retrieved Feb 2003 from ftp:
//ftp.cs.washington.edu/pub/decomp.tar.Z.
[Ric53] H. Rice. Classes of recursively enumerable sets and their decision problems.
[Rob02] J. Roberts. Answer to forum topic How do I translate EXE -> ASM
datarescue.com/ubb/ultimatebb.php?ubb=get_topic;f=1;t=000429.
BIBLIOGRAPHY 261
[Roe01] L. Roeder. Lutz roeder's programming.net, 2001. Retrieved Sep 2004 from
[SA01] K. Swadi and A. Appel. Typed machine language and its semantics, July
+
[SBB 00] B. De Sutter, B. De Bus, K. De Bosschere, P. Keyngnaert, and B. Demoen.
June 2000.
[Sca01] Scale, a scalable compiler for analytical experiments, 2001. Retrieved Apr
[Seg92] Sega Entertainment Ltd. v. Accolade, Inc, 1992. 977 F.2d 1510 (9th Cir.
1992).
[SF06] Open64 | The Open Research Compiler, 2006. Retrieved Aug 2007 from
http://www.open64.net.
+
[SFF 05] M. Suzuki, N. Fujinami, T. Fukuoka, T. Watanabe, and I. Nakata. SIMD
http://www.itee.uq.edu.au/~cristina/students/
able Apr 2007 from
doug/dougStructuringAsmThesis97.ps.
[Sof96] SofCheck Inc. Applet Magic, 1996. Retrieved Feb 2003 from http://www.
appletmagic.com.
[Son84] Sony Corp. of America v. Universal City Studios, 1984. 464 U.S. 417 (2005),
[SR02a] Source Recovery, also known as JuggerSoft and SST Global, 2002.
sourcerecovery.com.
[SR02b] Source Recovery's HP-UX C/C++ Decompiler, 2002. Retrieved Feb 2004
from http://www.sourcecovery.com/abstract.htm.
[SRC96] The Source Recovery Company, 1996. Retrieved July 2004 from http:
//www.source-recovery.com.
BIBLIOGRAPHY 263
[SST03] SST GLobal, 2003. (Also known as JuggerSoft and Source Recovery.)
[Sta02] Static recompilers. Yahoo Tech Group, 2002. Retrieved Apr 2007 from
http://tech.groups.yahoo.com/group/staticrecompilers.
[SW93] A. Srivastava and D. Wall. A practical system for intermodule code op-
March 1993. Also available as WRL Research Report 92/06; retrieved Apr
berlin.de/~tolk/vmlanguages.html.
[UQB01] UQBT web page. BSD licensed software, 2001. Retrieved Apr 2002 from
http://www.itee.uq.edu.au/~cristina/uqbt.html.
264 BIBLIOGRAPHY
http://www.datarescue.com/cgi-local/ultimatebb.cgi?ubb=get_
topic;f=1;t=000495;p=0#000001.
Press, 1989.
+
[VWK 03] L. Vinciguerra, L. Wills, N. Kejriwal, P. Martino, and R. Vinciguerra. An
Press.
[War01] M. Ward. The FermaT transformation system, 2001. Retrieved Feb 2003
from http://www.dur.ac.uk/martin.ward/fermat.html.
[Win01] Winelib web page, 2001. Retrieved Mar 2005 from http://winehq.org/
site/winelib.
Wesley, 1996.
Wisconsin-Madison, 2001.
assembler, xxviii, 45
a[exp ], xl
assembler comprehension, 45
abbreviations, xxv
assembly decompilers, 23
ABI, xxviii
assembly language, xxviii, 39, 41, 46
acronyms, xxv
assigned goto, 201
add features, 14
assigned goto statements, 201
addition and subtraction, 166
AT&T syntax, 64
address space, 187
Atari vs Nintendo, 9
ane relation, xxviii, 214
automated tools, 10
ane relations analysis, 51
automatically generated code, 11
aggregate, xxviii
available denitions, 85
aggregate structure dentication, 52
Analysis of Virtual Method Invocation for binary translation, xxxvii, 28, 48, 154, 158
266
INDEX 267
comprehensibility, 52
call by reference, 120
Computer Security Analysis through De-
call graph, xxix
compilation and High-Level Debug-
call graph, cycles in the, 126
ging, 36
call tail optimisation, 221
condition code register, 70
callees, xxix
condition codes, xxx
callers, xxix
conditional preservation, xxx
canonical form, xxix, 67, 179, 181
Connector plugin, 51
carry ag, xxix
constant, xxx
carrying (in propagation), 117
constant K, 181, 190
Chikofsky and Cross 1990, 52
constant propagation, 67
childless call, xxix, 85, 127
constant reference, 188
Cifuentes et al., 36
constant, type, xl
Cifuentes, Cristina, iv, 33, 35
constants, 63
CIL, xxx
constants, typing, 161
class hierarchy, recovering, 218
constraints, type, 42
class types, 154
constructors and destructors, 247
CLI, xxx, 44
context, xxx
COBOL, 45
continuation passing style, xxxi, 19
code checking, 9
contract, xxxi
code motion, 104
control ow analysis, 61
code recovery, 14
control ow, incomplete, 196
code, low level, 39
coprocessor, 75
CodeSurfer, 51
copy propagation, 67
CodeSurfer/x86, 51
copyright, 29
COINS, 54
cross platform, 12
Collberg, 60
CSE, 68
collector, xxx, 122, 136
cycle, 130
colocated variables, xxx, 151, 187, 189, 247
cycles (call graph), 126
commercial decompilation, 44
Cygwin, xxxi
common subexpression elimination, 68
indexing, 180
Gagnon et al, 42
indirect branches, 198
Gated Single Assignment (GSA) form, 141
indirect calls, 37, 208
GCC, 56
indirect jumps and calls, 195
GENERIC, 56
induction variable, xxxiii, 145
GIMPLE, 56
information loss, 16
Glasscock, Pete, 40
infringements, patent, 11
global data ow, 87
initialisation functions, 215
global data ow analysis, 87
initialised arrays, 156
Global Oset Table, 86
Input program, xxxiii
goto statement, assigned, 201
instruction set simulation, 6
grammar, inverting, 39
interference graph, 107, 108
GrammaTech Inc, 51
intermediate representation, 2, 61, 141
greatest lower bound, 171
internal pointers, 114
Guilfanov, Ilfak, iv, 36, 192
interoperability, 9, 13
Katsumata, 41 MicroAPL, 45
locations, 63 ndcc, 36
rhs-clear, 66 SML, 45
RTL, 40 Soot, 42
source code, 3
Sable group, 42
source code levels, 5
safety critical applications, 11
source code recovery, 14
Sassa, Masataka, 98
Source Recovery, 45
Sassaman, W., 39
Source Recovery Company, The, 45
the save/restore problem, 119
sources of type information, 158
Scale (compiler infrastructure), 56
signature (of procedure), 158 Static Single Assignment, xxxviii, 40, 97,
strongly connected component, xxxviii, 129 type information, soruces of, 158
Sun Microsystems, v
ud-chains, 66
swapping two registers, 120
Ultrasystems, 39
switch statements, 198
undecidable, 90
symbol table, 187
University of Durham, 45
symbols, debug, 159
University of London, 41
tail call optimisation, 19, 220
University of Oxford, 45
target, xxxix
University of Wisconsin-Madison, 51
T (e), xl
unreachable code, xxxix, 69
this parameter, 189
Upton, Eben, iv
thread, 247
UQBT, v, 35, 48, 154
tools, automated, 10
use dollectors, 137
top (lattice), 170
use-denition, xxxix, 102
Tröger and Cifuentes, 37, 210, 221
use-denition chains, 66
Tröger, Jens, v
translating out of SSA form, 102, 107 value dependence graph, 142
virtual method, 37
viruses, 11
VSA, 51
VT, 209
vulnerabilities, nding, 10
Ward, Martin, 41
warnings, 28
WHIRL, 58
whole-program analysis, 87
Wroblewski, 60
WSL, 41
x64, xxxix
x86, xxxix
YaDec, 37
Zebra, 40
Zongtian, L., 35