0% found this document useful (0 votes)
63 views

Static Analysis of Binary Exe

This document discusses static analysis techniques for binary executables. It begins by explaining the challenges of disassembling binary code and building a control flow graph due to indirect branches and varying instruction sizes. It then discusses how program slicing can be used to reduce assembly code to compute register values and determine possible branch targets. The document also notes that static analysis is commonly framed as a data flow analysis problem and outlines how techniques like constant propagation have been applied to optimize programs and prove properties. Overall, the document surveys how static analysis of binaries can provide insights into program behavior without execution.

Uploaded by

emanresusugob
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Static Analysis of Binary Exe

This document discusses static analysis techniques for binary executables. It begins by explaining the challenges of disassembling binary code and building a control flow graph due to indirect branches and varying instruction sizes. It then discusses how program slicing can be used to reduce assembly code to compute register values and determine possible branch targets. The document also notes that static analysis is commonly framed as a data flow analysis problem and outlines how techniques like constant propagation have been applied to optimize programs and prove properties. Overall, the document surveys how static analysis of binaries can provide insights into program behavior without execution.

Uploaded by

emanresusugob
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Static Analysis of Binary Executables

Steve Hanov
University of Waterloo
200 University Avenue West
Waterloo, Ontario, Canada N2L 3G1
[email protected]

ABSTRACT Many static program analysis techniques are reduced to a


This paper is a survey of the use of static program analysis data flow analysis problem (DFLAP) [8]. With a DFLAP,
techniques on binary executables. Static analysis techniques the program’s data is modeled as mathematical set in some
are often used on a program’s source code, which is usually way, for example, a mapping from variable names to val-
a high level language. It is possible to apply them directly ues. There is an instance of this set at each program point,
on the machine code of a compiled program. One of the summarizing all the information that can be deduced. The
challenges is building up a control flow graph of a proce- program’s instructions are reduced to functions that operate
dure, since indirect branch instructions accept the contents on the set (for example, adding, modifying, or removing val-
of a register for the destination address. Program slicing ues). During the analysis, the functions are executed one at
techniques can be used to reduce the assembly code to the a time, each one feeding its output into the next. A program
smallest possible program to compute the value of that reg- may contain loops, so the process is repeated until nothing
ister, and determine the range of values in the register. changes. When that happens, we are left with a fixed point.
Depending on the functions used, we can deduce informa-
Another problem is disassembly itself. On architectures with tion about the program by looking at the resulting sets at
instructions of varying size, it is difficult to locate the start each program point. The seminal example of this process
of the first machine code instruction in a section consisting of is constant propagation, where we can prove that a variable
both code and data. Also, malicious code could take advan- always has the same value at a particular point, and thus
tage of the difficulties in disassembly to hide its existence. eliminate it from the program entirely.
Various static analysis techniques have been developed to
analyze such programs, in order to build up a control flow The usefulness of static program analysis to program opti-
graph and a call graph. Finally, type-state techniques have mization is easy to see. It can be used to eliminate array
been developed to verify that machine code conforms to its bounds checks, for example. Outside of program optimiza-
interface, and does not alter areas of memory which it should tion, it can be used to modify the source code to make it
not. safer – adding type constraints to Java classes [13]. Static
analysis techniques have been applied to the problem of pro-
gram verification [6], and to detect buffer overflows in C [5].
1. INTRODUCTION However, all of these static analysis techniques operate on a
A competent programmer can examine a program’s source high level language.
code and get a pretty good idea about what it does. Granted,
he is aided by meaningful variable and procedure names. Computers do not directly execute code written in a high
Even if these are stripped away, it is possible to determine level language. Compilers transform the code into assem-
facts about a program. Countless times in all parts of the bly language instructions, and then finally into the machine
world, some programmer has painstakingly checked over a code compatible with the target processor. Since the ma-
program to find locking errors or memory that has not been chine code is being executed, there is value in applying static
freed, without first having to run the program in his head. analysis techniques to a binary executable.
It is logical, then that a computer can follow the same pro-
cesses and make the same deductions about program be- The analysis of binary executables is closely related to re-
haviour, more accurately, and also without running the pro- search into reverse engineering. A goal of reverse engineer-
gram. This is the realm of static program analysis. ing is decompiling, or reconstructing the source text from
the compiled machine code. This usually involves building
a control flow graph for each procedure, and looking for id-
ioms in the code. An idiom is a sequence of instructions
that, individually, do not have meaning, but when taken
together form a larger operation.

In this paper, we will examine how static analysis techniques


have been applied to binary executables. We will look at
what problems static analysis can solve, and also the chal-
1: read(x,y);
2: total = 0;
3: sum = 0;
4: if (x <= 1) {
5: sum = y;
} else {
6: read(x);
7: total = x * y;
}
8: print total, sum;

Figure 3: : Original Program, adapted from Weiser

to build the control flow graph. In order to determine the


destination of an indirect branch instruction (jumping to
the address stored in a register) she program slicing tech-
niques. We first follow the development of program slicing
techniques, and then show how it is applied to binary exe-
cutables.
Figure 1: The PE format, used by Microsoft Win-
dows executables. Program slicing was first developed in 1981 by Weiser [14],
who noticed that when trying to understand programs, ex-
lenges, and what static analysis techniques cannot solve. perienced programmers would create what he called a slice
in their conceptual model of how the program worked. The
program slice is an executable program, but with certain
2. DISASSEMBLY OF BINARY EXECUTA- statements deleted. The resulting program contains only
BLES statements of interest to the problem at hand.
An example of a binary executable file format is shown in
Figure 1. In modern executable formats, the binary is split Slice on the value of total at the end of the program.
into several sections, which neatly separate the code from the 1: read(x,y);
data. However, it is not necessary to make this separation. 2: total = 0;
As long as it is never accidentally executed, the data make 4: if (x <= 1) {
be placed at any location within the .text section. } else {
7: total = x * y;
Such neat separation in program executables is not always }
the case. The .com file format, used in MS-DOS, consists
of data and executable instructions intermixed. The file is Weiser defines a slicing criterion C(i, v). i is is the state-
loaded at a fixed address, known in advance. Execution ment at which to observe, and v is the set of variable names
begins at the first byte in the file. to be observed. He defines a monotone data flow analysis
framework:
The first step in the static analysis must the be conversion 
of the machine code into assembly language instructions. Rout (n) − DEF (n) ∪ REF (n) if n 6= i
Rin (n) =
However, this task itself is challenging, and has been the Rout (n) − DEF (n) ∪ REF (n) ∪ v if n = i
subject of much research. With Intel instructions [1], the
opcode of the instruction can be between 1 and 3 bytes, and Rout (n) =
S
Rin (m)
m∈SU CC(n)
this can be followed by 1 to 8 bytes of immediate data and
a scaling/offset byte (See Figure 2. The instructions are not where m, n are statements, Rin,out are the sets in the frame-
required to follow any alignment rules in memory. work, DEF (n) is the set of variables defined at n, and
REF (n) is the set of variables referenced at statement n.
Because it is possible for control flow to jump around the The data flow step will remove all variables referenced or
program, it is difficult for a disassembler to find the align- defined in a statement, unless a) they are referenced by
ment that yields the correct instructions. The disassembler the statement of interest, or b) a successor references them.
must follow the path of execution in the same way as the pro- When the analysis terminates, only variables which are ref-
cessor, in order to skip around any data bytes in between the erenced in the slicing criterion will remain in the sets. All
instructions. This may be as simple as performing a linear other program statements may be removed, as long as care
scan through the instructions, or as complex as interpreting is taken not to violate the syntax of the language.
the code to follow its path of execution.
In the ensuing years, new ways of representing programs
3. PROGRAM SLICING allowed Susan Horwitz [7] to extend Weiser’s method inter-
One of the earliest applications of static analysis techniques procedurally. Horwitz describes two graphs: the Program
to binary executables is due to Cifuentes [4]. Cifuentes was Dependence Graph (PDG), which represents a single pro-
researching into the decompilation of programs and needed cedure, and the System Dependence Graph (SDG), which
Figure 2: Intel Instruction format. Each opcode is of varying length.

connects procedures together. This procedure begins at the nodes of interest, and works
backwards to find all possible paths to those nodes from the
To construct the program dependence graph, entry point. By performing this procedure on one of the
F inalU se(v) nodes, we can create a program slice for any
variable we please.
• Create an entry node. For every variable v, create a
node labeled F inalU se(v). Horwitz also goes into into detail extending the method in-
• Create a node for each program statement. terprocedurally using the System Dependence graph to cre-
ate function summaries. However, this area has not been
• Connect the entry vertex to all other vertexes which applied to binaries, we omit it from this survey.
are not in a loop or conditional. These edges are con-
trol dependence edges. Thus far, the slicing method does have a problem: It handles
only structured programs. If a goto or jump statement is
• Connect control flow statements (such as if constructs) inserted, the results are incorrect. Agrawal [2] proposes a
to their immediately nested contents, using control de- method of handling them.
pendence edges.
• Connect together all nodes v1 and v2 such that v1 de-
1. First, he constructs the program slice using the con-
fines a variable x, and there is a path through the
ventional algorithm, which excludes jump statements.
program such that v2 uses that same definition of x.
This edge is a data dependence edge. 2. Then, he determines which jump statements to add
back in. The criteria for this is explained below.
In Figure 4, we have applied the above procedure to the 3. Finally, if the slice contains a jump, and the destina-
program in Figure 3. The result is the program dependence tion isn’t in the slice, the target is pushed forward to
graph. The intraprocedural slice can then be found with a the next executable statement.
very simple algorithm given by Horwitz:

Agrawal’s algorithm depends on the control flow graph for


procedure MarkVerticesOfSlice(G,S) the procedure. From the control flow graph, he derives two
declare other types of graphs: The post dominator tree, and the
G: a program dependence graph lexical successor tree.
S: a set of vertices in G
WorkList:a set of vertices in G Figure 5 shows an example program containing unstructured
v,w: vertices in G goto statements. Figure 6 is its control flow graph. From
begin the control flow graph, the program dependence graph in
WorkList := S Figure 7 is calculated using the algorithm given by Horwitz.
while WorkList $\neq$ {} do By examining the program dependency graph, we can follow
Select and remove vertex $v$ from WorkList the edges up from the node write(y). We see that it consists
Mark $v$ only of the nodes start, y = ..., and write(y). This is clearly
for each unmarked vertex w such that not an accurate slice.
there is an edge $w \to v$ in G, do
Insert $w$ into WorkList To add the jump statements, we first require the post dom-
od inator tree and the lexical successor tree. According to [?],
od a node p post dominates node i if every possible execution
end path from i to exit includes p. In other words, if i executes,
Figure 4: The program dependence graph, using Horwitz’s algorithm on the example of Weiser in Figure 3.

1: if ( C1 ) {
2: goto L6;
3: y = ...;
4: goto L8;
}
5: z = ...;
6: L6: x = ...;
7: goto L3;
8: L8: write(x);
9: write(y);
10: write(z);

Figure 5: from Agrawal. An unstructured program.

p must later execute. The tree is shown in figure 8. The lex-


ical successor is simply the next statement in the program
code, as written. In a compound statement, such as if or
while, the successor is considered to be the statement after
its body.

We traverse the post dominator tree using the preorder traver-


sal, and for each jump statement encountered that is not al-
ready in the slice and whose nearest post dominator in slice
is different from its lexical successor in the slice, we add it
and the transitive closure of its dependencies.

In our example, we add the goto statements in line 2, 4, and


7, and the if statement in line 1 since line 2 is dependent
upon it.

1: if ( C1 ) {
2: goto L6;
3: y = ...;
4: goto L8;
}
7: goto L3;
9: write(y);

4. DISASSEMBLY USING STATIC SLICING


Cifuentes’s 1997 paper [4] views the main problem of dis-
Figure 6: The control flow graph for the unstruc-
assembly as the separation of instructions from data. The
tured program in Figure 5.
instructions can jump around the data, so it is not obvious
to a disassembler which is which. A naive recursive dis-
assembler would not be able to work when it encountered
indexed jump instructions, or an indirect procedure call to
the contents of a register.

The conventional approach is to compare the code to that


output by a number of different compilers, in the hopes that
the indirect jump is part of some larger idiom, such as a
switch statement. However, this requires keeping a database
of compiler data, and programs written in assembler would
not be deciphered.

During disassembly, one could perform a reaching definitions


analysis on the assembly code to determine the destination
of an indirect branch instruction. However, more sophis-
tication is needed to obtain a complete result. Cifuentes
suggests using a static slice of the program to determine the
register contents.
Figure 7: The program dependence graph for the
program in Figure 5. This is the result of applying Cifuentes limits herself to the intraprocedural case, so she
the algorithm of Horwitz to the example of Agrawal. does not use the interprocedural system dependence graph
For simplicity, the FinalUse() nodes have been omit- of Horwitz [7]. However, the handling of the jump instruc-
ted. tion due to Agrawal [2] is essential to the analysis, because
machine code is unstructured.

One of the challenges faced by Cifuentes, and other re-


searchers such as [9], is the complexity of the CISC Intel
instruction set. Cifuentes deals with this complexity in two
ways. First, the scope of the research is limited to the In-
tel 286 instruction set, although Pentium was available at
the time. Secondly, by changing the form of some of the
instructions. The DIV instruction, for example, stores its
result and remainder in two different registers. For this pur-
pose, the researchers convert the statement into three sep-
arate statements which use a fictional tmp register for its
intermediate result. Such tricks mean the algorithm only
needs to 110 instructions instead of the full set of 250 In-
tel 80286 instructions Such tricks mean the algorithm only
needs to 110 instructions instead of the full set of 250 Intel
80286 instructions..

Cifuentes et al. were able to apply the static slicing algo-


rithm to short snippets of assembly code without difficulty.
However, they ignore the problem of stack variables and
aliasing entirely, stating that it is beyond the scope of the
analysis since they are only concerned with the intraproce-
dural case.

It is possible to perform an interprocedural analysis as well.


Most research is done in the context of deciphering obfus-
cated binaries. Therefore, we will first give an example of
obfuscation techniques, followed by ways of countering them.

5. OBFUSCATION TECHNIQUES
In 2003, Linn and Debray [10] proposed a number of ways to
improve the resistance of an executable to static disassem-
bly. There are two types of disassemblers. A linear sweep
Figure 8: The post dominator tree for the program disassembler begins at the program start address and disas-
in figure 5. sembles each instruction encountered until it gets to the end
address. The best example of a linear sweep disassembler is
objdump, part of the GNU binutils package.

To thwart a linear sweep disassembler, one merely has to in-


sert junk bytes in areas where they will never be executed,
but will be seen by the disassembler. For example, junk found. To this end, the algorithm searches for function pro-
bytes may be inserted after an unconditional jump state- logues (the first few instructions compilers used to prepare
ment, or before jump targets. the stack upon entering a new procedure). This method is
imprecise, but correct for a well-behaving compiler. The al-
Recursive disassemblers work differently. They begin at gorithm will also detect function entry points where none
the program’s entry point and disassemble instructions until exist, because the data in the program or parts of other
they reach a branch instruction. A branch has two targets instructions may also look like a function prologue. These
– if a condition is true, it will jump to another location. If spurious parts will be discarded at a later step.
the condition is false, execution will fall through. The disas-
sembler will continue from both of these paths recursively, For the intraprocedural disassembly, Kruegel uses a statis-
and thus eventually reach all possible statements. tical technique to build the control flow graph (CFG). Each
procedure is analyzed as follows.
To defeat a recursive disassembler, Linn and Debray propose
two methods. Taking advantage of the fact that a recursive
disassembler follows both branch targets, they modify the 1. The procedure is disassembled using all possible align-
program to include branch instructions that are always true ments of instructions. At this point, certain align-
or always false. When executed, the program will always ments may already be discarded due to illegal instruc-
take the same path. However, a disassembler will follow both tions. Contiguous blocks of statements ending at a
paths, including the incorrect one, and become confused. branching instruction become nodes in CFG. Note that
This technique is called opaque predicates. The suitability of the CFG may contain blocks that overlap in memory.
opaque predicates is questioned by [9], due to the difficulty in For now, they are allowed.
creating ones complex enough to fool a smart disassembler.
2. For each branch statement, the source and destination
A second technique for thwarting a recursive disassembler is nodes are connected. The destination node may be
the use of Branch functions. An unobfuscated program con- split, if the destination address is not at the start of
tains many call instructions to different procedures. In the the node.
obfuscation step, all of theses calls are replaced to a single 3. At this point in the algorithm, the CFG is a superset
branch function. The branch function takes the return ad- of the real control flow graph of the procedure, because
dress, and using a perfect hash function stored in the data some of the blocks are overlapping. In the next step,
section, determines the correct address that was intended Kruegel resolves these conflicts in a number of ways.
to be called. In order to determine the call graph, a dis- He assumes that real basic blocks are more tightly in-
assembler would first have to work out the inner workings tegrated into the control flow than spurious ones. If
of the branch function. As an additional obfuscation, the two basic blocks conflict, then he chooses the one that
caller may pass a parameter to the branch function which is is more tightly connected to others.
an offset. When the function returns, it will not return to
the next instruction after the call, which is the usual case. 4. If there are remaining conflicting basic blocks, Kruegel
Instead, it will first apply the offset and return to the new randomly eliminates them and leaves other results up
address. to future work.

In their paper, Linn and Debray present a tool that processes


a binary executable and obfuscates it. It is meant to be used At the end of the procedure, Kruegel’s algorithm has disas-
for protecting commercially deployed products, so it includes sembled at least the entire program, if indirect branch func-
a profiler component so that obfuscations are inserted only tions have not been used. Recall that Linn and Debray’s
in places where they would not severly affect performance. branch function takes as input the return address with which
It is on this obfuscation tool that our next paper focuses its it was called. It submits key to a perfect hash table, stored
attack. in the data section of the binary. The hash result is 1) the
address of the function which to call, and 2) a new address
5.1 Disassembling Obfuscated Code to which to return (to confuse disassemblers that continue
As the techniques of obfuscating code have advanced, so after the call instruction). Kruegel suggests interpreting the
have the techniques of dissembling such code. In [9], Kruegel instructions of branch function in order to calculate the two
attempts to create a tool to combat the Linn and Debray’s results statically. This, of course, requires some knowledge
obfuscater. His disassembler make several assumptions. First, of what the branch function does, and which function is the
valid instructions must not overlap. It is extremely difficult branch function. In a call graph, the identity of the branch
to design a sequence of instructions that has meaning when function should be obvious – it will be the only function ever
executed at an offset to itself. The self-correcting property called. As for what it does, it is difficult to come up with
of Intel instructions works against this. Secondly, the al- a branch function that is complex enough to evade abstract
gorithm assumes that opaque predicates do not exist. Al- interpretation, yet yields consistent results.
though Linn and Debray point them out as a possible ob-
fuscation, they do not actually use them. If they did, they 5.2 Bigram Analysis of Assembly Code
would be easily detectable by the disassembler, Kruegel ar- Instead of randomly choosing basic blocks to resolve con-
gues. flicts, the authors in [9] suggest using another statistical
technique: bigram analysis. Bigrams are widely used in
First, the start of each procedure in the binary must be natural language processing, for finding the best parse in
a probabilistic context free grammar [11]. In this analysis, derstand the detector portion. For each virus, Christodor-
for each set of two assembly language statements, s1 and escu manually builds a malicious code automaton. This au-
s2 , the probability that s2 follows s1 is stored in a table. tomaton is a sequence of abstract statements, free of indi-
This table can be derived by training the model – eg. count vidual registers or machine code instructions. For example,
the frequency of co-occurring statement pairs in a large set a portion of the Chernobyl virus is reduced to:
of programs. After training, the table is normalized to ob-
tain the probabilities. Then, the probability of an arbitrary Move(A,b)
sequence of instructions s1 , s2 , s3 , . . . , sn occurring can be Move(C,0d601h)
calculated as: Pop(D)
Pop(B)

To detect the malicious sequence, the assembly code must


P (s1 , s2 ) × P (s2 , s3 ) × · · · × P (sn−1 , sn )
be converted to a similar high level abstraction. Once the
assembly code is in this form, all that remains to be done
Bigram analysis is applied to Java bytecode sequences in is to decide whether any path through the target program
[12], a very short paper. The main result is that in the matches the path through the malicious code automaton.
test benchmarks, usually only about 10% of the possible Christodorescu does this by viewing the control flow graph
pairs of bytecode sequences are used. Because this number as a finite state machine (FSM) that produces a regular
is so low, one can calculate with a high degree of accuracy language. The FSM may include obfuscations that cause it
the likelihood that a given sequence of bytes is a real Java to double back and execute other optional paths. However,
program. If the result holds for Intel assembly language, this the objective is to determine if the language produced by the
alone may help reveal correct alignments for disassembly. two automatons share any common elements. If so, then at
However, there has not been any work in this area. Also, an least one path through the program is the virus.
obfuscater that is aware of bigram analysis may attempt to
further trick the disassembler by transform the instructions An additional complexity is that the detector must check
into an equivalent program with improbable statements. for all permutations of the variables used. For this purpose,
he borrows the concept of unification from the field of auto-
In the next section, we examine such transformation tech- mated theorem proving. The Unify() function returns false
niques and how to detect them. if there is no binding of free variables that can be applied to
the same two instruction sequences.
6. VIRUS DETECTION
Christodorescu’s 2003 paper [3] applies static analysis tech-
niques to the area of virus detection. Polymorphic viruses 7. TYPE-STATE CHECKING OF MACHINE
include self-modifying code, so that they are changed each CODE
time they spread themselves. They always have two parts A data flow analysis can determine the range of values of a
– a section of encrypted code, and the decryptor. The en- particular register at a program point, and so it can deter-
crypted part is easy to change by generating a new key. The mine if, for example, a memory access is reading from the
decryptor has the task of modifying itself to that the new NULL pointer. However, it cannot answer questions such
code is semantically equivalent to the old, yet any signatures as, ”is a particular variable always initialized before being
that a virus scanner could look for would not match. read from?”

Christodorescu identifies several types of obfuscation tech- Type-state analysis is designed to this type of question.
niques that are used by polymorphic viruses for the decrypt- With a type-state analysis, one can assign rules to variables
ing stub. The first, which worked for all commercial virus in a program. For example, in a higher level language, one
utilities at the time, was inserting nop instructions into the can create a rule that lock() may never be called on a lock
program text. Another technique is to rearrange the instruc- object that has already been locked. This may be verified
tions. There may be two or more instructions whose order at compile time. We will see how type-state analysis can be
do not matter, so they can be safely transposed. Likewise, applied to machine code as well.
the virus may perform instruction substitution, so that one
instruction is replaced by several that perform an equiva- In Xu’s PH.D thesis [15], he is concerned with verifying the
lent function. A particularly insidious version is to weave safety of machine code. Most often, safety is determined
the instructions into a host program, so that as the host is through code signing. However, users often do not have
executing, it also executes the virus. the time or inclination to verify that they trust the signing
authority. In addition, unsafe code may be signed. Xu’s
Christodorescu creates an analysis tool to detect the poly- method is best suited for a binary plug-in distributed to a
morphic techniques that he identified. The detector operates host program. The host program will be able to statically
in three phases. First, the code is disassembled. Secondly, it check for array bounds accesses, null pointer dereferences,
is annotated, and thirdly, it is run through a detection mod- and unaligned memory accesses before executing untrusted
ule. The paper does not discuss disassembly, and presumes code.
that the malicious programs have not made any attempt to
confuse the disassembler. The analysis operates in 5 phases: Preparation, Type-state
propagation, annotation, local verification, and global veri-
To understand the goals of the method, it is best to first un- fication. Xu assumes that the host program communicates
with the plug-in using a well defined interface. For example, preconditions and assertions for each program state-
consider a function call exported by a dynamic library. The ment. For isntance, it it determines that a variable
function call takes as arguments an array, a pointer, and a is being used to index an array, it adds a safety pre-
length. condition that the variable’s value must lie within the
bounds of the array. The two types of preconditions
that it creates are local preconditions, which can be
void foo( int array[], int length, verified using typestate information alone, and global
const InfoStruct_t *readStruct ); preconditions, which may require further analysis to
verify. For example, a global precondition may require
the calculation of the loop invariants to check.
The host program would like to verify that foo() does not
write beyond the bounds of the array. In addition, it would • In the Verification step, the preconditions for each
like to verify that it does not write to the readStruct only statement are checked. A local precondition may be,
structure. Xu’s algorithm can verify these rules, if they are for example, that ”e is initialized” and this can be
encoded in an access policy. checked immediately using the typestate calculated in
the propagation step. However, a global precondition,
An access policy is a set of 3-tuples. Each tuple consists of a such as an array bounds check, may require a range
region, category, and the type of access permitted. A region analysis on the values of the registers.
can be a range of addresses, or a single variable name. The
category is a set of types, and the access field is, for exam-
ple, a combination of read, write, followable, or executable If all of the preconditions hold for the program, then it is
flags. Together with the access policy, the host must provide deemed safe to execute.
the type-state for each variable and the invocation. In our
example, here is the type-state: Some of the limiations of the algorithm are its scalability
and precision. The propagation step uses a flow sensitive
Type-state interprocedural analysis which is quite slow [15]. Also, ar-
e: <int, uninitialized, rw> ray elements are collapsed into a single element, so it is not
array: <int [n], {e}, rw> possible to verify programs that are allowed to write to only
readStruct: <InfoStruct t, initialized, r> certain portions of an array without creating a separate ac-
cess policy for each element.
Access Policy
< e : int : rw > 8. CONCLUSION
< array : int [n]: rw > We have presented a survey of the use of static analysis
< readStruct : InfoStruct t : r > techniques in binary executables. The techniques have been
presented in the context of reverse engineering, disassembly,
Invocation virus detection, and safety analysis. It is most striking not
%o0 ← array how far the research has advanced, but how far it must still
%o1 ← n go to do such basic tasks as disassemble a program. It is
%o2 ← readStruct surprising that a virus can be constructed, and using simple
technqiues, can disguise its source code from the best known
disassembly techniques, such that it must be run in order to
The host type-state is the initial type-state of each variable.
reveal its secrets. Researchers in program obfuscation and
The access policy controls how the variables may be used.
static analysis will perhaps fight a never ending battle with
The invocation tells the algorithm what the initial state of
each other.
the registers are. In this example, SPARC assembly is used,
so the function parameters are passed in the registers.
9. REFERENCES
[1] Intel 64 and IA-32 Architectures Software Developer’s
• In the preparation stage, the algorithm takes the type- Manuals.
state, access policy, and invocation, applies the type- [2] H. Agrawal. On slicing programs with jump
state to the invocation conditions, and creates initial statements. Proceedings of the ACM SIGPLAN 1994
constraints. The initial constraints are predicates on conference on Programming language design and
the values of the variables and the registers. For ex- implementation, pages 302–312, 1994.
ample, there may be a predicate that %o3 < n, if %o3 [3] M. Christodorescu and S. Jha. Static Analysis of
is used to address an array. Executables to Detect Malicious Patterns. Proceedings
of the 12th USENIX Security Symposium, pages
• In the type-state propagation step, the statements of
169–186, 2003.
the program are abstractly interpreted. At each pro-
gram point, there is a copy of the memory state for [4] C. Cifuentes and A. Fraboulet. Intraprocedural static
each variable, register, or memory location, along with slicing of binary executables. Software Maintenance,
its type-state. 1997. Proceedings., International Conference on, pages
188–195, 1997.
• In the previous stage, the type-states are known at [5] N. Dor, M. Rodeh, and M. Sagiv. CSSV: towards a
each program point, but they cannot yet be verified. realistic tool for statically detecting all buffer
In the annotation step, the algorithm creates safety overflows in C. Proceedings of the ACM SIGPLAN
2003 conference on Programming language design and
implementation, pages 155–167, 2003.
[6] S. Hallem, B. Chelf, Y. Xie, and D. Engler. A system
and language for building system-specific, static
analyses. ACM Press New York, NY, USA, 2002.
[7] S. Horwitz, T. Reps, and D. Binkley. Interprocedural
slicing using dependence graphs. ACM Transactions
on Programming Languages and Systems (TOPLAS),
12(1):26–60, 1990.
[8] J. Kam and J. Ullman. Monotone data flow analysis
frameworks. Acta Informatica, 7(3):305–317, 1977.
[9] C. Kruegel, W. Robertson, F. Valeur, and G. Vigna.
Static disassembly of obfuscated binaries. Proceedings
of the 13th USENIX Security Symposium
(Security£04), 2004.
[10] C. Linn and S. Debray. Obfuscation of executable
code to improve resistance to static disassembly.
Proceedings of the 10th ACM conference on Computer
and communications security, pages 290–299, 2003.
[11] C. Manning and H. Sch‘eutze. Foundations of
Statistical Natural Language Processing. MIT Press,
1999.
[12] D. O’Donoghue, A. Leddy, J. Power, and J. Waldron.
Bigram analysis of Java bytecode sequences.
Proceedings of the inaugural conference on the
Principles and Practice of programming, 2002 and
Proceedings of the second workshop on Intermediate
representation engineering for virtual machines, 2002,
pages 187–192, 2002.
[13] F. Tip and D. Bäumer. Refactoring for generalization
using type constraints. Proceedings of the 18th annual
ACM SIGPLAN conference on Object-oriented
programing, systems, languages, and applications,
pages 13–26, 2003.
[14] M. Weiser. Program slicing. Proceedings of the 5th
international conference on Software engineering,
pages 439–449, 1981.
[15] Z. Xu, B. Miller, and T. Reps. Safety checking of
machine code. ACM SIGPLAN Notices, 35(5):70–82,
2000.

You might also like