Static Analysis of Binary Exe
Static Analysis of Binary Exe
Steve Hanov
University of Waterloo
200 University Avenue West
Waterloo, Ontario, Canada N2L 3G1
[email protected]
connects procedures together. This procedure begins at the nodes of interest, and works
backwards to find all possible paths to those nodes from the
To construct the program dependence graph, entry point. By performing this procedure on one of the
F inalU se(v) nodes, we can create a program slice for any
variable we please.
• Create an entry node. For every variable v, create a
node labeled F inalU se(v). Horwitz also goes into into detail extending the method in-
• Create a node for each program statement. terprocedurally using the System Dependence graph to cre-
ate function summaries. However, this area has not been
• Connect the entry vertex to all other vertexes which applied to binaries, we omit it from this survey.
are not in a loop or conditional. These edges are con-
trol dependence edges. Thus far, the slicing method does have a problem: It handles
only structured programs. If a goto or jump statement is
• Connect control flow statements (such as if constructs) inserted, the results are incorrect. Agrawal [2] proposes a
to their immediately nested contents, using control de- method of handling them.
pendence edges.
• Connect together all nodes v1 and v2 such that v1 de-
1. First, he constructs the program slice using the con-
fines a variable x, and there is a path through the
ventional algorithm, which excludes jump statements.
program such that v2 uses that same definition of x.
This edge is a data dependence edge. 2. Then, he determines which jump statements to add
back in. The criteria for this is explained below.
In Figure 4, we have applied the above procedure to the 3. Finally, if the slice contains a jump, and the destina-
program in Figure 3. The result is the program dependence tion isn’t in the slice, the target is pushed forward to
graph. The intraprocedural slice can then be found with a the next executable statement.
very simple algorithm given by Horwitz:
1: if ( C1 ) {
2: goto L6;
3: y = ...;
4: goto L8;
}
5: z = ...;
6: L6: x = ...;
7: goto L3;
8: L8: write(x);
9: write(y);
10: write(z);
1: if ( C1 ) {
2: goto L6;
3: y = ...;
4: goto L8;
}
7: goto L3;
9: write(y);
5. OBFUSCATION TECHNIQUES
In 2003, Linn and Debray [10] proposed a number of ways to
improve the resistance of an executable to static disassem-
bly. There are two types of disassemblers. A linear sweep
Figure 8: The post dominator tree for the program disassembler begins at the program start address and disas-
in figure 5. sembles each instruction encountered until it gets to the end
address. The best example of a linear sweep disassembler is
objdump, part of the GNU binutils package.
Christodorescu identifies several types of obfuscation tech- Type-state analysis is designed to this type of question.
niques that are used by polymorphic viruses for the decrypt- With a type-state analysis, one can assign rules to variables
ing stub. The first, which worked for all commercial virus in a program. For example, in a higher level language, one
utilities at the time, was inserting nop instructions into the can create a rule that lock() may never be called on a lock
program text. Another technique is to rearrange the instruc- object that has already been locked. This may be verified
tions. There may be two or more instructions whose order at compile time. We will see how type-state analysis can be
do not matter, so they can be safely transposed. Likewise, applied to machine code as well.
the virus may perform instruction substitution, so that one
instruction is replaced by several that perform an equiva- In Xu’s PH.D thesis [15], he is concerned with verifying the
lent function. A particularly insidious version is to weave safety of machine code. Most often, safety is determined
the instructions into a host program, so that as the host is through code signing. However, users often do not have
executing, it also executes the virus. the time or inclination to verify that they trust the signing
authority. In addition, unsafe code may be signed. Xu’s
Christodorescu creates an analysis tool to detect the poly- method is best suited for a binary plug-in distributed to a
morphic techniques that he identified. The detector operates host program. The host program will be able to statically
in three phases. First, the code is disassembled. Secondly, it check for array bounds accesses, null pointer dereferences,
is annotated, and thirdly, it is run through a detection mod- and unaligned memory accesses before executing untrusted
ule. The paper does not discuss disassembly, and presumes code.
that the malicious programs have not made any attempt to
confuse the disassembler. The analysis operates in 5 phases: Preparation, Type-state
propagation, annotation, local verification, and global veri-
To understand the goals of the method, it is best to first un- fication. Xu assumes that the host program communicates
with the plug-in using a well defined interface. For example, preconditions and assertions for each program state-
consider a function call exported by a dynamic library. The ment. For isntance, it it determines that a variable
function call takes as arguments an array, a pointer, and a is being used to index an array, it adds a safety pre-
length. condition that the variable’s value must lie within the
bounds of the array. The two types of preconditions
that it creates are local preconditions, which can be
void foo( int array[], int length, verified using typestate information alone, and global
const InfoStruct_t *readStruct ); preconditions, which may require further analysis to
verify. For example, a global precondition may require
the calculation of the loop invariants to check.
The host program would like to verify that foo() does not
write beyond the bounds of the array. In addition, it would • In the Verification step, the preconditions for each
like to verify that it does not write to the readStruct only statement are checked. A local precondition may be,
structure. Xu’s algorithm can verify these rules, if they are for example, that ”e is initialized” and this can be
encoded in an access policy. checked immediately using the typestate calculated in
the propagation step. However, a global precondition,
An access policy is a set of 3-tuples. Each tuple consists of a such as an array bounds check, may require a range
region, category, and the type of access permitted. A region analysis on the values of the registers.
can be a range of addresses, or a single variable name. The
category is a set of types, and the access field is, for exam-
ple, a combination of read, write, followable, or executable If all of the preconditions hold for the program, then it is
flags. Together with the access policy, the host must provide deemed safe to execute.
the type-state for each variable and the invocation. In our
example, here is the type-state: Some of the limiations of the algorithm are its scalability
and precision. The propagation step uses a flow sensitive
Type-state interprocedural analysis which is quite slow [15]. Also, ar-
e: <int, uninitialized, rw> ray elements are collapsed into a single element, so it is not
array: <int [n], {e}, rw> possible to verify programs that are allowed to write to only
readStruct: <InfoStruct t, initialized, r> certain portions of an array without creating a separate ac-
cess policy for each element.
Access Policy
< e : int : rw > 8. CONCLUSION
< array : int [n]: rw > We have presented a survey of the use of static analysis
< readStruct : InfoStruct t : r > techniques in binary executables. The techniques have been
presented in the context of reverse engineering, disassembly,
Invocation virus detection, and safety analysis. It is most striking not
%o0 ← array how far the research has advanced, but how far it must still
%o1 ← n go to do such basic tasks as disassemble a program. It is
%o2 ← readStruct surprising that a virus can be constructed, and using simple
technqiues, can disguise its source code from the best known
disassembly techniques, such that it must be run in order to
The host type-state is the initial type-state of each variable.
reveal its secrets. Researchers in program obfuscation and
The access policy controls how the variables may be used.
static analysis will perhaps fight a never ending battle with
The invocation tells the algorithm what the initial state of
each other.
the registers are. In this example, SPARC assembly is used,
so the function parameters are passed in the registers.
9. REFERENCES
[1] Intel 64 and IA-32 Architectures Software Developer’s
• In the preparation stage, the algorithm takes the type- Manuals.
state, access policy, and invocation, applies the type- [2] H. Agrawal. On slicing programs with jump
state to the invocation conditions, and creates initial statements. Proceedings of the ACM SIGPLAN 1994
constraints. The initial constraints are predicates on conference on Programming language design and
the values of the variables and the registers. For ex- implementation, pages 302–312, 1994.
ample, there may be a predicate that %o3 < n, if %o3 [3] M. Christodorescu and S. Jha. Static Analysis of
is used to address an array. Executables to Detect Malicious Patterns. Proceedings
of the 12th USENIX Security Symposium, pages
• In the type-state propagation step, the statements of
169–186, 2003.
the program are abstractly interpreted. At each pro-
gram point, there is a copy of the memory state for [4] C. Cifuentes and A. Fraboulet. Intraprocedural static
each variable, register, or memory location, along with slicing of binary executables. Software Maintenance,
its type-state. 1997. Proceedings., International Conference on, pages
188–195, 1997.
• In the previous stage, the type-states are known at [5] N. Dor, M. Rodeh, and M. Sagiv. CSSV: towards a
each program point, but they cannot yet be verified. realistic tool for statically detecting all buffer
In the annotation step, the algorithm creates safety overflows in C. Proceedings of the ACM SIGPLAN
2003 conference on Programming language design and
implementation, pages 155–167, 2003.
[6] S. Hallem, B. Chelf, Y. Xie, and D. Engler. A system
and language for building system-specific, static
analyses. ACM Press New York, NY, USA, 2002.
[7] S. Horwitz, T. Reps, and D. Binkley. Interprocedural
slicing using dependence graphs. ACM Transactions
on Programming Languages and Systems (TOPLAS),
12(1):26–60, 1990.
[8] J. Kam and J. Ullman. Monotone data flow analysis
frameworks. Acta Informatica, 7(3):305–317, 1977.
[9] C. Kruegel, W. Robertson, F. Valeur, and G. Vigna.
Static disassembly of obfuscated binaries. Proceedings
of the 13th USENIX Security Symposium
(Security£04), 2004.
[10] C. Linn and S. Debray. Obfuscation of executable
code to improve resistance to static disassembly.
Proceedings of the 10th ACM conference on Computer
and communications security, pages 290–299, 2003.
[11] C. Manning and H. Sch‘eutze. Foundations of
Statistical Natural Language Processing. MIT Press,
1999.
[12] D. O’Donoghue, A. Leddy, J. Power, and J. Waldron.
Bigram analysis of Java bytecode sequences.
Proceedings of the inaugural conference on the
Principles and Practice of programming, 2002 and
Proceedings of the second workshop on Intermediate
representation engineering for virtual machines, 2002,
pages 187–192, 2002.
[13] F. Tip and D. Bäumer. Refactoring for generalization
using type constraints. Proceedings of the 18th annual
ACM SIGPLAN conference on Object-oriented
programing, systems, languages, and applications,
pages 13–26, 2003.
[14] M. Weiser. Program slicing. Proceedings of the 5th
international conference on Software engineering,
pages 439–449, 1981.
[15] Z. Xu, B. Miller, and T. Reps. Safety checking of
machine code. ACM SIGPLAN Notices, 35(5):70–82,
2000.