0% found this document useful (0 votes)

110 views18 pages

Automatic Reverse Engineering of Data Structures From Binary Execution

The document describes a technique called REWARDS for automatically reverse engineering data structures from binary executables. REWARDS uses dynamic analysis to tag memory locations accessed by a program with timestamped type attributes. It propagates these types following the program's data flow to resolve variable types. It also performs backward type resolution to infer types of previously accessed variables starting from "type sinks". REWARDS aims to reconstruct in-memory data structure layout and infer both the syntax (e.g. size, structure) and semantics (e.g. meaning) of data structures from binary execution alone.

Uploaded by

Madhan Mohan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

110 views18 pages

Automatic Reverse Engineering of Data Structures From Binary Execution

Uploaded by

Madhan Mohan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Automatic Reverse Engineering of Data Structures from Binary Execution

Zhiqiang Lin Xiangyu Zhang Dongyan Xu

Department of Computer Science and CERIAS
Purdue University, West Lafayette, IN
{zlin,xyzhang,dxu}@cs.purdue.edu

Abstract will help locate specific information of interest (e.g., IP

addresses) in a memory core dump without symbolic infor-
With only the binary executable of a program, it is mation; In binary vulnerability discovery, this knowledge
useful to discover the program’s data structures and infer will help construct a meaningful view of in-memory data
their syntactic and semantic definitions. Such knowledge is structure layout and identify those semantically associated
highly valuable in a variety of security and forensic applica- with external input for guided fuzz testing.
tions. Although there exist efforts in program data structure Despite the usefulness of automatic data structure re-
inference, the existing solutions are not suitable for our verse engineering, solutions that suit our targeted applica-
targeted application scenarios. In this paper, we propose tion scenarios fall short. First, a large body of work on
a reverse engineering technique to automatically reveal type inference [29, 3, 13, 33, 32, 24] requires program
program data structures from binaries. Our technique, source code. Second, in the binary-only scenario, variables
called REWARDS, is based on dynamic analysis. More are mapped to low-level entities such as registers and
specifically, each memory location accessed by the program memory locations with no syntactic information, which
is tagged with a timestamped type attribute. Following the makes static analysis difficult. In particular, alias analysis
program’s runtime data flow, this attribute is propagated is hard at binary level while it is essential to type inference
to other memory locations and registers that share the – especially semantics inference – because precise data
same type. During the propagation, a variable’s type gets flow cannot be decided without accurate alias information.
resolved if it is involved in a type-revealing execution point Variable discovery [5] is a static, binary level technique
or “type sink”. More importantly, besides the forward that recovers syntactic characteristics of variables, such as
type propagation, REWARDS involves a backward type a variable’s offset in its activation record, size, and hier-
resolution procedure where the types of some previously archical structure. This technique relies on alias analysis
accessed variables get recursively resolved starting from a and abstract interpretation at binary level and is hence
type sink. This procedure is constrained by the timestamps heavy-weight. Moreover, due to the conservative nature of
of relevant memory locations to disambiguate variables re- binary alias analysis, the technique does not infer variable
using the same memory location. In addition, REWARDS is semantics. More recently, Laika [16] aims at dynamically
able to reconstruct in-memory data structure layout based discovering the syntax of observable data structures through
on the type information derived. We demonstrate that unsupervised machine learning on program execution. The
REWARDS provides unique benefits to two applications: accuracy of this technique, however, may fall below the
memory image forensics and binary fuzzing for vulnerabil- expectation of our applications. It does not consider data
ity discovery. structure semantics either. The limitations of these efforts
motivate us to develop new techniques for our targeted
application scenarios.
1 Introduction In this paper, we propose a reverse engineering scheme
to automatically reveal program data structures from bi-
naries. Our technique, called REWARDS1 , is based on
A desirable capability in many security and forensics dynamic analysis. Given a binary executable, REWARDS
applications is automatic reverse engineering of data struc- executes the binary, monitors the execution, aggregates and
tures given only the binary. Such capability is expected to analyzes runtime information, and finally recovers both
identify a program’s data structures and reveal their syntax the syntax and semantics of data structures observed in
(e.g., size, structure, offset, and layout) and semantics the execution. More specifically, each memory location
(e.g., “this integer variable represents a process ID”). Such
knowledge about program data structures is highly valuable. 1 REWARDS is the acronym for Reverse Engineering Work for Auto-

For example, in memory-based forensics, this knowledge matic Revelation of Data Structures.
accessed by the program is tagged with a timestamped semantic information from the memory dump of a binary
type attribute. Following the program’s runtime data flow, program. In binary fuzzing for vulnerability discovery,
this attribute is propagated to other memory addresses and REWARDS helps identifying vulnerability “suspects” in a
registers that share the same type in a forward fashion, binary for guided fuzzing and confirmation.
i.e., the execution direction. During the propagation, a
variable’s type gets resolved if it is involved in a type- 2 REWARDS Overview
revealing execution point or “type sink” (e.g., a system
call, a standard library call, or a type-revealing instruction).
REWARDS infers both syntax and semantics of data
Besides leveraging the forward type propagation technique,
structures from binary execution. More precisely, we aim
to expand the coverage of program data structures, RE-
at reverse engineering the following information:
WARDS involves the following key techniques:
• Data types. We first aim to infer the primitive data
• An on-line backward type resolution procedure where
types of variables, such as char, short, float,
the types of some previously accessed variables get
and int. In a binary, the variables are located in
recursively resolved starting from a type sink. Since
various segments of the virtual address space, such as
many variables are dynamically created and de-
.stack, .heap, .data, .bss, .got, .rodata,
allocated at runtime, and the same memory location
.ctors, and .dtors sections. (Although we focus
may be re-used by different variables, it is complicated
on ELF binary on Linux platform, REWARDS can
to track and resolve variable types based on memory
be easily ported to handle PE binary on Windows.)
locations alone. Hence, we constraint the resolution
Hence, our goal is essentially to annotate memory
process by the timestamps of relevant memory loca-
locations in these data sections with types and sizes,
tions such that variables sharing the same memory
following program execution. For our targeted appli-
location in different execution phases can be disam-
cations, REWARDS also infers composite types such
biguated.
as socket address structures and FILE structures.
• An off-line resolution procedure that complements the
on-line procedure. Some variables cannot be resolved • Semantics. Moreover, we aim to infer the semantics
during their lifetime by our on-line algorithm. How- (meaning) of program variables, which is critical to
ever, they may later get resolved when other variables applications such as computer forensics. For example,
having the same type are resolved. Hence, we propose in a memory dump, we want to decide if a 4-byte
an off-line backward resolution procedure to resolve integer denotes an IP address.
the types of some “dead” variables. • Abstract representation. Although we type memory
• A method for typed variable abstraction that maps locations, it is undesirable to simply present typed
multiple typed variable instances to the same static memory locations to the user. During program ex-
abstraction. For example, all N nodes in a linked ecution, a memory location may be used by multi-
list actually share the same type, instead of having N ple variables at different times; and a variable may
distinct types. have multiple instances. Hence we derive an abstract
representation for a variable by aggregating the type
• A method that reconstructs the structural and semantic information at multiple memory locations instantiated
view of in-memory data, driven by the derived type based on the same variable. For example, we use the
definitions. Once a program’s data structures are offset of a local variable in its activation record as its
identified, it is still not clear exactly how the data abstract representation. Type information collected in
structures would be laid out in memory – this is a all activation records of the same function is aggre-
useful piece of knowledge in many application sce- gated to derive the type of the variable.
narios such as memory forensics. Our method creates
an “organization chart” that illustrates the hierarchical Given only the binary, what can be observed at runtime
layout of those data structures. from each instruction includes (1) the addresses accessed
and the width of the accesses, (2) the semantics of the in-
We have developed a prototype of REWARDS and used struction, and (3) the execution context such as the program
it to analyze a number of binaries. Our evaluation results counter and the call stack. In some cases, data types can be
show that REWARDS is able to correctly reveal the types partially inferred from instructions. For example, a floating
of a high percentage of variables observed during a pro- point instruction (e.g., FADD) implies that the accessed lo-
gram’s execution. Furthermore, we demonstrate the unique cations must have floating point numbers. We also observe
benefits of REWARDS to a variety of application scenarios: that the parameters and return values of standard library
In memory image forensics, REWARDS helps recovering calls and system calls often have their syntax and semantics
1 struct { 1 extern foo 1 80480a0: e8 0f 00 00 00 call 0x80480b4
2 unsigned int pid; 2 section .text 2 80480a5: b8 01 00 00 00 mov $0x1,%eax
3 char data[16]; 3 global _start 3 80480aa: bb 00 00 00 00 mov $0x0,%ebx
4 }test; 4 4 80480af: cd 80 int $0x80
5 5 _start: 5 ...
6 void foo(){ 6 call foo 6 80480b4: 55 push %ebp
7 char *p="hello world"; 7 mov eax,1 7 80480b5: 89 e5 mov %esp,%ebp
8 test.pid=my_getpid(); 8 mov ebx,0 8 80480b7: 83 ec 18 sub $0x18,%esp
9 strcpy(test.data,p); 9 int 80h 9 80480ba: c7 45 fc 18 81 04 08 movl $0x8048118,0xfffffffc(%ebp)
10 } 10 80480c1: e8 4a 00 00 00 call 0x8048110
11 80480c6: a3 24 91 04 08 mov %eax,0x8049124
(a) Source code of function foo and the _start assembly code 12 80480cb: 8b 45 fc mov 0xfffffffc(%ebp),%eax
13 80480ce: 89 44 24 04 mov %eax,0x4(%esp)
14 80480d2: c7 04 24 28 91 04 08 movl $0x8049128,(%esp)
[Nr] Name Type Addr Off Size 15 80480d9: e8 02 00 00 00 call 0x80480e0
... 16 80480de: c9 leave
[ 1] .text PROGBITS 080480a0 0000a0 000078 17 80480df: c3 ret
[ 2] .rodata PROGBITS 08048118 000118 00000c 18 80480e0: 55 push %ebp
[ 3] .bss NOBITS 08049124 000124 000014 19 80480e1: 89 e5 mov %esp,%ebp
... 20 80480e3: 53 push %ebx
21 80480e4: 8b 5d 08 mov 0x8(%ebp),%ebx
(c) Section map of the example binary 22 80480e7: 8b 55 0c mov 0xc(%ebp),%edx
23 80480ea: 89 d8 mov %ebx,%eax
24 80480ec: 29 d0 sub %edx,%eax
rodata_0x08048118{ fun_0x08048110{ 25 80480ee: 8d 48 ff lea 0xffffffff(%eax),%ecx
+00: char[12] +00: ret_addr_t 26 80480f1: 0f b6 02 movzbl (%edx),%eax
} } 27 80480f4: 83 c2 01 add $0x1,%edx
bss_0x08049124{ 28 80480f7: 84 c0 test %al,%al
+00: pid_t, fun_0x080480e0{ 29 80480f9: 88 04 0a mov %al,(%edx,%ecx,1)
+04: char[12], -08: unused[4], 30 80480fc: 75 f3 jne 0x80480f1
+16: unused[4] -04: stack_frame_t, 31 80480fe: 89 d8 mov %ebx,%eax
} +00: ret_addr_t, 32 8048100: 5b pop %ebx
fun_0x080480b4{ +04: char*, 33 8048101: 5d pop %ebp
-28: unused[20], +08: char* 34 8048102: c3 ret
-08: char *, } 35 ...
-04: stack_frame_t, 36 8048110: b8 14 00 00 00 mov $0x14,%eax
+00: ret_addr_t 37 8048115: cd 80 int $0x80
} 38 8048117: c3 ret

(d) Output of REWARDS (b) Disassembly code of the example binary

Figure 1. An example showing how REWARDS works

well defined and publicly known. Hence we define the type world”), ebp-4 can be typed as a pointer, based on the
revealing instructions, system calls, and library calls as type heuristics that instruction executions using similar immedi-
sinks. Furthermore, the execution of an instruction creates a ate values within a code or data section are considered type
dependency between the variables involved. For instance, if sinks. Note that the type of the pointer is unknown yet.
a variable with a resolved type (from a type sink) is copied At line 10, foo calls 0x8048110. Inside the body of the
to another variable, the destination variable should have a function invocation (lines 36-38), our algorithm detects a
compatible type. As such, we model our problem as a type getpid system call (a type sink) with eax being 0x14 at
information flow problem. line 36. The return value of the function call is resolved as
To illustrate how REWARDS works, we use a simple pid t type, i.e., register eax at line 11 is typed pid t.
program compiled from the source code shown in Figure When eax is copied to address 0x8049124 (a global
1(a). According to the code snippet, the program has a variable in .bss section as shown in Figure 1(c)), the
global variable test (line 1-4) which consists of an int algorithm further resolves 0x8049124 as pid t. Before
and a char array. It contains a function foo (line 6- the function call 0x80480e0 at line 15 (strcpy), the
10) that calls my getpid and strcpy to initialize the parameters are initialized in lines 12-14. As ebp-4 has
global variable. The full disassembled code of the example been typed as a pointer at line 9, the data flow in lines 12
is shown in Figure 1(b) (a dotted line indicates a “NOP” and 13 dictates that location esp+4 at line 13 is a pointer
instruction). The address mapping of code and data is as well. At line 14, as 0x8049128 is in the global variable
shown in Figure 1(c). section and of a known type, location esp has an unknown
pointer type. At line 15, upon the call to strcpy (a
When foo is called during execution, it first saves ebp type sink), both esp and esp+4 are resolved to char*.
and then allocates 0x18 bytes of memory for the local Through a backward transitive resolution, 0x8049128 is
variables (line 8 in Figure 1(b)), and then initializes one resolved as char, ebp-4 as char*, and 0x8048118 as
local variable (at address 0xfffffffc(%ebp)=ebp-4) char. Also at line 26, inside the function body of strcpy,
with an immediate value 0x8048118 (line 9). Since the instruction “movzbl (%edx),%eax” can be used as
0x8048118 is in the address range of the .rodata another type sink as it moves between char variables.
section (it is actually the starting address of string “hello
When the program finishes, we resolve all data types system call returns, REWARDS will type register eax and,
(including function arguments, and those implicit vari- from there, those having the same type as eax. In our type
ables such as return address and stack frame pointer) propagation and resolution algorithm (Section 3.2), a type
as shown in Figure 1(d). The derived types for vari- sink will lead to the recursive type resolution of relevant
ables in .rodata, .bss and functions are presented variables accessed before and after the type sink.
in the figure. Each function is denoted by its entry
address. fun 0x080480b4, fun 0x08048110, and Standard library calls. With well-defined API, standard
fun 0x080480e0 denote foo(), my getpid(), and library calls are another category of type sink. For example,
strcpy(), respectively. The number before each de- the two arguments of strcpy must both be of the char*
rived type denotes the offset. Variables are listed in in- type. By intercepting library function calls and returns,
creasing order of their addresses. Type stack frame t REWARDS will type the registers and memory variables
indicates a frame pointer stored at that location. Type involved. Standard library calls tend to provide richer type
ret addr t means that the location holds a return ad- information than system calls – for example, Linux-2.6.15
dress. Such semantic information is useful in applica- has 289 system calls whereas libc.so.6 contains 2016
tions such as vulnerability fuzz. Locations that are not functions (note some library calls wrap system calls).
accessed during execution are annotated with the unused Type-revealing instructions. A number of machine in-
type. In fun 0x080480e0, the two char* below structions that require operands of specific types can serve
the ret addr t represent the two actual arguments of as type sinks. Examples in x86 are as follows: (1)
strcpy(). Although it seems that our example can be String instructions perform byte-string operations such as
statically resolved due to its simplicity, it is very difficult in moving/storing (MOVS/B/D/W, STOS/B/D/W), loading
practice to analyze data flows between instructions (espe- (LOADS/B/D/W), comparison (CMPS/B/D/W), and scan-
cially those involving heap locations) due to the difficulty ning (SCAS/B/D/W). Note that MOVZBL is also used in
of binary points-to analysis. string movement. (2) Floating-point instructions oper-
ate on floating-point, integer, and binary coded decimal
3 REWARDS Design operands (e.g. FADD, FABS, and FST). (3) Pointer-related
instructions reveal pointers. For a MOV instruction with
In this section, we describe the design of REWARDS. an indirect memory access operand (e.g., MOV (%edx),
We first identify the type sinks used in REWARDS and %ebx or MOV [mem], %eax), the value held in the
then present the on-line type propagation and resolution source operand must be a pointer. Meanwhile, if the
algorithm, which will be enhanced by an off-line procedure target address is within the range of data sections such as
that recovers more variable types not reported by the on-line .stack, .heap, .data, .bss or .rodata, the pointer
algorithm. Finally, we present a method to construct a typed must be a data pointer; If it is in the range of .text
hierarchical view of memory layout. (including library code), the pointer must be a function
pointer. Note that the concrete type of such a pointer will
3.1 Type Sinks be resolved through other constraints.

A type sink is an execution point of a program where

3.2 Online Type Propagation and Resolution Al-
the types (including semantics) of one or more variables
gorithm
can be directly resolved. In REWARDS, we identify three
categories of type sinks: (1) system calls, (2) standard
library calls, and (3) type-revealing instructions. Given a binary program, our algorithm reveals variable
System calls. Most programs request OS services via types, including both syntactic types (e.g., int and char)
system calls. Since system call conventions and semantics and semantics (e.g., return address), by propagating
are well-defined, the types of arguments of a system call and resolving type information along the data flow during
are known from the system call’s specification. By moni- program execution. Each type sink encountered leads
toring system call invocations and returns, REWARDS can to both direct and transitive type resolution of variables.
determine the types of parameters and return value of each More specifically, at the binary level, variables exist in
system call at runtime. For example, in Linux, based on either memory locations or registers without their symbolic
the system call number in register eax, REWARDS will be names. Hence, the goal of our algorithm is to type these
able to type the parameter-passing registers (i.e., ebx, ecx, memory addresses and registers. We attach three shadow
edx, esi, edi, and ebp, if they are used for passing the variables – as the type attribute – to each memory address
parameters). From this type sink, REWARDS will further at byte granularity (registers are treated similarly): (1)
type those variables that are determined to have the same Constraint set is a set of other memory addresses that
type as the parameter passing registers. Similarly, when a should have the same type as this address; (2) Type set
stores the set of resolved types of the address2 , including a set of <address, timestamp> tuples each representing a
both syntactic and semantic types; (3) Timestamp records variable instance that should have the same type as v; its
the birth time of the variable currently in this address. For type set Tv represents the resolved types for v; and the birth
example, the timestamp of a stack variable is the time time of the current variable instance is denoted as tsv .
when its residence method is invoked and the stack frame
1. If the current execution point i is a type sink (line
is allocated. Timestamps are needed because the same
3). The arguments/operands/return value of the sink
memory address may be reused by multiple variables (e.g.,
will be directly typed according to the sink’s definition
the same stack memory being reused by stack frames of
(Get Sink Type() on line 5)3 . Type resolution is
different method invocations). More precisely, a variable
then triggered by calling the recursive method Back-
instance should be uniquely identified by a tuple <address,
ward Resolve(). The method recursively types all
timestamp>. These shadow variables are updated during
variables that should have the same type (lines 32-36):
program execution, depending on the semantics of executed
It tests if each variable w in the constraint set of v has
instructions.
been resolved as type T of v. If not, it recursively
Algorithm 1 On-line Type Propagation and Resolution calls itself to type all the variables that should have
1: /* Sv : constraint set for memory cell (or register) v; Tv : type set of v; tsv : the same type as w. Note that at line 34, it checks if
(birth) time stamp of v; MOV(v,w): moving v to w; BIN OP(v,w,d): a binary the current birth timestamp of w is equal to the one
operation that computes d from v and w; Get Sink Type(v,i): retrieving the
type of argument/operand v from the specification of sink i; ALLOC(v,n): stored in the constraint set to ensure the memory has
allocating a memory region starting from v with size n – the memory region not been re-used by a different variable. If w is re-
may be a stack frame or a heap struct; FREE(v,n): freeing a memory region –
this may be caused by eliminating a stack frame or de-allocating a heap struct*/ used (t 6= tsw ), the algorithm does not resolve the
2: Instrument(i){ current w. Instead, the resolution is done by a different
3: case i is a Type Sink:
4: for each operand v off-line procedure (Section 3.3). Since variable types
5: T ← Get Sink Type(v, i) are resolved according to constraints derived from data
6: Backward Resolve (v, T )
7: case i has indirect memory access operand o flows in the past, we call this step backward type
8: To ← To ∪ {pointer type t} resolution.
9: case i is MOV(v, w):
10: if w is a register 2. If i contains an indirect memory access operand o
11: Sw ← S v
12: Tw ← Tv (line 7), either through registers (e.g., using (%eax)
13: else to access the address designated by eax) or memory
14: Unify(v, w)
15: case i is BIN OP(v, w, d): (e.g., using [mem] to indirectly access the memory
16: if pointer type t ∈ Tv pointed to by mem), then the corresponding operand
17: Unify(d, v)
18: Backward Resolve (w, {int, pointer index t}) will have a pointer type tag (pointer type t) as a
19: else new element in To .
20: Unify3(d, v, w)
21: case i is ALLOC(v, n):
22: for t=0 to n − 1 3. If i is a move instruction (line 9), there are two cases
23: tsv+t ← current timestamp to consider. In particular, if the destination operand
24: Sv+t ← φ
25: Tv+t ← φ w is a register, then we just move the properties (i.e.,
26: case i is FREE(v, n): the Sv and Tv ) of the source operand to the destination
27: for t=0 to n − 1
28: a ← v+t (i.e., the register); otherwise we need to unify the types
29: if (Ta ) log (a, tsa , Ta ) of the source and destination operands because the
30: log (a, tsa , Sa )
31: } destination is now a memory location that may have
32: Backward Resolve(v,T ){ already contained some resolved types. The intuition
33: for < w, t > ∈ Sv
34: if (T 6⊂ Tw and t ≡ tsw ) Backward Resolve(w,T -Tw ) is that the source operand v should have the same type
35: Tv ← Tv ∪ T as the destination operand w if the destination is a
36: }
37: Unify(v,w){ memory address. Hence, the algorithm calls method
38: Backward Resolve(v, Tw -Tv ) Unify() to unify the types of the two. In Unify() (lines
39: Backward Resolve(w, Tv -Tw )
40: Sv ← Sv ∪ {< w, tsw >} 37-42), the algorithm first unions the two type sets by
41: Sw ← Sw ∪ {< v, tsv >} performing backward resolution at lines 38 and 39.
42: }
Intuitively, the call at line 38 means that if there are any
new types in Tw that are not in Tv (i.e. Tw -Tv ), those
The algorithm is shown in Algorithm 1. The algorithm new types need to be propagated to v and transitively
takes appropriate actions to resolve types on the fly accord- to all variables that share the same type as v, mandated
ing to the instruction being executed. For a memory address by v’s constraint set. Such unification is not performed
or a register v, its constraint set is denoted as Sv , which is if the w is a register to avoid over-aggregation.
2 We need a set to store the resolved types because one variable may 3 The sink’s definition also reveals the semantics of some argu-

have multiple compatible types. ments/operands, e.g., a PID.

4. If i is a binary operation, the algorithm first tests if 10 matches tsl1 , indicating the same variable is still alive.
an operand has been identified as a pointer. If so, it Transitively, the variables in Sl1 , i.e. g1 and l2, are resolved
must be a pointer arithmetic operation, the destination to the same type. Note that if the backward resolution was
must have the same type as the pointer operand and not conducted, we would not be able to resolve the type
the other operand must be a pointer index – denoted of l2 because when the move from l1 to l2 (timestamp 12)
by a semantic type pointer index t (line 18). occurred, l1 was not typed and hence l2 was not typed.
The semantic type is useful in vulnerability fuzz to
overflow buffers. If i is not related to pointers, the 3.3 Off-line Type Resolution
three operands shall have the same type. The method
Unify3() unifies three variables. It is very similar to Most variables accessed during the binary’s execution
Unify() and hence not shown. Note that in cases where can be resolved by our online algorithm. However, there
the binary operation implicitly casts the type of some are still some cases in which, when a memory variable gets
operand (e.g., an addition of a float and an integer), freed (and its information gets emitted to the log file), its
the unification induces over-approximation (e.g., asso- type is still unresolved. We realize that there may be enough
ciating the float point type with the integer variable). information from later phases of the execution to resolve
In practice, we consider such cases reasonable and those variables. We propose an off-line procedure to be
allow multiple types for one variable as long as they performed after the program execution terminates. It is
are compatible. essentially an off-line version of the Backward Resolve()
method in Algorithm 1. The difference is that it has to
5. If i allocates a memory region (line 21) – either a stack traverse the log file to perform the recursive resolution.
frame or a heap struct, the algorithm updates the birth Consider the example in Table 2. It shares the same
time stamps of all the bytes in the region, and resets the execution as the example in Table 1 before timestamp 13.
memory constraint set (Sv ) and type set (Tv ) to empty. At time instance 13, the execution returns from M , de-
By doing so, we prevent the type information of the allocating the local variables l1 and l2. According to the
old variable instance from interfering with that of the online algorithm, their constraint sets are emitted to a log
new instance at the same address. file since neither is typed at that point. Later at timestamp
99, another method N is called. Assume it reuses l1 and
6. If i frees a memory region (line 26), the algorithm
l2, namely, N allocates its local variables at the locations of
traverses each byte in the region and prints out the type
l1 and l2. The birth time of l1 and l2 becomes 99. Their
information. In particular, if the type set is not empty,
type sets and constraint sets are reset. When the sink is
it is emitted. Otherwise, the constraint set is emitted.
encountered at 100, l1 and l2 are not typed as their current
Later, the emitted constraints will be used in the off-
birth timestamp is 99, not 10 as in Sg1 , indicating they
line procedure (Section 3.3) to resolve more variables.
are re-used by other variables. Fortunately, the variable
represented by < l1, 10 > can be found in the log and hence
Example. Table 1 presents an example of executing our
resolved. Transitively, < l2, 10 > can be resolved as well.
algorithm. The first column shows the instruction trace
with the numbers denoting timestamps. The other columns
show the type sets and the constraint sets after each in- 3.4 Typed Variable Abstraction
struction execution for three sample variables, namely the
global variable g1 and two local variables l1 and l2. For Our algorithm is able to annotate memory locations with
brevity, we abstract the calling sequence of strcpy to a syntax and semantics. However, multiple variables may
strcpy instruction. After the execution enters method M occupy the same memory location at different times and
at timestamp 10, the local variables are allocated and hence a static variable may have multiple instances at runtime4 .
both l1 and l2 have the birth time of 10. The global variable Hence it is important to organize the inferred type informa-
g1 has the birth time of 0. After the first mov instruction, the tion according to abstract, location-independent variables
type sets of g1 and l1 are unified. Since neither was typed, other than specific memory locations. In particular, prim-
the unified type set remains empty. Moreover, l1, together itive global variables are represented by their offsets to
with its birth time 10, is added to the constraint set of g1 the base of the global sections (e.g., .data and .bss
and vice versa, denoting they should have the same type. sections). Stack variables are abstracted by the offsets from
Similar actions are taken after the second mov instruction. their residence activation record, which is represented by
Here, the constraint set of l1 has both g1 and l2. The the function name (as shown in Figure 1).
strcpy invocation is a type sink and g1 must be of type For heap variables, we use the execution context, i.e., the
char*, the algorithm performs the backward resolution by PC (instruction address) of the allocation point of a heap
calling Backward Resolve(). In particular, the variable in 4 A local variable has the same life time of a method invocation and a

Sg1 , i.e. l1, is typed to char*. Note that the timestamp method can be invoked multiple times, giving rise to multiple instances.
instruction Tg1 Sg1 tsg1 Tl1 Sl1 tsl1 Tl2 Sl2 tsl2
10. enter M φ φ 0 φ φ 10 φ φ 10
11. mov g1, l1 φ {< l1, 10 >} 0 φ {< g1, 0 >} 10 φ φ 10
12. mov l1, l2 φ {< l1, 10 >} 0 φ {< g1, 0 >, < l2, 10 >} 10 φ {< l1, 10 >} 10
... ... ... ... ... ... ... ... ... ...
100. strcpy(g1,...) {char*} {< l1, 10 >} 0 {char*} {< g1, 0 >, < l2, 10 >} 10 {char*} {< l1, 10 >} 10

Table 1. Example of running the online algorithm. Variable g1 is a global, l1 and l2 are locals.

instruction Tg1 Sg1 tsg1 Tl1 Sl1 tsl1 Tl2 Sl2 tsl2
... ... ... ... ... ... ... ... ... ...
12. mov l1, l2 φ {< l1, 10 >} 0 φ {< g1, 0 >, < l2, 10 >} 10 φ {< l1, 10 >} 10
13. Exit M φ {< l1, 10 >} 0 φ {< g1, 0 >, < l2, 10 >} 10 φ {< l1, 10 >} 10
... ... ... ... ... ... ... ... ... ...
99. Enter N φ {< l1, 10 >} 0 φ φ 99 φ φ 99
100. strcpy(g1,...) {char*} {< l1, 10 >} 0 φ φ 99 φ φ 99

Table 2. Example of running the off-line type resolution procedure. The execution before timestamp
12 is the same as Table 1. Method N reuses l1 and l2

structure plus the call stack at that point, as the abstraction corresponding children. If a variable is a pointer, the
of the structure. The intuition is that the heap structure algorithm further recursively constructs the sub-view of the
instances allocated from the same PC in the same call stack data structure being pointed to, leveraging the derived type
should have the same type. Fields of the structure are of the pointer. For instance, assume a global pointer p is of
represented by the allocation site and field offsets. As an type T*, our method creates a node representing the region
allocated heap region may be an array of a data structure, pointed to by p. The region is typed based on the reverse
we use the recursion detection heuristics in [9] to detect the engineered definition of T. The recursive process terminates
array size. Specifically, the array size is approximated by when none of the fields of a data structure is a pointer. Stack
the maximum number of accesses by the same PC to unique is similarly handled: A root node is created to represent
memory locations in the allocated region. The intuition is each activation record. Local variables of the record
that array elements are often accessed through a loop in are denoted as children nodes. Recursive construction is
the source code and the same instruction inside the loop performed until all memory locations through pointers are
body often accesses the same field across all array elements. traversed. Note that all live heap structures can be reached
Finally, if heap structures allocated from different sites have (transitively) through a global pointer or a stack pointer.
the same field types, we will heuristically cluster these heap Hence, the above two steps essentially also construct the
structures into one abstraction. structural views of live heap data.
Our method can also type some of the unreachable
3.5 Constructing Hierarchical View of In- memory regions, which represent “dead” data structures,
Memory Data Structure Layout e.g., activation records of previous method invocations
whose space has been freed but not reused. Such dead
data is as important as live data as they disclose what had
An important feature of REWARDS is to construct a happened in the past. In particular, our method scans the
hierarchical view of a memory snapshot, in which the prim- stack beyond the current activation record to identify any
itive syntax of individual memory locations, as well as their pointers to the code section, which often denote return
semantics and the integrated hierarchical structure are visu- addresses of method invocations. With a return address, the
ally represented. This is highly desirable in applications like function invocation can be identified and we can follow the
memory forensics as interesting queries, e.g., “find all aforementioned steps to type the activation record.
IP addresses”, can be easily answered by traversing
the view (examples in Section 5.1). So far, REWARDS 4 Implementation and Evaluation
is able to reverse engineer the syntax and semantics of
data structures, represented by their abstractions. Next, we We have implemented REWARDS on PIN-2.6 [27], with
present how we leverage such information to construct a 12.1K lines (LOC) of C code and 1.2K LOC of Python
hierarchical view. code. In the following, we present several key implementa-
Our method works as follows. It first types the top level tion details. REWARDS is able to reveal variable semantics.
global variables. In particular, a root node is created to In our implementation, variable semantics are represented
represent a global section. Individual global variables are as special semantic tags complementary to regular type tags
represented as children of the root. Edges are annotated such as int and char. Both semantic tags and regular tags
with offset, size, primitive type, and semantics of the are stored in the variable’s type set. Tags are enumerated
to save space. The vast diversity of program semantics 4.1 Evaluation of Accuracy
makes it infeasible to consider them all. Since we are
mainly interested in forensics and security applications, we
To evaluate the reverse engineering accuracy of RE-
focus on the following semantic tags: (1) file system related
WARDS, we compare the derived data structure types with
(e.g., FILE pointer, file descriptor, file name, file status);
those declared in the program source code. To acquire
(2) network communication related (e.g., socket descriptor,
the oracle information, we recompile the programs with
IP address, port, receiving and sending buffer, host info,
debugging information, and then use libdwarf [1] to
msghdr); and (3) operating systems related (e.g., PID, TID,
extract type information from the binaries. The libdwarf
UID, system time, system name, and device info).
library is capable of presenting the stack and global variable
Meanwhile, we introduce some of our own semantic mappings after compilation. For instance, global variables
tags, such as ret addr t indicating that a memory loca- scattering in various places in the source code will be
tion is holding a return address, stack frame t indicat- organized into a few data sections. The library allows us see
ing that a memory location is holding a stack frame pointer, the organization. In particular, libdwarf extracts stack
format string t indicating that a string is used in variables by presenting the mapping from their offsets in
format string argument, and malloc arg t indicating an the stack frame and the corresponding types. For global
argument of malloc function (similarly, calloc arg t variables, the output by libdwarf is program virtual
for calloc function, etc.). Note that these tags reflect the addresses and their types. Such information allows us to
properties of variables at those specific locations and hence conduct direct and automated comparison. Note that we
do not particitate in the type information propagation. They only verify the types in .data, .bss, and .rodata sec-
can bring important benefits to our targeted applications tions, other global data in sections such as .got, .ctors
(Section 5). are not verified. For heap variables, since we use the
REWARDS needs to know the program’s address space execution context at allocation sites as the abstract repre-
mapping, which will be used to locate the addresses of sentation, given an allocation context, we can locate it in
global variables and detect pointer types. In particular, the disassembled binary, and then correlate it with program
REWARDS checks the target address range when deter- source code to identify the heap data structure definition,
mining if a pointer is a function pointer or a data pointer. and finally compare it with REWARDS’s output. Although
Thus, when a binary starts executing with REWARDS, REWARDS extracts variable types for the entire program
we first extract the coarse-grained address mapping from address space (including libraries), we only compare the
the /proc/pid/maps file, which defines the ranges of results for user-level code.
code and data sections including those from libraries, and The result for stack variables is presented in Figure
the ranges of stack and heap (at that time). Then for 2(a). The figure presents the percentage of (1) functions
each detailed address mapping such as .data, .bss and that are actually executed, (2) data structures that are used
.rodata for all loaded files (including libraries), we in the executed functions (over all structures declared in
extract the mapping using the API provided by PIN when those functions), and (3) data structures whose types are
the corresponding image file is loaded. accurately recovered by REWARDS (over those in (2)). At
runtime, it is often the case that even though a buffer is
We have performed two sets of experiments to evaluate defined in the source code with size n, only part of the
REWARDS: one is to evaluate its correctness, and the n bytes are used. Consequently, only those used ones are
other is to evaluate its time and space efficiency. All typed (the others are considered unused). We consider the
the experiments were conducted on a machine with two buffer is correctly typed if its bytes are either correctly typed
2.13Ghz Pentium processors and 2GB RAM running Linux or unused. From the figure, we can observe that, due to
kernel 2.6.15. the nature of dynamic analysis, not all functions or data
We select 10 widely used utility programs from the structures in a function are exercised and hence amenable
following packages: procps-3.2.6 (with 19.1K LOC and to REWARDS. More importantly, REWARDS achieves an
containing command ps), iputils-20020927 (with 10.8K average of 97% accuracy (among these benchmarks) for
LOC and containing command ping), net-tools-1.60 (with the data structures that get exercised. For heap variables,
16.8K LOC and containing netstat), and coreutils- the result is presented in Figure 2(b), the bars are similarly
5.93 (with 117.5K LOC and containing the remaining test defined. REWARDS’s output perfectly matches the types in
commands such as ls, pwd, and date). The reason the original definitions when they are exercised. Note some
for selecting these programs is that they contain many of the benchmarks are missing in Figure 2(b) (e.g., date)
data structures related to the operating system and network because their executions do not allocate any user-level heap
communications. We run these utilities without command structures. The result for global variables is presented in
line option except ping, which is run with a localhost and Figure 2(c), and REWARDS achieves over 85% accuracy.
a packet count 4 option. To explain why REWARDS cannot achieve 100% accu-
120
Dynamically Executed Funs
Dynamically Exposed Types
REWARDS Accuracy
100

Percentage
60

0
ps

ho
ng

er
ts

tim

st
d
ta

na
s
e

e
t

m
e
Benchmark Program

(a) Accuracy on Stack Variables

120
Dynamically Allocated Types Dynamically Exercised Types
Dynamically Exercised Types REWARDS Accuracy
REWARDS Accuracy 100

100

80
80
Percentage

Percentage

60
60

40
40

20 20

0 0
ps

ho
ng

ng
er

er
ts

tim

st
d
ta

na
s

s
e

e
t

t
m

m
e

e
Benchmark Program Benchmark Program

(b) Accuracy on Heap Variables (c) Accuracy on Global Variables

400 6e+07
REWARDS REWARDS
MemTrace
Normal Execution
350
5e+07
Shadow Memory Consumption (bytes)

300
Execution Time (seconds)

4e+07
250

200 3e+07

150
2e+07

100

1e+07
50

0 0
ps

ho
ng

ng
er

er
ts

tim

st
d

d
ta

na
s

s
e

e
e

e
t

t
m

m
e

Benchmark Program Benchmark Program

(d) Performance Overhead (e) Space Overhead

Figure 2. Evaluation results for REWARDS accuracy and efficiency

racy, we carefully examined the benchmarks and identified 5 Applications of REWARDS
the following two reasons:
REWARDS can be applied to a number of applications.
• Hierarchy loss. If a hierarchical structure becomes In this section, we demonstrate how REWARDS provides
flat after compilation, we are not able to identify its unique benefits to (1) memory image forensics and (2)
hierarchy. This happens to structures declared as binary vulnerability fuzz.
global variables or stack variables. And the binary
never accesses such a variable using the base address
plus a local offset. Instead, it directly uses a global
5.1 Memory Image Forensics
offset (starting from the base address of the global data
section or a stack frame). In other words, multiple Memory image forensics is a process to extract mean-
composite structures are flattened into one large struc- ingful information from a memory dump. Examples of
ture. In contrast, such flattening does not happen to such information are IP addresses that the application under
heap structures. investigation is talking to and files being accessed. Data
structure definitions play a critical role in the extraction
• Path-sensitive memory reuse. This often happens process. For instance, without data structure information,
to stack variables. In particular, the compiler might it is hard to decide if four consecutive bytes represent an
assign different local variables declared in different IP address or just a regular value. REWARDS enables
program paths to the same memory address. As a analyzing memory dumps for a binary without symbolic
result, the types of these variables are undesirably information. In this subsection, we demonstrate how RE-
unified in our current design. A more thorough design WARDS can be used to type reachable memory as well as
would use a path-sensitive local offset to denote a stack some of the unreachable (i.e., dead) memory.
variable.
5.1.1 Typing Reachable Memory
Despite the imperfect accuracy, REWARDS still suits
our targeted application scenarios, i.e., memory forensics In this case study, we demonstrate how we use REWARDS
and vulnerability fuzzing. For example, although RE- to discover IP addresses from a memory dump using the
WARDS outputs a flat layout for all global and stack hierarchical view (Section 3.5). We run a web server
variables, we can still conduct vulnerability fuzzing because nullhttpd-0.5.1. A client communicates with this
the absolute offsets of these variables are sufficient; and we server through wget (wget-1.10.2). The client has
can still construct hierarchical views of memory images as IP 10.0.0.11 and the server has IP 10.0.0.4. The
pointer types can be obtained. memory dump is obtained from the server at the moment
when a system call is invoked to close the client connection.
4.2 Evaluation of Efficiency Part of the memory dump is shown in Figure 3. The IPs are
underlined in the figure. From the memory dump, it is very
We also measured the time and space overhead of hard for human inspectors to identify those IPs without a
REWARDS. We compared it with (1) a standard memory meaningful view of the memory. We use REWARDS to
trace tool, MemTrace (shipped along with PIN-2.6) and derive the data structure definitions for nullhttpd and
(2) the normal execution of the program, to evaluate the then construct a hierarchical view of the memory dump
performance overhead. The result is shown in Figure 2(d). following the method described in Section 3.5.
Note the normal execution data is nearly not visible in this The relevant part of the reconstructed view is presented
figure because they are very small (roughly at the 0.01 sec- in Figure 4(a). The root represents a pointer variable in
ond level). We can observe that REWARDS causes slow- the global section. The outgoing edge of the root leads
down in the order of ten times compared with MemTrace, to the data structure being pointed to. The edge label
and in the order of thousands (or tens of thousands) times “struct 0x0804dd4f *” denotes that this is a heap
compared with the normal execution. data structure whose allocation PC (also its abstraction)
For space overhead, we are interested in the space con- is 0x0804dd4f. According to the view construction
sumption by shadow type sets and constraint sets. Hence, method, the memory region being pointed to is typed
we track the peak value of the shadow memory consump- according to the derived definition of the data structure
tion. The result is shown in Figure 2(e). We can observe denoted by 0x0804dd4f, resulting in the second layer in
that the shadow memory consumption is around 10 Mbytes Figure 4(a). The memory region starts at 0x08052170 is
for these benchmarks. A special case is ping, which uses denoted by the node with the address label. The individual
much less memory. The reason is that it has fewer function child nodes represent the different fields of the structure,
calls and memory allocations, which is also why it runs e.g. the first field is a thread id according to the semantic
much faster than the other programs shown in Figure 2(d). tag pthread t, the fourth field (with offset +12) denotes
...
08052170 b0 5b fe b7 b0 5b fe b7 05 00 00 00 02 00 92 7e 080534a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
08052180 0a 00 00 0b 00 00 00 00 00 00 00 00 c7 b0 af 4a *
08052190 c7 b0 af 4a 00 00 00 00 58 2a 05 08 00 00 00 00 08053910 00 00 00 00 00 00 00 00 57 67 65 74 2f 31 2e 31
080521a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 08053920 30 2e 32 00 00 00 00 00 00 00 00 00 00 00 00 00
... 08053930 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
08052a50 00 00 00 00 59 31 01 00 4b 65 65 70 2d 41 6c 69 *
08052a60 76 65 00 00 00 00 00 00 00 00 00 00 00 00 00 00 08053990 00 00 00 00 00 00 00 00 c8 00 00 00 00 00 00 00
08052a70 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 080539a0 00 00 00 00 00 00 00 00 00 00 43 6c 6f 73 65 00
* 080539b0 00 00 00 00 00 00 00 00 00 00 00 00 52 00 00 00
08052ee0 00 00 00 00 00 00 00 00 00 00 00 00 31 30 2e 30 080539c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
08052ef0 2e 30 2e 34 00 00 00 00 00 00 00 00 00 00 00 00 *
08052f00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 08053a90 48 54 54 50 2f 31 2e 30 00 00 00 00 00 00 00 00
* 08053aa0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
08052fe0 00 00 00 00 00 00 00 00 00 00 00 00 48 54 54 50 *
08052ff0 2f 31 2e 30 00 00 00 00 00 00 00 00 00 00 00 00 08053b20 74 65 78 74 2f 68 74 6d 6c 00 00 00 00 00 00 00
08053000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 08053b30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
* *
08053470 00 00 00 00 00 00 00 00 00 00 00 00 31 30 2e 30 08063ba0 01 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
08053480 2e 30 2e 31 31 00 00 00 00 00 00 00 00 00 00 00 08063bb0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
08053490 47 45 54 00 00 00 00 00 2f 00 00 00 00 00 00 00 *
...

Figure 3. Part of a memory dump from null-httpd

a sockaddr structure. The last field (with offset +40) correctly identified and its composition bytes are either
denotes another heap structure whose allocation site is correctly typed or unused.
0x0804ddfb. Transitively, our method reconstructs the
entire hierarchy.
5.1.2 Typing Dead Memory
The extraction of IP addresses is translated into a
traversal over the view to identify those with the IP ad- In this case, we demonstrate how to type dead memory,
dress semantic tags. Along the path 08050260 → i.e., memory regions containing dead variables, using the
08052170 → 7e9200...0 → 0x0b0000a , slapper worm bot-master program. Slapper worm relies on
a variable with the sin addr type can be identified, P2P communications. The bot-master uses a program called
which stores the client IP. The same IP can also be iden- pudclient to control the P2P botnet, such as launching
tified along the path 08050260 → 08052170 → TCP-flood, UDP-flood, and DNS-flood attacks. Our goal is
08052a58 → 10.0.0.11 , with the field offset to extract evidence from a memory dump of pudclient
+2596. The field has the ip addr str t tag, which is from the attacker’s machine.
resolved at the return of a call to inet ntoa(). RE- Our experiment has two scenes: the investigator’s scene
WARDS is able to isolate the server IP 10.0.0.4 as a and the attacker’s scene. More specifically,
string along the path 08050260 → 08051170 →
• Scene I: In the lab, the investigator runs the bot-master
10.0.0.4 with the field offset +1172. Interestingly,
program pudclient to communicate with slapper
this field does not have a semantic tag related to an IP
bots to derive the data structures of pudclient.
address. The reason is that the field is simply a part of the
request string (the host field in HTTP Request Message), • Scene II: In the wild, the attacker runs pudclient to
but it is not used in any type sinks that can resolve it as an IP. control real slapper bots.
However, isolating the string also allows a human inspector
to extract it as an IP. In Scene I, we run a number of slapper worm in-
To validate our result, we present in Figure 4(b) the stances in a contained environment (at IP addresses rang-
corresponding symbolic definitions extracted from the ing from 10.0.0.1 - 10.0.1.255). Then we launch
source for comparison. Fields that are underlined are pudclient with REWARDS and issue a series of
used during execution. In particular, struct CONNECTION commands such as listing the compromised hosts, and
corresponds to the abstraction struct 0x0804dd4f launching the UDPFlood, TCPFlood, and DNSFlood at-
(node 08052170 ) and struct CONNDATA corresponds tacks. REWARDS extracts the data structure definitions for
to struct 0x0804ddfb (node 08052a58 ). Observe pudclient. Then in Scene II, we run pudclient again
that all fields of CONNECTION are precisely derived, except without REWARDS. Indeed, the attacker’s machine does
the pointer PostData, which is represented as an unused not have any forensics tool running. Emulating the attacker,
array in the inferred definition because the field is not used we issue some commands and then hibernate the machine.
during execution. For the CONNDATA structure, all the We then get the memory image of pudclient and use the
exercised fields are extracted and correctly typed. Recall data structure information derived in Scene I to investigate
that we consider a field is correctly typed if its offset is the image.
+0 pthread_t
b7fe5bb0

+4 int sin_family
b7fe5bb0 0002

+8 socket 00000005 sin_port 7e92

sin_addr
+12 struct sockaddr 7e920002 0b00000a 0...0 0b00000a
sin_zero
struct _0x0804dd4f *
08050260 08052170
+28 time_t
4aafb0c7 0...0
+32 time_t
+0 char [11]
4aafb0c7 Keep−Alive
+36 unused [4]

+11 unused [1161]

00000000 0...0
+40 struct _0x0804ddfb *
+1172 char [9]
10.0.0.4

+1181 unused [247]

0...0

+1428 char [9] 180 typedef struct {

HTTP/1.0
181 pthread_t handle;
182 unsigned long int id;
+1437 unused [1159]
183 short int socket;
0...0
184 struct sockaddr_in ClientAddr;
185 time_t ctime; // Creation time
+2596 ip_addr_str_t
10.0.0.11
186 time_t atime; // Last Access time
187 char *PostData;
188 CONNDATA *dat;
+2606 unused [10]
0...0 189 } CONNECTION;

+2616 char [4]

GET 206 CONNECTION *conn; //matched the root node

+2620 unused [4]

00000000 143 typedef struct {
144 // incoming data
145 char in_Connection[16];
+2624 char [2] / 146 int in_ContentLength;
147 char in_ContentType[128];
148 char in_Cookie[1024];
+2626 unused [1150] 0...0

149 char in_Host[64];

Wget/1.10.2
150 char in_IfModifiedSince[64];
+3776 char [12]
151 char in_PathInfo[128];

+3788 unused [116] 152 char in_Protocol[16];

08052a58 0...0
+3904 short int 153 char in_QueryString[1024];
154 char in_Referer[128];
+3906 unused [16] 00c8
155 char in_RemoteAddr[16];
156 int in_RemotePort;
+3922 char [6] 0...0
157 char in_RequestMethod[8];

Close 158 char in_RequestURI[1024];

+3928 unused [12]
159 char in_ScriptName[128];
0...0
160 char in_UserAgent[128];
+3940 int

161 // outgoing data

00000052
162 short int out_status;
+3944 unused [208]
163 char out_CacheControl[16];
0...0
+4152 char [9]
164 char out_Connection[16];

HTTP/1.0 165 int out_ContentLength;

+4161 unused [135]
166 char out_Date[64];
167 char out_Expires[64];
0...0 168 char out_LastModified[64];
+4296 char [10] 169 char out_Pragma[16];

text/html

+4306 unused [65654] 170 char out_Protocol[16];

171 char out_Server[128];
0...0

+69960 short int

172 char out_ContentType[128];
0001
173 char out_ReplyData[MAX_REPLYSIZE];
+69962 short int

0001
174 short int out_headdone;
175 short int out_bodydone;
+69964 short int 176 short int out_flushed;
0001
177 // user data
+69966 unused [8192]
178 char envbuf[8192];
0...0 179 } CONNDATA;

(a) Hierarchical view from REWARDS (b) Data structure definition

Figure 4. Comparison between the REWARDS-derived hierarchical view and source code definition
bfffd140 05 00 00 00 6b 00 00 00 69 00 00 00 00 00 00 00 bfffe5d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bfffd150 00 00 00 00 38 ea ff bf 00 00 00 00 00 00 00 01 bfffe5e0 00 00 00 00 00 00 00 00 00 00 00 00 e0 f5 ff bf
bfffd160 2c 00 00 00 67 45 8b 6b 0e 00 00 00 00 00 00 00 bfffe5f0 a0 2d 05 08 e0 f5 ff bf a0 13 05 08 00 00 00 00
bfffd170 0a 00 00 63 0f 27 00 00 9f 86 01 00 9f 86 01 00 bfffe600 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bfffd180 1c ea ff bf 10 ea ff bf 6a f2 b2 4a 7a 4a 0e 00 *
bfffd190 22 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 bfffea00 00 00 00 00 00 00 00 00 00 00 00 00 10 ea ff bf
bfffd1a0 6a f2 b2 4a 7a 4a 0e 00 f2 f3 8d 8c 00 00 00 00 bfffea10 01 00 00 00 00 00 00 00 e5 de f2 49 46 00 00 00
bfffd1b0 00 00 00 00 00 00 00 00 01 00 00 00 02 00 00 00 bfffea20 67 45 8b 6b 10 00 00 00 e8 be e6 71 0a 00 00 34
bfffd1c0 64 6e 73 66 6c 6f 6f 64 00 00 00 00 00 00 00 00 bfffea30 0a 00 01 33 0a 00 00 0b 0a 00 00 04 00 00 00 00
bfffd1d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 bfffea40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
* *
bfffd5c0 c0 d1 ff bf 00 00 00 00 02 ca 04 08 00 00 00 00 ...
bfffd5d0 00 00 00 00 00 00 00 00 02 ca 04 08 02 ca 04 08 bffff5c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bfffd5e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 bffff5d0 01 00 00 00 80 00 00 00 80 00 00 00 ff f7 ff bf
bfffd5f0 00 00 00 00 00 00 00 00 00 00 00 00 04 d6 ff bf bffff5e0 00 00 00 00 00 00 00 00 f3 f7 ff bf 67 45 8b 6b
bfffd600 64 6e 73 66 6c 6f 6f 64 00 00 00 00 00 00 00 00 bffff5f0 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bfffd610 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 bffff600 01 00 00 00 c0 f6 ff bf 28 f6 ff bf fb c7 04 08
* bffff610 02 00 00 00 dc 3a 1f b6 d4 df 04 08 dc 3a 1f b6
bfffe5b0 00 00 00 00 00 00 00 00 0e 00 00 00 00 00 00 00 bffff620 00 00 00 00 dc 3a 1f b6 88 f6 ff bf a2 de 0d b6
bfffe5c0 00 00 00 00 02 00 4e 34 0a 00 00 0b 00 00 00 00 bffff630 02 00 00 00 b4 f6 ff bf c0 f6 ff bf f6 5b ff b7

Figure 5. Memory dump for Slapper worm control program when exiting the control interface

We construct the hierarchical view and try to identify IP vulnerability candidate. Vulnerability-specific patterns are
addresses from the view. However, the hierarchical view followed during mutation. One example pattern is to
can only map the memory locations that are alive, namely exponentially expand an input string in the lineage of a
they are reachable from global and stack (pointer) variables. candidate buffer with the goal of generating an overflow
Here, we take an extra step to type the dead (unreachable) exploit. In that project, we had difficulty finding publicly
data. As described in Section 3.5, our technique scans available, binary-level vulnerability detectors to use as the
the stack space lower than the current (the lowest and front end. REWARDS helps address this issue by deriving
live) activation record and looks for values that are in the both variable syntax and semantics from a subject binary.
range of the code section, as they are very likely return Next, we present our experience of using REWARDS to
addresses. Four such values are identified. One example identify vulnerability suspects and then using our prior
and its memory context is shown in Figure 5. In this system (a fuzzer) to confirm them.
memory dump snippet, the return address, as underlined, For this study, we design a static vulnerability suspect
is located at address 0xbffff62c. Our technique further detector that relies on the variable type information derived
identifies that the corresponding function invocation is to by REWARDS. The result of the detector is passed to our
0x0804a708. Hence, we use the data structure definition lineage-based fuzzer to generate exploits. In the following,
of fun 0x0804a708 to type the activation record. The we present how REWARDS helps identify various types of
definition and the typed values are shown in Table 3. vulnerability suspects.
Observe that a number of IPs (fields with ip addr t) are
identified. We also spot the bot command “dnsflood” • Buffer overflow vulnerability. Buffer overflows
at -9324 and -8236. Note that these two fields have the could happen in three different places: stack, heap,
input t tag as part of their derived definition, indicating and global areas. As such, we define three types of
they hold values from input. buffer overflow vulnerability patterns. Specifically,
for stack overflow, if a stack layout contains a buffer
and its content comes from user input, we consider
5.2 Vulnerability Fuzz it a suspect. Note that this can be easily facilitated
by REWARDS’s typing algorithm: A semantics tag
It is a challenging task to detect and confirm vulner- input t is defined to indicate that a variable re-
abilities in a given binary without symbolic information. ceives its value from external input. The tag is only
Previously in [26], we have proposed a dynamic analysis susceptible to the forward flow but not the backward
approach that can decide if a vulnerability suspect is true flow. In the stack layout derived by REWARDS, if
positive by generating a concrete exploit. The basic idea a buffer’s type set contains an input t tag, it is
is to first use existing static tools to identify vulnerability considered vulnerable. For heap overflow, we consider
candidates, which are often of large quantity; then benign two cases: one is to exploit heap management data
executions are mutated to generate exploits. Mutations structure outside the user-allocated heap chunk; and
are directed by dynamic information called input lineage, the other is to exploit user-defined function pointers
which denotes the set of input elements that is used to inside the heap chunk. Detecting the former case is
compute a value at a given execution point, usually a simply to check if a heap structure contains a buffer
Offset Type Size Mem Addr Content Offset Type Size Mem Addr Content
-9432 void* 4 bfffd154 38 ea ff bf -9324 char[9],input t 9 bfffd1c0 64 6e..64
-9428 char* 4 bfffd158 00 00 00 00 -8300 char* 4 bfffd5c0 c0 d1 ff bf
-9420 int 4 bfffd160 2c 00 00 00 -8236 char[9],input t 9 bfffd600 64 6e..64
-9416 int 4 bfffd164 67 45 8b 6b -8227 char[28] 28 bfffd609 00 .. 00
-9412 int 4 bfffd168 0e 00 00 00 -4236 void* 4 bfffe5a0 00 00 00 00
-9408 int 4 bfffd16c 00 00 00 00 -4156 struct 0x804834e* 4 bfffe5f0 a0 2d 05 08
-9404 ip addr t 4 bfffd170 0a 00 00 63 -4152 void* 4 bfffe5f4 e0 f5 ff bf
-9300 port t 4 bfffd174 0f 27 00 00 -3104 char* 4 bfffea0c 10 ea ff bf
-9396 int 4 bfffd178 9f 86 01 00 -3088 char[16] 16 bfffea1c 46 00 00 00
-9392 int 4 bfffd17c 9f 86 01 00 -3068 ip addr t 4 bfffea30 0a 00 01 33
-9388 void* 4 bfffd180 1c ea ff bf -3064 ip addr t 4 bfffea34 0a 00 00 0b
-9384 void* 4 bfffd184 10 ea ff bf -3058 ip addr t 4 bfffea38 0a 00 00 04
timeval.tv sec 4 bfffd18c 7a 4a 0e 00 -3054 ip addr t 4 bfffea3c 0a 00 00 04
-9376 timeval.tv usec 4 bfffd190 22 00 00 00 -0088 int 4 bffff5d4 80 00 00 00
-9368 int 4 bfffd194 00 00 00 00 -0084 int 4 bffff5d8 80 00 00 00
-9352 int 4 bfffd1a4 7a 4a 0e 00 -0080 int 4 bffff5dc ff f7 ff bf
-9348 int 4 bfffd1a8 f2 f3 8d 8c -0004 stack frame t 4 bffff628 88 f6 ff bf
-9344 int 4 bfffd1ac 00 00 00 00 +0000 ret addr t 4 bffff62c a2 de 0d b6
-9332 int 4 bfffd1b8 01 00 00 00 +0004 int 4 bffff630 02 00 00 00
-9328 int 4 bfffd1bc 02 00 00 00 +0008 char* 4 bffff634 b4 f6 ff bf

Table 3. Result on the unreachable memory type using type fun 0x804a708

Program #Buffer Overflow #Integer Overflow #Format String

field that is input-relevant, in a way similar to stack
ncompress-4.2.4 1 0 0
vulnerability detection. For the later case, the detector bftpd-1.0.11 3 0 0
scans the derived layout of a heap structure to check gzip-1.2.4 3 0 0
nullhttpd-0.5.0 5 2 0
the presence of both an input-relevant buffer field and xzgv-5.8 3 8 0
a function pointer field. Vulnerabilities in the global gnuPG-1.4.3 0 3 0
memory region are handled similarly. ipgrab-0.9.9 0 5 0
cfingerd-1.4.3 4 0 1
ngircd-0.8.2 12 0 1
• Integer overflow vulnerability. Integer overflow oc-
curs when an integer exceeds the maximum value that Table 4. Number of vulnerability suspects
a machine can represent. Integer overflow itself may reported with help of REWARDS
not be harmful (e.g., gcc actually leverages integer
overflow to manipulate control flow path condition
[38]), but if an integer variable is dependent on user
input without any sanity check and it is used as an the distance between a vulnerable stack buffer and a return
argument to malloc-family functions, then an integer address, i.e., a variable with the ret addr t tag, in
overflow vulnerability is likely. In particular, over- order to construct a stack overflow exploit. Similarly, it
flowed values passed to malloc functions usually result is important to know the distance between a heap buffer
in heap buffers being smaller than they are supposed and a heap function pointer for composing a heap overflow-
to be. Consequently, heap overflows occur. For this based code injection attack. Such information is provided
type of vulnerabilities, our detector checks the actual by REWARDS.
arguments to malloc family function invocations: if We applied our REWARDS-based detector to examine
an integer parameter has both malloc arg t and several programs shown in the 1st column of Table 4. The
input t tags, an integer overflow vulnerability sus- detector reported a number of vulnerable suspects based
pect will be reported. on the aforementioned vulnerability patterns. The total
number of vulnerabilities of each type is presented in the
• Format string vulnerability. The format string vul- remaining columns. Observe that our detector does not
nerability pattern involves a user input flowing into produce many suspects for these programs and hence can
a format string argument. Thus, we introduce a serve as a tractable front end for our fuzzer. The fuzzer then
semantics tag format string t, which is only tries to generate exploits to convict the suspects. Details
resolved at invocations to printf-family functions. of each confirmed vulnerable data structure is shown in
If a variable’s type set contains both input t and the 2nd column of Table 5. The field symbols do not
format string t tags, a format string vulnerabil- represent their symbolic names, which we do not know, but
ity suspect is reported. rather the type tags derived for these fields. For instance,
format string t denotes that the field is essentially
Besides facilitating vulnerability suspect identification, a format string; sockaddr in indicates that the field
the information generated by REWARDS can also help holds a socket address. The 3rd column presents the input
composing exploits. For instance, it is critical to know category that is relevant to the vulnerable data structure.
Benchmark Suspicious Data Structure Input Offset Vulnerability Type
fun 0x08048e76 { -1052: char[13],
-1039: unused[1023],...
-0008: char*,
ncompress-4.2.4 -0004: stack frame t, argv[1] {0..11} Stack overflow
+0000: ret addr t,
+0004: char**}
fun 0x080494b8 { -0064: char*,
-0060: char[12],
-0048: unused [44],
bftpd-1.0.11 -0004: stack frame t, recv {0..3} Stack overflow
+0000: ret addr t,
+0004: char*}
bss 0x08053f80 { ...
+244128: char[8],
gzip-1.2.4 +244136: unused[1016], argv[1] {0..6} Global overflow
+245152: char*,...}
heap 0x0804f205 { +0000: char[11],
+0011: unused[5], recv {607,608} Integer overflow
nullhttpd-0.5.0 +0016: int,... }
heap 0x0804c41f {+0000: void[29],
+0029: unused[1024]} recv {661..690} Heap Overflow
bss 0x0809ac80 { ...
xzgv-5.8 +91952: int, fread {4..11} Integer overflow
+91956: int,...}
fun 0x080673fc { ...,
-0176: char[6],unused[2], fread {2..5} Integer overflow
gnuPG-1.0.5 -0168: int,int,...}
heap 0x080afec1 { +0000:int,...,
+0036: void[5] } fread {6..10} Heap overflow
fun 0x0804d06b { ...,
-0056: int, fread {20..23} Integer overflow
ipgrab-0.9.9 -0052: int, int,...}
heap 0x0805a976 {+0000: void[60] } fread {40..100} Heap overflow
fun 0x080496b8 { ...,
-0440: struct sockaddr in,
cfingerd-1.4.3 -0424: format string t[34], read {0..3} Format String
-0390: unused[174],
-0216: char[4],,...}
fun 0x0805f9a5 { ...,
-0284: format string t[76]
ngircd-0.8.2 -0208: unused[204], recv {12..15} Format String
-0004: stack frame t,
+0000: ret addr t,...}

Table 5. Result from our vulnerability fuzzer with help of REWARDS

For example, the char[12] buffer in bftpd denotes a 6 Discussion

packet received from outside (the recv category). Note
that the input categories are conveniently implemented as REWARDS has a number of limitations: (1) As a dy-
semantics tags in REWARDS. The 4th column offset namic analysis-based approach, REWARDS cannot achieve
represents the input offsets reported by our fuzzer. They full coverage of data structures defined in a program.
represent the places that are mutated to generate the real Instead, the coverage of REWARDS relies on those data
exploits. The REWARDS-based vulnerability detector also structures that are actually created and accessed during a
emits vulnerability types (shown in the 5th column) based particular run of the binary. (2) REWARDS is not fully on-
on the vulnerability patterns matched. Consider the first line as our timestamp-based on-line algorithm may leave
benchmark ncompress: Its entry in the table indicates some variables unresolved by the time they are de-allocated,
that the char[13] buffer inside a function starting with and thus the off-line companion procedure is needed to
PC 0x08048e76 is vulnerable to stack buffer overflow. make the system sound. A fully on-line type resolution
The buffer receives values from the second command line algorithm is our future work. (3) Based on PIN, REWARDS
option (argv[1]). Our data lineage fuzzer mutates the does not support the reverse engineering of kernel-level data
lineage of the buffer, which are the first 12 input items structures. (4) REWARDS does not work with obfuscated
(offset 0 to 11) to generate the exploit. From the data code. Thus it is possible that an adversary can write an
structure in the 2nd column, the exploit has to contain a obfuscated program to dodge REWARDS – for example,
byte string longer than 1052 bytes to overwrite the return by avoiding touching the type sinks we define. (5) Besides
address at the bottom. Other vulnerabilities can be similarly the general data structures, REWARDS has yet to support
apprehended. the extraction of other data types, such as the format of a
specific type of files (e.g., ELF files, multimedia files), and
browser-related data types (e.g., URL, cookie). Moreover, involves reconstructing variable types [31, 19]. By using
REWARDS does not distinguish between sign and unsigned unification, Mycroft [31] extends the Hindley-Milner algo-
integers in our current implementation. rithm [29] and delays unification until all constraints are
available. Recently, Dolgova and Chernov [19] present an
7 Related Work iterative algorithm that uses a lattice over the properties of
data types for reconstruction.
All these techniques are static and hence share the same
Type inference. Some programming languages, such as limitations of static type inference and they only derive
ML, do not explicitly declare types. Instead, types are in- simple syntactic structures. Moreover, they aim to get
ferred from programs. Typing constraints are derived from an execution-equivalent code and do not pay attention to
program statements statically and programs are typed by whether the recovered types reflect the original declarations
solving these constraints. Notable type inference algorithms and have the same structures.
include Hindley-Milner algorithm [29], Cartesian Product
Protocol format reverse engineering. Recent efforts in
algorithm [3], iterative type analysis [13], object oriented
protocol reverse engineering involve using dynamic binary
type inference [33], and aggregate structure identification
analysis (in particular input data taint analysis) to reveal
[35].
the format of protocol messages, facilitated by instruction
These techniques, like REWARDS, rely on type uni- semantics (e.g., Polyglot [9]) or execution context (e.g.,
fication, namely, variables connected by operators shall AutoFormat [25]). Recently, it has been shown that the
have the same type. However, these techniques assume BNF structure of a given protocol with multiple messages
program source code and they are static, that is, typing can be derived [40, 17, 28]; and the format of out-going
constraints are generated from source code at compile time. messages as well as encrypted messages can be revealed
For REWARDS, we only assume binaries without symbolic [8, 39]. In particular, REWARDS shares the same insight as
information, in which high level language artifacts are all Dispatcher [8] for type inference and semantics extraction.
broken down to machine level entities, such as registers, The difference is that Dispatcher and other protocol reverse
memory addresses, and instructions. REWARDS relies engineering techniques mainly focus on live input and
on type sinks to obtain the initial type and semantics output messages, whereas we strive to reveal general data
information. Variables are then typed through unification structures in a program. Meanwhile, we care more about
with type sinks during execution. the detailed in-memory layout of program data, motivated
Lately, Balakrishnan et al. [4, 5, 36] showed that by our different targeted application scenarios.
analyzing executables alone can largely discover syntactic Memory forensics and vulnerability discovery.
structures of variables, such as sizes, field offsets, and FATKit [34] is a toolkit to facilitate the extraction,
simple structures. Their technique entails points-to analysis analysis, aggregation, and visualization of forensic data.
and abstract interpretation at binary level. They cannot han- Their technique is based on pre-defined data structures
dle obfuscated binaries and dynamically loaded libraries. extracted from program source code to type memory
Furthermore, the inaccuracy of binary points-to analysis dumps. This is also the case for other similar systems
makes it hard to type heap variables. In comparison, our (e.g., [12, 30, 2]). KOP [11] is an effective system that
technique is relatively simple, with the major hindrances can map dynamic kernel objects with nearly complete
to static analysis (e.g., points-to relations and dynamically coverage and perfect accuracy. It also relies on program
loaded libraries) addressed via dynamic analysis. source code and uses an inter-procedural points-to analysis
Abstract type inference. Abstract type inference [32] to compute all possible types for generic pointers. There
is to group typed variables according to their semantics. are several other efforts [37, 18] that use data structure
For example, variables that are meant to store money, zip signatures to scan and type memory. Complementing these
codes, ages, etc., are clustered based on their intention’s, efforts, REWARDS extracts data structure definitions and
even though they may have the same integer type. Such reconstructs hierarchical in-memory layouts from binaries.
an intention is called an abstract type. The technique relies There is a large body of research in vulnerability dis-
on the Hindley-Milner type inference algorithm. Recently, covery such as Archer [41], EXE [10], Bouncer [15],
dynamic abstract type inference is proposed [24] to infer BitScope [7], DART [22], and SAGE [23, 21]. REWARDS
abstract types from execution. Regarding the goal of complements these techniques by enabling identification of
performing semantics-aware typing, these techniques and vulnerability suspects directly from binaries.
ours are similar. However, they work at the source code
level whereas ours works at the binary level. Our technique
further derives syntactic type structures. 8 Conclusion
Decompilation. Decompilation is a process of recon-
structing program source code from lower-level languages We have presented the REWARDS reverse engineering
(e.g., assembly or machine code) [14, 20, 6]. It usually system that automatically reveals data structures in a bi-
nary based on dynamic execution. REWARDS involves [9] J. Caballero and D. Song. Polyglot: Automatic extraction
an algorithm that performs data flow-based type attribute of protocol format using dynamic binary analysis. In
forward propagation and backward resolution. Driven by Proceedings of the 14th ACM Conference on Computer and
the type information derived, REWARDS is also capable and Communications Security (CCS’07), pages 317–329,
of reconstructing the structural and semantic view of in- Alexandria, Virginia, USA, 2007.
memory data layout. Our evaluation using a number of real- [10] C. Cadar, V. Ganesh, P. M. Pawlowski, D. L. Dill, and
world programs indicates that REWARDS achieves high D. R. Engler. Exe: automatically generating inputs of death.
accuracy in revealing data structures accessed during an In Proceedings of the 13th ACM conference on Computer
execution. Furthermore, we demonstrate the benefits of and communications security (CCS’06), pages 322–335,
Alexandria, Virginia, USA, 2006. ACM.
REWARDS to two application scenarios: memory image
forensics and binary vulnerability discovery. [11] M. Carbone, W. Cui, L. Lu, W. Lee, M. Peinado, and
X. Jiang. Mapping kernel objects to enable systematic in-
tegrity checking. In The 16th ACM Conference on Computer
9 Acknowledgment and Communications Security (CCS’09), pages 555–565,
Chicago, IL, USA, 2009.
We would like to thank the anonymous reviewers for
[12] A. Case, A. Cristina, L. Marziale, G. G. Richard, and
their insightful comments. We are grateful to Xuxian Jiang
V. Roussev. Face: Automated digital evidence discovery
and Heng Yin for earlier discussions and help on this and and correlation. Digital Investigation, 5(Supplement 1):S65
related problems. This research is supported, in part, by the – S75, 2008. The Proceedings of the Eighth Annual DFRWS
Office of Naval Research (ONR) under grant N00014-09-1- Conference.
0776 and by the National Science Foundation (NSF) under
[13] C. Chambers and D. Ungar. Iterative type analysis and
grant 0720516. Any opinions, findings, and conclusions or extended message splitting: Optimizing dynamically-typed
recommendations in this paper are those of the authors and object-oriented programs. In Proceedings of the SIGPLAN
do not necessarily reflect the views of the ONR or NSF. Conference on Programming Language Design and Imple-
mentation, pages 150–164, 1990.
References [14] C. Cifuentes. Reverse Compilation Techniques. PhD thesis,
Queensland University of Technology, 1994.
[1] Libdwarf. http://reality.sgiweb.org/davea/dwarf.html. [15] M. Costa, M. Castro, L. Zhou, L. Zhang, and M. Peinado.
[2] Mission critical linux. In Memory Core Dump, Bouncer: securing software by blocking bad input. In Pro-
http://oss.missioncriticallinux.com/projects/mcore/. ceedings of the 21st ACM SIGOPS symposium on Operating
[3] O. Agesen. The cartesian product algorithm: Simple and systems principles (SOSP’07), pages 117–130, Stevenson,
precise type inference of parametric polymorphism. In Washington, USA, 2007. ACM.
Proceedings of the 9th European Conference on Object- [16] A. Cozzie, F. Stratton, H. Xue, and S. T. King. Digging
Oriented Programming (ECOOP’95), pages 2–26, London, for data structures. In Proceeding of 8th Symposium on
UK, 1995. Springer-Verlag. Operating System Design and Implementation (OSDI’08),
[4] G. Balakrishnan, , G. Balakrishnan, and T. Reps. Analyzing pages 231–244, San Diego, CA, December, 2008.
memory accesses in x86 executables. In Proceedings of In-
[17] W. Cui, M. Peinado, K. Chen, H. J. Wang, and L. Irun-
ternational Conference on Compiler Construction (CC’04),
Briz. Tupni: Automatic reverse engineering of input formats.
pages 5–23. Springer-Verlag, 2004.
In Proceedings of the 15th ACM Conference on Computer
[5] G. Balakrishnan and T. Reps. Divine: Discovering variables and Communications Security (CCS’08), pages 391–402,
in executables. In Proceedings of International Conf. on Alexandria, Virginia, USA, October 2008.
Verification Model Checking and Abstract Interpretation
[18] B. Dolan-Gavitt, A. Srivastava, P. Traynor, and J. Giffin. Ro-
(VMCAI’07), Nice, France, 2007. ACM Press.
bust signatures for kernel data structures. In Proceedings of
[6] P. T. Breuer and J. P. Bowen. Decompilation: the enumer- the 16th ACM conference on Computer and communications
ation of types and grammars. ACM Trans. Program. Lang. security (CCS’09), pages 566–577, Chicago, Illinois, USA,
Syst., 16(5):1613–1647, 1994. 2009. ACM.
[7] D. Brumley, C. Hartwig, M. G. Kang, Z. Liang, J. Newsome,
[19] E. N. Dolgova and A. V. Chernov. Automatic reconstruction
P. Poosankam, D. Song, and H. Yin. Bitscope: Automatically
of data types in the decompilation problem. Program.
dissecting malicious binaries, 2007. Technical Report CMU-
Comput. Softw., 35(2):105–119, 2009.
CS-07-133, Carnegie Mellon University.
[8] J. Caballero, P. Poosankam, C. Kreibich, and D. Song. Dis- [20] M. V. Emmerik and T. Waddington. Using a decompiler
patcher: Enabling active botnet infiltration using automatic for real-world source recovery. In Proceedings of the 11th
protocol reverse-engineering. In Proceedings of the 16th Working Conference on Reverse Engineering, pages 27–36,
ACM Conference on Computer and and Communications 2004.
Security (CCS’09), pages 621–634, Chicago, Illinois, USA, [21] P. Godefroid, A. Kiezun, and M. Y. Levin. Grammar-based
2009. whitebox fuzzing. In Proceedings of the ACM SIGPLAN
Conference on Programming Language Design and Imple- [33] J. Palsberg and M. I. Schwartzbach. Object-oriented type
mentation (PLDI’08), pages 206–215, Tucson, AZ, USA, inference. In OOPSLA ’91: Conference proceedings on
2008. ACM. Object-oriented programming systems, languages, and ap-
plications, pages 146–161, Phoenix, Arizona, United States,
[22] P. Godefroid, N. Klarlund, and K. Sen. Dart: directed
1991. ACM.
automated random testing. In Proceedings of the 2005 ACM
SIGPLAN conference on Programming language design and [34] N. L. Petroni, Jr., A. Walters, T. Fraser, and W. A. Arbaugh.
implementation (PLDI’05), pages 213–223, Chicago, IL, Fatkit: A framework for the extraction and analysis of
USA, 2005. ACM. digital forensic data from volatile system memory. Digital
Investigation, 3(4):197 – 210, 2006.
[23] P. Godefroid, M. Levin, and D. Molnar. Automated whitebox
fuzz testing. In Proceedings of the 15th Annual Network [35] G. Ramalingam, J. Field, and F. Tip. Aggregate structure
and Distributed System Security Symposium (NDSS’08), San identification and its application to program analysis. In
Diego, CA, February 2008. Proceedings of the 26th ACM SIGPLAN-SIGACT symposium
on Principles of programming languages (POPL’99), pages
[24] P. J. Guo, J. H. Perkins, S. McCamant, and M. D. Ernst. 119–132, San Antonio, Texas, 1999. ACM.
Dynamic inference of abstract types. In Proceedings of
[36] T. W. Reps and G. Balakrishnan. Improved memory-access
the 2006 international symposium on Software testing and
analysis for x86 executables. In Proceedings of International
analysis (ISSTA’06), pages 255–265, Portland, Maine, USA,
Conference on Compiler Construction (CC’08), pages 16–
2006. ACM.
35, 2008.
[25] Z. Lin, X. Jiang, D. Xu, and X. Zhang. Automatic protocol
[37] A. Schuster. Searching for processes and threads in mi-
format reverse engineering through context-aware monitored
crosoft windows memory dumps. Digital Investigation,
execution. In Proceedings of the 15th Annual Network
3(Supplement-1):10–16, 2006.
and Distributed System Security Symposium (NDSS’08), San
Diego, CA, February 2008. [38] T. Wang, T. Wei, Z. Lin, and W. Zou. Intscope: Au-
tomatically detecting integer overflow vulnerability in x86
[26] Z. Lin, X. Zhang, and D. Xu. Convicting exploitable binary using symbolic execution. In Proceedings of the 16th
software vulnerabilities: An efficient input provenance based Annual Network and Distributed System Security Symposium
approach. In Proceedings of the 38th Annual IEEE/IFIP (NDSS’09), San Diego, CA, February 2009.
International Conference on Dependable Systems and Net-
works (DSN’08), Anchorage, Alaska, USA, June 2008. [39] Z. Wang, X. Jiang, W. Cui, X. Wang, and M. Grace. Refor-
mat: Automatic reverse engineering of encrypted messages.
[27] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, In Proceedings of 14th European Symposium on Research
G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: in Computer Security (ESORICS’09), Saint Malo, France,
building customized program analysis tools with dynamic September 2009. LNCS.
instrumentation. In Proceedings of ACM SIGPLAN Confer-
[40] G. Wondracek, P. Milani, C. Kruegel, and E. Kirda. Auto-
ence on Programming Language Design and Implementation
matic network protocol analysis. In Proceedings of the 15th
(PLDI’05), pages 190–200, Chicago, IL, USA, 2005.
Annual Network and Distributed System Security Symposium
[28] P. Milani Comparetti, G. Wondracek, C. Kruegel, and (NDSS’08), San Diego, CA, February 2008.
E. Kirda. Prospex: Protocol Specification Extraction. In [41] Y. Xie, A. Chou, and D. Engler. Archer: using sym-
IEEE Symposium on Security & Privacy, pages 110–125, bolic, path-sensitive analysis to detect memory access errors.
Oakland, CA, 2009. In Proceedings of the 9th European software engineering
[29] R. Milner. A theory of type polymorphism in programming. conference held jointly with 10th ACM SIGSOFT interna-
Journal of Computer and System Sciences, 17:348–375, tional symposium on Foundations of software engineering
1978. (ESEC/FSE-10), pages 327–336, Helsinki, Finland, 2003.
[30] P. Movall, W. Nelson, and S. Wetzstein. Linux physical
memory analysis. In Proceedings of the USENIX Annual
Technical Conference, pages 39–39, Anaheim, CA, 2005.
USENIX Association.
[31] A. Mycroft. Type-based decompilation (or program recon-
struction via type reconstruction). In Proceedings of the
8th European Symposium on Programming Languages and
Systems (ESOP’99), pages 208–223, London, UK, 1999.
Springer-Verlag.
[32] R. O’Callahan and D. Jackson. Lackwit: a program under-
standing tool based on type inference. In Proceedings of
the 19th international conference on Software engineering,
pages 338–348, Boston, Massachusetts, United States, 1997.
ACM.

Foundations of ARM64 Linux Debugging, Disassembling, and Reversing (Dmitry Vostokov) (Z-Library)
No ratings yet
Foundations of ARM64 Linux Debugging, Disassembling, and Reversing (Dmitry Vostokov) (Z-Library)
170 pages
205 String References and Basic Patching
No ratings yet
205 String References and Basic Patching
64 pages
Unit 2 PPL
No ratings yet
Unit 2 PPL
56 pages
CPP Dynamic Type Recovery
No ratings yet
CPP Dynamic Type Recovery
129 pages
Embedded-C Language: Vu Trong
100% (1)
Embedded-C Language: Vu Trong
31 pages
Reverse Engineering-Memory Analysis
No ratings yet
Reverse Engineering-Memory Analysis
73 pages
01 Procedural Vs Objects
No ratings yet
01 Procedural Vs Objects
59 pages
Reveng Memory Analysis
No ratings yet
Reveng Memory Analysis
73 pages
Eturn Riented Bfuscation: Vivek Balachandran, Sabu Emmanuel and NG Wee Keong
No ratings yet
Eturn Riented Bfuscation: Vivek Balachandran, Sabu Emmanuel and NG Wee Keong
12 pages
C++ Reverse Disassembly
100% (2)
C++ Reverse Disassembly
33 pages
HR A Revised
No ratings yet
HR A Revised
40 pages
Dirty Sec22 Chen Qibin
No ratings yet
Dirty Sec22 Chen Qibin
18 pages
Debugging PDF
No ratings yet
Debugging PDF
29 pages
Reverse Engineering Tools Review
No ratings yet
Reverse Engineering Tools Review
53 pages
Reverse Engineering
No ratings yet
Reverse Engineering
39 pages
Static Analysis of Binary Exe
No ratings yet
Static Analysis of Binary Exe
9 pages
MCA Data Structures With Algorithms 01
No ratings yet
MCA Data Structures With Algorithms 01
14 pages
C Decompilation PDF
No ratings yet
C Decompilation PDF
15 pages
Reverse Engineering
No ratings yet
Reverse Engineering
58 pages
20-Design Recovery For Distributed Systems
No ratings yet
20-Design Recovery For Distributed Systems
12 pages
Elementary Data Types
No ratings yet
Elementary Data Types
29 pages
Elementary Data Types
No ratings yet
Elementary Data Types
29 pages
Reverse Engineering Machine Code 3
No ratings yet
Reverse Engineering Machine Code 3
30 pages
Dumping Code For Spying and Windows Tools
No ratings yet
Dumping Code For Spying and Windows Tools
13 pages
Code Obfuscation Against Static and Dynamic
No ratings yet
Code Obfuscation Against Static and Dynamic
15 pages
2 Elementary Data Types
No ratings yet
2 Elementary Data Types
7 pages
A Boosting Ensemble For The Recognition of Code Sharing in Malware
No ratings yet
A Boosting Ensemble For The Recognition of Code Sharing in Malware
11 pages
Augmenting Decompiler Output With Learned Variable Names and Types
No ratings yet
Augmenting Decompiler Output With Learned Variable Names and Types
17 pages
Static Analysis of String Manipulations in Critical Embedded C Programs
No ratings yet
Static Analysis of String Manipulations in Critical Embedded C Programs
17 pages
Software RE RevII
No ratings yet
Software RE RevII
47 pages
Reverse Engineering Malware: Hassen Saidi
No ratings yet
Reverse Engineering Malware: Hassen Saidi
67 pages
Trace Surfing Presentation
No ratings yet
Trace Surfing Presentation
63 pages
04 Reversing Tools
No ratings yet
04 Reversing Tools
22 pages
CCS2003 PDF
No ratings yet
CCS2003 PDF
10 pages
Pyemu: A Multi-Purpose Scriptable Ia-32 Emulator: Cody Pierce
No ratings yet
Pyemu: A Multi-Purpose Scriptable Ia-32 Emulator: Cody Pierce
38 pages
ReverseEngineeringMachineCode1 PDF
No ratings yet
ReverseEngineeringMachineCode1 PDF
60 pages
B.tech CS S8 Principles of Programming Languages Notes Module 2
No ratings yet
B.tech CS S8 Principles of Programming Languages Notes Module 2
10 pages
Quist 2009
No ratings yet
Quist 2009
6 pages
MACF - Memory Address Content Forgery (Eric A. Schulman)
No ratings yet
MACF - Memory Address Content Forgery (Eric A. Schulman)
11 pages
Disassembly Using IDA
No ratings yet
Disassembly Using IDA
24 pages
Introduction To Procedural Debugging Through Binary Libification
No ratings yet
Introduction To Procedural Debugging Through Binary Libification
10 pages
Final Review 09
No ratings yet
Final Review 09
21 pages
How To Write Malware and Learn How To Fight It!
No ratings yet
How To Write Malware and Learn How To Fight It!
40 pages
Malware Analysis Series Article 1 1638935075
No ratings yet
Malware Analysis Series Article 1 1638935075
36 pages
Session 3
No ratings yet
Session 3
49 pages
Reverse Engineering Linux ELF Binaries On The x86 Platform: (C) 2002 Sean Burford The University of Adelaide
No ratings yet
Reverse Engineering Linux ELF Binaries On The x86 Platform: (C) 2002 Sean Burford The University of Adelaide
68 pages
Deobfuscation Reverse Engineering Obfuscated Code
No ratings yet
Deobfuscation Reverse Engineering Obfuscated Code
10 pages
Obf Signal
No ratings yet
Obf Signal
16 pages
Chapter 1 - Introduction To Reverse Engineering
No ratings yet
Chapter 1 - Introduction To Reverse Engineering
68 pages
Building Fast and Reliable Reverse Engineering Tools With Frida and Rust
No ratings yet
Building Fast and Reliable Reverse Engineering Tools With Frida and Rust
6 pages
CSAW ESC Final
No ratings yet
CSAW ESC Final
5 pages
Cracklab - Team - Codisasm
No ratings yet
Cracklab - Team - Codisasm
13 pages
CSC-335 Data Structures and Algorithms: Instructor: Ahmad Reza Hadaegh
No ratings yet
CSC-335 Data Structures and Algorithms: Instructor: Ahmad Reza Hadaegh
23 pages
Chapter 5: Elementary Data Types
No ratings yet
Chapter 5: Elementary Data Types
31 pages
Part 1
No ratings yet
Part 1
4 pages
Reverse Engineering I C
No ratings yet
Reverse Engineering I C
4 pages
Reverse Engineering
No ratings yet
Reverse Engineering
4 pages
Principled Reverse Engineering of Types in Binary Programs
No ratings yet
Principled Reverse Engineering of Types in Binary Programs
18 pages
Binary Code Obfuscation Through C++ Template Metaprogramming
No ratings yet
Binary Code Obfuscation Through C++ Template Metaprogramming
12 pages
How BIOS Works 4
100% (1)
How BIOS Works 4
7 pages
AzureTroubleshooting Technet
No ratings yet
AzureTroubleshooting Technet
407 pages
Operating Manual PA 8000: Edition 11.01 Software Revision 1.9 PA Subject To Technical Modifications and Errors
No ratings yet
Operating Manual PA 8000: Edition 11.01 Software Revision 1.9 PA Subject To Technical Modifications and Errors
51 pages
1 Apache Zookeeper
No ratings yet
1 Apache Zookeeper
7 pages
SSD
No ratings yet
SSD
11 pages
Etpe Etpa Replacement
No ratings yet
Etpe Etpa Replacement
5 pages
IT Infrastructure and Emerging Technologies
100% (1)
IT Infrastructure and Emerging Technologies
20 pages
Rejinpaul App From Playstore: D. Virtual
No ratings yet
Rejinpaul App From Playstore: D. Virtual
7 pages
3.1-6 Folder Redirection
No ratings yet
3.1-6 Folder Redirection
41 pages
Introduction To Java String Handling
No ratings yet
Introduction To Java String Handling
10 pages
Oscam Only+0963 Card+Omnikey 3121 Tutorial
100% (1)
Oscam Only+0963 Card+Omnikey 3121 Tutorial
3 pages
USA Resume Akshay Shah
No ratings yet
USA Resume Akshay Shah
4 pages
Robocraze Com Blogs Post Interfacing GSM Module With Arduino Srsltid AfmBOoo7MQYv - GvqA583wgE0VcxkaDYqcC AQKeuPtJ - aAlq63Hkj3TJ
No ratings yet
Robocraze Com Blogs Post Interfacing GSM Module With Arduino Srsltid AfmBOoo7MQYv - GvqA583wgE0VcxkaDYqcC AQKeuPtJ - aAlq63Hkj3TJ
7 pages
ATX - Napajanje
No ratings yet
ATX - Napajanje
5 pages
WCHISPTool CMD CommandLineProgrammingToolInstruction
No ratings yet
WCHISPTool CMD CommandLineProgrammingToolInstruction
7 pages
DX Diag
No ratings yet
DX Diag
8 pages
Siliconchip Softwave List
No ratings yet
Siliconchip Softwave List
246 pages
1) How Many Types of Files Are There in A SQL Server Database?
No ratings yet
1) How Many Types of Files Are There in A SQL Server Database?
16 pages
PDF 2 Unidad N 2 La Competitividad Empresarial
No ratings yet
PDF 2 Unidad N 2 La Competitividad Empresarial
35 pages
AXIOM File Fixer Manual
No ratings yet
AXIOM File Fixer Manual
212 pages
Matrix Log
No ratings yet
Matrix Log
20 pages
Analyzing Esxtop Columns
No ratings yet
Analyzing Esxtop Columns
3 pages
98-361 Software Development Fundamentals - Skills Measured
No ratings yet
98-361 Software Development Fundamentals - Skills Measured
2 pages
Android Project Folder Structure
No ratings yet
Android Project Folder Structure
10 pages
Lesson 1
No ratings yet
Lesson 1
8 pages
AWS Certified Developer Associate Exam Guide
No ratings yet
AWS Certified Developer Associate Exam Guide
8 pages
4 Input Output
No ratings yet
4 Input Output
30 pages
Midterm1 Solutions
No ratings yet
Midterm1 Solutions
13 pages
10 Coding Principles Every Programmer Should Learn: 1. DRY (Don't Repeat Yourself)
No ratings yet
10 Coding Principles Every Programmer Should Learn: 1. DRY (Don't Repeat Yourself)
2 pages
Gabor Andras Pagination Article
No ratings yet
Gabor Andras Pagination Article
10 pages

Automatic Reverse Engineering of Data Structures From Binary Execution

Uploaded by

Automatic Reverse Engineering of Data Structures From Binary Execution

Uploaded by

Automatic Reverse Engineering of Data Structures from Binary Execution

Zhiqiang Lin Xiangyu Zhang Dongyan Xu

Abstract will help locate specific information of interest (e.g., IP

(d) Output of REWARDS (b) Disassembly code of the example binary

Figure 1. An example showing how REWARDS works

A type sink is an execution point of a program where

have multiple compatible types. ments/operands, e.g., a PID.

(a) Accuracy on Stack Variables

(b) Accuracy on Heap Variables (c) Accuracy on Global Variables

Benchmark Program Benchmark Program

(d) Performance Overhead (e) Space Overhead

Figure 2. Evaluation results for REWARDS accuracy and efficiency

Figure 3. Part of a memory dump from null-httpd

+8 socket 00000005 sin_port 7e92

+11 unused [1161]

+1181 unused [247]

+1428 char [9] 180 typedef struct {

+2616 char [4]

+2620 unused [4]

149 char in_Host[64];

+3788 unused [116] 152 char in_Protocol[16];

Close 158 char in_RequestURI[1024];

161 // outgoing data

HTTP/1.0 165 int out_ContentLength;

+4306 unused [65654] 170 char out_Protocol[16];

+69960 short int

(a) Hierarchical view from REWARDS (b) Data structure definition

Program #Buffer Overflow #Integer Overflow #Format String

Table 5. Result from our vulnerability fuzzer with help of REWARDS

For example, the char[12] buffer in bftpd denotes a 6 Discussion

You might also like