Automatic Reverse Engineering of Data Structures From Binary Execution
Automatic Reverse Engineering of Data Structures From Binary Execution
Automatic Reverse Engineering of Data Structures From Binary Execution
For example, in memory-based forensics, this knowledge matic Revelation of Data Structures.
accessed by the program is tagged with a timestamped semantic information from the memory dump of a binary
type attribute. Following the program’s runtime data flow, program. In binary fuzzing for vulnerability discovery,
this attribute is propagated to other memory addresses and REWARDS helps identifying vulnerability “suspects” in a
registers that share the same type in a forward fashion, binary for guided fuzzing and confirmation.
i.e., the execution direction. During the propagation, a
variable’s type gets resolved if it is involved in a type- 2 REWARDS Overview
revealing execution point or “type sink” (e.g., a system
call, a standard library call, or a type-revealing instruction).
REWARDS infers both syntax and semantics of data
Besides leveraging the forward type propagation technique,
structures from binary execution. More precisely, we aim
to expand the coverage of program data structures, RE-
at reverse engineering the following information:
WARDS involves the following key techniques:
• Data types. We first aim to infer the primitive data
• An on-line backward type resolution procedure where
types of variables, such as char, short, float,
the types of some previously accessed variables get
and int. In a binary, the variables are located in
recursively resolved starting from a type sink. Since
various segments of the virtual address space, such as
many variables are dynamically created and de-
.stack, .heap, .data, .bss, .got, .rodata,
allocated at runtime, and the same memory location
.ctors, and .dtors sections. (Although we focus
may be re-used by different variables, it is complicated
on ELF binary on Linux platform, REWARDS can
to track and resolve variable types based on memory
be easily ported to handle PE binary on Windows.)
locations alone. Hence, we constraint the resolution
Hence, our goal is essentially to annotate memory
process by the timestamps of relevant memory loca-
locations in these data sections with types and sizes,
tions such that variables sharing the same memory
following program execution. For our targeted appli-
location in different execution phases can be disam-
cations, REWARDS also infers composite types such
biguated.
as socket address structures and FILE structures.
• An off-line resolution procedure that complements the
on-line procedure. Some variables cannot be resolved • Semantics. Moreover, we aim to infer the semantics
during their lifetime by our on-line algorithm. How- (meaning) of program variables, which is critical to
ever, they may later get resolved when other variables applications such as computer forensics. For example,
having the same type are resolved. Hence, we propose in a memory dump, we want to decide if a 4-byte
an off-line backward resolution procedure to resolve integer denotes an IP address.
the types of some “dead” variables. • Abstract representation. Although we type memory
• A method for typed variable abstraction that maps locations, it is undesirable to simply present typed
multiple typed variable instances to the same static memory locations to the user. During program ex-
abstraction. For example, all N nodes in a linked ecution, a memory location may be used by multi-
list actually share the same type, instead of having N ple variables at different times; and a variable may
distinct types. have multiple instances. Hence we derive an abstract
representation for a variable by aggregating the type
• A method that reconstructs the structural and semantic information at multiple memory locations instantiated
view of in-memory data, driven by the derived type based on the same variable. For example, we use the
definitions. Once a program’s data structures are offset of a local variable in its activation record as its
identified, it is still not clear exactly how the data abstract representation. Type information collected in
structures would be laid out in memory – this is a all activation records of the same function is aggre-
useful piece of knowledge in many application sce- gated to derive the type of the variable.
narios such as memory forensics. Our method creates
an “organization chart” that illustrates the hierarchical Given only the binary, what can be observed at runtime
layout of those data structures. from each instruction includes (1) the addresses accessed
and the width of the accesses, (2) the semantics of the in-
We have developed a prototype of REWARDS and used struction, and (3) the execution context such as the program
it to analyze a number of binaries. Our evaluation results counter and the call stack. In some cases, data types can be
show that REWARDS is able to correctly reveal the types partially inferred from instructions. For example, a floating
of a high percentage of variables observed during a pro- point instruction (e.g., FADD) implies that the accessed lo-
gram’s execution. Furthermore, we demonstrate the unique cations must have floating point numbers. We also observe
benefits of REWARDS to a variety of application scenarios: that the parameters and return values of standard library
In memory image forensics, REWARDS helps recovering calls and system calls often have their syntax and semantics
1 struct { 1 extern foo 1 80480a0: e8 0f 00 00 00 call 0x80480b4
2 unsigned int pid; 2 section .text 2 80480a5: b8 01 00 00 00 mov $0x1,%eax
3 char data[16]; 3 global _start 3 80480aa: bb 00 00 00 00 mov $0x0,%ebx
4 }test; 4 4 80480af: cd 80 int $0x80
5 5 _start: 5 ...
6 void foo(){ 6 call foo 6 80480b4: 55 push %ebp
7 char *p="hello world"; 7 mov eax,1 7 80480b5: 89 e5 mov %esp,%ebp
8 test.pid=my_getpid(); 8 mov ebx,0 8 80480b7: 83 ec 18 sub $0x18,%esp
9 strcpy(test.data,p); 9 int 80h 9 80480ba: c7 45 fc 18 81 04 08 movl $0x8048118,0xfffffffc(%ebp)
10 } 10 80480c1: e8 4a 00 00 00 call 0x8048110
11 80480c6: a3 24 91 04 08 mov %eax,0x8049124
(a) Source code of function foo and the _start assembly code 12 80480cb: 8b 45 fc mov 0xfffffffc(%ebp),%eax
13 80480ce: 89 44 24 04 mov %eax,0x4(%esp)
14 80480d2: c7 04 24 28 91 04 08 movl $0x8049128,(%esp)
[Nr] Name Type Addr Off Size 15 80480d9: e8 02 00 00 00 call 0x80480e0
... 16 80480de: c9 leave
[ 1] .text PROGBITS 080480a0 0000a0 000078 17 80480df: c3 ret
[ 2] .rodata PROGBITS 08048118 000118 00000c 18 80480e0: 55 push %ebp
[ 3] .bss NOBITS 08049124 000124 000014 19 80480e1: 89 e5 mov %esp,%ebp
... 20 80480e3: 53 push %ebx
21 80480e4: 8b 5d 08 mov 0x8(%ebp),%ebx
(c) Section map of the example binary 22 80480e7: 8b 55 0c mov 0xc(%ebp),%edx
23 80480ea: 89 d8 mov %ebx,%eax
24 80480ec: 29 d0 sub %edx,%eax
rodata_0x08048118{ fun_0x08048110{ 25 80480ee: 8d 48 ff lea 0xffffffff(%eax),%ecx
+00: char[12] +00: ret_addr_t 26 80480f1: 0f b6 02 movzbl (%edx),%eax
} } 27 80480f4: 83 c2 01 add $0x1,%edx
bss_0x08049124{ 28 80480f7: 84 c0 test %al,%al
+00: pid_t, fun_0x080480e0{ 29 80480f9: 88 04 0a mov %al,(%edx,%ecx,1)
+04: char[12], -08: unused[4], 30 80480fc: 75 f3 jne 0x80480f1
+16: unused[4] -04: stack_frame_t, 31 80480fe: 89 d8 mov %ebx,%eax
} +00: ret_addr_t, 32 8048100: 5b pop %ebx
fun_0x080480b4{ +04: char*, 33 8048101: 5d pop %ebp
-28: unused[20], +08: char* 34 8048102: c3 ret
-08: char *, } 35 ...
-04: stack_frame_t, 36 8048110: b8 14 00 00 00 mov $0x14,%eax
+00: ret_addr_t 37 8048115: cd 80 int $0x80
} 38 8048117: c3 ret
well defined and publicly known. Hence we define the type world”), ebp-4 can be typed as a pointer, based on the
revealing instructions, system calls, and library calls as type heuristics that instruction executions using similar immedi-
sinks. Furthermore, the execution of an instruction creates a ate values within a code or data section are considered type
dependency between the variables involved. For instance, if sinks. Note that the type of the pointer is unknown yet.
a variable with a resolved type (from a type sink) is copied At line 10, foo calls 0x8048110. Inside the body of the
to another variable, the destination variable should have a function invocation (lines 36-38), our algorithm detects a
compatible type. As such, we model our problem as a type getpid system call (a type sink) with eax being 0x14 at
information flow problem. line 36. The return value of the function call is resolved as
To illustrate how REWARDS works, we use a simple pid t type, i.e., register eax at line 11 is typed pid t.
program compiled from the source code shown in Figure When eax is copied to address 0x8049124 (a global
1(a). According to the code snippet, the program has a variable in .bss section as shown in Figure 1(c)), the
global variable test (line 1-4) which consists of an int algorithm further resolves 0x8049124 as pid t. Before
and a char array. It contains a function foo (line 6- the function call 0x80480e0 at line 15 (strcpy), the
10) that calls my getpid and strcpy to initialize the parameters are initialized in lines 12-14. As ebp-4 has
global variable. The full disassembled code of the example been typed as a pointer at line 9, the data flow in lines 12
is shown in Figure 1(b) (a dotted line indicates a “NOP” and 13 dictates that location esp+4 at line 13 is a pointer
instruction). The address mapping of code and data is as well. At line 14, as 0x8049128 is in the global variable
shown in Figure 1(c). section and of a known type, location esp has an unknown
pointer type. At line 15, upon the call to strcpy (a
When foo is called during execution, it first saves ebp type sink), both esp and esp+4 are resolved to char*.
and then allocates 0x18 bytes of memory for the local Through a backward transitive resolution, 0x8049128 is
variables (line 8 in Figure 1(b)), and then initializes one resolved as char, ebp-4 as char*, and 0x8048118 as
local variable (at address 0xfffffffc(%ebp)=ebp-4) char. Also at line 26, inside the function body of strcpy,
with an immediate value 0x8048118 (line 9). Since the instruction “movzbl (%edx),%eax” can be used as
0x8048118 is in the address range of the .rodata another type sink as it moves between char variables.
section (it is actually the starting address of string “hello
When the program finishes, we resolve all data types system call returns, REWARDS will type register eax and,
(including function arguments, and those implicit vari- from there, those having the same type as eax. In our type
ables such as return address and stack frame pointer) propagation and resolution algorithm (Section 3.2), a type
as shown in Figure 1(d). The derived types for vari- sink will lead to the recursive type resolution of relevant
ables in .rodata, .bss and functions are presented variables accessed before and after the type sink.
in the figure. Each function is denoted by its entry
address. fun 0x080480b4, fun 0x08048110, and Standard library calls. With well-defined API, standard
fun 0x080480e0 denote foo(), my getpid(), and library calls are another category of type sink. For example,
strcpy(), respectively. The number before each de- the two arguments of strcpy must both be of the char*
rived type denotes the offset. Variables are listed in in- type. By intercepting library function calls and returns,
creasing order of their addresses. Type stack frame t REWARDS will type the registers and memory variables
indicates a frame pointer stored at that location. Type involved. Standard library calls tend to provide richer type
ret addr t means that the location holds a return ad- information than system calls – for example, Linux-2.6.15
dress. Such semantic information is useful in applica- has 289 system calls whereas libc.so.6 contains 2016
tions such as vulnerability fuzz. Locations that are not functions (note some library calls wrap system calls).
accessed during execution are annotated with the unused Type-revealing instructions. A number of machine in-
type. In fun 0x080480e0, the two char* below structions that require operands of specific types can serve
the ret addr t represent the two actual arguments of as type sinks. Examples in x86 are as follows: (1)
strcpy(). Although it seems that our example can be String instructions perform byte-string operations such as
statically resolved due to its simplicity, it is very difficult in moving/storing (MOVS/B/D/W, STOS/B/D/W), loading
practice to analyze data flows between instructions (espe- (LOADS/B/D/W), comparison (CMPS/B/D/W), and scan-
cially those involving heap locations) due to the difficulty ning (SCAS/B/D/W). Note that MOVZBL is also used in
of binary points-to analysis. string movement. (2) Floating-point instructions oper-
ate on floating-point, integer, and binary coded decimal
3 REWARDS Design operands (e.g. FADD, FABS, and FST). (3) Pointer-related
instructions reveal pointers. For a MOV instruction with
In this section, we describe the design of REWARDS. an indirect memory access operand (e.g., MOV (%edx),
We first identify the type sinks used in REWARDS and %ebx or MOV [mem], %eax), the value held in the
then present the on-line type propagation and resolution source operand must be a pointer. Meanwhile, if the
algorithm, which will be enhanced by an off-line procedure target address is within the range of data sections such as
that recovers more variable types not reported by the on-line .stack, .heap, .data, .bss or .rodata, the pointer
algorithm. Finally, we present a method to construct a typed must be a data pointer; If it is in the range of .text
hierarchical view of memory layout. (including library code), the pointer must be a function
pointer. Note that the concrete type of such a pointer will
3.1 Type Sinks be resolved through other constraints.
Sg1 , i.e. l1, is typed to char*. Note that the timestamp method can be invoked multiple times, giving rise to multiple instances.
instruction Tg1 Sg1 tsg1 Tl1 Sl1 tsl1 Tl2 Sl2 tsl2
10. enter M φ φ 0 φ φ 10 φ φ 10
11. mov g1, l1 φ {< l1, 10 >} 0 φ {< g1, 0 >} 10 φ φ 10
12. mov l1, l2 φ {< l1, 10 >} 0 φ {< g1, 0 >, < l2, 10 >} 10 φ {< l1, 10 >} 10
... ... ... ... ... ... ... ... ... ...
100. strcpy(g1,...) {char*} {< l1, 10 >} 0 {char*} {< g1, 0 >, < l2, 10 >} 10 {char*} {< l1, 10 >} 10
Table 1. Example of running the online algorithm. Variable g1 is a global, l1 and l2 are locals.
instruction Tg1 Sg1 tsg1 Tl1 Sl1 tsl1 Tl2 Sl2 tsl2
... ... ... ... ... ... ... ... ... ...
12. mov l1, l2 φ {< l1, 10 >} 0 φ {< g1, 0 >, < l2, 10 >} 10 φ {< l1, 10 >} 10
13. Exit M φ {< l1, 10 >} 0 φ {< g1, 0 >, < l2, 10 >} 10 φ {< l1, 10 >} 10
... ... ... ... ... ... ... ... ... ...
99. Enter N φ {< l1, 10 >} 0 φ φ 99 φ φ 99
100. strcpy(g1,...) {char*} {< l1, 10 >} 0 φ φ 99 φ φ 99
Table 2. Example of running the off-line type resolution procedure. The execution before timestamp
12 is the same as Table 1. Method N reuses l1 and l2
structure plus the call stack at that point, as the abstraction corresponding children. If a variable is a pointer, the
of the structure. The intuition is that the heap structure algorithm further recursively constructs the sub-view of the
instances allocated from the same PC in the same call stack data structure being pointed to, leveraging the derived type
should have the same type. Fields of the structure are of the pointer. For instance, assume a global pointer p is of
represented by the allocation site and field offsets. As an type T*, our method creates a node representing the region
allocated heap region may be an array of a data structure, pointed to by p. The region is typed based on the reverse
we use the recursion detection heuristics in [9] to detect the engineered definition of T. The recursive process terminates
array size. Specifically, the array size is approximated by when none of the fields of a data structure is a pointer. Stack
the maximum number of accesses by the same PC to unique is similarly handled: A root node is created to represent
memory locations in the allocated region. The intuition is each activation record. Local variables of the record
that array elements are often accessed through a loop in are denoted as children nodes. Recursive construction is
the source code and the same instruction inside the loop performed until all memory locations through pointers are
body often accesses the same field across all array elements. traversed. Note that all live heap structures can be reached
Finally, if heap structures allocated from different sites have (transitively) through a global pointer or a stack pointer.
the same field types, we will heuristically cluster these heap Hence, the above two steps essentially also construct the
structures into one abstraction. structural views of live heap data.
Our method can also type some of the unreachable
3.5 Constructing Hierarchical View of In- memory regions, which represent “dead” data structures,
Memory Data Structure Layout e.g., activation records of previous method invocations
whose space has been freed but not reused. Such dead
data is as important as live data as they disclose what had
An important feature of REWARDS is to construct a happened in the past. In particular, our method scans the
hierarchical view of a memory snapshot, in which the prim- stack beyond the current activation record to identify any
itive syntax of individual memory locations, as well as their pointers to the code section, which often denote return
semantics and the integrated hierarchical structure are visu- addresses of method invocations. With a return address, the
ally represented. This is highly desirable in applications like function invocation can be identified and we can follow the
memory forensics as interesting queries, e.g., “find all aforementioned steps to type the activation record.
IP addresses”, can be easily answered by traversing
the view (examples in Section 5.1). So far, REWARDS 4 Implementation and Evaluation
is able to reverse engineer the syntax and semantics of
data structures, represented by their abstractions. Next, we We have implemented REWARDS on PIN-2.6 [27], with
present how we leverage such information to construct a 12.1K lines (LOC) of C code and 1.2K LOC of Python
hierarchical view. code. In the following, we present several key implementa-
Our method works as follows. It first types the top level tion details. REWARDS is able to reveal variable semantics.
global variables. In particular, a root node is created to In our implementation, variable semantics are represented
represent a global section. Individual global variables are as special semantic tags complementary to regular type tags
represented as children of the root. Edges are annotated such as int and char. Both semantic tags and regular tags
with offset, size, primitive type, and semantics of the are stored in the variable’s type set. Tags are enumerated
to save space. The vast diversity of program semantics 4.1 Evaluation of Accuracy
makes it infeasible to consider them all. Since we are
mainly interested in forensics and security applications, we
To evaluate the reverse engineering accuracy of RE-
focus on the following semantic tags: (1) file system related
WARDS, we compare the derived data structure types with
(e.g., FILE pointer, file descriptor, file name, file status);
those declared in the program source code. To acquire
(2) network communication related (e.g., socket descriptor,
the oracle information, we recompile the programs with
IP address, port, receiving and sending buffer, host info,
debugging information, and then use libdwarf [1] to
msghdr); and (3) operating systems related (e.g., PID, TID,
extract type information from the binaries. The libdwarf
UID, system time, system name, and device info).
library is capable of presenting the stack and global variable
Meanwhile, we introduce some of our own semantic mappings after compilation. For instance, global variables
tags, such as ret addr t indicating that a memory loca- scattering in various places in the source code will be
tion is holding a return address, stack frame t indicat- organized into a few data sections. The library allows us see
ing that a memory location is holding a stack frame pointer, the organization. In particular, libdwarf extracts stack
format string t indicating that a string is used in variables by presenting the mapping from their offsets in
format string argument, and malloc arg t indicating an the stack frame and the corresponding types. For global
argument of malloc function (similarly, calloc arg t variables, the output by libdwarf is program virtual
for calloc function, etc.). Note that these tags reflect the addresses and their types. Such information allows us to
properties of variables at those specific locations and hence conduct direct and automated comparison. Note that we
do not particitate in the type information propagation. They only verify the types in .data, .bss, and .rodata sec-
can bring important benefits to our targeted applications tions, other global data in sections such as .got, .ctors
(Section 5). are not verified. For heap variables, since we use the
REWARDS needs to know the program’s address space execution context at allocation sites as the abstract repre-
mapping, which will be used to locate the addresses of sentation, given an allocation context, we can locate it in
global variables and detect pointer types. In particular, the disassembled binary, and then correlate it with program
REWARDS checks the target address range when deter- source code to identify the heap data structure definition,
mining if a pointer is a function pointer or a data pointer. and finally compare it with REWARDS’s output. Although
Thus, when a binary starts executing with REWARDS, REWARDS extracts variable types for the entire program
we first extract the coarse-grained address mapping from address space (including libraries), we only compare the
the /proc/pid/maps file, which defines the ranges of results for user-level code.
code and data sections including those from libraries, and The result for stack variables is presented in Figure
the ranges of stack and heap (at that time). Then for 2(a). The figure presents the percentage of (1) functions
each detailed address mapping such as .data, .bss and that are actually executed, (2) data structures that are used
.rodata for all loaded files (including libraries), we in the executed functions (over all structures declared in
extract the mapping using the API provided by PIN when those functions), and (3) data structures whose types are
the corresponding image file is loaded. accurately recovered by REWARDS (over those in (2)). At
runtime, it is often the case that even though a buffer is
We have performed two sets of experiments to evaluate defined in the source code with size n, only part of the
REWARDS: one is to evaluate its correctness, and the n bytes are used. Consequently, only those used ones are
other is to evaluate its time and space efficiency. All typed (the others are considered unused). We consider the
the experiments were conducted on a machine with two buffer is correctly typed if its bytes are either correctly typed
2.13Ghz Pentium processors and 2GB RAM running Linux or unused. From the figure, we can observe that, due to
kernel 2.6.15. the nature of dynamic analysis, not all functions or data
We select 10 widely used utility programs from the structures in a function are exercised and hence amenable
following packages: procps-3.2.6 (with 19.1K LOC and to REWARDS. More importantly, REWARDS achieves an
containing command ps), iputils-20020927 (with 10.8K average of 97% accuracy (among these benchmarks) for
LOC and containing command ping), net-tools-1.60 (with the data structures that get exercised. For heap variables,
16.8K LOC and containing netstat), and coreutils- the result is presented in Figure 2(b), the bars are similarly
5.93 (with 117.5K LOC and containing the remaining test defined. REWARDS’s output perfectly matches the types in
commands such as ls, pwd, and date). The reason the original definitions when they are exercised. Note some
for selecting these programs is that they contain many of the benchmarks are missing in Figure 2(b) (e.g., date)
data structures related to the operating system and network because their executions do not allocate any user-level heap
communications. We run these utilities without command structures. The result for global variables is presented in
line option except ping, which is run with a localhost and Figure 2(c), and REWARDS achieves over 85% accuracy.
a packet count 4 option. To explain why REWARDS cannot achieve 100% accu-
120
Dynamically Executed Funs
Dynamically Exposed Types
REWARDS Accuracy
100
80
Percentage
60
40
20
0
ps
pi
ne
ls
pw
da
up
un
us
ho
ng
er
ts
te
tim
am
st
d
ta
na
s
e
e
t
m
e
Benchmark Program
100
80
80
Percentage
Percentage
60
60
40
40
20 20
0 0
ps
pi
ne
ls
up
us
ho
ps
pi
ne
ls
pw
da
up
un
us
ho
ng
ng
er
er
ts
tim
st
ts
te
tim
am
st
d
ta
na
ta
na
s
s
e
e
t
t
m
m
e
e
Benchmark Program Benchmark Program
300
Execution Time (seconds)
4e+07
250
200 3e+07
150
2e+07
100
1e+07
50
0 0
ps
pi
ne
ls
pw
da
up
un
us
ho
ps
pi
ne
ls
pw
da
up
un
us
ho
ng
ng
er
er
ts
te
tim
am
st
ts
te
tim
am
st
d
d
ta
na
ta
na
s
s
e
e
e
e
t
t
m
m
e
a sockaddr structure. The last field (with offset +40) correctly identified and its composition bytes are either
denotes another heap structure whose allocation site is correctly typed or unused.
0x0804ddfb. Transitively, our method reconstructs the
entire hierarchy.
5.1.2 Typing Dead Memory
The extraction of IP addresses is translated into a
traversal over the view to identify those with the IP ad- In this case, we demonstrate how to type dead memory,
dress semantic tags. Along the path 08050260 → i.e., memory regions containing dead variables, using the
08052170 → 7e9200...0 → 0x0b0000a , slapper worm bot-master program. Slapper worm relies on
a variable with the sin addr type can be identified, P2P communications. The bot-master uses a program called
which stores the client IP. The same IP can also be iden- pudclient to control the P2P botnet, such as launching
tified along the path 08050260 → 08052170 → TCP-flood, UDP-flood, and DNS-flood attacks. Our goal is
08052a58 → 10.0.0.11 , with the field offset to extract evidence from a memory dump of pudclient
+2596. The field has the ip addr str t tag, which is from the attacker’s machine.
resolved at the return of a call to inet ntoa(). RE- Our experiment has two scenes: the investigator’s scene
WARDS is able to isolate the server IP 10.0.0.4 as a and the attacker’s scene. More specifically,
string along the path 08050260 → 08051170 →
• Scene I: In the lab, the investigator runs the bot-master
10.0.0.4 with the field offset +1172. Interestingly,
program pudclient to communicate with slapper
this field does not have a semantic tag related to an IP
bots to derive the data structures of pudclient.
address. The reason is that the field is simply a part of the
request string (the host field in HTTP Request Message), • Scene II: In the wild, the attacker runs pudclient to
but it is not used in any type sinks that can resolve it as an IP. control real slapper bots.
However, isolating the string also allows a human inspector
to extract it as an IP. In Scene I, we run a number of slapper worm in-
To validate our result, we present in Figure 4(b) the stances in a contained environment (at IP addresses rang-
corresponding symbolic definitions extracted from the ing from 10.0.0.1 - 10.0.1.255). Then we launch
source for comparison. Fields that are underlined are pudclient with REWARDS and issue a series of
used during execution. In particular, struct CONNECTION commands such as listing the compromised hosts, and
corresponds to the abstraction struct 0x0804dd4f launching the UDPFlood, TCPFlood, and DNSFlood at-
(node 08052170 ) and struct CONNDATA corresponds tacks. REWARDS extracts the data structure definitions for
to struct 0x0804ddfb (node 08052a58 ). Observe pudclient. Then in Scene II, we run pudclient again
that all fields of CONNECTION are precisely derived, except without REWARDS. Indeed, the attacker’s machine does
the pointer PostData, which is represented as an unused not have any forensics tool running. Emulating the attacker,
array in the inferred definition because the field is not used we issue some commands and then hibernate the machine.
during execution. For the CONNDATA structure, all the We then get the memory image of pudclient and use the
exercised fields are extracted and correctly typed. Recall data structure information derived in Scene I to investigate
that we consider a field is correctly typed if its offset is the image.
+0 pthread_t
b7fe5bb0
+4 int sin_family
b7fe5bb0 0002
sin_addr
+12 struct sockaddr 7e920002 0b00000a 0...0 0b00000a
sin_zero
struct _0x0804dd4f *
08050260 08052170
+28 time_t
4aafb0c7 0...0
+32 time_t
+0 char [11]
4aafb0c7 Keep−Alive
+36 unused [4]
text/html
0001
174 short int out_headdone;
175 short int out_bodydone;
+69964 short int 176 short int out_flushed;
0001
177 // user data
+69966 unused [8192]
178 char envbuf[8192];
0...0 179 } CONNDATA;
Figure 4. Comparison between the REWARDS-derived hierarchical view and source code definition
bfffd140 05 00 00 00 6b 00 00 00 69 00 00 00 00 00 00 00 bfffe5d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bfffd150 00 00 00 00 38 ea ff bf 00 00 00 00 00 00 00 01 bfffe5e0 00 00 00 00 00 00 00 00 00 00 00 00 e0 f5 ff bf
bfffd160 2c 00 00 00 67 45 8b 6b 0e 00 00 00 00 00 00 00 bfffe5f0 a0 2d 05 08 e0 f5 ff bf a0 13 05 08 00 00 00 00
bfffd170 0a 00 00 63 0f 27 00 00 9f 86 01 00 9f 86 01 00 bfffe600 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bfffd180 1c ea ff bf 10 ea ff bf 6a f2 b2 4a 7a 4a 0e 00 *
bfffd190 22 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 bfffea00 00 00 00 00 00 00 00 00 00 00 00 00 10 ea ff bf
bfffd1a0 6a f2 b2 4a 7a 4a 0e 00 f2 f3 8d 8c 00 00 00 00 bfffea10 01 00 00 00 00 00 00 00 e5 de f2 49 46 00 00 00
bfffd1b0 00 00 00 00 00 00 00 00 01 00 00 00 02 00 00 00 bfffea20 67 45 8b 6b 10 00 00 00 e8 be e6 71 0a 00 00 34
bfffd1c0 64 6e 73 66 6c 6f 6f 64 00 00 00 00 00 00 00 00 bfffea30 0a 00 01 33 0a 00 00 0b 0a 00 00 04 00 00 00 00
bfffd1d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 bfffea40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
* *
bfffd5c0 c0 d1 ff bf 00 00 00 00 02 ca 04 08 00 00 00 00 ...
bfffd5d0 00 00 00 00 00 00 00 00 02 ca 04 08 02 ca 04 08 bffff5c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bfffd5e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 bffff5d0 01 00 00 00 80 00 00 00 80 00 00 00 ff f7 ff bf
bfffd5f0 00 00 00 00 00 00 00 00 00 00 00 00 04 d6 ff bf bffff5e0 00 00 00 00 00 00 00 00 f3 f7 ff bf 67 45 8b 6b
bfffd600 64 6e 73 66 6c 6f 6f 64 00 00 00 00 00 00 00 00 bffff5f0 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bfffd610 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 bffff600 01 00 00 00 c0 f6 ff bf 28 f6 ff bf fb c7 04 08
* bffff610 02 00 00 00 dc 3a 1f b6 d4 df 04 08 dc 3a 1f b6
bfffe5b0 00 00 00 00 00 00 00 00 0e 00 00 00 00 00 00 00 bffff620 00 00 00 00 dc 3a 1f b6 88 f6 ff bf a2 de 0d b6
bfffe5c0 00 00 00 00 02 00 4e 34 0a 00 00 0b 00 00 00 00 bffff630 02 00 00 00 b4 f6 ff bf c0 f6 ff bf f6 5b ff b7
Figure 5. Memory dump for Slapper worm control program when exiting the control interface
We construct the hierarchical view and try to identify IP vulnerability candidate. Vulnerability-specific patterns are
addresses from the view. However, the hierarchical view followed during mutation. One example pattern is to
can only map the memory locations that are alive, namely exponentially expand an input string in the lineage of a
they are reachable from global and stack (pointer) variables. candidate buffer with the goal of generating an overflow
Here, we take an extra step to type the dead (unreachable) exploit. In that project, we had difficulty finding publicly
data. As described in Section 3.5, our technique scans available, binary-level vulnerability detectors to use as the
the stack space lower than the current (the lowest and front end. REWARDS helps address this issue by deriving
live) activation record and looks for values that are in the both variable syntax and semantics from a subject binary.
range of the code section, as they are very likely return Next, we present our experience of using REWARDS to
addresses. Four such values are identified. One example identify vulnerability suspects and then using our prior
and its memory context is shown in Figure 5. In this system (a fuzzer) to confirm them.
memory dump snippet, the return address, as underlined, For this study, we design a static vulnerability suspect
is located at address 0xbffff62c. Our technique further detector that relies on the variable type information derived
identifies that the corresponding function invocation is to by REWARDS. The result of the detector is passed to our
0x0804a708. Hence, we use the data structure definition lineage-based fuzzer to generate exploits. In the following,
of fun 0x0804a708 to type the activation record. The we present how REWARDS helps identify various types of
definition and the typed values are shown in Table 3. vulnerability suspects.
Observe that a number of IPs (fields with ip addr t) are
identified. We also spot the bot command “dnsflood” • Buffer overflow vulnerability. Buffer overflows
at -9324 and -8236. Note that these two fields have the could happen in three different places: stack, heap,
input t tag as part of their derived definition, indicating and global areas. As such, we define three types of
they hold values from input. buffer overflow vulnerability patterns. Specifically,
for stack overflow, if a stack layout contains a buffer
and its content comes from user input, we consider
5.2 Vulnerability Fuzz it a suspect. Note that this can be easily facilitated
by REWARDS’s typing algorithm: A semantics tag
It is a challenging task to detect and confirm vulner- input t is defined to indicate that a variable re-
abilities in a given binary without symbolic information. ceives its value from external input. The tag is only
Previously in [26], we have proposed a dynamic analysis susceptible to the forward flow but not the backward
approach that can decide if a vulnerability suspect is true flow. In the stack layout derived by REWARDS, if
positive by generating a concrete exploit. The basic idea a buffer’s type set contains an input t tag, it is
is to first use existing static tools to identify vulnerability considered vulnerable. For heap overflow, we consider
candidates, which are often of large quantity; then benign two cases: one is to exploit heap management data
executions are mutated to generate exploits. Mutations structure outside the user-allocated heap chunk; and
are directed by dynamic information called input lineage, the other is to exploit user-defined function pointers
which denotes the set of input elements that is used to inside the heap chunk. Detecting the former case is
compute a value at a given execution point, usually a simply to check if a heap structure contains a buffer
Offset Type Size Mem Addr Content Offset Type Size Mem Addr Content
-9432 void* 4 bfffd154 38 ea ff bf -9324 char[9],input t 9 bfffd1c0 64 6e..64
-9428 char* 4 bfffd158 00 00 00 00 -8300 char* 4 bfffd5c0 c0 d1 ff bf
-9420 int 4 bfffd160 2c 00 00 00 -8236 char[9],input t 9 bfffd600 64 6e..64
-9416 int 4 bfffd164 67 45 8b 6b -8227 char[28] 28 bfffd609 00 .. 00
-9412 int 4 bfffd168 0e 00 00 00 -4236 void* 4 bfffe5a0 00 00 00 00
-9408 int 4 bfffd16c 00 00 00 00 -4156 struct 0x804834e* 4 bfffe5f0 a0 2d 05 08
-9404 ip addr t 4 bfffd170 0a 00 00 63 -4152 void* 4 bfffe5f4 e0 f5 ff bf
-9300 port t 4 bfffd174 0f 27 00 00 -3104 char* 4 bfffea0c 10 ea ff bf
-9396 int 4 bfffd178 9f 86 01 00 -3088 char[16] 16 bfffea1c 46 00 00 00
-9392 int 4 bfffd17c 9f 86 01 00 -3068 ip addr t 4 bfffea30 0a 00 01 33
-9388 void* 4 bfffd180 1c ea ff bf -3064 ip addr t 4 bfffea34 0a 00 00 0b
-9384 void* 4 bfffd184 10 ea ff bf -3058 ip addr t 4 bfffea38 0a 00 00 04
timeval.tv sec 4 bfffd18c 7a 4a 0e 00 -3054 ip addr t 4 bfffea3c 0a 00 00 04
-9376 timeval.tv usec 4 bfffd190 22 00 00 00 -0088 int 4 bffff5d4 80 00 00 00
-9368 int 4 bfffd194 00 00 00 00 -0084 int 4 bffff5d8 80 00 00 00
-9352 int 4 bfffd1a4 7a 4a 0e 00 -0080 int 4 bffff5dc ff f7 ff bf
-9348 int 4 bfffd1a8 f2 f3 8d 8c -0004 stack frame t 4 bffff628 88 f6 ff bf
-9344 int 4 bfffd1ac 00 00 00 00 +0000 ret addr t 4 bffff62c a2 de 0d b6
-9332 int 4 bfffd1b8 01 00 00 00 +0004 int 4 bffff630 02 00 00 00
-9328 int 4 bfffd1bc 02 00 00 00 +0008 char* 4 bffff634 b4 f6 ff bf
Table 3. Result on the unreachable memory type using type fun 0x804a708