Dynamic Binary Modification: Tools, Techniques, and Applications
Dynamic Binary Modification: Tools, Techniques, and Applications
The Memory System: You Can’t Avoid It, You Can’t Ignore It, You Can’t Fake It
Bruce Jacob
2009
Transactional Memory
James R. Larus and Ravi Rajwar
2006
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in
printed reviews, without the prior permission of the publisher.
DOI 10.2200/S00345ED1V01Y201104CAC015
Lecture #15
Series Editor: Mark D. Hill, University of Wisconsin
Series ISSN
Synthesis Lectures on Computer Architecture
Print 1932-3235 Electronic 1932-3243
Dynamic Binary Modification
Tools, Techniques, and Applications
Kim Hazelwood
University of Virginia
M
&C Morgan & cLaypool publishers
ABSTRACT
Dynamic binary modification tools form a software layer between a running application and the
underlying operating system, providing the powerful opportunity to inspect and potentially modify
every user-level guest application instruction that executes. Toolkits built upon this technology have
enabled computer architects to build powerful simulators and emulators for design-space exploration,
compiler writers to analyze and debug the code generated by their compilers, software developers
to fully explore the features, bottlenecks, and performance of their software, and even end-users to
extend the functionality of proprietary software running on their computers.
Several dynamic binary modification systems are freely available today that place this power
into the hands of the end user. While these systems are quite complex internally, they mask that
complexity with an easy-to-learn API that allows a typical user to ramp up fairly quickly and build
any of a number of powerful tools. Meanwhile, these tools are robust enough to form the foundation
for software products in use today.
This book serves as a primer for researchers interested in dynamic binary modification systems,
their internal design structure, and the wide range of tools that can be built leveraging these systems.
The hands-on examples presented throughout form a solid foundation for designing and constructing
more complex tools, with an appreciation for the techniques necessary to make those tools robust
and efficient. Meanwhile, the reader will get an appreciation for the internal design of the engines
themselves.
KEYWORDS
dynamic binary modification, instrumentation, runtime optimization, binary transla-
tion, profiling, debugging, simulation, security, user-level analysis
To my husband Matthew
and our daughters Anastasia and Adrianna
for their patience and encouragement
while I worked on this project,
and for their ongoing love and support.
ix
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
5 Architectural Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.1.1 Trace Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1.2 Functional Cache Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.3 Functional Branch Prediction Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1.4 Timing Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Emulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2.1 Supporting New Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.2 Masking Hardware Flaws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3 Binary Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.4 Design-Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7 Historical Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Author’s Biography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Acknowledgments
Much of the content of this book would not be possible without some of the great innovators
in this field. Some people who stand out in my memory include Robert Cohn, Robert Muth,
Derek Bruening, Vas Bala, Evelyn Duesterwald, Mike Smith, Jim Smith, Vijay Janapa Reddi, Nick
Nethercote, Julian Seward, Wei Hsu, CK Luk, Greg Lueck, Artur Klauser, and Geoff Lowney. I
should also thank the countless contributors to Pin, as well as the contributors to the many projects
that preceded and formed the foundation for those projects listed throughout this book.
I would like to thank Mark Hill for approaching me and encouraging me to write this book,
as well as the feedback and support he has provided throughout the project. Additionally, I would
like to thank Michael Morgan for providing me the opportunity to contribute to this lecture series
and for doing his best to keep me on schedule.
I should also acknowledge the people who advised me against writing this book prior to tenure
(who shall remain nameless). Although I ultimately ignored that advice, I do know that it was well
intended, and I always appreciate those who take the time and express enough interest to offer advice
to others. Meanwhile, I always tend to fall back on the mantra to “Keep asking for advice until you
get the advice you want.”
Kim Hazelwood
March 2011
1
CHAPTER 1
Figure 1.1: The code-discovery problem that arises from variable instruction lengths, mixed code and
data, and indirect branches (Smith and Nair [2005]). On the left, we see the correct interpretation of the
binary as determined by dynamic analysis. On the right, we see an incorrect interpretation as determined
by static analysis.
as shown in Figure 1.1. Finally, by operating at runtime, the dynamic binary modifier only targets
the portions of the guest application code and program paths that actually execute.
1.1 UTILITY
Dynamic binary modifiers have been used for a wide variety of reasons, many of which the designers
of the original systems had never envisioned. Users now span the subfields of computer architecture,
compilers, program analysis, software engineering, and computer security. We’ll take a high-level
look at some of these motivating applications in the following sections, and we will follow up with
detailed examples in Chapters 3–5.
Utility For Application Developers Software engineers have a myriad of reasons to require a
detailed understanding of the software systems they develop. While this performance analysis can
be done in an ad-hoc manner, dynamic binary modification enables a more systematic approach
to software profiling. Rather than mining massive amounts of source code, potentially missing key
instances, developers may instead analyze the runtime behavior of their applications using a simple
API and minimal profiling code. For instance, they can analyze all of the branches in their program
(and all shared libraries it calls) using one or two API calls, or they can classify all of the instructions
executed using a small number of calls.
Developers may also wish to perform systematic debugging of their software. For instance,
they may wish to ensure that every dynamic memory allocation has a corresponding deallocation.
Using binary modification to dynamically record every allocation, this goal can be achieved with
very little developer effort.
1.2. FUNCTIONALITY 3
Utility For Hardware Designers An interesting application of dynamic binary modification is
emulating new instructions. Given that the binary modifier has access to every instruction before it
executes, it can recognize a new instruction that is currently unsupported by the hardware. Instead
of executing that instruction and causing an illegal instruction exception, the system can emulate the
new behavior while measuring the frequency of use of the new instruction. In fact, a similar approach
can be used to mask faulty implementations of machine instructions, by dynamically replacing those
instructions with a correct emulation of that instruction’s desired behavior.
A more general application of dynamic binary modification is to generate live traces for
driving simple simulators. For instance, a user can write a simple cache simulator by instrumenting
all memory accesses in a guest application. Memory access data can either be written to a file to
drive an offline simulator, or it can be piped directly to a running cache simulator. Similarly, a branch
prediction simulator can be written by instrumenting all branch instructions to record the source
address, target address, and branch outcome. Finally, a full-blown timing simulator can be written
by instrumenting all instructions to record any information necessary for driving a timing simulator,
though it is only possible to measure the overhead of committed instructions using this mechanism.
Committed instructions are all that are visible to a software-level binary modification tool.
Utility For System Software Designers Yet another application of dynamic binary modification
is the ability to add and enforce new security or privacy policies to existing applications. A user may
wish to enforce that applications do not overwrite instructions or jump to locations that have been
classified as data. The ability to observe and potentially modify every application instruction prior
to executing that instruction makes these tasks straightforward.
The motivating applications listed in this chapter attempt to demonstrate the wide variety of
possibilities that arise when a user is given the ability to observe or modify every executing instruction.
Each of these examples is described in deeper detail, with sample implementations and output, in
the later chapters. Meanwhile, all examples presented in this book simply serve to scratch the surface
of the potential of this technology.
1.2 FUNCTIONALITY
A user-level dynamic binary modification system is generally invoked in one of two ways. First, a
user may execute an entire application, start to finish, under the control of the system. This approach
is well suited for full system simulation, emulation, debugging tools, or security applications where
full control and complete code coverage are paramount. In the second invocation method, a user
may wish to attach a binary modification engine to an already running application, much in the
same way that a debugger can be attached to/detached from a running program. This method may
work well for profiling and locating bottlenecks, or simply to figure out what a program is doing at
a given instant.
Whatever the invocation method, most binary modifiers have three modes of execution:
interpretation-mode, probe-mode, and JIT-mode execution. In an interpretation-mode execution,
4 1. DYNAMIC BINARY MODIFICATION: OVERVIEW
the original binary is viewed as data, and each instruction is used as a lookup into a table of alternative
instructions that provide the corresponding functionality desired by the user. In a probe-mode
execution, the original binary is modified in-place by overwriting instructions with new instructions
or branches to new routines. While this mode results in lower runtime overhead, it is quite limited,
particularly on architectures such as x86, and therefore JIT-mode execution ends up being the more
common implementation. In a JIT-mode execution, the original binary is never modified or even
executed. Instead, the original binary is viewed as data, and a modified copy of the executed portions
of binary are regenerated in a new area of memory. These modified copies are then executed in lieu
of the original application. Both probe-mode and JIT-mode execution models are discussed in more
detail in Chapter 2, while their internal implementation is discussed in Chapter 6. Interpretation is
not discussed as the overhead of interpretation prevents it from being widely used in these systems.
Once the user of a dynamic binary modification tool has control over the execution of a
guest application, they then have the ability to incorporate programmable instrumentation into that
guest application. They can define the conditions under which to modify the application (e.g., upon
all taken branches) as well as the changes they wish to make (e.g., increment a counter or record
the target address). From there, the binary modifier will transparently inject the new code into the
running application, taking care to perform supporting tasks, such as freeing any registers necessary
to perform the function, but otherwise maintain the system state that the application expects. The
level of transparency may vary by system – e.g., some systems will avoid writing to the application
stack while others will borrow the application’s stack temporarily. Either way, most systems do ensure
that the observed state is as close as possible to that of a native run of the guest application.
CHAPTER 2
1The software code cache is also called a translation cache in some sources.
8 2. USING A DYNAMIC BINARY MODIFIER
short-running programs and/or programs with few iterations, it becomes difficult to amortize the
overhead of just-in-time code regeneration.
Dynamic Binary
Modifier Engine
Operating System
Hardware
Figure 2.1: The binary modification engine, the guest application, and the user’s plug-in tool all execute
in the same address space.
specification for the changes they wish to make, which is really a matter of understanding the system’s
exported API.
Because these APIs differ greatly between system, we will focus on the Pin API for the
purposes of this and the next few chapters, before moving back to a system-agnostic view when
covering the internal implementations of dynamic binary modification systems in Chapter 6 and
beyond. This chapter is by no means an extensive user manual defining the Pin API; the goal is to
provide an intuitive sense of the power available to the user and the overall intent of the API. The
interested reader is encouraged to visit the project website for each system to access the complete
user guide.
API Overview From the highest level, the API allows a user to iterate over the instructions that
are about to execute, in order to have the opportunity to add, remove, change, or simply observe
the instructions prior to executing them. The changes can be as simple as inserting instructions to
gather dynamic profiling information, or as complex as replacing a sequence of instructions with an
alternate implementation.
The most basic APIs provide common functionalities like determining instruction details,
determining control-flow changes, or analyzing memory accesses. In Pin, most of the API routines
are call-based. The user can register a callback to be notified when key events occur, and the user can
make calls from their plug-in tool into the Pin engine to gather relevant information. (Note that
in many cases, Pin will automatically inline these calls to improve performance, as will be discussed
later. Meanwhile, some tools always assume that analysis code will be inlined, and they will leave it
to the user to ensure that inlining is safe by saving and restoring any needed registers or state.)
Instrumentation vs. Analysis At this point, it’s important to provide a bit of terminology to
distinguish the opportunities available for observing and modifying the guest application. Most
12 2. USING A DYNAMIC BINARY MODIFIER
systems provide two types of opportunities to observe the application – a static opportunity and a
dynamic (runtime) opportunity.The static opportunity allows every distinct instruction that executes
to be observed or modified once, and more specifically, the first time that instruction is seen. From
that point forward, any of the static changes that were made to those instructions will persist
for the duration of the execution time. We call the routines that provide this static opportunity
instrumentation code. Instrumentation routines focus on specific code locations.
Alternatively, dynamic opportunities arise every time a single instruction executes at runtime.
Measuring a dynamic event involves inserting code that will execute over and over for any given
instruction. We call the routines that provide this dynamic view analysis code. Analysis code focuses
on events that occur at some point within the execution of an application.
To summarize, instrumentation routines define where to insert instrumentation. Analysis
routines define what to do when the instrumentation is activated. Instrumentation routines execute
once per instruction. Analysis can execute millions of times for each instruction, depending on how
deeply nested in loop code that one instruction lies.This terminology becomes particularly important
when thinking about how to implement a desired goal. For instance, if the user wishes to gather the
frequency of using a particular register, they will have to distinguish the static frequency (how often
the register appears in the binary) from the dynamic frequency (how often the register is accessed
at runtime). It is also important to distinguish these opportunities so that the user is not adding
unnecessary dynamic overhead when some fixed amount of static overhead would suffice.
Instrumentation Points, Granularity, and Arguments The system permits the user’s plug-in tool
to access every executed instruction. The tool can then choose to modify the particular instruction,
or insert code before or after that instruction. For branch instructions, new code can be inserted on
either the fall-through or taken path. The tool designer must be sure that the particular location they
choose to insert code will actually execute. For instance, inserting new code after an unconditional
jump is probably not a good idea.
The APIs of Pin, DynamoRIO, and Valgrind all allow the user or tool to iterate over and
inspect several distinct granularities of the guest application.The user can choose to iterate over single
instructions just before each instruction executes, entire basic blocks2 (a straight-line sequence of
non-control flow instructions followed by a single control flow instruction, such as a branch, jump,
call, or return), entire traces3 (a series of basic blocks), or the entire program image.
It is important to understand that the basic blocks and traces that the system presents to the
tool represent a single-entry dynamic path. That is, control can only enter the top of the sequence
(not the sides), but control can exit through any side exit that exists. If later, control enters the
side of an existing sequence, then a new structure (basic block or trace) will be formed starting
at the side entry point. Therefore, there can be duplication between basic blocks and traces when
performing static analysis! This fact is important for users to understand if their results depend on all
2Technically, what is commonly called a basic block in the dynamic binary modification world is actually an extended basic block
in the literature, as the system cannot tell whether any of the straight-line instructions are targets of other branches.
3 What is called a trace in the dynamic binary modification world is called a superblock in the static compiler world.
2.4. PLATFORM-SPECIFIC BEHAVIORS 13
statically-reported instruction sequences being distinct, such as gathering a static instruction count.
In practice, the duplication affects a very small proportion of instructions. For situations where this
matters, the user can distinguish unique instances of static instructions using the original memory
address of an instruction as an indicator of uniqueness, rather than assuming that all instructions
reported at instrumentation time are unique.
dŽŽůKǀĞƌŚĞĂĚ
/ŶƐƚƌƵŵĞŶƚĂƚŝŽŶ ŶĂůLJƐŝƐ
ZŽƵƚŝŶĞƐ н ZŽƵƚŝŶĞƐ
KǀĞƌŚĞĂĚ KǀĞƌŚĞĂĚ
&ƌĞƋƵĞŶĐLJŽĨ tŽƌŬWĞƌĨŽƌŵĞĚ
ĂůůŝŶŐŶĂůLJƐŝƐ × ŝŶƚŚĞŶĂůLJƐŝƐ
ZŽƵƚŝŶĞ ZŽƵƚŝŶĞ
tŽƌŬZĞƋƵŝƌĞĚ tŽƌŬWĞƌĨŽƌŵĞĚ
ƚŽdƌĂŶƐŝƚŝŽŶƚŽ н /ŶƐŝĚĞƚŚĞ
ŶĂůLJƐŝƐZŽƵƚŝŶĞ ŶĂůLJƐŝƐZŽƵƚŝŶĞ
Assuming a correctly implemented guest application, and a correctly implemented binary modifi-
cation engine, the final task that a user of a binary modifier is likely to require is a way to debug
their custom plug-in tool. The fact that three applications are actually running in the same address
space (the binary modifier, the guest application, and the user’s plug-in tool) means that standard
debugging methodologies will not apply. Instead, documentation for each system provides specific
details about the best way to debug plug-in tools on that system.
On Pin, for example, it is possible to use gdb to debug a user plug-in tool on Linux. However,
the process involves using two different shells, one to run the debugger, and one to run the application
under the control of Pin. The three step process is shown below:
Step 1 In one window, invoke gdb with Pin:
prompt% gdb pin
(gdb)
Step 2 In a second window, launch your Pintool with the -pause_tool flag, which takes the
number of seconds to pause as an argument.
16 2. USING A DYNAMIC BINARY MODIFIER
prompt% pin -pause_tool 5 -t myPinTool.so -- <guestApp>
Pausing to attach to pid 32017
Step 3 Back in the gdb window, attach to the paused process. You may now use gdb in the standard
fashion, setting breakpoints as usual, and running cont to continue execution.
(gdb) attach 32017
(gdb) break main
(gdb) cont
Other systems will have their own tricks for debugging the variety of execution modes on the
variety of supported platforms in their user manuals.
Summary At this point, we now understand the high-level applications and necessary terminology
for using dynamic binary modification systems. We will therefore move on to specific examples and
use cases in the following chapters.
17
CHAPTER 3
In this section, we will cover four simple program analysis Pintools that demonstrate the ease of
analyzing a running program using dynamic binary modification.
Generating a Dynamic Instruction Trace with PrintPC Perhaps one of the simplest plug-in
tools that can be written is one that performs the task of printing the machine’s program counter
throughout the execution of an application. Such a tool can be useful for gathering a statistical
view of where the execution time is spent. Figure 3.1 demonstrates the entire program necessary
to implement this functionality as a plug-in to the Pin dynamic instrumentation system, and it
demonstrates some of the basic APIs provided to the user.
The easiest way to understand PrintPC is to start from the bottom of Figure 3.1 and focus
on the main() routine. Here we see that the user makes some calls to initialize their data structures,
open any necessary output files, and initialize Pin. Next, they register an instrumentation routine
that will be executed for every static instruction seen at runtime (line 26). Finally, they register
another routine that will execute immediately prior to exiting at the end of the execution time
(line 27), before instructing Pin to launch the guest application (line 28). Nothing after the call
18 3. PROGRAM ANALYSIS AND DEBUGGING
to PIN_StartProgram() will ever execute. (The call to return 0 is only present to make the
compiler happy.)
The PrintPC Tool
1 ofstream TraceFile;
2
3 // This analysis call is invoked for every dynamic instruction executed
4 VOID PrintPC(VOID *ip)
5 {
6 TraceFile << ip << endl;
7 }
8
9 // This instrumentation routine is invoked once per static instruction
10 VOID Instruction(INS ins, VOID *v)
11 {
12 INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)PrintPC, IARG_INST_PTR, IARG_END);
13 }
14
15 // This fini routine is called after the guest program terminates, just prior to exiting Pin
16 VOID Fini(INT32 code, VOID *v)
17 {
18 TraceFile.close();
19 }
20
21 // This main routine is invoked to initialize Pin, prior to starting the guest application
22 int main(int argc, char * argv[])
23 {
24 TraceFile.open("pctrace.out"); // Open an output file
25 PIN_Init(argc, argv); // Initialize Pin
26 INS_AddInstrumentFunction(Instruction, 0); // Register a routine to be called to instrument instructions
27 PIN_AddFiniFunction(Fini, 0); // Register Fini to be called when the application exits
28 PIN_StartProgram(); // Start the program; this call never returns
29 return 0;
30 }
Figure 3.1: This program analysis tool prints the address of every instruction that executes to a file. It
demonstrates the use of static instruction-level instrumentation routines and dynamic analysis routines.
Next, let’s look at the instrumentation and analysis routines. The instrumentation routine is
called Instruction() (line 10), and it will be called every time an instruction is encountered for the
first time. When that occurs, we tell the system to insert a new routine before that instruction, which
will be called (immediately prior to) when the instruction executes (line 13). The new routine we
insert before every instruction is called PrintPC() (line 4). It takes the current instruction pointer
(program counter) as an argument, then prints that PC to a file.
If we compile PrintPC, link it to Pin’s libraries, and execute a guest application using this
plug-in, the output will be a (large) trace file. The file will contain a list of program addresses that
executed, in the order that they executed, including all executed addresses within shared libraries.
Since the plug-in will run in user space alongside the guest application, no kernel addresses will
appear in the trace file.
This simple tool can easily be extended to sample the program counter, rather than to print
every single address. It can further be optimized to use conditional instrumentation to reduce the
overhead of PC sampling. Finally, it can be extended to print not only the instruction addresses, but
3.1. PROGRAM ANALYSIS EXAMPLES 19
The CallTrace Tool
1 // One of the two following analysis routines will be invoked for every dynamic call
2 VOID do_call(const string *s)
3 {
4 TraceFile << *s << endl;
5 }
6 VOID do_call_indirect(ADDRINT target, BOOL taken)
7 {
8 if( !taken ) return;
9 do_call( Target2String(target) );
10 }
11
12 // This instrumentation routine is invoked once per static trace
13 VOID Trace(TRACE trace, VOID *v)
14 {
15 for (BBL bbl = TRACE_BblHead(trace); BBL_Valid(bbl); bbl = BBL_Next(bbl)) {
16 INS tail = BBL_InsTail(bbl);
17 if( INS_IsCall(tail) ) {
18 if( INS_IsDirectBranchOrCall(tail) ) {
19 const ADDRINT target = INS_DirectBranchOrCallTargetAddress(tail);
20 INS_InsertPredicatedCall(tail, IPOINT_BEFORE, AFUNPTR(do_call),
21 IARG_PTR, Target2String(target), IARG_END);
22 }
23 else INS_InsertCall(tail, IPOINT_BEFORE, AFUNPTR(do_call_indirect),
24 IARG_BRANCH_TARGET_ADDR, IARG_BRANCH_TAKEN, IARG_END);
25 }
26 }
27 }
28
29 // This fini routine is called after the guest program terminates, just prior to exiting Pin
30 VOID Fini(INT32 code, VOID *v)
31 {
32 TraceFile.close();
33 }
34
35 // This main routine is invoked to initialize Pin, prior to starting the guest application
36 int main(int argc, char *argv[])
37 {
38 PIN_InitSymbols();
39 PIN_Init(argc,argv);
40 TraceFile.open(); // Opens the output file
41 TRACE_AddInstrumentFunction(Trace, 0); // Will gather the call trace as the program runs
42 PIN_AddFiniFunction(Fini, 0); // Closes the output file prior to exiting
43 PIN_StartProgram(); // Launches the program and never returns
44 return 0;
45 }
Figure 3.2: The CallTrace program analysis tool records all function calls that occur during execution.
This is a simplified version of the tool available in SimpleExamples/calltrace.cpp of the Pin distribution,
originally written by Robert Muth.
the actual instructions, including opcodes and operand values. We leave these tasks as an exercise
for the interested reader.
Call-Graph Generation with CallTrace Another simple profiling tool is one that analyzes the
function calls made while a program runs. Such a tool could be used for analyzing code coverage or
for detecting inefficiencies in the call stream. Figure 3.2 demonstrates a simple call trace generation
tool. Unlike the previous example, this tool instruments entire traces at once, rather than individual
20 3. PROGRAM ANALYSIS AND DEBUGGING
instructions. Working at larger granularities makes for a more efficient design that has a lower run-
time overhead. We see a difference in the implementation first on line 43, where we use the TRACE
API rather than the INS API. Next, within the instrumentation routine Trace() on line 13, we
iterate over all instructions within the trace to search for a call, rather than handling one instruction
at a time. We also see a few more query APIs that are available, such as the boolean INS_IsCall()
query on line 19 and the INS_IsDirectBranchOrCall() query on line 20. This allows the tool
to distinguish between direct and indirect calls statically, and to insert the corresponding specialized
analysis calls for each case. This also demonstrates a subtle point, which is that any query that can be
done once statically should be done at that time. While we could have embedded the query function
for determining whether an instruction is a branch or call into the dynamic stream, this trait does
not change at runtime, and it would therefore be inefficient to query the same instruction multiple
times.
If we execute CallTrace as a plug-in to Pin while running a guest application, we will get a
log of all of the function calls that were made (by name) throughout execution. Again, we will see
only user-level behavior, which includes calls to shared and dynamically-loaded libraries and even
system calls, but no calls made from within the kernel. We can also extend the CallTrace example in
a number of ways. We can modify the tool to print the arguments to each call, to focus exclusively
on system calls, or to focus on one particular call, such as malloc(). Finally, we can write a more
sophisticated tool to generate a call graph, rather than simply a log of all calls.
Memory-Leak Detection with MallocTrace Rather than focusing on all dynamic calls, a natural
extension is focus in on a few calls of interest, such as those relating to memory allocations. Such
a tool can be used to easily detect memory leaks within applications by comparing the amount of
memory allocated to the amount deallocated.
The tool shown in Figure 3.3 shows a simple way to modify one or more particular functions of
interest. In this case, the tool instruments any application call to malloc() or free() by modifying
the contents of the functions themselves. We accomplish this by performing instrumentation on the
entire image at load time, as is demonstrated on line 39 of the tool.The instrumentation routine itself
searches for the two routines of interest (the malloc() and free() routines) on lines 17 and 25,
respectively. This search is performed once. If either routine is located, the system inserts the new
functionality shown in the two analysis routines called BeforeMallocFree() and AfterMalloc(),
which simply prints some information about the size and location of the allocation or deallocation by
analyzing the inputs to the calls themselves from within the function body. Since the details of each
call to malloc() and free() will vary at run-time as different amounts of memory are requested
or released, we must track the arguments and return values to/from each call. We accomplish this
in Pin by specifying a set of arguments that can be captured at runtime and passed to the plug-
in analysis routines. The arguments themselves are specified on lines 20-21, 22, and 28-29. The
arguments of interest are already captured by the Pin API, which allows the user to access the
FUNCARG_ENTRYPOINT_VALUE (the inputs to a function) and/or the FUNCRET_EXITPOINT_VALUE
(the return value from a function). These values are then passed to the analysis routines to be printed
3.1. PROGRAM ANALYSIS EXAMPLES 21
The MallocTrace Tool
1 #define MALLOC "malloc"
2 #define FREE "free"
3
4 // The following analysis calls are invoked before/after every call to malloc and free
5 VOID BeforeMallocFree(CHAR * name, ADDRINT size)
6 {
7 cout << name << "(" << size << ")" << endl;
8 }
9 VOID AfterMalloc(ADDRINT ret)
10 {
11 cout << " returns " << ret << endl;
12 }
13
14 // This image routine is invoked once, prior to executing the program
15 VOID Image(IMG img, VOID *v)
16 {
17 RTN mallocRtn = RTN_FindByName(img, MALLOC); // Finds malloc()
18 if (RTN_Valid(mallocRtn)) {
19 RTN_Open(mallocRtn);
20 RTN_InsertCall(mallocRtn, IPOINT_BEFORE, (AFUNPTR)BeforeMallocFree, IARG_ADDRINT, MALLOC,
21 IARG_FUNCARG_ENTRYPOINT_VALUE, 0, IARG_END);
22 RTN_InsertCall(mallocRtn, IPOINT_AFTER, (AFUNPTR)AfterMalloc, IARG_FUNCRET_EXITPOINT_VALUE, IARG_END);
23 RTN_Close(mallocRtn);
24 }
25 RTN freeRtn = RTN_FindByName(img, FREE); // Finds free()
26 if (RTN_Valid(freeRtn)) {
27 RTN_Open(freeRtn);
28 RTN_InsertCall(freeRtn, IPOINT_BEFORE, (AFUNPTR)BeforeMallocFree, IARG_ADDRINT, FREE,
29 IARG_FUNCARG_ENTRYPOINT_VALUE, 0, IARG_END);
30 RTN_Close(freeRtn);
31 }
32 }
33
34 // This main routine is invoked to initialize Pin, prior to starting the guest application
35 int main(int argc, char *argv[])
36 {
37 PIN_InitSymbols(); // Initialize Pin’s symbols for RTN instrumentation
38 PIN_Init(argc,argv); // Initialize Pin
39 IMG_AddInstrumentFunction(Image, 0); // Register a routine to be called to instrument the image
40 PIN_StartProgram(); // Start the program; this call never returns
41 return 0;
42 }
Figure 3.3: The MallocTrace program analysis tool instruments the malloc() and free() functions.
It prints the arguments to each function, and the return value from malloc().
out a runtime. The net result of applying the MallocTrace tool is that we have an application that,
rather than calling the native malloc and free routines, will instead call a new version of malloc
and free. The new versions are, otherwise, identical to the old, but they will be amended to contain
new code that prints the arguments to these routines and the return values.This corresponding log of
memory allocations and deallocations can subsequently be analyzed to detect memory leaks. While
the log itself is generated as the guest application executes, the task of detecting memory leaks can
be performed either online during execution, or offline after the guest application completes, and
the log has been written to a file. These are good starting points for realistic memory leak tools.
22 3. PROGRAM ANALYSIS AND DEBUGGING
The InsMix Tool
1 // This analysis call is invoked for every dynamic instruction executed
2 VOID PIN_FAST_ANALYSIS_CALL docount(COUNTER * counter)
3 {
4 (*counter)++;
5 }
6
7 // This instrumentation routine is invoked once per static trace. It inserts the analysis routine.
8 VOID Trace(TRACE trace, VOID *v)
9 {
10 for (BBL bbl = TRACE_BblHead(trace); BBL_Valid(bbl); bbl = BBL_Next(bbl)) {
11 // Insert instrumentation to count the number of times the bbl is executed
12 BBLSTATS * bblstats = new BBLSTATS(stats, INS_Address(BBL_InsHead(bbl)), rtn_num, size, numins );
13 INS_InsertCall(BBL_InsHead(bbl), IPOINT_BEFORE, AFUNPTR(docount), IARG_FAST_ANALYSIS_CALL, IARG_PTR,
14 &(bblstats->_counter), IARG_END);
15 }
16 }
17
18 // This image routine is invoked once, prior to executing the program
19 VOID Image(IMG img, VOID * v)
20 {
21 for (SEC sec = IMG_SecHead(img); SEC_Valid(sec); sec = SEC_Next(sec)) {
22 for (RTN rtn = SEC_RtnHead(sec); RTN_Valid(rtn); rtn = RTN_Next(rtn)) {
23 // A RTN is not broken up into BBLs, it is merely a sequence of INSs
24 RTN_Open(rtn);
25 for (INS ins = RTN_InsHead(rtn); INS_Valid(ins); ins = INS_Next(ins)) {
26 for(UINT16 *start=array; start<end; start++) GlobalStatsStatic[ *start ]++;
27 }
28 RTN_Close(rtn); // to preserve space, release data associated with RTN after processing
29 }
30 }
31 }
Figure 3.4: Snippet of the InsMix program analysis tool, which categorizes the static and dynamic
instruction stream. The complete tool is available in Insmix/insmix.cpp of the Pin distribution and was
originally written by Robert Muth.
Instruction Profiling and Code Coverage with InsMix Our final program analysis example
focuses on the general case of instruction profiling. There are a number of reasons that developers
may wish to profile the static and dynamic instruction stream of their applications. The InsMix tool
shown in Figure 3.4 serves as the foundation for a number of such profiling tools. This tool can
be used to group instructions by class to determine the most frequently used and most frequently
executed instruction classes for a given program. Such a tool can be used for code coverage analysis,
which can be done by comparing the static and dynamic instruction stream. Interestingly, this same
tool can be used for compiler bug detection. By comparing the code generated by one compiler
to that generated by another compiler, we can easily detect inefficiencies in the code generation
routines, such as unnecessary spills and fills.
Debugging and analyzing parallel programs is difficult because their execution is not deterministic.
The threads’ relative progress can change in every run of the program, possibly changing the results.
24 3. PROGRAM ANALYSIS AND DEBUGGING
Even single-threaded program execution is not deterministic because of behavior changes in certain
system calls (for example, gettimeofday()) and stack and shared library load locations.
Using dynamic binary modification, it is possible to perform user-level capture and determin-
istic replay of multithreaded programs. PinPlay is one such tool, based on Pin. The program first
runs under the control of a logging tool, which captures all the system call side effects and inter-
thread shared-memory dependencies. Another tool replays the log, exactly reproducing the recorded
execution by loading system call side effects and possibly delaying threads to satisfy recorded shared-
memory dependencies.
Replaying a previously captured log by itself is not very useful. Instead, a captured log can
be used to ensure that other program analysis tools see the same program behavior on multiple
runs, making the analysis deterministic. The tool can also replay a PinPlay log while connected to a
debugger, making multithreaded program debugging deterministic. As long as the PinPlay logger
can capture a bug once, the behavior can be precisely replicated multiple times with replay under
a debugger. More details of PinPlay are presented by Patil et al. [2010], while more details on the
logging operating system effects is presented by Narayanasamy et al. [2006].
CHAPTER 4
Figure 4.1: Snippet of a tool for rewriting the memory operations in a guest application.
Figure 4.2: This tool replaces calls to malloc() with calls to a custom memory allocation routine.
4.3. DYNAMIC OPTIMIZATION 27
Figure 4.2 shows an example of function replacement. In this case, the user would like to
replace all calls to malloc() on Linux with a call to a custom memory allocation routine. The
custom routine, called MyNewMalloc, begins on Line 1. Note that, in this case, the custom routine
simply acts as a wrapper function that calls the standard malloc(), routine after printing a custom
message. Yet, there is no restriction in place that a new routine must call the routine it has replaced,
so a truly custom memory allocator could be implemented and deployed, and in fact, this example
code would serve as a solid template for doing so.
The function replacement plug-in shown in Figure 4.2 also demonstrates two key features
of dynamic binary modification systems. First, when deployed, the tool will replace all calls to
malloc(), including those made by the shared libraries invoked by the application – not just the
calls contained within the guest application itself. Second, the tool demonstrates the use of probe-
based instrumentation where the calls to malloc are overwritten prior to executing the application, at
load time, rather than on-the-fly as the calls are executed.
Some of the earlier dynamic binary modifiers were designed to perform optimizations on running
applications with the goal of a net performance improvement. One example system was Dynamo
from Hewlett-Packard (Bala et al. [1999, 2000]), which optimized PA-RISC applications running
on the HPUX operating system. Dynamo applied a series of optimizations designed to leverage
the fact that at runtime, the program inputs are known; example optimizations included constant
propagation and dead-code elimination. Since the goal of Dynamo was optimization, the system
had the ability to “bail out” and execute the application natively if the act of binary modification was
seen to be causing a performance degradation. Most of the modern dynamic binary modifiers do not
support this feature as they are intended to provide comprehensive control over a guest application.
The DynamoRIO project from MIT spawned out of the original Dynamo project at HP
(albeit ported to x86 on Linux and Windows rather than PA-RISC on HPUX), so DynamoRIO’s
goal has been one of dynamic optimization from the very beginning. Many of the design decisions
within DynamoRIO reflect this emphasis, as DynamoRIO provides for fine-grained control over
the internal behavior of its code generation engine and therefore the resulting performance of a
modified guest application. Given this fact, it seems logical that our example demonstration of
dynamic optimization will use DynamoRIO as the binary modification engine.
Figure 4.3 presents one of the standard demonstrations of dynamic optimization within
DynamoRIO. The optimization leverages a processor-specific feature that would otherwise be too
processor specific to implement in a static optimization framework. The optimization developer
made the keen observation that when performing x86 assembly instruction selection for the high-
level language statement i++, there are two choices which perform differently on two different
processors. Specifically, the assembly instruction inc (for increment) is faster on the Pentium-III
processor, while the addi (for add immediate where the constant operand is the value 1) is faster on
28 4. ACTIVE PROGRAM MODIFICATION
The IncVsAdd DynamoRIO Client
1 EXPORT void dr_init() {
2 if (proc_get_family() == FAMILY_PENTIUM_IV) dr_register_trace_event(event_trace);
3 }
4 static void event_trace(void *drcontext, app_pc tag, instrlist_t *trace, bool xl8) {
5 instr_t *instr, *next_instr; int opcode;
6 for (instr = instrlist_first(bb); instr != NULL; instr = next_instr) {
7 next_instr = instr_get_next(instr);
8 opcode = instr_get_opcode(instr);
9 if (opcode == OP_inc || opcode == OP_dec) replace_inc_with_add(drcontext, instr, trace);
10 }
11 }
12 static bool replace_inc_with_add(void *drcontext, instr_t *instr, instrlist_t *trace) {
13 instr_t *in; uint eflags; int opcode = instr_get_opcode(instr);
14 bool ok_to_replace = false;
15 for (in = instr; in != NULL; in = instr_get_next(in)) {
16 eflags = instr_get_arith_flags(in);
17 if ((eflags & EFLAGS_READ_CF) != 0) return false;
18 if ((eflags & EFLAGS_WRITE_CF) != 0) {
19 ok_to_replace = true;
20 break;
21 }
22 if (instr_is_exit_cti(in)) return false;
23 }
24 if (!ok_to_replace) return false;
25 if (opcode == OP_inc) in = INSTR_CREATE_add (drcontext, instr_get_dst(instr, 0), OPND_CREATE_INT8(1));
26 else in = INSTR_CREATE_sub(drcontext, instr_get_dst(instr, 0), OPND_CREATE_INT8(1));
27 instr_set_prefixes(in, instr_get_prefixes(instr));
28 instrlist_replace(trace, instr, in);
29 instr_destroy(drcontext, instr);
30 return true;
31 }
Figure 4.3: This DynamoRIO client tool determines whether the underlying processor is Pentium-IV,
and if so, it replaces each increment instruction with an add instruction dynamically.
the Pentium-IV. Therefore, the client tool determines the specific underlying processor at runtime,
and dynamically converts the instruction selections when appropriate.
Digging deeper into the source code shown in Figure 4.3, we see that the client tool begins
by registering an event that will be invoked upon initialization if the underlying processor is a
Pentium-IV (shown on lines 1–3). Next, the system walks through the various instructions as they
are encountered, searching for instances of the increment or decrement instructions (line 9). When
found, the system invokes the routine replace_inc_with_add on that instruction. This particular
routine is then shown on lines 12–31, where it begins by determining whether the instruction
replacement is safe to perform. Next, it creates a new instruction that adds or subtracts the immediate
value 1, and inserts it into the instruction stream before deleting the existing increment or decrement
instruction. For the remainder of the current execution, the modified code sequence will persist, and,
therefore, the act of replacing the instruction only occurs once per static instance of the increment
or decrement instruction.
Many additional dynamic optimizations can also be envisioned and explored using dynamic
binary modification systems. The ideal optimization candidates are those that cannot be performed
4.4. SANDBOXING AND SECURITY ENFORCEMENT 29
statically because they are either too aggressive, too processor specific, or input specific, but that can
be safely applied once the runtime environment is known.
CHAPTER 5
Architectural Exploration
Computer architects have found great utility in dynamic binary modification systems as an en-
abling technology for fast exploratory studies of novel architectural algorithms or features. These
exploratory applications have taken the form of simulation tools, instruction emulation features, and
general design-space exploration. Furthermore, dynamic binary modification systems can be used
for complete binary translation functionality, to enable a smooth transition from one instruction-
set architecture to another, otherwise incompatible architecture. In fact, several corporations have
leveraged this precise technology to support industrial strength and widely circulated binary trans-
lation (Apple, Dehnert et al. [2003]). The wide variety of opportunities and applications within the
computer architecture community will be discussed in this chapter.
5.1 SIMULATION
For the purposes of simulating new or existing architectures, computer architects typically have
two options for employing binary modification tools. First, they can build plug-in tools that gather
runtime data and feed their custom simulator in real time. Alternatively, they can record runtime
data as instruction traces, which can be saved for later use. Figure 5.1 demonstrates the former case
of online simulation, while Figure 5.2 depicts the latter case of offline, or trace-driven simulation.
Both cases gather the relevant data to drive simulation during an actual run of the guest application.
This allows designers to validate that their modifications will not affect the correctness of the guest
application, or that the traces they generate represent a valid execution of the user-level code.
Unlike traditional simulators that can be prohibitively slow, simulators based on dynamic
binary modification technology can be on the order of native execution times, allowing more extensive
testing than is typically possible. In fact, the robust nature of binary modification systems means that
complex applications, such as Oracle database applications, or the user-level components of large
32 5. ARCHITECTURAL EXPLORATION
parallel data-mining applications, can be characterized in addition to the small, toy applications, or
benchmarks that are commonplace.
Figure 5.3: This figure depicts a snippet of a data cache simulator Pintool, similar to the dcache.cpp tool
distributed with Pin.
As Figure 5.3 indicates, cache simulation only requires the dynamic binary modifier to instru-
ment the memory operations present in the guest application. Lines 14 and 17 determine whether
a given instruction accesses memory, and if so, whether it is a read or write. This (and only this)
information is then forwarded on to the cache simulation engine that runs alongside the guest ap-
plication. Note that the only memory accesses observed are those accesses by the guest application,
and neither the binary modifier itself nor the cache simulator will affect the order or location of the
accesses. The cache simulator can then emulate the behavior of the proposed cache hierarchy, deter-
mine whether a given read or write would have resulted in a cache miss, and update the simulated
cache state accordingly.
34 5. ARCHITECTURAL EXPLORATION
A more extensive cache simulator called CMP$im is loosely based on the example above and
is presented by Jaleel et al. [2008]. CMP$im simulates multicore caches as well as single core caches.
As Figure 5.4 indicates, the dynamic binary modifier focuses only on the branches present
in the guest application. It then streams the branch address, target address, and outcome to the
branch prediction simulator. The simulator can then emulate the new prediction policy, and it can
record the hit/miss rate of the new design. Again, since only branches from the guest application
are analyzed, there is no risk of polluting the results with information from the binary modification
engine or the simulator itself. And just like the previous simulator, the branch predictor simulation
can be performed both online or offline.
5.2 EMULATION
Our previous discussions of dynamic binary modification as a driver for architectural simulation
emphasized observing execution and demonstrated how the observations could then drive various
simulators. Another important application of dynamic binary modification is its utility for modify-
ing applications and allowing a user to emulate entirely new functionalities in lieu of the existing
functionality that would have occurred otherwise on the underlying machine.
Figure 5.5 shows an example application of emulation. Let’s say that a user wishes to replace
the normal operation that occurs when loading data from memory with a new implementation. That
user could essentially replace all instances where data is loaded with a custom implementation. In
Figure 5.5, that custom implementation simply augments the load with a logging functionality that
also prints out what data was loaded. Yet, the same principle can be applied to replace a load with
an entirely new implementation that, for instance, loads all data from a new area of memory.
the respective divide operation. The ability to easily locate and replace arbitrary instructions from
a guest application is a powerful application of dynamic binary modification systems as it enables
software solutions to hardware problems.
CHAPTER 6
LJŶĂŵŝĐŝŶĂƌLJDŽĚŝĨŝĞƌ
Ŷ
ƌĞ ŝͲ ůŽ WůƵŐͲ/ŶW/Ɛ
Ɛ ŐƵ Ž
h ůW d sŝƌƚƵĂůDĂĐŚŝŶĞ;sDͿ
ƌĞ
ŶŽ ŚĐ
ƚƐ
:/dŽŵƉŝůĞƌ
ŝƚ ƚĂ ŽĚĞĂĐŚĞ
ĞƵ ĂĐ ƉƐ
ŝů ŝ
' ƉƉ
ŵƵůĂƚŝŽŶhŶŝƚ
^ŚĂƌĞĚĚĚƌĞƐƐ^ƉĂĐĞ
KƉĞƌĂƚŝŶŐ^LJƐƚĞŵ
,ĂƌĚǁĂƌĞ
Figure 6.1: Internal organization of a JIT-based dynamic binary modification system. Three programs
run in the same address space (the guest application, the plug-in, and the modification engine itself ).
Within the modification engine, there is a JIT compiler that creates a modified copy of every guest
application instruction, a code cache for storing previously translated code, and an emulation unit for
maintaining control at system call points.
instruction were to target the second overwritten instructions, as that address would now contain
the second half of the new jump instruction.
translated
data code
structures 23%
36%
auxiliary
code
41%
Figure 6.2: Memory distribution of the contents of the software code cache. Contents include translated
code, auxiliary code, and data structures.
Figure 6.3: Organization of a software code cache that supports medium-grained evictions. The traces
are grouped into larger, fixed-sized cache blocks, which can be deleted as a whole, avoiding fragmentation.
suffering the fragmentation issues that hinder fine-grained evictions. Figure 6.3 presents a code
cache organization that supports medium-grained evictions.
For the interested reader, a much more detailed discussion of code cache replacement
issues and policies is available in the Ph.D. thesis by Hazelwood [2004] or in the papers
by Hazelwood and Smith [2006] or Bruening and Amarasinghe [2005].
Figure 6.4: A code cache introspection tool that detects and handles self-modifying code.
perform custom actions, such invalidating a single trace, or even flushing the entire code cache. The
lookup API provide access to Pin’s internal data structures that keep track of the code cache’s contents,
and finally, the statistics API gives the user access to aggregated data about the various actions that
have taken place at any given point in time. More details about the potential applications of cache
introspection is discussed by Hazelwood and Cohn [2006].
ĨĨĞĐƚŽĨdŚƌĞĂĚͲWƌŝǀĂƚĞŽĚĞĂĐŚĞƐ
D=
;9=
<
:B C
9A
?@
>=
<=
8:9;
!"#$ %!"#&%'# )(*!+ !,- %!/.10243 . !65) !"43 !"77
Figure 6.5: Code expansion resulting from thread-private code caches. The SPEC OMP 2001 bench-
marks were run as 8 application threads.
its own private cache. This memory scalability issue justifies the use of a shared code cache across
threads.
Figure 6.6: Timeline comparing a naïve code cache flush to a thread-safe generational flush. The naïve
implementation stalls until all threads return to the VM, while the generational implementation makes
forward progress as it waits for threads to return to the VM.
table, and thus the thread will never re-enter code from the old generation. Threads may also move
forward with generating new code for the new generation. Finally, when the last thread leaves the
old generation, we flush the cache blocks for that generation. This scheme allows threads to continue
to generate and execute new code while other threads are potentially stalled and/or in the process
of leaving the old generation.
More details about the support necessary for handling multithreaded applications can be
found in a paper by Hazelwood et al. [2009].
CHAPTER 7
Historical Perspectives
The dynamic binary modification systems detailed in this text are by no means the first of their kind
(nor are they likely to be the last). The three systems were chosen as the focus of this book because
at the time of its writing, they were widely used and readily available.
In the 1990’s, several other dynamic binary modifiers were developed, including
Shade for SPARC/Solaris (Cmelik and Keppel [1994]), DynInst for a variety of platforms
(Buck and Hollingsworth [2000]), Vulcan for x86/Windows (Edwards et al. [2001]), Wig-
gins/Redstone for Alpha (Deaver et al. [1999]), and Dynamo for HPUX/PA-RISC (Bala et al.
[1999]).
Later on, numerous other dynamic binary instrumentation frameworks appeared, including
Strata (Scott et al. [2003]), DELI (Desoli et al. [2002]), which is a descendent of Dynamo for the
LX architecture, DIOTA for x86/Linux (Maebe et al. [2002]), Mojo for x86/Windows (Chen et al.
[2000]), Walkabout for SPARC/Solaris (Cifuentes et al. [2002]), and HDTrans (Sridhar et al.
[2006]). In addition, the three focus systems from this book (Pin, DynamoRIO, and Valgrind)
appeared during that first decade of 2000.
Other tools served a similar purpose to one of more of the applications of dynamic binary
modification, such as simulation or dynamic translation. Hardware simulators or emulators in-
clude Embra (Witchel and Rosenblum [1996]) and Simics (Magnusson et al. [2002]). Dynamic
binary translators include DAISY (Ebcioğlu and Altman [1997]), Crusoe (Dehnert et al. [2003]),
and Rosetta (Apple).
Outside of dynamic binary modification, there are a wide variety of static instrumentation tools
dating back several decades. For instance, the ATOM toolkit from Digital (Srivastava and Eustace
[1994]) formed the basis for the look-and-feel of the Pin API, and indeed there were several
developers in common. Meanwhile, other systems included Etch (Romer et al. [1997]), EEL
(Larus and Schnarr [1995]), and Morph (Zhang et al. [1997]).
57
CHAPTER 8
1This claim is based on informal observations made at MICRO 2009, HPCA 2010, ASPLOS 2010, CGO 2010, and ISCA 2010.
58 8. SUMMARY AND OBSERVATIONS
themselves. The book’s coverage of the internal structures and algorithms is intended to meet that
goal. While many of the individual components have been covered in one form or another through-
out the body of literature in place today, the goal of this text was to provide a comprehensive and
up-to-date view of the relevant structures and algorithms in place in some of the most commonly
used dynamic binary modification systems today.
Above all, the subfield of dynamic binary modification is still evolving. New applications,
new challenges, and new internal algorithms are regularly surfacing. One of the luxuries of an
electronically-based textbook series is that the text itself can and should evolve as well. As such, I
hope to include your own breakthroughs in future editions. Stay tuned!
ADDITIONAL RESOURCES
This book provides detailed coverage of one of many topics covered in the virtual machines textbook
by Smith and Nair [2005].The reader can refer to their book to get an understanding of how dynamic
binary modification fits into the bigger virtualization picture.
For the most part, one seminal paper exists for each of the three dynamic binary modi-
fication systems covered in this book. For Valgrind, that seminal paper appeared in PLDI 2007
(Nethercote and Seward [2007]), for Pin, it appeared in PLDI 2005 (Luk et al. [2005]), and for
DynamoRIO, it appeared in CGO 2003 (Bruening et al. [2003]). Readers interested in one partic-
ular system should start by focusing on the seminal work before moving on to the large number of
followup papers that have appeared since then. The interested reader should also consider reading
the Ph.D. theses of some of the developers of the three systems highlighted in this text. In particular,
Bruening [2004] presents an in-depth look at the internal workings of DynamoRIO. Meanwhile,
Nethercote [2004] presents the internal workings of Valgrind, as well as a nice historical perspective
on similar systems.
59
Bibliography
Apple. Rosetta. http://www.apple.com/rosetta/. 31, 38, 55
Moshe (Maury) Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazel-
wood, Aamer Jaleel, Chi-Keung Luk, Gail Lyons, Harish Patil, and Ady Tal. Analyzing parallel
programs with Pin. IEEE Computer, 43(3):34–41, March 2010. DOI: 10.1109/MC.2010.60 23
Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia. Transparent dynamic optimization. Tech-
nical Report HPL-1999-77, Hewlett Packard, June 1999. 27, 55
Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia. Dynamo: a transparent dynamic op-
timization system. In Proceedings of the ACM SIGPLAN Conference on Programming Lan-
guage Design and Implementation, PLDI ’00, pages 1–12, Vancouver, BC, Canada, June 2000.
DOI: 10.1145/349299.349303 27
Utpal Banerjee, Brian Bliss, Zhiqiang Ma, and Paul Petersen. A theory of data race detection.
In Proceedings of the 2006 Workshop on Parallel and Distributed Systems: Testing and Debugging,
PADTAD ’06, pages 69–78, Portland, ME, USA, July 2006. DOI: 10.1145/1147403.1147416
23
Fabrice Bellard. Qemu: a fast and portable dynamic translator. In Proceedings of the USENIX Annual
Technical Conference, ATEC ’05, pages 41–46, Anaheim, CA, USA, 2005. USENIX Association.
38
Marc Berndl and Laurie Hendren. Dynamic profiling and trace cache generation. In Proceedings of
the 1st Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO
’03, pages 276–288, San Francisco, CA, USA, March 2003. DOI: 10.1109/CGO.2003.1191552
43
Derek Bruening and Saman Amarasinghe. Maintaining consistency and bounding capacity of
software code caches. In Proceedings of the 3rd Annual IEEE/ACM International Symposium on
Code Generation and Optimization, CGO ’05, pages 74–85, San Jose, CA, USA, March 2005.
DOI: 10.1109/CGO.2005.19 46, 47
Derek Bruening and Vladimir Kiriansky. Process-shared and persistent code caches. In Proceedings
of the 4th Annual ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environ-
ments, VEE ’08, pages 61–70, Seattle, WA, USA, March 2008. DOI: 10.1145/1346256.1346265
8
60 BIBLIOGRAPHY
Derek Bruening, Evelyn Duesterwald, and Saman Amarasinghe. Design and implementation of
a dynamic optimization framework for windows. In Proceedings of the 4th ACM Workshop on
Feedback-Directed and Dynamic Optimization, FDDO-4, Austin, TX, USA, December 2001. 42,
52
Derek Bruening,Timothy Garnett, and Saman Amarasinghe. An infrastructure for adaptive dynamic
optimization. In Proceedings of the 1st Annual IEEE/ACM International Symposium on Code
Generation and Optimization, CGO ’03, pages 265–275, San Francisco, CA, USA, March 2003.
DOI: 10.1109/CGO.2003.1191551 58
Derek Bruening, Vladimir Kiriansky, Timothy Garnett, and Sanjeev Banerji. Thread-shared soft-
ware code caches. In Proceedings of the 4th Annual IEEE/ACM International Symposium on
Code Generation and Optimization, CGO ’06, pages 28–38, New York, NY, USA, March 2006.
DOI: 10.1109/CGO.2006.36 49
Derek L. Bruening. Efficient, Transparent and Comprehensive Runtime Code Manipulation. PhD
thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, September 2004. 52, 58
Bryan Buck and Jeffrey K. Hollingsworth. An api for runtime code patching. Interna-
tional Journal of High Performance Computing Applications, 14(4):317–329, November 2000.
DOI: 10.1177/109434200001400404 39, 55
Wen-Ke Chen, Sorin Lerner, Ronnie Chaiken, and David Gillies. Mojo: A dynamic optimization
system. In Proceedings of the 4th ACM Workshop on Feedback-Directed and Dynamic Optimization,
FDDO-4, pages 81–90, Austin, TX, USA, December 2000. 55
Cristina Cifuentes, Brian Lewis, and David Ung. Walkabout: A retargetable dynamic binary trans-
lation framework. Technical Report SMLI TR-2002-106, Mountain View, CA, USA, 2002.
55
Bob Cmelik and David Keppel. Shade: A fast instruction-set simulator for execution profiling.
In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Com-
puter Systems, SIGMETRICS ’94, pages 128–137, Nashville, TN, USA, May 1994. ACM.
DOI: 10.1145/183018.183032 55
Derek Davis and Kim Hazelwood. Improving region selection through loop completion. In Pro-
ceedings of the ASPLOS Workshop on Runtime Environments/Systems, Layering, and Virtualized
Environments, RESoLVE ’11, Newport Beach, CA, USA, March 2011. 44
Dean Deaver, Rick Gorton, and Norm Rubin. Wiggins/redstone: An on-line program specializer.
In IEEE Hot Chips XI, 1999. 55
James C. Dehnert, Brian K. Grant, John P. Banning, Richard Johnson, Thomas Kistler, Alexander
Klaiber, and Jim Mattson. The transmeta code morphing software: Using speculation, recov-
ery, and adaptive retranslation to address real-life challenges. In Proceedings of the 1st Annual
BIBLIOGRAPHY 61
IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’03, pages 15–
24, San Francisco, CA, USA, March 2003. DOI: 10.1109/CGO.2003.1191529 31, 38, 47, 55
Giuseppe Desoli, Nikolay Mateev, Evelyn Duesterwald, Paolo Faraboschi, and Joseph A. Fisher.
Deli: A new run-time control point. In Proceedings of the 35th Annual ACM/IEEE Inter-
national Symposium on Microarchitecture, MICRO-35, pages 257–268, Istanbul, Turkey, 2002.
DOI: 10.1109/MICRO.2002.1176255 55
Balaji Dhanasekaran and Kim Hazelwood. Improving indirect branch translation in dynamic binary
translators. In Proceedings of the ASPLOS Workshop on Runtime Environments/Systems, Layering,
and Virtualized Environments, RESoLVE ’11, Newport Beach, CA, USA, March 2011. 44
Evelyn Duesterwald and Vasanth Bala. Software profiling for hot path prediction: Less is more. In
Proceedings of the 12th International Conference on Architectural Support for Programming Languages
and Operating Systems, ASPLOS ’00, pages 202–211, Cambridge, MA, USA, October 2000.
DOI: 10.1145/356989.357008 43
Kemal Ebcioğlu and Erik R. Altman. Daisy: dynamic compilation for 100% architectural compat-
ibility. In Proceedings of the 24th Annual International Symposium on Computer Architecture, ISCA
’97, pages 26–37, Denver, CO, USA, 1997. ACM. DOI: 10.1145/384286.264126 55
Andrew Edwards, Amitabh Srivastava, and Hoi Vo. Vulcan: Binary transformation in a distributed
environment. Technical Report MSR-TR-2001-50, Microsoft Research, April 2001. 55
Apala Guha, Kim Hazelwood, and Mary Lou Soffa. Reducing exit stub memory consumption
in code caches. In Proceedings of the International Conference on High-Performance Embed-
ded Architectures and Compilers, HiPEAC ’07, pages 87–101, Ghent, Belgium, January 2007.
DOI: 10.1007/978-3-540-69338-3_7 44
Apala Guha, Kim Hazelwood, and Mary Lou Soffa. Balancing memory and performance through
selective flushing of software code caches. In Proceedings of the International Conference on Com-
pilers, Architectures and Synthesis for Embedded Systems, CASES ’10, pages 1–10, Scottsdale, AZ,
USA, October 2010a. DOI: 10.1145/1878921.1878923 45
Apala Guha, Kim Hazelwood, and Mary Lou Soffa. Dbt path selection for holistic memory efficiency
and performance. In Proceedings of the 6th ACM SIGPLAN/SIGOPS International Conference on
Virtual Execution Environments, VEE ’10, pages 145–156, Pittsburgh, PA, USA, March 2010b.
DOI: 10.1145/1837854.1736018 44
Kim Hazelwood. Code Cache Management in Dynamic Optimization Systems. PhD thesis, Harvard
University, Cambridge, MA, USA, May 2004. 46
62 BIBLIOGRAPHY
Kim Hazelwood and Robert Cohn. A cross-architectural framework for code cache manipulation.
In Proceedings of the 4th Annual IEEE/ACM International Symposium on Code Generation and Opti-
mization, CGO ’06, pages 17–27, New York, NY, USA, March 2006. DOI: 10.1109/CGO.2006.3
47
Kim Hazelwood and Artur Klauser. A dynamic binary instrumentation engine for the arm
architecture. In Proceedings of the International Conference on Compilers, Architectures, and
Synthesis for Embedded Systems, CASES ’06, pages 261–270, Seoul, Korea, October 2006.
DOI: 10.1145/1176760.1176793 13
Kim Hazelwood and James E. Smith. Exploring code cache eviction granularities in dynamic
optimization systems. In Proceedings of the 2nd Annual IEEE/ACM International Symposium on
Code Generation and Optimization, CGO ’04, pages 89–99, Palo Alto, CA, USA, March 2004.
DOI: 10.1109/CGO.2004.1281666 45
Kim Hazelwood and Michael D. Smith. Characterizing inter-execution and inter-application op-
timization persistence. In Proceedings of the Workshop on Exploring the Trace Space for Dynamic
Optimization Techniques, pages 51–58, San Francisco, CA, USA, June 2003. 8
Kim Hazelwood and Michael D. Smith. Managing bounded code caches in dynamic binary op-
timization systems. Transactions on Code Generation and Optimization, 3(3):263–294, September
2006. DOI: 10.1145/1162690.1162692 42, 46
Kim Hazelwood, Greg Lueck, and Robert Cohn. Scalable support for multithreaded applica-
tions on dynamic binary instrumentation systems. In Proceedings of the ACM International
Symposium on Memory Management, ISMM ’09, pages 20–29, Dublin, Ireland, June 2009.
DOI: 10.1145/1542431.1542435 51
David J. Hiniker, Kim Hazelwood, and Michael D. Smith. Improving region selection in dynamic op-
timization systems. In Proceedings of the 38th Annual International Symposium on Microarchitecture,
MICRO-38, pages 141–154, Barcelona, Spain, November 2005. DOI: 10.1109/MICRO.2005.22
43
Jason D. Hiser, Daniel Williams, Wei Hu, Jack W. Davidson, Jason Mars, and Bruce R. Childers.
Evaluating indirect branch handling mechanisms in software dynamic translation systems. In
Proceedings of the 5th Annual IEEE/ACM International Symposium on Code Generation and Opti-
mization, CGO ’07, pages 61–73, San Jose, CA, USA, March 2007. DOI: 10.1109/CGO.2007.10
44
Raymond J. Hookway and Mark A. Herdeg. Digital FX!32: Combining emulation and binary
translation. Digital Technical Journal, pages 3–12, February 1997. 38
Galen Hunt and Doug Brubacher. Detours: Binary interception of win32 functions. In Proceedings
of the 3rd USENIX Windows NT Symposium, pages 135–143, Seattle, WA, USA, July 1999. 39
BIBLIOGRAPHY 63
Aamer Jaleel, Robert S. Cohn, Chi-Keung Luk, and Bruce Jacob. Cmp$im: A Pin-based on-the-fly
single/multi-core cache simulator. In Proceedings of the 2008 Workshop on Modeling, Benchmarking
and Simulation, MOBS ’08, Beijing, China, June 2008. 34
Rahul Joshi, Michael D. Bond, and Craig Zilles. Targeted path profiling: Lower overhead path
profiling for staged dynamic optimization systems. In Proceedings of the 2nd Annual IEEE/ACM
International Symposium on Code Generation and Optimization, CGO ’04, pages 239–250, Palo
Alto, CA, USA, March 2004. 43
Minjang J. Kim, Chi-Keung Luk, and Hyesoon Kim. Prospector: Discovering parallelism via dy-
namic data-dependence profiling. Technical Report TR-2009-001, Georgia Institute of Tech-
nology, 2009. 23
Vladimir Kiriansky, Derek Bruening, and Saman Amarasinghe. Secure execution via program shep-
herding. In Proceedings of the 11th USENIX Security Symposium, pages 191–206, San Francisco,
CA, USA, August 2002. 29
James R. Larus and Eric Schnarr. Eel: machine-independent executable editing. In Proceedings of
the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’95,
pages 291–300, La Jolla, CA, USA, 1995. ACM. DOI: 10.1145/223428.207163 55
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven
Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: Building customized program analysis
tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Program-
ming Language Design and Implementation, PLDI ’05, pages 190–200, Chicago, IL, USA, June
2005. DOI: 10.1145/1065010.1065034 58
Jonas Maebe, Michiel Ronsse, and Koen De Bosschere. Diota: Dynamic instrumentation, optimiza-
tion, and transformation of applications. In Proceedings of the 4th Workshop on Binary Translation,
WBT ’02, Charlottesville, VA, USA, September 2002. 55
Peter S. Magnusson, Magnus Christensson, Jesper Eskilson, Daniel Forsgren, Gustav Hållberg,
Johan Högberg, Fredrik Larsson, Andreas Moestedt, and Bengt Werner. Simics: A full system
simulation platform. IEEE Computer, 35(2):50–58, February 2002. DOI: 10.1109/2.982916 55
Duane Merrill and Kim Hazelwood. Trace fragment selection within method-based JVMs.
In Proceedings of the 4th Annual ACM SIGPLAN/SIGOPS International Conference on Vir-
tual Execution Environments, VEE ’08, pages 41–50, Seattle, WA, USA, March 2008.
DOI: 10.1145/1346256.1346263 44
Tipp Moseley, Alex Shye, Vijay Janapa Reddi, Dirk Grunwald, and Ramesh V. Peri. Shadow profil-
ing: Hiding instrumentation costs with parallelism. In Proceedings of the 5th Annual IEEE/ACM
International Symposium on Code Generation and Optimization, CGO ’07, San Jose, CA, USA,
March 2007. DOI: 10.1109/CGO.2007.35 53
64 BIBLIOGRAPHY
Satish Narayanasamy, Cristiano Pereira, Harish Patil, Robert Cohn, and Brad Calder. Auto-
matic logging of operating system effects to guide application-level architecture simulation.
In Proceedings of the Joint International Conference on Measurement and Modeling of Computer
Systems, SIGMETRICS ’06/Performance ’06, pages 216–227, Saint Malo, France, June 2006.
DOI: 10.1145/1140103.1140303 24
Nicholas Nethercote. Dynamic Binary Analysis and Instrumentation. PhD thesis, University of
Cambridge, Cambridge, U.K., November 2004. 58
Nicholas Nethercote and Julian Seward. Valgrind: a framework for heavyweight dynamic binary
instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language
Design and Implementation, PLDI ’07, pages 89–100, San Diego, CA, USA, June 2007.
DOI: 10.1145/1273442.1250746 58
Heidi Pan, Krste Asanovic, Robert Cohn, and Chi-Keung Luk. Controlling program execution
through binary instrumentation. In Proceedings of the Workshop on Binary Instrumentation and
Applications, WBIA ’05, St. Louis, MO, USA, September 2005. DOI: 10.1145/1127577.1127587
48
Harish Patil, Robert Cohn, Mark Charney, Rajiv Kapoor, Andrew Sun, and Anand Karunanidhi.
Pinpointing representative portions of large intel itanium programs with dynamic instrumenta-
tion. In Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture,
MICRO-37, pages 81–92, Portland, OR, USA, December 2004.
DOI: 10.1109/MICRO.2004.28 32
Harish Patil, Cristiano Pereira, Mack Stallcup, Gregory Lueck, and James Cownie. PinPlay: A
framework for deterministic replay and reproducible analysis of parallel programs. In Proceedings
of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO
’10, pages 2–11, Toronto, Ontario, Canada, April 2010. DOI: 10.1145/1772954.1772958 24
Vijay Janapa Reddi, Dan Connors, Robert Cohn, and Michael D. Smith. Persistent code caching:
Exploiting code reuse across executions and applications. In Proceedings of the 5th Annual Inter-
national IEEE/ACM Symposium on Code Generation and Optimization, CGO ’07, pages 74–88,
San Jose, CA, USA, March 2007. DOI: 10.1109/CGO.2007.29 8
Ted Romer, Geoff Voelker, Dennis Lee, Alec Wolman, Wayne Wong, Hank Levy, Brian Bershad,
and Brad Chen. Instrumentation and optimization of win32/intel executables using etch. In
Proceedings of the USENIX Windows NT Workshop, Seattle,WA, USA, 1997. USENIX Association.
55
Kevin Scott, Naveen Kumar, Siva Velusamy, Bruce Childers, Jack W. Davidson, and Mary Lou
Soffa. Retargetable and reconfigurable software dynamic translation. In Proceedings of the 1st
Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’03,
pages 36–47, San Francisco, CA, USA, March 2003. DOI: 10.1109/CGO.2003.1191531 45, 55
BIBLIOGRAPHY 65
Alex Skaletsky, Tevi Devor, Nadav Chachmon, Robert S. Cohn, Kim Hazelwood, Vladimir
Vladimirov, and Moshe Bach. Dynamic program analysis of microsoft windows applications.
In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Soft-
ware, ISPASS ’10, pages 2–12, White Plains, NY, USA, March 2010.
DOI: 10.1109/ISPASS.2010.5452079 41, 52
James E. Smith and Ravi Nair. Virtual Machines: Versatile Platforms for Systems and Processes. Morgan
Kaufmann, June 2005. 2, 58
Swaroop Sridhar, Jonathan S. Shapiro, Eric Northup, and Prashanth P. Bungale. HDTrans: An
open source, low-level dynamic instrumentation system. In Proceedings of the 2nd Annual ACM
SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE ’06, pages
175–185, Ottawa, Ontario, Canada, June 2006. DOI: 10.1145/1134760.1220166 55
Amitabh Srivastava and Alan Eustace. Atom: a system for building customized program anal-
ysis tools. In Proceedings of the ACM SIGPLAN Conference on Programming Language De-
sign and Implementation, PLDI ’94, pages 196–205, Orlando, FL, USA, June 1994. ACM.
DOI: 10.1145/989393.989446 55
Gang-Ryung Uh, Robert Cohn, Bharadwaj Yadavalli, Ramesh Peri, and Ravi Ayyagari. Analyzing
dynamic binary instrumentation overhead. In Proceedings of the Workshop on Binary Instrumentation
and Applications, WBIA ’06, San Jose, CA, USA, October 2006. 4
Steven Wallace and Kim Hazelwood. SuperPin: Parallelizing dynamic instrumentation for real-
time performance. In Proceedings of the 5th Annual IEEE/ACM International Symposium on
Code Generation and Optimization, CGO ’07, pages 209–217, San Jose, CA, USA, March 2007.
DOI: 10.1109/CGO.2007.37 52
Emmett Witchel and Mendel Rosenblum. Embra: Fast and flexible machine simulation. In
Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling
of Computer Systems, SIGMETRICS ’96, pages 68–79, Philadelphia, PA, USA, 1996. ACM.
DOI: 10.1145/233008.233025 55
Xiaolan Zhang, Zheng Wang, Nicholas Gloy, J. Bradley Chen, and Michael D. Smith. System
support for automatic profiling and optimization. In Proceedings of the Sixteenth ACM Sympo-
sium on Operating Systems Principles, SOSP ’97, pages 15–26, Saint Malo, France, 1997. ACM.
DOI: 10.1145/268998.266640 55
Qin Zhao, Ioana Cutcutache, and Weng-Fai Wong. Pipa: Pipelined profiling and analysis on
multicore systems. ACM Transactions on Architecture and Code Optimization, 7(3):13:1–13:29,
December 2010. DOI: 10.1145/1880037.1880038 53
67
Author’s Biography
KIM HAZELWOOD
Kim Hazelwood is an Assistant Professor of Computer Science
at the University of Virginia and a faculty consultant for Intel
Corporation. She works at the boundary between hardware and
software, with research efforts focusing on computer architecture,
run-time optimizations, and the implementation and applications
of process virtualization systems. She received the Ph.D. degree
from Harvard University in 2004. Since then, she has become
widely known for her active contributions to the Pin dynamic in-
strumentation system, which allows users to easily inject arbitrary
code into existing program binaries at run time (www.pintool.
org). Pin is widely used throughout industry and academia to
investigate new approaches to program introspection, optimiza-
tion, security, and architectural design. It has been downloaded over 50,000 times and cited in over
800 publications since it was released in July 2004. Kim has published over 40 peer-reviewed articles
related to computer architecture and virtualization. She has served on over two dozen program com-
mittees, including ISCA, PLDI, MICRO, OSDI, and PACT, and was a program chair of CGO 2010.
Kim is the recipient of numerous awards, including the FEST Distinguished Young Investigator
Award for Excellence in Science and Technology, an NSF CAREER Award, a Woodrow Wilson
Career Enhancement Fellowship, the Anita Borg Early Career Award, an MIT Technology Re-
view “Top 35 Innovators under 35 Award”, and research awards from Microsoft, Google, NSF, and
the SRC. Her research has been featured in MIT Technology Review, Computer World, ZDNet,
EE Times, and Slashdot.