How To Shadow Every Byte of Memory Used by A Program

How to Shadow Every Byte of Memory Used by a Program

Uploaded by

Mark

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2K views

How To Shadow Every Byte of Memory Used by A Program

How to Shadow Every Byte of Memory Used by a Program

Uploaded by

Mark

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

How to Shadow Every Byte of Memory Used by a Program

Nicholas Nethercote Julian Seward

National ICT Australia, Melbourne, Australia OpenWorks LLP, Cambridge, UK
[email protected] [email protected]

Abstract 1.1 What is Shadow Memory?

Several existing dynamic binary analysis tools use shadow mem- Programming tools such as profilers and checkers make program-
ory—they shadow, in software, every byte of memory used by a ming easier and improve software quality. Dynamic binary analy-
program with another value that says something about it. Shadow sis (DBA) tools are one class of such tools. They analyse a client
memory is difficult to implement both efficiently and robustly. program at the level of machine code as it runs. They can be built
Nonetheless, existing shadow memory implementations have not from scratch, but nowadays are usually implemented using a dy-
been studied in detail. This is unfortunate, because shadow mem- namic binary instrumentation (DBI) framework such as Pin [9] or
ory is powerful—for example, some of the existing tools that use it Valgrind [16].
detect critical errors such as bad memory accesses, data races, and This paper focuses on a class of DBA tools that use shadow
uses of uninitialised or untrusted data. memory, i.e. they shadow, in software, every byte of memory used
In this paper we describe the implementation of shadow mem- by a program with a shadow memory value that says something
ory in Memcheck, a popular memory checker built with Valgrind, a about it. We call these tools shadow memory tools. A shadow
dynamic binary instrumentation framework. This implementation memory value may describe the value within a memory location
has several novel features that make it efficient: carefully chosen (e.g. is it from a trusted source?), or it may describe the memory
data structures and operations result in a mean slow-down factor of location itself (e.g. how many times has it been accessed?).
only 22.2 and moderate memory usage. This may sound slow, but The analysis code added by the tool updates the shadow mem-
we show it is 8.9 times faster and 8.5 times smaller on average than ory in response to memory accesses, and uses the shadow mem-
a naive implementation, and shadow memory operations account ory to report information to the programmer. The granularity of the
for only about half of Memcheck’s execution time. Equally impor- shadowing can vary, but usually every used memory byte or word
tantly, unlike some tools, Memcheck’s shadow memory implemen- has a shadow memory value, and each shadow memory value may
tation is robust: it is used on Linux by thousands of programmers itself be one bit, a few bits, one byte, or one word, for example.
on sizeable programs such as Mozilla and OpenOffice, and is suited Some tools that use shadow memory also shadow every register
to almost any memory configuration. with an extra value. Shadow registers are challenging to implement
This is the first detailed description of a robust shadow mem- in their own right [16] but their implementation details are beyond
ory implementation, and the first detailed experimental evaluation the scope of this paper.
of any shadow memory implementation. The ideas within are ap-
plicable to any shadow memory tool built with any instrumentation
framework. 1.2 Shadow Memory is Useful
Shadow memory lets a tool remember something about the history
Categories and Subject Descriptors D.2.5 [Software Engineer-
of every memory location and/or value in memory. Consider the
ing]: Testing and Debugging—debugging aids, monitors; E.1
following motivating list of shadow memory tools; the descriptions
[Data Structures]
are brief but demonstrate that shadow memory (a) is powerful, and
General Terms Design, Reliability, Performance, Experimenta- (b) can be used in a wide variety of ways.
tion Memcheck [21, 12] is a memory checker. It remembers what
Keywords Shadow memory, Valgrind, Memcheck, dynamic bi- allocation/deallocation operations have affected each memory lo-
nary instrumentation, dynamic binary analysis cation, and can thus detect accesses of unaddressable memory. It
also remembers which values are undefined (uninitialised or de-
rived from undefined values) and can therefore detect dangerous
1. Introduction uses of undefined values. Purify [6] is a similar tool.
This paper describes how to create dynamic analysis tools that use TaintCheck [17] is a security tool. It remembers which values
shadow memory—tools that shadow every byte of memory used by are from untrusted (tainted) sources and which values were subse-
a program with another value, in software—that are both efficient quently derived from them, and can thus detect dangerous uses of
and robust. tainted values. TaintTrace [4] and LIFT [18] are similar tools.
Eraser [20] is a data race detector. It remembers which locks are
held when each memory location is accessed, and can thus detect
when a memory location is accessed without a consistent lock-set,
Permission to make digital or hard copies of all or part of this work for personal or which may imply a data race. VisualThreads [5] and Helgrind [10]
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
are similar. DRD [19] is another race detector that uses a different
on the first page. To copy otherwise, to republish, to post on servers or to redistribute race-detection algorithm.
to lists, requires prior specific permission and/or a fee. Hobbes [3] is a run-time type checker. It remembers what oper-
VEE 2007 June 13–15, San Diego, California, USA. ations have been performed on each value, and can thus detect any
Copyright c 2007 ACM 978-1-59593-630-1/07/0006. . . $5.00 later operations that are inappropriate for a value of that type.
Annelid [13] is a bounds checker. It remembers which values • First detailed description of Memcheck’s shadow memory.
are array pointers, and can thus detect bounds errors. Memcheck is a widely-used tool, and this is the first detailed de-
Redux [14] is a dataflow visualisation tool. It remembers which scription of its shadow memory implementation. Previous pub-
operation created each value, and its inputs, and records these lications [21, 12, 16] have discussed in detail every significant
in a dynamic dataflow graph which can be viewed at program aspect of Memcheck except its shadow memory implementa-
termination. tion.
pinSEL [11] automatically extracts system call side-effects from • First detailed description of any robust shadow memory im-
benchmarks so that architectural simulators do not have to emulate plementation. This is also the first detailed description of any
system calls when running those benchmarks. It shadows each robust shadow memory implementation. This is more important
memory location with a copy of itself, and does a “memory diff” than it may seem, because shadow memory is a topic where
between original and shadow memory values after each system details matter. High-level descriptions are not sufficient; lower-
call executes in order to determine how the system call affected level implementation details such a data representations are cru-
memory. cial, as they make the difference between a toy and a real-world
All of these tools rely crucially on shadow memory. Eraser, tool. Most published descriptions of shadow memory imple-
DRD and pinSEL use shadow memory but not shadow registers, the mentations have been minimal, and the three that have been
others use both shadow memory and shadow registers. The shadow discussed in detail are not robust enough, in our opinion, for
memory is implemented entirely in software and so these tools run use in a widely used tool like Memcheck.
on stock hardware.
• First experimental evaluation of shadow memory. This is the
1.3 Shadow Memory is Hard to Implement Well first paper that has evaluated and compared multiple versions of
Speed. The speed of a shadow memory implementation is impor- a shadow memory implementation.
tant. Although programmers will use slow tools if the benefits are The fourth and final contribution advances the state-of-the-art in
high enough, they prefer fast tools. shadow memory implementations.
Shadow memory is inherently expensive. Large amounts of
extra state must be maintained; one shadow byte per byte of live • Novel shadow memory optimisations. Memcheck’s basic shadow
original memory is typical. Most or all loads and stores must be memory data structure is similar to that used in several other
instrumented to keep the shadow memory state up-to-date, as must shadow memory tools. However, Memcheck adds several novel
operations that affect large regions of memory, such as allocations optimisations that speed up common cases, and compress
and deallocations (on the heap, stack or via system calls such as shadow memory at coarse-grained (per-64KB chunk) and fine-
mmap), reads/writes of large areas by system calls, and the loading grained (per-byte) levels. Together they reduce Memcheck’s
of the program image into memory at start-up. mean slow-down factor by 4.0–13.6x and shrink its mean
These requirements unavoidably increase the total amount of shadow memory size by a factor of 4.5–213.4 over a naive
code that is run, increase a program’s memory footprint, and de- implementation. The reduction in shadow memory size also
grade the locality of its memory accesses. Shadow memory tools improves robustness because it allows programs with larger
thus typically slow down programs by a factor of 10–100, and memory footprints to be run in the same amount of address
shadow memory operations cause much more of this slow-down space.
than other well-studied aspects of tools built with DBI frameworks Although the paper is centred around the implementation of
such as hot trace formation [1] and code cache management [7].1 Memcheck, the ideas within are general and apply to any shadow
Robustness. The robustness of a shadow memory implementa- memory tool implemented with any framework.
tion is also important, arguably even more so than its speed. Real-
world tools must cope with large, uncooperative programs, and if 1.5 Paper Structure
they are to be portable, cannot rely on particular operating system This introduction has discussed several aspects of shadow mem-
characteristics such as memory layouts. ory: (a) what it is, (b) that it is useful and used in several existing
Shadow memory is hard to implement robustly. The shadow tools, (c) that it is difficult to implement well, and (d) that it has not
memory must be squeezed into the address space alongside the been studied closely. Section 2 introduces Memcheck. Section 3
original memory in a way that does not conflict with it and does not describes the most basic form of Memcheck’s shadow memory im-
change the program’s behaviour. This requires considerable flexi- plementation, and Section 4 describes the optimisations that im-
bility in the shadow memory structure and layout. It also unavoid- prove its performance to an acceptable level. Section 5 evaluates
ably reduces the amount of address space a program can use itself, Memcheck’s robustness, speed and memory usage. Finally, Sec-
which is an important issue on embedded platforms and even 32-bit tion 6 discusses related work and Section 7 describes future work
machines. Obviously, this becomes less of an issue if shadow mem- and concludes.
ory can be made smaller, so compact shadow memory is desirable.
Trade-offs. In summary, we want shadow memory to be: (a) 2. Overview of Memcheck
fast (for efficiency); (b) structured flexibly (for robustness); (c)
compact (for efficiency and robustness). Not surprisingly, these Memcheck is our example tool. It is built using Valgrind. It is a
desires conflict and we will see that trade-offs must be made. good example because (a) it stores two kinds of shadow memory
values, and so is a challenging example, and (b) it is probably the
1.4 Contributions most widely-used shadow memory tool.
In this paper we make four contributions. The first three arise
2.1 Valgrind
because we delve more deeply into the topic of shadow memory
than previous publications have. Valgrind [16, 12, 15] is a DBI framework designed for building
heavyweight tools. Tools such as Memcheck are written, in C, as
1 These aspects are more important for lightweight tools [8] where the cost plug-ins to Valgrind’s core, which handles the details of running the
of analysis is small relative to the cost of running the original code. client program. The core translates machine code into a platform-
neutral intermediate representation before instrumentation, which
means tools are naturally platform-independent. VVVVVVVV A VVVVVVVV A VVVVVVVV A
Valgrind tools are used by thousands of programmers (the Val- VVVVVVVV A VVVVVVVV A VVVVVVVV A
.. .. .. .. .. ..
grind website [22] averages more than 1000 unique visitors per . . . . . .
day) including the developers of many large projects such as Fire-
fox, OpenOffice, KDE, GNOME, libstdc++, MySQL, Perl, Python, VVVVVVVV A VVVVVVVV A VVVVVVVV A
PHP, Samba, RenderMan and Unreal Tournament. It is a standard SM1 DSM SM2
package on most Linux distributions. Valgrind is licensed under the
GPL and is currently available for x86/Linux, AMD64/Linux, and
PPC{32,64}/{Linux,AIX} [22]; experimental ports also exist for
other platforms such as x86/FreeBSD and PPC32/Mac OS X. PM ...

2.2 Memcheck 0KB 64KB 128KB 192KB 3904KB 3968KB 4032KB

Memcheck is a memory error detector designed primarily for use Figure 1. The two-level table. Entries PM[1] and PM[2] cover
with C and C++ programs. It is the main reason for Valgrind’s 64KB regions that have been written to and so have their own SM.
popularity–users surveys have indicate it accounts for more than The remaining PM entries still point to the NOACCESS DSM.
80% of Valgrind tool use.
When a client program is run under Memcheck, Memcheck in-
struments almost every operation and issues messages about de- registers, nor the heap block metadata, as they have been covered
tected memory errors. Memcheck maintains three kinds of meta- by previous publications.
data about the running client.
• A bits. Every memory byte is shadowed with a single A bit (‘A’
3. A Simple Implementation (M0)
is short for “addressability”) which indicates if the client may This section presents a simple, robust, but slow implementation of
legitimately access it. A 0 represents an unaddressable byte, a 1 shadow memory for Memcheck, which we call M0.
represents an addressable byte. They are updated as memory is No released versions of Memcheck have used this implemen-
allocated and freed, and checked on every memory access. With tation, but Memcheck has a debugging mode which falls back to
the A bits Memcheck can detect uses of unaddressable memory it. Some or all of the optimisations described in Section 4 must be
such as heap buffer overflows and wild reads and writes. added to this implementation to obtain acceptable performance.
• V bits. Every register and memory byte is shadowed with eight 3.1 Shadow Memory Data Structures
V bits (‘V’ for “validity”) which indicate if the value bits are de-
fined (i.e. initialised, or derived from other defined values). A Memcheck’s main shadow memory data structure is a two-level
0 represents a defined bit, a 1 represents an undefined bit.2 Ev- table somewhat like a page table. It is designed for a 32-bit (4GB)
ery value-writing operation is shadowed with another operation address space; Section 3.8 describes how it is modified it for 64-
that updates the corresponding shadow values. With the V bits bit address spaces. The address space is divided into 64K chunks
Memcheck can detect dangerous uses of undefined values with of 64KB each. The primary map (PM) is a global array with 64K
bit-precision [21]. In this paper we are only concerned with V entries, one for each chunk. Each entry is a pointer to a secondary
bits for memory, not V bits for registers. map (SM) which holds the shadow memory (A and V bits) for a
64KB chunk. (In the code below, the types U1, U4, U8, U32 and U64
• Heap blocks. Memcheck records the location of every live heap are 1, 4, 8, 32 and 64 bit unsigned integers respectively, and Uw,
block in a hash table. With this information it can detect bad or Addr and SizeT are word-sized unsigned integers.)
repeated frees of heap blocks, and memory leaks.
typedef struct { // Secondary Map: covers 64KB
Conceptually, every register byte has eight shadow bits (the V U8 abits8[8192]; // 8K A bytes == 64K A bits
bits), and every memory byte has nine shadow bits (eight V bits U8 vbits8[65536]; // 64K V bytes
and one A bit). This implies that each shadow memory byte has } SM;
512 possible states. However, a byte’s V bits are only consulted if
the A bit says it is addressable, so there are 257 meaningful states SM* PM[65536]; // Primary Map: covers 4GB
for each byte (one “unaddressable” state, and 256 “addressable”
states with different definedness sub-states). There is a distinguished secondary map (DSM), called the
Representing the definedness of every bit is potentially expen- NOACCESS DSM, which is marked as entirely “unaddressable”,
sive, but crucial for accuracy. Tracking definedness at the byte and is never modified. All PM entries initially point to it. Figure 1
level—a commonly-suggested alternative—inevitably causes false illustrates the relationship between the PM and the SMs.
positives and/or false negatives, particularly for programs that use Although each SM contains Memcheck-specific data, the basic
bit-fields and bit-level operations [21]. We have heard from several two-level table can be used with any tool that uses shadow memory.
users that Memcheck has identified bugs caused by the use of a Indeed, Section 6 shows that most existing shadow memory tools
single undefined bit. use a data structure similar to this.
Fortunately, the redundancy in the number of states and the high
3.2 Single-byte Loads and Stores
cost of bit-level definedness tracking can be minimised with the use
of compressed shadow memory (Section 4.4). This section defines four fundamental functions used to load and
The next two sections present Memcheck’s implementation of store individual shadow memory bytes. Our functionally correct but
shadow memory, i.e. how it stores and accesses the V and A bits for slow implementation (M0) is built on top of these four functions.
memory locations. We do not discuss the use of V bits in shadow There are two main steps when loading or storing a single
shadow memory byte. In the first step, Memcheck uses the high
2 This counter-intuitive
encoding makes many of Memcheck’s V bit shadow 16 bits of the address to find the relevant SM within the PM. For
operations simpler [21] for reasons that are beyond the scope of this paper. loads, it can use the found SM as-is.
SM* get_SM_for_reading(Addr a) U64 vbits64 = V_BITS64_UNDEFINED;
{ SizeT n_bad_addrs = 0;
return PM[a >> 16]; // use bits [31..16] of ’a’ for (i = 0; i < nBits / 8; i++) {
} get_abit_and_vbits8(&abit, &vbits8, a + i);
if (abit != A_BIT_ADDRESSABLE) {
For stores, Memcheck uses copy-on-write semantics—it checks if n_bad_addrs++; // Defined-if-
the SM found is the DSM (with is_DSM), and if so, allocates and vbits8 = V_BITS8_DEFINED; // unaddressable
initialises a new SM (with copy_for_writing). }
SM* get_SM_for_writing(Addr a) vbits64 = (vbits64 << 8) | vbits8;
{ }
SM** sm_p = &PM[a >> 16]; // bits [31..16] if (n_bad_addrs > 0)
if (is_DSM(*sm_p)) record_address_error(a, nBits);
*sm_p = copy_for_writing(*sm_p); // copy-on-write return vbits64;
return *sm_p; }
} Note that this code assumes memory values are little-endian.
In the second step, Memcheck uses the low 16 bits of the Memcheck also handles big-endian values, but we omit the relevant
address to find the A and V bits within the SM. Loads are done with details (which are minor) for clarity.
the following function get_abit_and_vbits8. Extracting the V All client stores are instrumented with a call to STOREVn, which
bits from the SM is straightforward; extracting the single A bit from is similar to LOADVn; it copies a given shadow memory value of
a group of eight A bits requires extra shifting and masking. 8, 16, 32 or 64 bits into shadow memory, one byte at a time. If
any destination byte’s A bits indicate that it is unaddressable, the
void get_abit_and_vbits8(Addr a,/*OUT*/Uw* abit, relevant V bits are not copied and an error message is issued.
/*OUT*/Uw* vbits8)
{ 3.4 Range-setting Operations
SM* sm = get_SM_for_reading(a); In certain cases, Memcheck needs to set every shadow memory
U8 abits8 = sm->abits8[(a & 0xffff) >> 3];//[15..3] byte in a range to one of the following three states.
*abit = 0x1 & (abits8 >> (a & 0x7)); //[ 2..0]
*vbits8 = sm->vbits8[a & 0xffff]; //[15..0] 1. NOACCESS (unaddressable): for memory deallocations (on the
} heap, stack, or via system calls such as munmap).
The store case (set_abit_and_vbits8) is similar. 2. UNDEFINED (addressable and fully undefined): for memory al-
locations that do not initialise the allocated memory (such as
void set_abit_and_vbits8(Addr a, Uw abit, Uw vbits8) malloc, brk, and stack allocations).
{
3. DEFINED (addressable and fully defined): for memory alloca-
SM* sm = get_SM_for_writing(a);
tions that initialise the allocated memory (such as calloc and
Uw shift = a & 0x7;
mmap), for memory loaded at program start-up, and for when a
Uw i = (a & 0xffff) >> 3;
system call writes to a block of memory (e.g. gettimeofday
sm->abits[i] = (sm->abits[i] & ~(1 << shift))
fills in two structs with data).
| ((abit & 0x1) << shift);
sm->vbyte[a & 0xffff] = vbits8 & 0xff; Valgrind provides an event-tracking system that lets a tool know,
} via callbacks, when these operations occur [16]. Memcheck has
three range-setting callback functions: make_mem_noaccess,
The following sections show other shadow memory operations
make_mem_undefined, and make_mem_defined. They each take
that are layered on top of these four fundamental functions.
a starting address and a length in bytes, and set the shadow mem-
ory bytes for the given memory range, one byte at a time, using
3.3 Multi-byte Loads and Stores
set_abit_and_vbits8.
Most memory accesses are multi-byte. Every memory load in the A similar function, copy_range, is used to handle realloc
client program is instrumented with a call to the following func- and mremap. It copies A and V bits from one shadow memory
tion LOADVn, which does a shadow load of 8, 16, 32 or 64 bits. It region to another.
obtains each V byte individually from shadow memory and then
combines them into a single n-bit value. The V bits of each byte 3.5 Range-checking Operations
are only used if the A bit indicates that the byte is addressable. Sometimes Memcheck needs to check A and V bits over ranges of
If any unaddressable bytes are touched, Memcheck issues an er- shadow memory and issue error messages if they do not match what
ror (with record_address_error) and acts as if the bits are all is required. This is mostly done to check that ranges of memory
defined.3 This avoids possible chains of multiple error messages passed to system calls have the properties that they should.
caused by a single program defect [21]. (The V_BITS* and A_BIT* Memcheck has three such operations: check_mem_is_addressable
constants hold aggregations of one or more V or A bits. For exam- checks that every byte in a range about to be written by a system
ple, V_BITS8_DEFINED represents eight defined V bits, i.e. the V call is addressable (and thus safe to write); check_mem_is_defined
bits for a fully defined byte.) checks that every byte in a range about to be read by a system
U64 LOADVn(Addr a, SizeT nBits) call is both addressable and fully defined (and thus safe to read);
{ and check_mem_is_defined_asciiz checks that every byte in a
U1 abit; U8 vbits8; Int i; string of unknown length is safe to read.

3 All
3.6 Do Not Shadow the Shadows
shadow memory tools must handle this case, i.e. provide a reasonable
shadow memory value for memory locations that have not been allocated There is a subtle problem that can occur in this shadow memory
and are erroneously accessed. implementation. For example, consider this memory layout:
SMX (a 72KB SM, which covers 64KB of address space) be scaled to 64-bit address spaces, so this remains an open research
Y (4KB of client data) question for the future.
SMZ (a 72KB SM, which covers 64KB of address space)
3.9 Handling Multi-threaded Programs
Any client accesses to Y causes shadow memory accesses to a
Threads pose a particular challenge for shadow memory tools. The
secondary map SMY (it does not matter where SMY is located).
reason is that loads and stores become non-atomic: each load or
But because SMY covers 64KB of address space, even in the best
store translates into the original load/store plus a shadow load/store.
case it must cover at least 60KB’s worth of address space from SMX
There are two potential problems with this. First, asynchronous
and/or SMZ . In other words, because of the intermingling of SMs
signals may be delivered between these two operations. To avoid
and client data, some parts of the SMs end up uselessly covering
this problem, Valgrind only delivers asynchronous signals to the
parts of the address space which are occupied by other SMs. (In
client at particular safe points (between the code blocks that Val-
Memcheck, those ranges would be marked as unaddressable by
grind’s JIT compiler uses) [16].
the client program.) This wastes space. If such intermingling is
Second, on a uni-processor machine, a thread switch might oc-
frequent, it can lead to a steep increase in the number of SMs
cur between these two operations. On a multi-processor machine,
needed. This problem affected early versions of Memcheck.
concurrent memory accesses to the same memory location may
There are two ways to avoid this problem. If SMs are all nKB
complete in a different order to their corresponding shadow mem-
and they are guaranteed to be nKB-aligned, there will be no over-
ory accesses. It is unclear how to best deal with this, as a fine-
lapping. But this can be too restrictive; for example, in Section 4.4
grained locking approach would likely be slow.
we introduce an SM that covers 64KB but is only 16KB in size.
To sidestep this second problem, Valgrind uses a thread lock-
Memcheck uses a simpler approach: ensure that SMs are kept far
ing mechanism; on a thread-switch the kernel still chooses which
away from the client’s original data whenever possible.
thread is to run, but Valgrind dictates when thread-switches occur
3.7 Possible Corruption of Shadow Memory by the Client and prevents more than one thread from running at a time. This
works well on uni-processor machines—Valgrind can ensure that
It is possible for a buggy client to do wild writes that overwrite
thread-switches never occur between a load/store and a shadow
Memcheck’s shadow memory (or any of its other data structures).
load/store, and Memcheck can simply ignore threading issues. On
When Valgrind only ran on x86/Linux, it used the x86 segment
multi-processor machines it can also run multi-threaded programs
registers to prevent this. However, this non-portable feature was
safely, but only serially. As multi-processor machines become more
removed when Valgrind was ported to other architectures, and we
popular, this shortcoming will become more critical. Whether the
know of no other way to prevent such wild writes without large
current approach will remain the optimal one in the future is an
slow-downs.
open research question.
This problem rarely occurs in practice because Valgrind and
Memcheck’s data tends to be far away from client data, which
minimises the chance of a wild write causing corruption. (This is 4. A Better Implementation
another good reason why client data should not be closely inter- This section presents four optimisations for M0 which when ap-
mingled with Valgrind and Memcheck data.) Also, Memcheck will plied successively give us four new versions which we call M1–
always warn about any such wild writes by the client before they M4. These optimisations use standard principles: make common
happen, because Valgrind and Memcheck data is marked as unad- cases fast, and reduce storage requirements by exploiting redun-
dressable via the NOACCESS DSM. (Other shadow memory tools dancy in data. Together they reduce Memcheck’s mean slow-down
will not be so lucky.) This is a good example of a trade-off: some factor by 4.0–13.6x, and reduce its mean shadow memory size by
robustness is sacrificed, albeit in rare cases, in favour of perfor- 4.5–213.4x, making it fast enough for widespread use, and compact
mance and portability. enough to handle large programs.
3.8 Handling 64-Bit Machines 4.1 Faster Loads and Stores (M1)
64-bit address spaces are much larger than 32-bit address spaces. Multi-byte loads and stores are very common, and we can do
The obvious extension is to use a three- or four-level table, but better than use LOADVn and STOREVn with them. For example, the
this would make every shadow memory access slower. Instead we following function does fast 32-bit shadow loads. If the load is
extend the size of the primary map to 219 bits (covering 32GB), and aligned (which guarantees that all four bytes are covered by the
we add a slow, sparse auxiliary table for secondary maps higher same SM) and all four bytes are addressable, it obtains the 32 V
than 32GB, and the Valgrind core avoids allocating above this bits from the SM in a single operation. Otherwise, it falls back to
32GB point when possible. A below-32GB check has to be done the slow, general case.
for every shadow memory operation, but for the fast cases it can
U32 LOADV32_fast(Addr a) {
be combined for zero cost in a single mask-and-test operation with
SM* sm; U4 abits4; U8 abits8;
the alignment check used in the optimised shadow loads and stores
if (!IS_32BIT_ALIGNED(a))
(described in Section 4.1 below).
return (U32)LOADVn(a, 32);
This approach works well under Linux because Valgrind has
sm = get_SM_for_reading(a);
enough control over the address space layout that it can allocate
abits8 = sm->abits8[(a & 0xffff) >> 3];
most memory under the 32GB limit. Unfortunately, we have found
abits4 = (abits8 >> (a & 0x4)) & 0xf;
that this is not true for PPC64/AIX. Therefore, Memcheck uses
return ( A_BITS4_ADDRESSABLE == abits4
some “semi-fast” cases, similar to those described in Section 4, for
? ((U32*)(sm->vbits8))[(a & 0xffff) >> 2]
certain accesses above the 32GB limit, e.g. those that are aligned : (U32)LOADVn(a, 32) );
and fully defined. This avoids the large slow-down that the slow }
cases cause, but is still approximately half the speed of the 32-bit
version. This function’s fast path does one alignment test, two SM
In the future, as 64-bit architectures become more common lookups, and one addressability check. This is much faster than
and memory footprints grow, this issue will become increasingly LOADVn which does, for 32-bit loads, eight SM lookups, four ad-
important. It is unclear how this shadow memory scheme can best dressability checks, and combines the four V bytes together. With
this function present, the slow case is run only every few thousand size of shadow memory by 4.29x and the mean slow-down factor
or even million loads. by 1.16x, as Section 5 shows.
STOREV32_fast is similar, but with the additional fast-path It is an optimisation that may seem obvious in hindsight. How-
condition that sm must not be the DSM. The functions for fast 1, ever, we are aware of no other shadow memory tool that uses com-
2 and 8 byte shadow loads and stores are similar, except that the pression like this, and Memcheck had been publically available for
1-byte case does not need the alignment check. These functions more than four years before we conceived and implemented it.
handle all the common multi-byte loads and store cases. Indeed, before this optimisation was implemented, the Valgrind
Section 5 shows that this optimisation reduces Memcheck’s distribution included a cut-down version of Memcheck (called Ad-
mean slow-down factor by 3.73x. drcheck) which tracked A bits but not V bits (thus recording 1 bit
of shadow information per byte, rather than 9), for use when Mem-
4.2 Faster Range-setting (M2) check’s memory overhead was too great. After adding this optimi-
Range-setting operations are also very common. There are three sation to Memcheck, we were able to kill off Addrcheck because
improvements that can be made to them. Section 5 shows that the difference in memory usage (1 bit per byte versus slightly more
together these optimisations reduce Memcheck’s mean slow-down than 2 bits) was so much smaller.
factor by 1.62x, and reduce its mean shadow memory size by 1.97,
but sometimes drastically more (e.g. more than 30x). 4.4.1 The Basic Idea
Vectorising set_range. First, set_range from Section 3.4 can Memcheck’s tracking of definedness at the level of individual bits
be vectorised so that it sets one byte at a time until the current is useful [21], but it is expensive considering that partially defined
address is N -aligned, then sets N bytes at a time, and finally sets bytes (PDBs) are rarely involved in more than 0.1% of memory
any left-over bytes one at a time. N must be a power of two to accesses, and are not present at all in many programs.
ensure that all N bytes belong to the same SM. Memcheck uses This situation can be improved. Instead of maintaining 8 V
N = 8. bits and 1 A bit for every used memory byte, we can instead use
This improvement is generally helpful, and also occasionally only two VA bits per memory byte. With two bits we can mark
affects performance greatly by mitigating some nasty performance each memory byte with one of four states: NOACCESS, DEFINED,
cases. Some operations that are very fast natively take far longer UNDEFINED or PARTDEFINED. The first three states are the familiar
when run under Memcheck. For example, a large stack allocation ones from Section 3.4. PARTDEFINED represents PDBs, which have
takes a single instruction natively, but requires setting a large region their full eight V bits stored in a sparse secondary V bit table.
of shadow memory under Memcheck. We have seen one example Shadow registers still have eight V bits per byte, so only shadow
program that did a lot of 8KB stack allocations, and with the loads and stores are affected. Shadow loads uncompress the two
improved set_range it ran more than three times faster. This is VA bits for each memory byte into the eight V bits for each register
a good reminder that it is important to not only make the common byte, and shadow stores do the opposite. Loads and stores involving
cases fast, but also make the uncommon cases not too slow. PDBs are much slower because they involve the secondary V bits
table, which is an AVL tree. copy_range is also changed to copy
Replacing whole SMs. Second, when setting a 64KB range cov- entries from the secondary V bits table for any bytes that have the
ered by a single SM to “unaddressable”, instead of laboriously PARTDEFINED state.
marking every byte Memcheck can instead replace the existing SM This approach makes shadow memory much smaller. And al-
with the NOACCESS DSM, achieving the same effect in a single op- though the PDB cases are slower, this approach is faster overall.
eration (this can be viewed as vectorisation on a much larger scale). This may be partly due to better cache behaviour but it is mostly
The replaced SM can then be deallocated. This greatly speeds up because many shadow operations are simpler in the common case,
large deallocations, such as those done with munmap when unload- as the next section shows.
ing a shared object, and reduces shadow memory size as well.
Additional DSMs. The third improvement is more subtle, and 4.4.2 The Details
was added to Memcheck more than three years after Memcheck Secondary maps now have the following structure.
was first released. It involves the introduction of additional DEFINED
and UNDEFINED DSMs to complement the NOACCESS DSM. This typedef struct {
turns out to be a big win: large code segments and read-only data U8 vabits8[16384]; // 64K two-bit values
regions can be covered by the DEFINED DSM, and because code } SM;
segments are rarely written to, Memcheck avoids allocating many The first two of the four fundamental functions introduced in
SMs. This saves memory and also improves speed because fewer Section 3.2, get_SM_for_reading and get_SM_for_writing,
SMs need to be initialised. are unchanged because the primary map is unchanged. The third
fundamental function, get_abit_and_vbits8 is replaced by the
4.3 Faster Stack Pointer Updates (M3) following function which loads the two VA bits for a memory byte.
Stack pointer updates are very common, and the increment/decrement
U8 get_vabits2(Addr a)
sizes are often small, statically known constants such as 4, 8, 12,
{
16 or 32 bytes. Memcheck uses specialised range-setting functions
SM* sm = get_SM_for_reading(a);
for these sizes which are faster than the variable-length range-
U8 vabits8 = sm->vabits8[(a & 0xffff) >> 2];
setting functions. These functions first check that the stack pointer
vabits8 >>= ((a & 3) << 1);// shift 2 bits down
is aligned—it almost always is—and then operate like unrolled ver-
return 0x3 & vabits8; // mask out the rest
sions of the vectorised set_range. These operations are so com-
}
mon that this optimisation reduces Memcheck’s mean slow-down
factor by 1.28x, as Section 5 shows. It is used by the following function which uncompresses the ob-
tained VA bits into eight V bits, suitable for placing in a shadow
4.4 Compressed V Bits (M4) register, and returns a boolean indicating if it was unaddressable. If
This section describes a more elaborate optimisation—a low-level the byte is a PDB, get_sec_vbits8 is used to look up the V bits
compression technique that on average further reduces the mean in the secondary V bits table.
Bool get_vbits8(Addr a, U8* vbits8) values are never read—but we need to garbage collect (GC) the
{ table when it fills up to prevent space leaks. Memcheck initially
U8 vabits2 = get_vabits2(a); limits the table to 1024 nodes, but doubles that limit after any
if ( VA_BITS2_DEFINED == vabits2 ) { GC in which more than half the nodes survive. This scales well
*vbits8 = V_BITS8_DEFINED; for programs with many PDBs.
} else if ( VA_BITS2_UNDEFINED == vabits2 ) { • Line Sizes. We can store the V bits for multiple consecutive
*vbits8 = V_BITS8_UNDEFINED; memory bytes in a single table node, i.e. have a larger line
} else if ( VA_BITS2_NOACCESS == vabits2 ) { size. A node (line) is then stale only if every byte in it is
*vbits8 = V_BITS8_DEFINED; // Defined-if- stale. Bigger lines are better if PDBs are clustered, because
return False; // unaddressable fewer lines will be needed, saving space and lookup time. But
} else { if PDBs are sparsely distributed, bigger lines will just take up
*vbits8 = get_sec_vbits8(a); more space. (The issues are similar to those affecting cache line
} sizes.) Memcheck uses a line size of 16 bytes, which provides
return True; a good balance.
}
• Eviction policies. A GC should not immediately evict all stale
This function is in turn used by a new version of LOADVn which is lines from the table, because lines may become non-stale soon,
very similar to the original one from Section 3.3. in which case unnecessary work will have been done. Mem-
The fourth fundamental function, set_abit_and_vbits8, is check uses an aging mechanism: during a GC it only evicts lines
replaced by a new function set_vabits2 similar to get_vabits2. that have not been touched for three GCs.
A function set_vabits8 (similar to get_vabits8) is built on top
of it; it is used by the new STOREVn. This policy ensures that the secondary V bits table lookups
The fast-case shadow load function LOADV32_fast from Sec- have negligible performance impact in all but the most pathological
tion 4.1 now has the following form. cases. With the large common-case time and space savings com-
pressed V bits are a clear overall improvement.
U32 LOADV32_fast(Addr a) {
SM* sm; U8 vabits8; 4.4.4 Another Trade-off
if (!IS_32BIT_ALIGNED(a))
return (U32)LOADVn(a, 32); Compressed V bits show another important trade-off, this time
sm = get_SM_for_reading(a); between precision and performance. Memcheck does not detect
vabits8 = sm->vabits8[(a & 0xffff) >> 2]; writes to read-only memory before they occur. It would require an
if (VA_BITS8_DEFINED == vabits8) extra read-only state, which would be common, and so five states
return V_BITS32_DEFINED; would be needed. Five states would not fit neatly into two bits, so
else if (VA_BITS8_UNDEFINED == vabits8) the implementation would be much slower. (A read-only state was
return V_BITS32_UNDEFINED; omitted from the pre-compressed V bits representation for similar
else reasons.) Besides, the additional benefit would be small because
return (U32)LOADVn(a, 32); such errors are rare and they usually cause segmentation faults
} which Memcheck can pinpoint immediately afterwards anyway.

The alignment check and primary map look-up is the same as 5. Evaluation
before. The secondary map look-up differs; the cases where the
four bytes are entirely defined or entirely undefined are handled In this section we evaluate the robustness, speed and memory usage
here, and the decompression of the two VA bits into eight V bits of Memcheck’s shadow memory implementation.
is straightforward. The remaining cases are handled by the slow
LOADVn function, similar to before. In particular, if any of the bytes 5.1 Robustness
loaded are PDBs—i.e. they have the PARTDEFINED state—LOADVn Memcheck’s key shadow memory robustness feature is its division
looks up the secondary V bits table. of shadow memory into smallish (SM-sized) chunks which can be
This LOADV32_fast is faster than the previous version in the laid out very flexibly. The only restriction is that the base Mem-
fast case because (a) it gets the VA bits with one SM access instead check executable (which is 4MB and contains some Memcheck
of two; (b) it does not have to do any shifting and masking to code and some Valgrind code) is statically linked and so must be
extract the A bits; and (c) the number of conditional branches is loaded at a pre-specified address. The address chosen is one that
usually unchanged—most loads are from DEFINED memory, so the is rarely used but never reserved by the kernel. On x86/Linux it
second test (VA_BITS8_DEFINED == vabits8) usually succeeds. is 0x38000000; no problems have been reported with this address
The other fast multi-byte load and store functions have similar but a user could change it if necessary by changing a configuration
benefits. file and recompiling Valgrind. This restriction is an implementation
Finally, the new versions of set_range (from 4.2) and the stack detail of Valgrind itself, however, and not inherent in Memcheck’s
operations (from 4.3) both benefit from the faster SM accesses. shadow memory scheme.
Robustness is not easy to quantify, and we can provide only
4.4.3 The Secondary V Bits Table anecdotal evidence for Memcheck’s robustness—we cite its num-
The secondary V bits table is an AVL tree that holds the full V bits ber of users, and the range of software and systems it has been
for PDBs in memory. It has three subtleties that require care. used on. Section 2.2 described how many users Memcheck has. We
have heard from users that Memcheck has been used successfully
• Stale nodes. When a PDB is overwritten with a non-PDB, on programs containing up to 25 million lines of code, on 32-bit
we could remove its entry from the table, but checking for and 64-bit platforms, both big-endian and little-endian, on several
overwritten PDBs on every store would be slow and remove flavours of Unix. Despite this broad exposure, we are aware of no
much of the benefit of compressed V bits. Instead we let these user problems relating to shadow memory layout while the current
entries become stale. This does not affect correctness—the stale scheme has been in place.
In earlier versions, Memcheck used the two-level table, but put down caused by shadow memory, we also give the figures for two
all shadow memory into a large contiguous region towards the high other tools: (a) Nulgrind (NL), the no-instrumentation tool, which
end of the address space. This region could then never be used shows the base slow-down due to Valgrind; and (b) Memcheck-lite
by a client. During this period a number of problems relating to (M5), a version of Memcheck with its register-level V bit prop-
shadow memory layout were encountered by users. Some programs agation and checking turned off, in which almost all of the tool
wouldn’t work without access to this part of the address space. It overhead is due to shadow memory operations.
was also incompatible with kernels with uncommon address space Nulgrind’s mean slow-down factor is 4.6. This is high, but the
configurations (such as the top 2GB being kernel-only, instead of no-instrumentation case is mostly uninteresting because the added
the more common 1GB), and with kernels configured to disallow instrumentation code dominates execution time, and Valgrind is not
“over-committing”, i.e. the mapping of more virtual memory than optimised for this case [16]. The mean slow-down of 22.2 for Mem-
the machine has physical memory and swap space. check on the “ref” inputs is respectable given the amount of anal-
These cases, while not typical, were common enough that we ysis it is doing. (The improvement over the mean slow-down of
rewrote Valgrind’s address space management code to give Mem- 23.4 for the “test” inputs shows how instrumentation costs are usu-
check its current flexibility. This flexibility solved the Linux prob- ally amortised in longer-running programs). Memcheck-lite’s mean
lems, and is becoming increasingly important as Valgrind and slow-down is 16.0x. By subtracting Nulgrind’s slow-down factor
Memcheck are ported to OSes that have more restrictive address from Memcheck-lite’s slow-down factor, we can estimate that ap-
space layouts than Linux. For example, Mac OS X places the main proximately half of Memcheck’s overhead is related to shadow
stack, shared libraries, C library code and Objective-C runtime memory accesses.
code in the upper half of address space in a manner that is very dif-
ficult to avoid. Similarly, AIX places the main stack, thread stacks, Other tools. Section 6 mentions some published performance re-
shared libraries and mapped-in network cards at various locations sults for other shadow memory tools. We do not perform any di-
in the address space which Valgrind cannot control. Also, many rect comparisons with other tools because they (a) are built with
embedded systems have similarly restrictive address space layouts. Valgrind and use basically the same implementation (but less opti-
This change also means that Valgrind can now run itself, which mised) as Memcheck (Annelid, Helgrind, TaintCheck, Redux); or
it could not do before this change.4 (b) are proprietary, not publically available, and/or implemented
on different platforms (Purify, Eraser, VisualThreads, Hobbes, pin-
5.2 Performance SEL); or (c) use shadow value data structures sufficiently different
to be not worth comparing (DRD—see Section 6); or (d) are only
We performed experiments on 25 of the 26 SPEC CPU2000 bench- capable of running a fraction of the SPEC 2000 benchmark suite
marks (we could not run galgel as gfortran failed to compile it). (TaintTrace, LIFT).
Eight of the benchmarks invoke their program more than once; for Nonetheless, as our second and third contributions stated, our
these (marked with a ‘*’ in Table 1) we ran all of them but only detailed description and evaluation of Memcheck’s shadow mem-
report the results for the longest-running invocation. We ran them ory implementation exceeds anything else in the literature.
in 32-bit mode on a 2.4 GHz Intel Core 2 Duo with 1GB RAM and
a 4MB L2 cache running SUSE Linux 10.2, kernel 2.6.18.2. To en-
sure a fair comparison, we implemented all variants using a single 6. Related Work
version (a pre-3.2.0 version) of Memcheck as the starting point. In this section we compare Memcheck’s shadow memory imple-
mentation to those of other shadow memory tools, all of which were
Smaller inputs. The left-hand side of Table 1 shows the slow- introduced in Section 1.
down factors of the five versions (M0–M4) of Memcheck from
Sections 3 and 4.1–4.4 (all with leak-checking off, because it runs Other Valgrind tools. Four of the tools (other than Memcheck)
at program termination and is largely orthogonal to the concerns mentioned in Section 1 were built with Valgrind: Annelid, Hel-
of this paper) on the SPEC “test” inputs.5 The slow-down factors grind, TaintCheck and Redux. Like Memcheck, they all use the
for perlbmk and fma3d are omitted; the native run-times were so two-level shadow memory data structure. Unlike Memcheck, they
short that their slow-down numbers for M4 were both over 200. do not use all of Section 4’s optimisations because they are more
The middle portion of the table shows the peak size of shadow experimental, so their performance is not as critical.
memory—the peak combined size of the primary map, DSMs, non-
distinguished SMs, and the secondary V bit table (for M4)—for Hobbes, TaintTrace and LIFT. Hobbes [3] and TaintTrace [4]
M0–M4. use a simple implementation of shadow memory that we call “half-
The four optimisations all improve speed, reducing the mean and-half”. They put client memory in the bottom 1.5GB of address
slow-downs by 3.73x, 1.62x, 1.28x and 1.16x, for a combined space, shadow memory in the next 1.5GB, and assume the top 1GB
speed-up factor of 8.9. The optimisations also reduce the mean is reserved for the kernel (this is all for 32-bit machines). Shadow
memory consumption by a factor of 8.5—extra DSMs by 1.97x memory accesses become so simple—each memory byte’s shadow
(although occasionally drastically more for programs with a lot of byte is found at a 1.5GB offset—that they can be inlined rather than
code and/or read-only data such as mcf and applu), and compressed requiring a C call, which makes them very fast. LIFT [18] is similar,
V bits by 4.29x. This last figure shows that compressed V bits are but shadow memory is 1/8th the size of client memory because each
highly effective. memory byte has a 1-bit shadow, and so it uses a scaled offset.
Hobbes’ reported slow-downs for SPECint programs were in
Larger inputs. The right-hand side of Table 1 shows the same the range 30–187x. However, other parts of Hobbes were inefficient
statistics for fully optimised Memcheck (M4) on the SPEC “refer- and so this is a poor comparison point. TaintTrace is implemented
ence” inputs. These inputs are so large that the experiments took with DynamoRIO [1], and its reported average slow-down is only
several days to run. To get an idea of the proportion of the slow- 5.5x for six of the SPECint benchmarks. LIFT [18] is built with
StarDBT, and has a mean slow-down factor of 3.5x for a similar
4 With one proviso: the “inner” Valgrind must be configured so it (i.e. the subset of the SPEC CPU2000 integer benchmarks. There are two
static executable) is loaded at a different address to the “outer” Valgrind. main reasons why they are much faster than Memcheck [16]: (a)
5 These versions are so slow that larger inputs would have taken weeks to they are doing simpler analyses, and (b) they use some instrumen-
complete. tation techniques that are faster but do not handle as wide a range of
“Test” inputs “Ref” inputs
Slow-down Factor Peak Sh Mem Size (KB) Slow-down Factor ShMem
Prog. M0 M1 M2 M3 M4 Tx M0,M1 M2,M3 M4 Mx NL M4 M5 M4,M5
bzip2* 162.9 45.0 23.7 20.8 19.2 8.5 27,040 21,424 4,880 5.5 3.6 17.1 12.9 47,888
crafty 252.0 110.5 51.3 38.1 35.1 7.2 5,584 3,856 864 6.5 7.0 35.9 26.2 864
eon* 380.7 201.8 133.8 59.8 55.1 6.9 5,296 2,416 656 8.1 8.4 51.9 51.1 704
gap 243.4 113.1 48.4 35.7 30.2 8.1 78,304 43,312 7,664 10.2 4.1 26.7 17.4 49,728
gcc* 234.9 112.5 51.0 39.9 37.0 6.3 14,944 12,064 2,873 5.2 5.2 33.3 24.5 25,523
gzip* 173.9 40.4 26.3 20.3 15.7 11.1 12,496 10,768 2,496 5.0 3.0 13.7 10.6 48,576
mcf 184.9 63.5 32.0 19.5 16.1 11.5 109,264 3,136 512 213.4 2.1 7.1 5.4 528
parser 216.7 72.6 45.0 24.1 18.4 11.8 38,128 8,896 2,144 17.8 3.9 17.9 13.9 7,200
perlbmk* (omitted; run-time too short) 3,496 1,264 464 7.5 4.8 25.3 18.9 40,512
twolf 156.7 47.7 32.8 28.3 25.4 6.2 4,432 2,704 704 6.3 3.2 15.8 11.5 5,584
vortex* 238.9 119.1 64.4 49.0 43.7 5.5 36,904 34,672 7,632 4.8 6.9 41.2 30.8 20,784
vpr* 172.2 52.6 30.9 24.2 21.4 8.0 3,568 2,056 512 7.0 4.3 20.3 14.1 1,552
ammp 113.6 39.9 36.1 32.7 28.2 4.0 28,192 22,504 5,072 5.6 3.6 32.7 27.0 5,088
applu 222.2 38.8 27.5 27.2 25.1 8.9 221,728 13,072 3,008 73.7 5.4 19.3 11.9 47,728
apsi 223.1 26.3 20.6 19.1 16.9 13.2 223,168 221,872 49,392 4.5 3.7 16.2 11.1 49,600
art* 207.1 45.0 44.1 43.0 25.9 8.0 7,024 5,152 1,200 5.9 5.1 24.4 21.6 1,568
equake 205.1 59.4 25.5 21.0 17.9 11.5 29,632 28,696 6,320 4.7 4.3 17.1 13.3 25,472
facerec 135.1 27.4 20.8 16.7 14.2 9.5 33,880 29,056 6,480 5.2 4.9 18.4 11.8 6,848
fma3d (omitted; run-time too short) 5,872 2,704 736 8.0 4.3 25.4 18.2 28,592
lucas 331.8 67.8 37.6 27.2 24.8 13.4 3,928 2,056 576 6.8 4.1 23.3 14.6 37,056
mesa 202.2 109.3 45.1 30.9 29.2 6.9 27,256 11,200 2,560 10.6 5.9 58.8 33.4 2,704
mgrid 205.5 20.0 20.0 20.0 17.5 11.7 67,000 65,632 14,672 4.6 4.1 16.8 11.2 14,720
sixtrack 268.8 34.8 27.8 22.3 20.2 13.3 74,560 39,352 8,464 8.8 6.4 19.8 15.2 9,648
swim 183.8 25.8 16.7 16.0 13.5 13.6 222,736 87,232 19,456 11.4 3.7 10.8 7.1 49,296
wupwise 279.1 63.2 35.6 30.0 25.6 10.9 205,960 204,520 45,520 4.5 7.8 26.9 19.1 45,536
geo. mean 209.6 56.2 34.7 27.2 23.4 8.9 254,65 12,928 3,013 8.5 4.6 22.2 16.0 11,144
rel. imp. 3.73 1.62 1.28 1.16 1.97 4.29 1.38
Table 1. Performance of six Memcheck variants (M0–M5) and Nulgrind (NL). Column 1 gives the program name; integer programs are
listed before floating-point programs. Columns 2–6 give the slow-down factors for M0–M4 (with “test” inputs), and column 7 (Tx) gives
the overall speed improvement from M0 to M4. Columns 8–10 give the shadow memory sizes for M0–M4, and column 11 (Mx) gives the
overall shadow memory reduction from M0 to M4. Columns 12–14 give the slow-down factors for Nulgrind, M4 and M5 (with “ref” inputs).
Column 15 gives the shadow memory size for M4 and M5. The second-last row gives geometric means of each column. The last row gives
the relative improvements in the means for M1–M4.

programs, and half-and-half shadow memory is one of these tech- state code is like Memcheck’s compressed VA bits but without the
niques. PARTDEFINED value for handling PDBs. We know of no published
Unfortunately, although half-and-half is simple and fast, its less information about the bit table’s structure.
flexible layout means it fails for some programs under Linux, and VisualThreads [5], another data-race detector, uses a two-level
is incompatible with OSes with more restrictive memory layouts table like Memcheck, but with much larger secondary maps (16MB
such as Mac OS X and AIX, as Section 5.1 explained. vs. 64KB). Judging from the cited paper, the primary map is a
For these reasons, for 32-bit machines, half-and-half is unsuit- structure with a non-constant lookup time such as a tree. This is in
able for Memcheck and related Valgrind tools, for which robustness contrast to Memcheck’s first-level lookup which is constant-time.
is as important or more important than performance. For 64-bit ma- Larger secondary maps cause more memory to be wasted in the
chines the situation is less clear, but we suspect similar problems cases where secondary maps are only partially used, and DSMs are
would arise with half-and-half in that setting. In comparison, the likely to be less effective. The paper also says: “This table lookup
two-level table approach provides acceptable performance and ex- was added for improved robustness necessary in a product, even at
cellent robustness. This is an example of a crucial design trade-off. the cost of some additional execution overhead.” We suspect this
The Hobbes, TaintTrace and LIFT papers are notable for being cryptic statement corroborates our claim that a flexible layout is
the only other publications we know of that describe a shadow required for robustness, as opposed to the half-and-half scheme.
memory implementation in detail beyond a couple of sentences. pinSEL [11] uses a two-level table, with smaller secondary
Also, all three tools could be changed to use a two-level shadow maps than Memcheck (4KB vs. 16KB). Its primary map is a hash
memory implementation. table. The reported slow-down for pinSEL is in the range 10–163x,
with an average of 93x.
Other tools. The original version of Eraser [20] used the half-and-
DRD [19] structures shadow memory differently. It needs to
half approach. The commercial version uses an approach more like
record all the memory bytes accessed during a segment (a time-
Memcheck’s—each memory page has a shadow page, a shadow
slice). For each segment it uses a bit-map, where each bit represents
page table does the real-to-shadow page mapping, and an array
a memory byte. Each bit-map is structured like our two-level table,
is used as a mapping cache (shadow TLB) [2]—but there is no
but instead with nine levels. This makes lookups slower, but results
publication describing it.
in very little wasted space in the sparsely populated segment bit-
Purify [6] uses “a bit table that holds a two-bit state code for
each byte in the heap, stack, data and bss sections”. The two-bit
maps, which is important as there can be many segments live at [4] W. Cheng, Q. Zhao, B. Yu, and S. Hiroshige. Tainttrace: Efficient
one time. The measured slow-down factors ranged from 10–247. flow tracing with dynamic binary rewriting. In Proceedings of ISCC
2006, pages 749–754, Cagliari, Sardinia, Italy, June 2006.
Other DBI frameworks. Although this paper described a tool
implemented using Valgrind, the techniques described here would [5] J. J. Harrow, Jr. Runtime checking of multithreaded applications
with Visual Threads. In Proceedings of SPIN 2000, pages 331–342,
be suitable for use with shadow memory tools built with other DBI
Stanford, California, USA, August 2000.
frameworks such as Pin [9] and DynamoRIO [1].
[6] R. Hastings and B. Joyce. Purify: Fast detection of memory leaks
OS page tables. Memcheck’s two-level shadow memory table and access errors. In Proceedings of the Winter USENIX Conference,
looks somewhat like an operating system (OS) page table. The pages 125–136, San Francisco, California, USA, January 1992.
obvious similarity is that page tables divide address space up into [7] K. Hazelwood. Code Cache Management in Dynamic Optimization
smallish chunks, as Memcheck’s table does. Systems. PhD thesis, Harvard University, Cambridge, Mass., USA,
However, there are many differences. OS page tables point to May 2004.
pages of original values rather than shadow values, so there are no
[8] V. Kiriansky, D. Bruening, and S. Amarasinghe. Secure execution via
questions about shadow value representation, such as whether com- program shepherding. In Proceedings of the 11th USENIX Security
pression is suitable. Also, shadow value tools do not have to deal Symposium, pages 191–206, San Francisco, California, USA, August
with issues that OSes do, such as making decisions about which 2002.
pages should be swapped out, nor track which files are mapped to
[9] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney,
which pages. Finally, the performance issues are completely dif- S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building customized
ferent because page tables benefit from hardware TLBs. Could a program analysis tools with dynamic instrumentation. In Proceedings
shadow value tool somehow utilise a hardware TLB to speed it up? of PLDI 2005, pages 191–200, Chicago, Illinois, USA, June 2005.
We do not see how it could, since all existing shadow value tools
[10] A. Mühlenfeld and F. Wotawa. Fault detection in multi-threaded
we know of are user-mode programs. C++ server applications. In Informal Proceedings of TV06, pages
191–200, Seattle, Washington, USA, August 2006.
7. Future Work and Conclusion [11] S. Narayanasamy, C. Pereira, H. Patil, R. Cohn, and B. Calder. Au-
A number of powerful DBA tools share one crucial characteris- tomatic logging of operation system effects to guide application-level
tic: the use of shadow memory. We have shown how to implement architecture simulation. In Proceedings of SIGMetrics/Performance
shadow memory in a manner that is highly robust and acceptably 2006, pages 216–227, St. Malo, France, June 2006.
fast. We began with a simple but slow implementation in Mem- [12] N. Nethercote. Dynamic Binary Analysis and Instrumentation. PhD
check, and improved it by (a) speeding up common cases such as thesis, University of Cambridge, United Kingdom, November 2004.
loads, stores, range-setting and stack pointer updates, and (b) re- [13] N. Nethercote and J. Fitzhardinge. Bounds-checking entire programs
ducing the size of shadow memory using both high-level and low- without recompiling. In Informal Proceedings of SPACE 2004,
level compression. The resulting implementation is fairly fast, very Venice, Italy, January 2004.
compact and robust, and used by thousands of programmers daily. [14] N. Nethercote and A. Mycroft. Redux: A dynamic dataflow tracer.
The results show the importance of low-level representation details Electronic Notes in Theoretical Computer Science, 89(2), 2003.
and operations in good shadow memory implementations.
[15] N. Nethercote and J. Seward. Valgrind: A program supervision
We think there are three main areas of future work in shadow
framework. Electronic Notes in Theoretical Computer Science, 89(2),
memory. First, the performance issues thrown up by 64-bit address 2003.
spaces and multi-processor machines need to be addressed. Second,
the performance of shadow memory tools could still be improved, [16] Nicholas Nethercote and Julian Seward. Valgrind: A framework for
heavyweight dynamic binary instrumentation. In Proceedings of
perhaps with better representations, or by finding ways to omit
PLDI 2007, San Diego, California, USA, June 2007.
unimportant shadow memory operations. Third, new tools that use
shadow memory in new ways could be created. For example, a [17] J. Newsome and D. Song. Dynamic taint analysis for automatic de-
profiling tool that tracks how values flow through memory and how tection, analysis, and signature generation of exploits on commodity
software. In Proceedings of NDSS ’05, San Diego, California, USA,
often they are copied might help programmers reduce the memory
February 2005.
bandwidth requirements of their programs; shadow memory would
be an important part of such a tool. [18] F. Qin, C. Wang, Z. Li, H. Kim, Y. Zhou, and Y. Wu. Lift:
Shadow memory tools are powerful. We look forward to seeing A low-oeverhead practical information flow tracking system for
detecting security attacks. In Proceedings of the Annual IEEE/ACM
them become better, faster, and more widely-used.
International Symposium on Microarchitecture (Micro’06), Orlando,
Florida, USA, December 2006.
Acknowledgments [19] M. Ronsse, B. Stougie, J. Maebe, F. Cornelis, and K. De Bosschere.
Thanks to Greg Parker for his Mac OS X expertise, Jeremy An efficient data race detector backend for DIOTA. In Parallel
Fitzhardinge for the multiple DSMs idea and implementation, Computing: Software Technology, Algorithms, Architectures &
Donna Robinson for encouragement, and Mike Bond, Kim Hazel- Applications, volume 13, pages 39–46. Elsevier, February 2004.
wood, Kathryn McKinley, Jeremy Singer and the anonymous re- [20] S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson.
viewers for helpful comments on earlier versions of this paper. Eraser: A dynamic data race detector for multithreaded programs.
ACM Transactions on Computer Systems, 15(4):391–411, November
References 1997.

[1] D. Bruening, T. Garnett, and S. Amarasinghe. An infrastructure for [21] J. Seward and N. Nethercote. Using Valgrind to detect undefined
adaptive dynamic optimization. In Proceedings of CGO’03, pages value errors with bit-precision. In Proceedings of the USENIX’05
265–276, San Francisco, California, USA, March 2003. Annual Technical Conference, Anaheim, California, USA, April
2005.
[2] M. Burrows. Personal communication, February 2006.
[22] The Valgrind Developers. Valgrind.
[3] M. Burrows, S. N. Freund, and J. L. Wiener. Run-time type checking http://www.valgrind.org/.
for binary programs. In Proceedings of CC 2003, pages 90–105,
Warsaw, Poland, April 2003.

Snake Robot Report
100% (1)
Snake Robot Report
30 pages
Machine_Learning_Analysis_of_Memory_Images_for_Process_Characterization
No ratings yet
Machine_Learning_Analysis_of_Memory_Images_for_Process_Characterization
10 pages
Unit-5
No ratings yet
Unit-5
144 pages
Detection of Precise C/C++ Memory Leakage by Diagnosing Heap Dumps Using Inter Procedural Flow Analysis Statistics
No ratings yet
Detection of Precise C/C++ Memory Leakage by Diagnosing Heap Dumps Using Inter Procedural Flow Analysis Statistics
8 pages
alibaba
No ratings yet
alibaba
17 pages
Csol 570 Module 7 Huskey
No ratings yet
Csol 570 Module 7 Huskey
29 pages
Meetup Jupyterhub - ML
No ratings yet
Meetup Jupyterhub - ML
20 pages
Java Programming
No ratings yet
Java Programming
48 pages
Detecting Malware Using Deep Learning Mo
No ratings yet
Detecting Malware Using Deep Learning Mo
4 pages
Cs3551 DC Unit-I 2 Marks
No ratings yet
Cs3551 DC Unit-I 2 Marks
5 pages
p1138 Cohen
No ratings yet
p1138 Cohen
8 pages
2010-02-report
No ratings yet
2010-02-report
20 pages
Glosarios de Ingles
No ratings yet
Glosarios de Ingles
18 pages
Mca Synopsis
No ratings yet
Mca Synopsis
28 pages
Deep Learning Tools (1)
No ratings yet
Deep Learning Tools (1)
23 pages
Set 1 - Marking Guide Information Systems
No ratings yet
Set 1 - Marking Guide Information Systems
5 pages
Kali Linux
No ratings yet
Kali Linux
57 pages
Runtime Model Checking of Multithreaded C/C++ Programs: Yu Yang Xiaofang Chen Ganesh Gopalakrishnan Robert M. Kirby
No ratings yet
Runtime Model Checking of Multithreaded C/C++ Programs: Yu Yang Xiaofang Chen Ganesh Gopalakrishnan Robert M. Kirby
11 pages
CSAW_ESC_final
No ratings yet
CSAW_ESC_final
5 pages
Practical Malware Analysis Based On Sandboxing
No ratings yet
Practical Malware Analysis Based On Sandboxing
6 pages
CSS_tae_3
No ratings yet
CSS_tae_3
11 pages
OSY report
No ratings yet
OSY report
5 pages
Parallel Systems Assignment
No ratings yet
Parallel Systems Assignment
11 pages
Malware analysis
No ratings yet
Malware analysis
7 pages
Dynamic Program Analysis and Tools
No ratings yet
Dynamic Program Analysis and Tools
6 pages
Assignment No 2 CC
No ratings yet
Assignment No 2 CC
11 pages
Persistence and Synchronization: Friends or Foes?
No ratings yet
Persistence and Synchronization: Friends or Foes?
11 pages
Paper-Windows Memory Forensics Detecting Unintentionally Hidden Injected Code by Examining Page Table Entries
No ratings yet
Paper-Windows Memory Forensics Detecting Unintentionally Hidden Injected Code by Examining Page Table Entries
11 pages
CH - 7 Abstractions For Programming
No ratings yet
CH - 7 Abstractions For Programming
18 pages
Lab7.docx
No ratings yet
Lab7.docx
10 pages
Win_sec_extr
No ratings yet
Win_sec_extr
10 pages
Unit 4 - Cloud Programming Models
100% (2)
Unit 4 - Cloud Programming Models
21 pages
Agent Oriented Prog
No ratings yet
Agent Oriented Prog
170 pages
Cs theory
No ratings yet
Cs theory
13 pages
Study of Object Detection Implementation Using Matlab: L.S.Alandkar, S.R.Gengaje
No ratings yet
Study of Object Detection Implementation Using Matlab: L.S.Alandkar, S.R.Gengaje
6 pages
Caracas 2009
No ratings yet
Caracas 2009
9 pages
Week 2 - Lecture
No ratings yet
Week 2 - Lecture
43 pages
Towards The Memory Forensics of OOP
No ratings yet
Towards The Memory Forensics of OOP
6 pages
Quist 2009
No ratings yet
Quist 2009
6 pages
Wa0007 1
No ratings yet
Wa0007 1
64 pages
OS Unit 5
No ratings yet
OS Unit 5
58 pages
Metasploit 1726815812
No ratings yet
Metasploit 1726815812
28 pages
Sda A1
No ratings yet
Sda A1
5 pages
2012-DroidScope - Seamlessly Reconstructing The OS and Dalvik Semantic Views For Dynamic Android Malware Analysis
No ratings yet
2012-DroidScope - Seamlessly Reconstructing The OS and Dalvik Semantic Views For Dynamic Android Malware Analysis
16 pages
ROPdefender A Detection Tool To Defend Against Ret
No ratings yet
ROPdefender A Detection Tool To Defend Against Ret
13 pages
Learning Objectives of Memory Analysis: SEPTEMBER 27, 2020
No ratings yet
Learning Objectives of Memory Analysis: SEPTEMBER 27, 2020
14 pages
Fundamentals of It and Programming
No ratings yet
Fundamentals of It and Programming
12 pages
HookChain - A New Perspective For Bypassing EDR Solutions
No ratings yet
HookChain - A New Perspective For Bypassing EDR Solutions
50 pages
Helping Johnny to Analyze Malware
No ratings yet
Helping Johnny to Analyze Malware
20 pages
Club Hack Magazine 35 PDF
No ratings yet
Club Hack Magazine 35 PDF
31 pages
BSC Computer Science Notes
No ratings yet
BSC Computer Science Notes
13 pages
Darjeeling, A Feature-Rich VM For The Resource Poor: Niels Brouwers Koen Langendoen Peter Corke
No ratings yet
Darjeeling, A Feature-Rich VM For The Resource Poor: Niels Brouwers Koen Langendoen Peter Corke
14 pages
The Java Mobile Risk
No ratings yet
The Java Mobile Risk
7 pages
Chapter 1
No ratings yet
Chapter 1
10 pages
Unit III Notes Mobile Application Development
No ratings yet
Unit III Notes Mobile Application Development
21 pages
Rootkit Identification
No ratings yet
Rootkit Identification
17 pages
p351 Ganapathy PDF
No ratings yet
p351 Ganapathy PDF
10 pages
Dr. B.R. Ambedkar National Institute of Technology: Computer Networks CSX-321
No ratings yet
Dr. B.R. Ambedkar National Institute of Technology: Computer Networks CSX-321
34 pages
1.Paper Chat1114 Chat1125
No ratings yet
1.Paper Chat1114 Chat1125
4 pages
Static Analysis of Binary Exe
No ratings yet
Static Analysis of Binary Exe
9 pages
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
Anisotropic Surfaces PDF
No ratings yet
Anisotropic Surfaces PDF
4 pages
View-Dependent Particles For Interactive Non-Photorealistic Rendering
No ratings yet
View-Dependent Particles For Interactive Non-Photorealistic Rendering
8 pages
HLSL Tutorial
No ratings yet
HLSL Tutorial
13 pages
Real Time Interactive Ocean Wave Simulation Using Multithread
No ratings yet
Real Time Interactive Ocean Wave Simulation Using Multithread
4 pages
Cartoon-Looking Rendering of 3D-Scenes: Philippe Decaudin
No ratings yet
Cartoon-Looking Rendering of 3D-Scenes: Philippe Decaudin
11 pages
An Intoductory Tutorial On Kd-Trees: Andrew W. Moore Carnegie Mellon University Awm@cs - Cmu.edu
No ratings yet
An Intoductory Tutorial On Kd-Trees: Andrew W. Moore Carnegie Mellon University Awm@cs - Cmu.edu
20 pages
Real Time Cloth Simulation: Sebastian Olsson (81-04-20) Mattias Stridsman (78-04-13)
No ratings yet
Real Time Cloth Simulation: Sebastian Olsson (81-04-20) Mattias Stridsman (78-04-13)
5 pages
GDC 2012 March 5-9: Runtime CPU Performance Spike Detection Using Manual and Compiler Automated Instrumentation
No ratings yet
GDC 2012 March 5-9: Runtime CPU Performance Spike Detection Using Manual and Compiler Automated Instrumentation
9 pages
Adventures in Data Compilation
No ratings yet
Adventures in Data Compilation
23 pages
Logan A Than
No ratings yet
Logan A Than
27 pages
Partner Ordering Guide Aug 9 2010 165289
No ratings yet
Partner Ordering Guide Aug 9 2010 165289
6 pages
API Design Giude
No ratings yet
API Design Giude
9 pages
Ewc Content
100% (1)
Ewc Content
20 pages
Methodology For The Lidar Survey of A Section of River Ankobra
No ratings yet
Methodology For The Lidar Survey of A Section of River Ankobra
2 pages
English-Rs232 Hart Modem
No ratings yet
English-Rs232 Hart Modem
1 page
Product Life Cycle Announcement: Current SKU Description Replacement SKU Description
No ratings yet
Product Life Cycle Announcement: Current SKU Description Replacement SKU Description
1 page
An Information Systems Perspective: The Impact of Mobility On E-Commerce
No ratings yet
An Information Systems Perspective: The Impact of Mobility On E-Commerce
32 pages
Vocal Removal - Audacity Wiki
No ratings yet
Vocal Removal - Audacity Wiki
3 pages
Git Push Pull
No ratings yet
Git Push Pull
36 pages
Hwontlog
No ratings yet
Hwontlog
22 pages
Getting Started With EIOS
No ratings yet
Getting Started With EIOS
4 pages
Nokia
No ratings yet
Nokia
5 pages
Dma 2 Pio Mode Problem
No ratings yet
Dma 2 Pio Mode Problem
12 pages
GA-78LMT-S2: Revision: 1.11
No ratings yet
GA-78LMT-S2: Revision: 1.11
27 pages
JDBC Driver
No ratings yet
JDBC Driver
4 pages
CDT
No ratings yet
CDT
2 pages
Faqs Ds-Ba - Version1.0
No ratings yet
Faqs Ds-Ba - Version1.0
23 pages
Programming and Data Structure Short Notes-1
No ratings yet
Programming and Data Structure Short Notes-1
42 pages
Carreon Osc11
No ratings yet
Carreon Osc11
6 pages
Half Adder
No ratings yet
Half Adder
16 pages
GGS Commands
No ratings yet
GGS Commands
10 pages
Vlsi
No ratings yet
Vlsi
12 pages
Resources ML
No ratings yet
Resources ML
22 pages
Learn Java 8 in A Week - A Beginner's Guide To Java Programming PDF
No ratings yet
Learn Java 8 in A Week - A Beginner's Guide To Java Programming PDF
107 pages
Chmod666 NIM Cheat Sheet
No ratings yet
Chmod666 NIM Cheat Sheet
2 pages
Entity and Attributes With Examples
No ratings yet
Entity and Attributes With Examples
7 pages
WVR61 Problem Determination c3470877
No ratings yet
WVR61 Problem Determination c3470877
516 pages
Siemens EWSD
87% (15)
Siemens EWSD
188 pages
Paper 39-Machine Learning Based Secure 5G Network Slicing
No ratings yet
Paper 39-Machine Learning Based Secure 5G Network Slicing
17 pages

How To Shadow Every Byte of Memory Used by A Program

Uploaded by

How To Shadow Every Byte of Memory Used by A Program

Uploaded by

How to Shadow Every Byte of Memory Used by a Program

Nicholas Nethercote Julian Seward

Abstract 1.1 What is Shadow Memory?

2.2 Memcheck 0KB 64KB 128KB 192KB 3904KB 3968KB 4032KB

You might also like