How To Shadow Every Byte of Memory Used by A Program
How To Shadow Every Byte of Memory Used by A Program
Memcheck is a memory error detector designed primarily for use Figure 1. The two-level table. Entries PM[1] and PM[2] cover
with C and C++ programs. It is the main reason for Valgrind’s 64KB regions that have been written to and so have their own SM.
popularity–users surveys have indicate it accounts for more than The remaining PM entries still point to the NOACCESS DSM.
80% of Valgrind tool use.
When a client program is run under Memcheck, Memcheck in-
struments almost every operation and issues messages about de- registers, nor the heap block metadata, as they have been covered
tected memory errors. Memcheck maintains three kinds of meta- by previous publications.
data about the running client.
• A bits. Every memory byte is shadowed with a single A bit (‘A’
3. A Simple Implementation (M0)
is short for “addressability”) which indicates if the client may This section presents a simple, robust, but slow implementation of
legitimately access it. A 0 represents an unaddressable byte, a 1 shadow memory for Memcheck, which we call M0.
represents an addressable byte. They are updated as memory is No released versions of Memcheck have used this implemen-
allocated and freed, and checked on every memory access. With tation, but Memcheck has a debugging mode which falls back to
the A bits Memcheck can detect uses of unaddressable memory it. Some or all of the optimisations described in Section 4 must be
such as heap buffer overflows and wild reads and writes. added to this implementation to obtain acceptable performance.
• V bits. Every register and memory byte is shadowed with eight 3.1 Shadow Memory Data Structures
V bits (‘V’ for “validity”) which indicate if the value bits are de-
fined (i.e. initialised, or derived from other defined values). A Memcheck’s main shadow memory data structure is a two-level
0 represents a defined bit, a 1 represents an undefined bit.2 Ev- table somewhat like a page table. It is designed for a 32-bit (4GB)
ery value-writing operation is shadowed with another operation address space; Section 3.8 describes how it is modified it for 64-
that updates the corresponding shadow values. With the V bits bit address spaces. The address space is divided into 64K chunks
Memcheck can detect dangerous uses of undefined values with of 64KB each. The primary map (PM) is a global array with 64K
bit-precision [21]. In this paper we are only concerned with V entries, one for each chunk. Each entry is a pointer to a secondary
bits for memory, not V bits for registers. map (SM) which holds the shadow memory (A and V bits) for a
64KB chunk. (In the code below, the types U1, U4, U8, U32 and U64
• Heap blocks. Memcheck records the location of every live heap are 1, 4, 8, 32 and 64 bit unsigned integers respectively, and Uw,
block in a hash table. With this information it can detect bad or Addr and SizeT are word-sized unsigned integers.)
repeated frees of heap blocks, and memory leaks.
typedef struct { // Secondary Map: covers 64KB
Conceptually, every register byte has eight shadow bits (the V U8 abits8[8192]; // 8K A bytes == 64K A bits
bits), and every memory byte has nine shadow bits (eight V bits U8 vbits8[65536]; // 64K V bytes
and one A bit). This implies that each shadow memory byte has } SM;
512 possible states. However, a byte’s V bits are only consulted if
the A bit says it is addressable, so there are 257 meaningful states SM* PM[65536]; // Primary Map: covers 4GB
for each byte (one “unaddressable” state, and 256 “addressable”
states with different definedness sub-states). There is a distinguished secondary map (DSM), called the
Representing the definedness of every bit is potentially expen- NOACCESS DSM, which is marked as entirely “unaddressable”,
sive, but crucial for accuracy. Tracking definedness at the byte and is never modified. All PM entries initially point to it. Figure 1
level—a commonly-suggested alternative—inevitably causes false illustrates the relationship between the PM and the SMs.
positives and/or false negatives, particularly for programs that use Although each SM contains Memcheck-specific data, the basic
bit-fields and bit-level operations [21]. We have heard from several two-level table can be used with any tool that uses shadow memory.
users that Memcheck has identified bugs caused by the use of a Indeed, Section 6 shows that most existing shadow memory tools
single undefined bit. use a data structure similar to this.
Fortunately, the redundancy in the number of states and the high
3.2 Single-byte Loads and Stores
cost of bit-level definedness tracking can be minimised with the use
of compressed shadow memory (Section 4.4). This section defines four fundamental functions used to load and
The next two sections present Memcheck’s implementation of store individual shadow memory bytes. Our functionally correct but
shadow memory, i.e. how it stores and accesses the V and A bits for slow implementation (M0) is built on top of these four functions.
memory locations. We do not discuss the use of V bits in shadow There are two main steps when loading or storing a single
shadow memory byte. In the first step, Memcheck uses the high
2 This counter-intuitive
encoding makes many of Memcheck’s V bit shadow 16 bits of the address to find the relevant SM within the PM. For
operations simpler [21] for reasons that are beyond the scope of this paper. loads, it can use the found SM as-is.
SM* get_SM_for_reading(Addr a) U64 vbits64 = V_BITS64_UNDEFINED;
{ SizeT n_bad_addrs = 0;
return PM[a >> 16]; // use bits [31..16] of ’a’ for (i = 0; i < nBits / 8; i++) {
} get_abit_and_vbits8(&abit, &vbits8, a + i);
if (abit != A_BIT_ADDRESSABLE) {
For stores, Memcheck uses copy-on-write semantics—it checks if n_bad_addrs++; // Defined-if-
the SM found is the DSM (with is_DSM), and if so, allocates and vbits8 = V_BITS8_DEFINED; // unaddressable
initialises a new SM (with copy_for_writing). }
SM* get_SM_for_writing(Addr a) vbits64 = (vbits64 << 8) | vbits8;
{ }
SM** sm_p = &PM[a >> 16]; // bits [31..16] if (n_bad_addrs > 0)
if (is_DSM(*sm_p)) record_address_error(a, nBits);
*sm_p = copy_for_writing(*sm_p); // copy-on-write return vbits64;
return *sm_p; }
} Note that this code assumes memory values are little-endian.
In the second step, Memcheck uses the low 16 bits of the Memcheck also handles big-endian values, but we omit the relevant
address to find the A and V bits within the SM. Loads are done with details (which are minor) for clarity.
the following function get_abit_and_vbits8. Extracting the V All client stores are instrumented with a call to STOREVn, which
bits from the SM is straightforward; extracting the single A bit from is similar to LOADVn; it copies a given shadow memory value of
a group of eight A bits requires extra shifting and masking. 8, 16, 32 or 64 bits into shadow memory, one byte at a time. If
any destination byte’s A bits indicate that it is unaddressable, the
void get_abit_and_vbits8(Addr a,/*OUT*/Uw* abit, relevant V bits are not copied and an error message is issued.
/*OUT*/Uw* vbits8)
{ 3.4 Range-setting Operations
SM* sm = get_SM_for_reading(a); In certain cases, Memcheck needs to set every shadow memory
U8 abits8 = sm->abits8[(a & 0xffff) >> 3];//[15..3] byte in a range to one of the following three states.
*abit = 0x1 & (abits8 >> (a & 0x7)); //[ 2..0]
*vbits8 = sm->vbits8[a & 0xffff]; //[15..0] 1. NOACCESS (unaddressable): for memory deallocations (on the
} heap, stack, or via system calls such as munmap).
The store case (set_abit_and_vbits8) is similar. 2. UNDEFINED (addressable and fully undefined): for memory al-
locations that do not initialise the allocated memory (such as
void set_abit_and_vbits8(Addr a, Uw abit, Uw vbits8) malloc, brk, and stack allocations).
{
3. DEFINED (addressable and fully defined): for memory alloca-
SM* sm = get_SM_for_writing(a);
tions that initialise the allocated memory (such as calloc and
Uw shift = a & 0x7;
mmap), for memory loaded at program start-up, and for when a
Uw i = (a & 0xffff) >> 3;
system call writes to a block of memory (e.g. gettimeofday
sm->abits[i] = (sm->abits[i] & ~(1 << shift))
fills in two structs with data).
| ((abit & 0x1) << shift);
sm->vbyte[a & 0xffff] = vbits8 & 0xff; Valgrind provides an event-tracking system that lets a tool know,
} via callbacks, when these operations occur [16]. Memcheck has
three range-setting callback functions: make_mem_noaccess,
The following sections show other shadow memory operations
make_mem_undefined, and make_mem_defined. They each take
that are layered on top of these four fundamental functions.
a starting address and a length in bytes, and set the shadow mem-
ory bytes for the given memory range, one byte at a time, using
3.3 Multi-byte Loads and Stores
set_abit_and_vbits8.
Most memory accesses are multi-byte. Every memory load in the A similar function, copy_range, is used to handle realloc
client program is instrumented with a call to the following func- and mremap. It copies A and V bits from one shadow memory
tion LOADVn, which does a shadow load of 8, 16, 32 or 64 bits. It region to another.
obtains each V byte individually from shadow memory and then
combines them into a single n-bit value. The V bits of each byte 3.5 Range-checking Operations
are only used if the A bit indicates that the byte is addressable. Sometimes Memcheck needs to check A and V bits over ranges of
If any unaddressable bytes are touched, Memcheck issues an er- shadow memory and issue error messages if they do not match what
ror (with record_address_error) and acts as if the bits are all is required. This is mostly done to check that ranges of memory
defined.3 This avoids possible chains of multiple error messages passed to system calls have the properties that they should.
caused by a single program defect [21]. (The V_BITS* and A_BIT* Memcheck has three such operations: check_mem_is_addressable
constants hold aggregations of one or more V or A bits. For exam- checks that every byte in a range about to be written by a system
ple, V_BITS8_DEFINED represents eight defined V bits, i.e. the V call is addressable (and thus safe to write); check_mem_is_defined
bits for a fully defined byte.) checks that every byte in a range about to be read by a system
U64 LOADVn(Addr a, SizeT nBits) call is both addressable and fully defined (and thus safe to read);
{ and check_mem_is_defined_asciiz checks that every byte in a
U1 abit; U8 vbits8; Int i; string of unknown length is safe to read.
3 All
3.6 Do Not Shadow the Shadows
shadow memory tools must handle this case, i.e. provide a reasonable
shadow memory value for memory locations that have not been allocated There is a subtle problem that can occur in this shadow memory
and are erroneously accessed. implementation. For example, consider this memory layout:
SMX (a 72KB SM, which covers 64KB of address space) be scaled to 64-bit address spaces, so this remains an open research
Y (4KB of client data) question for the future.
SMZ (a 72KB SM, which covers 64KB of address space)
3.9 Handling Multi-threaded Programs
Any client accesses to Y causes shadow memory accesses to a
Threads pose a particular challenge for shadow memory tools. The
secondary map SMY (it does not matter where SMY is located).
reason is that loads and stores become non-atomic: each load or
But because SMY covers 64KB of address space, even in the best
store translates into the original load/store plus a shadow load/store.
case it must cover at least 60KB’s worth of address space from SMX
There are two potential problems with this. First, asynchronous
and/or SMZ . In other words, because of the intermingling of SMs
signals may be delivered between these two operations. To avoid
and client data, some parts of the SMs end up uselessly covering
this problem, Valgrind only delivers asynchronous signals to the
parts of the address space which are occupied by other SMs. (In
client at particular safe points (between the code blocks that Val-
Memcheck, those ranges would be marked as unaddressable by
grind’s JIT compiler uses) [16].
the client program.) This wastes space. If such intermingling is
Second, on a uni-processor machine, a thread switch might oc-
frequent, it can lead to a steep increase in the number of SMs
cur between these two operations. On a multi-processor machine,
needed. This problem affected early versions of Memcheck.
concurrent memory accesses to the same memory location may
There are two ways to avoid this problem. If SMs are all nKB
complete in a different order to their corresponding shadow mem-
and they are guaranteed to be nKB-aligned, there will be no over-
ory accesses. It is unclear how to best deal with this, as a fine-
lapping. But this can be too restrictive; for example, in Section 4.4
grained locking approach would likely be slow.
we introduce an SM that covers 64KB but is only 16KB in size.
To sidestep this second problem, Valgrind uses a thread lock-
Memcheck uses a simpler approach: ensure that SMs are kept far
ing mechanism; on a thread-switch the kernel still chooses which
away from the client’s original data whenever possible.
thread is to run, but Valgrind dictates when thread-switches occur
3.7 Possible Corruption of Shadow Memory by the Client and prevents more than one thread from running at a time. This
works well on uni-processor machines—Valgrind can ensure that
It is possible for a buggy client to do wild writes that overwrite
thread-switches never occur between a load/store and a shadow
Memcheck’s shadow memory (or any of its other data structures).
load/store, and Memcheck can simply ignore threading issues. On
When Valgrind only ran on x86/Linux, it used the x86 segment
multi-processor machines it can also run multi-threaded programs
registers to prevent this. However, this non-portable feature was
safely, but only serially. As multi-processor machines become more
removed when Valgrind was ported to other architectures, and we
popular, this shortcoming will become more critical. Whether the
know of no other way to prevent such wild writes without large
current approach will remain the optimal one in the future is an
slow-downs.
open research question.
This problem rarely occurs in practice because Valgrind and
Memcheck’s data tends to be far away from client data, which
minimises the chance of a wild write causing corruption. (This is 4. A Better Implementation
another good reason why client data should not be closely inter- This section presents four optimisations for M0 which when ap-
mingled with Valgrind and Memcheck data.) Also, Memcheck will plied successively give us four new versions which we call M1–
always warn about any such wild writes by the client before they M4. These optimisations use standard principles: make common
happen, because Valgrind and Memcheck data is marked as unad- cases fast, and reduce storage requirements by exploiting redun-
dressable via the NOACCESS DSM. (Other shadow memory tools dancy in data. Together they reduce Memcheck’s mean slow-down
will not be so lucky.) This is a good example of a trade-off: some factor by 4.0–13.6x, and reduce its mean shadow memory size by
robustness is sacrificed, albeit in rare cases, in favour of perfor- 4.5–213.4x, making it fast enough for widespread use, and compact
mance and portability. enough to handle large programs.
3.8 Handling 64-Bit Machines 4.1 Faster Loads and Stores (M1)
64-bit address spaces are much larger than 32-bit address spaces. Multi-byte loads and stores are very common, and we can do
The obvious extension is to use a three- or four-level table, but better than use LOADVn and STOREVn with them. For example, the
this would make every shadow memory access slower. Instead we following function does fast 32-bit shadow loads. If the load is
extend the size of the primary map to 219 bits (covering 32GB), and aligned (which guarantees that all four bytes are covered by the
we add a slow, sparse auxiliary table for secondary maps higher same SM) and all four bytes are addressable, it obtains the 32 V
than 32GB, and the Valgrind core avoids allocating above this bits from the SM in a single operation. Otherwise, it falls back to
32GB point when possible. A below-32GB check has to be done the slow, general case.
for every shadow memory operation, but for the fast cases it can
U32 LOADV32_fast(Addr a) {
be combined for zero cost in a single mask-and-test operation with
SM* sm; U4 abits4; U8 abits8;
the alignment check used in the optimised shadow loads and stores
if (!IS_32BIT_ALIGNED(a))
(described in Section 4.1 below).
return (U32)LOADVn(a, 32);
This approach works well under Linux because Valgrind has
sm = get_SM_for_reading(a);
enough control over the address space layout that it can allocate
abits8 = sm->abits8[(a & 0xffff) >> 3];
most memory under the 32GB limit. Unfortunately, we have found
abits4 = (abits8 >> (a & 0x4)) & 0xf;
that this is not true for PPC64/AIX. Therefore, Memcheck uses
return ( A_BITS4_ADDRESSABLE == abits4
some “semi-fast” cases, similar to those described in Section 4, for
? ((U32*)(sm->vbits8))[(a & 0xffff) >> 2]
certain accesses above the 32GB limit, e.g. those that are aligned : (U32)LOADVn(a, 32) );
and fully defined. This avoids the large slow-down that the slow }
cases cause, but is still approximately half the speed of the 32-bit
version. This function’s fast path does one alignment test, two SM
In the future, as 64-bit architectures become more common lookups, and one addressability check. This is much faster than
and memory footprints grow, this issue will become increasingly LOADVn which does, for 32-bit loads, eight SM lookups, four ad-
important. It is unclear how this shadow memory scheme can best dressability checks, and combines the four V bytes together. With
this function present, the slow case is run only every few thousand size of shadow memory by 4.29x and the mean slow-down factor
or even million loads. by 1.16x, as Section 5 shows.
STOREV32_fast is similar, but with the additional fast-path It is an optimisation that may seem obvious in hindsight. How-
condition that sm must not be the DSM. The functions for fast 1, ever, we are aware of no other shadow memory tool that uses com-
2 and 8 byte shadow loads and stores are similar, except that the pression like this, and Memcheck had been publically available for
1-byte case does not need the alignment check. These functions more than four years before we conceived and implemented it.
handle all the common multi-byte loads and store cases. Indeed, before this optimisation was implemented, the Valgrind
Section 5 shows that this optimisation reduces Memcheck’s distribution included a cut-down version of Memcheck (called Ad-
mean slow-down factor by 3.73x. drcheck) which tracked A bits but not V bits (thus recording 1 bit
of shadow information per byte, rather than 9), for use when Mem-
4.2 Faster Range-setting (M2) check’s memory overhead was too great. After adding this optimi-
Range-setting operations are also very common. There are three sation to Memcheck, we were able to kill off Addrcheck because
improvements that can be made to them. Section 5 shows that the difference in memory usage (1 bit per byte versus slightly more
together these optimisations reduce Memcheck’s mean slow-down than 2 bits) was so much smaller.
factor by 1.62x, and reduce its mean shadow memory size by 1.97,
but sometimes drastically more (e.g. more than 30x). 4.4.1 The Basic Idea
Vectorising set_range. First, set_range from Section 3.4 can Memcheck’s tracking of definedness at the level of individual bits
be vectorised so that it sets one byte at a time until the current is useful [21], but it is expensive considering that partially defined
address is N -aligned, then sets N bytes at a time, and finally sets bytes (PDBs) are rarely involved in more than 0.1% of memory
any left-over bytes one at a time. N must be a power of two to accesses, and are not present at all in many programs.
ensure that all N bytes belong to the same SM. Memcheck uses This situation can be improved. Instead of maintaining 8 V
N = 8. bits and 1 A bit for every used memory byte, we can instead use
This improvement is generally helpful, and also occasionally only two VA bits per memory byte. With two bits we can mark
affects performance greatly by mitigating some nasty performance each memory byte with one of four states: NOACCESS, DEFINED,
cases. Some operations that are very fast natively take far longer UNDEFINED or PARTDEFINED. The first three states are the familiar
when run under Memcheck. For example, a large stack allocation ones from Section 3.4. PARTDEFINED represents PDBs, which have
takes a single instruction natively, but requires setting a large region their full eight V bits stored in a sparse secondary V bit table.
of shadow memory under Memcheck. We have seen one example Shadow registers still have eight V bits per byte, so only shadow
program that did a lot of 8KB stack allocations, and with the loads and stores are affected. Shadow loads uncompress the two
improved set_range it ran more than three times faster. This is VA bits for each memory byte into the eight V bits for each register
a good reminder that it is important to not only make the common byte, and shadow stores do the opposite. Loads and stores involving
cases fast, but also make the uncommon cases not too slow. PDBs are much slower because they involve the secondary V bits
table, which is an AVL tree. copy_range is also changed to copy
Replacing whole SMs. Second, when setting a 64KB range cov- entries from the secondary V bits table for any bytes that have the
ered by a single SM to “unaddressable”, instead of laboriously PARTDEFINED state.
marking every byte Memcheck can instead replace the existing SM This approach makes shadow memory much smaller. And al-
with the NOACCESS DSM, achieving the same effect in a single op- though the PDB cases are slower, this approach is faster overall.
eration (this can be viewed as vectorisation on a much larger scale). This may be partly due to better cache behaviour but it is mostly
The replaced SM can then be deallocated. This greatly speeds up because many shadow operations are simpler in the common case,
large deallocations, such as those done with munmap when unload- as the next section shows.
ing a shared object, and reduces shadow memory size as well.
Additional DSMs. The third improvement is more subtle, and 4.4.2 The Details
was added to Memcheck more than three years after Memcheck Secondary maps now have the following structure.
was first released. It involves the introduction of additional DEFINED
and UNDEFINED DSMs to complement the NOACCESS DSM. This typedef struct {
turns out to be a big win: large code segments and read-only data U8 vabits8[16384]; // 64K two-bit values
regions can be covered by the DEFINED DSM, and because code } SM;
segments are rarely written to, Memcheck avoids allocating many The first two of the four fundamental functions introduced in
SMs. This saves memory and also improves speed because fewer Section 3.2, get_SM_for_reading and get_SM_for_writing,
SMs need to be initialised. are unchanged because the primary map is unchanged. The third
fundamental function, get_abit_and_vbits8 is replaced by the
4.3 Faster Stack Pointer Updates (M3) following function which loads the two VA bits for a memory byte.
Stack pointer updates are very common, and the increment/decrement
U8 get_vabits2(Addr a)
sizes are often small, statically known constants such as 4, 8, 12,
{
16 or 32 bytes. Memcheck uses specialised range-setting functions
SM* sm = get_SM_for_reading(a);
for these sizes which are faster than the variable-length range-
U8 vabits8 = sm->vabits8[(a & 0xffff) >> 2];
setting functions. These functions first check that the stack pointer
vabits8 >>= ((a & 3) << 1);// shift 2 bits down
is aligned—it almost always is—and then operate like unrolled ver-
return 0x3 & vabits8; // mask out the rest
sions of the vectorised set_range. These operations are so com-
}
mon that this optimisation reduces Memcheck’s mean slow-down
factor by 1.28x, as Section 5 shows. It is used by the following function which uncompresses the ob-
tained VA bits into eight V bits, suitable for placing in a shadow
4.4 Compressed V Bits (M4) register, and returns a boolean indicating if it was unaddressable. If
This section describes a more elaborate optimisation—a low-level the byte is a PDB, get_sec_vbits8 is used to look up the V bits
compression technique that on average further reduces the mean in the secondary V bits table.
Bool get_vbits8(Addr a, U8* vbits8) values are never read—but we need to garbage collect (GC) the
{ table when it fills up to prevent space leaks. Memcheck initially
U8 vabits2 = get_vabits2(a); limits the table to 1024 nodes, but doubles that limit after any
if ( VA_BITS2_DEFINED == vabits2 ) { GC in which more than half the nodes survive. This scales well
*vbits8 = V_BITS8_DEFINED; for programs with many PDBs.
} else if ( VA_BITS2_UNDEFINED == vabits2 ) { • Line Sizes. We can store the V bits for multiple consecutive
*vbits8 = V_BITS8_UNDEFINED; memory bytes in a single table node, i.e. have a larger line
} else if ( VA_BITS2_NOACCESS == vabits2 ) { size. A node (line) is then stale only if every byte in it is
*vbits8 = V_BITS8_DEFINED; // Defined-if- stale. Bigger lines are better if PDBs are clustered, because
return False; // unaddressable fewer lines will be needed, saving space and lookup time. But
} else { if PDBs are sparsely distributed, bigger lines will just take up
*vbits8 = get_sec_vbits8(a); more space. (The issues are similar to those affecting cache line
} sizes.) Memcheck uses a line size of 16 bytes, which provides
return True; a good balance.
}
• Eviction policies. A GC should not immediately evict all stale
This function is in turn used by a new version of LOADVn which is lines from the table, because lines may become non-stale soon,
very similar to the original one from Section 3.3. in which case unnecessary work will have been done. Mem-
The fourth fundamental function, set_abit_and_vbits8, is check uses an aging mechanism: during a GC it only evicts lines
replaced by a new function set_vabits2 similar to get_vabits2. that have not been touched for three GCs.
A function set_vabits8 (similar to get_vabits8) is built on top
of it; it is used by the new STOREVn. This policy ensures that the secondary V bits table lookups
The fast-case shadow load function LOADV32_fast from Sec- have negligible performance impact in all but the most pathological
tion 4.1 now has the following form. cases. With the large common-case time and space savings com-
pressed V bits are a clear overall improvement.
U32 LOADV32_fast(Addr a) {
SM* sm; U8 vabits8; 4.4.4 Another Trade-off
if (!IS_32BIT_ALIGNED(a))
return (U32)LOADVn(a, 32); Compressed V bits show another important trade-off, this time
sm = get_SM_for_reading(a); between precision and performance. Memcheck does not detect
vabits8 = sm->vabits8[(a & 0xffff) >> 2]; writes to read-only memory before they occur. It would require an
if (VA_BITS8_DEFINED == vabits8) extra read-only state, which would be common, and so five states
return V_BITS32_DEFINED; would be needed. Five states would not fit neatly into two bits, so
else if (VA_BITS8_UNDEFINED == vabits8) the implementation would be much slower. (A read-only state was
return V_BITS32_UNDEFINED; omitted from the pre-compressed V bits representation for similar
else reasons.) Besides, the additional benefit would be small because
return (U32)LOADVn(a, 32); such errors are rare and they usually cause segmentation faults
} which Memcheck can pinpoint immediately afterwards anyway.
The alignment check and primary map look-up is the same as 5. Evaluation
before. The secondary map look-up differs; the cases where the
four bytes are entirely defined or entirely undefined are handled In this section we evaluate the robustness, speed and memory usage
here, and the decompression of the two VA bits into eight V bits of Memcheck’s shadow memory implementation.
is straightforward. The remaining cases are handled by the slow
LOADVn function, similar to before. In particular, if any of the bytes 5.1 Robustness
loaded are PDBs—i.e. they have the PARTDEFINED state—LOADVn Memcheck’s key shadow memory robustness feature is its division
looks up the secondary V bits table. of shadow memory into smallish (SM-sized) chunks which can be
This LOADV32_fast is faster than the previous version in the laid out very flexibly. The only restriction is that the base Mem-
fast case because (a) it gets the VA bits with one SM access instead check executable (which is 4MB and contains some Memcheck
of two; (b) it does not have to do any shifting and masking to code and some Valgrind code) is statically linked and so must be
extract the A bits; and (c) the number of conditional branches is loaded at a pre-specified address. The address chosen is one that
usually unchanged—most loads are from DEFINED memory, so the is rarely used but never reserved by the kernel. On x86/Linux it
second test (VA_BITS8_DEFINED == vabits8) usually succeeds. is 0x38000000; no problems have been reported with this address
The other fast multi-byte load and store functions have similar but a user could change it if necessary by changing a configuration
benefits. file and recompiling Valgrind. This restriction is an implementation
Finally, the new versions of set_range (from 4.2) and the stack detail of Valgrind itself, however, and not inherent in Memcheck’s
operations (from 4.3) both benefit from the faster SM accesses. shadow memory scheme.
Robustness is not easy to quantify, and we can provide only
4.4.3 The Secondary V Bits Table anecdotal evidence for Memcheck’s robustness—we cite its num-
The secondary V bits table is an AVL tree that holds the full V bits ber of users, and the range of software and systems it has been
for PDBs in memory. It has three subtleties that require care. used on. Section 2.2 described how many users Memcheck has. We
have heard from users that Memcheck has been used successfully
• Stale nodes. When a PDB is overwritten with a non-PDB, on programs containing up to 25 million lines of code, on 32-bit
we could remove its entry from the table, but checking for and 64-bit platforms, both big-endian and little-endian, on several
overwritten PDBs on every store would be slow and remove flavours of Unix. Despite this broad exposure, we are aware of no
much of the benefit of compressed V bits. Instead we let these user problems relating to shadow memory layout while the current
entries become stale. This does not affect correctness—the stale scheme has been in place.
In earlier versions, Memcheck used the two-level table, but put down caused by shadow memory, we also give the figures for two
all shadow memory into a large contiguous region towards the high other tools: (a) Nulgrind (NL), the no-instrumentation tool, which
end of the address space. This region could then never be used shows the base slow-down due to Valgrind; and (b) Memcheck-lite
by a client. During this period a number of problems relating to (M5), a version of Memcheck with its register-level V bit prop-
shadow memory layout were encountered by users. Some programs agation and checking turned off, in which almost all of the tool
wouldn’t work without access to this part of the address space. It overhead is due to shadow memory operations.
was also incompatible with kernels with uncommon address space Nulgrind’s mean slow-down factor is 4.6. This is high, but the
configurations (such as the top 2GB being kernel-only, instead of no-instrumentation case is mostly uninteresting because the added
the more common 1GB), and with kernels configured to disallow instrumentation code dominates execution time, and Valgrind is not
“over-committing”, i.e. the mapping of more virtual memory than optimised for this case [16]. The mean slow-down of 22.2 for Mem-
the machine has physical memory and swap space. check on the “ref” inputs is respectable given the amount of anal-
These cases, while not typical, were common enough that we ysis it is doing. (The improvement over the mean slow-down of
rewrote Valgrind’s address space management code to give Mem- 23.4 for the “test” inputs shows how instrumentation costs are usu-
check its current flexibility. This flexibility solved the Linux prob- ally amortised in longer-running programs). Memcheck-lite’s mean
lems, and is becoming increasingly important as Valgrind and slow-down is 16.0x. By subtracting Nulgrind’s slow-down factor
Memcheck are ported to OSes that have more restrictive address from Memcheck-lite’s slow-down factor, we can estimate that ap-
space layouts than Linux. For example, Mac OS X places the main proximately half of Memcheck’s overhead is related to shadow
stack, shared libraries, C library code and Objective-C runtime memory accesses.
code in the upper half of address space in a manner that is very dif-
ficult to avoid. Similarly, AIX places the main stack, thread stacks, Other tools. Section 6 mentions some published performance re-
shared libraries and mapped-in network cards at various locations sults for other shadow memory tools. We do not perform any di-
in the address space which Valgrind cannot control. Also, many rect comparisons with other tools because they (a) are built with
embedded systems have similarly restrictive address space layouts. Valgrind and use basically the same implementation (but less opti-
This change also means that Valgrind can now run itself, which mised) as Memcheck (Annelid, Helgrind, TaintCheck, Redux); or
it could not do before this change.4 (b) are proprietary, not publically available, and/or implemented
on different platforms (Purify, Eraser, VisualThreads, Hobbes, pin-
5.2 Performance SEL); or (c) use shadow value data structures sufficiently different
to be not worth comparing (DRD—see Section 6); or (d) are only
We performed experiments on 25 of the 26 SPEC CPU2000 bench- capable of running a fraction of the SPEC 2000 benchmark suite
marks (we could not run galgel as gfortran failed to compile it). (TaintTrace, LIFT).
Eight of the benchmarks invoke their program more than once; for Nonetheless, as our second and third contributions stated, our
these (marked with a ‘*’ in Table 1) we ran all of them but only detailed description and evaluation of Memcheck’s shadow mem-
report the results for the longest-running invocation. We ran them ory implementation exceeds anything else in the literature.
in 32-bit mode on a 2.4 GHz Intel Core 2 Duo with 1GB RAM and
a 4MB L2 cache running SUSE Linux 10.2, kernel 2.6.18.2. To en-
sure a fair comparison, we implemented all variants using a single 6. Related Work
version (a pre-3.2.0 version) of Memcheck as the starting point. In this section we compare Memcheck’s shadow memory imple-
mentation to those of other shadow memory tools, all of which were
Smaller inputs. The left-hand side of Table 1 shows the slow- introduced in Section 1.
down factors of the five versions (M0–M4) of Memcheck from
Sections 3 and 4.1–4.4 (all with leak-checking off, because it runs Other Valgrind tools. Four of the tools (other than Memcheck)
at program termination and is largely orthogonal to the concerns mentioned in Section 1 were built with Valgrind: Annelid, Hel-
of this paper) on the SPEC “test” inputs.5 The slow-down factors grind, TaintCheck and Redux. Like Memcheck, they all use the
for perlbmk and fma3d are omitted; the native run-times were so two-level shadow memory data structure. Unlike Memcheck, they
short that their slow-down numbers for M4 were both over 200. do not use all of Section 4’s optimisations because they are more
The middle portion of the table shows the peak size of shadow experimental, so their performance is not as critical.
memory—the peak combined size of the primary map, DSMs, non-
distinguished SMs, and the secondary V bit table (for M4)—for Hobbes, TaintTrace and LIFT. Hobbes [3] and TaintTrace [4]
M0–M4. use a simple implementation of shadow memory that we call “half-
The four optimisations all improve speed, reducing the mean and-half”. They put client memory in the bottom 1.5GB of address
slow-downs by 3.73x, 1.62x, 1.28x and 1.16x, for a combined space, shadow memory in the next 1.5GB, and assume the top 1GB
speed-up factor of 8.9. The optimisations also reduce the mean is reserved for the kernel (this is all for 32-bit machines). Shadow
memory consumption by a factor of 8.5—extra DSMs by 1.97x memory accesses become so simple—each memory byte’s shadow
(although occasionally drastically more for programs with a lot of byte is found at a 1.5GB offset—that they can be inlined rather than
code and/or read-only data such as mcf and applu), and compressed requiring a C call, which makes them very fast. LIFT [18] is similar,
V bits by 4.29x. This last figure shows that compressed V bits are but shadow memory is 1/8th the size of client memory because each
highly effective. memory byte has a 1-bit shadow, and so it uses a scaled offset.
Hobbes’ reported slow-downs for SPECint programs were in
Larger inputs. The right-hand side of Table 1 shows the same the range 30–187x. However, other parts of Hobbes were inefficient
statistics for fully optimised Memcheck (M4) on the SPEC “refer- and so this is a poor comparison point. TaintTrace is implemented
ence” inputs. These inputs are so large that the experiments took with DynamoRIO [1], and its reported average slow-down is only
several days to run. To get an idea of the proportion of the slow- 5.5x for six of the SPECint benchmarks. LIFT [18] is built with
StarDBT, and has a mean slow-down factor of 3.5x for a similar
4 With one proviso: the “inner” Valgrind must be configured so it (i.e. the subset of the SPEC CPU2000 integer benchmarks. There are two
static executable) is loaded at a different address to the “outer” Valgrind. main reasons why they are much faster than Memcheck [16]: (a)
5 These versions are so slow that larger inputs would have taken weeks to they are doing simpler analyses, and (b) they use some instrumen-
complete. tation techniques that are faster but do not handle as wide a range of
“Test” inputs “Ref” inputs
Slow-down Factor Peak Sh Mem Size (KB) Slow-down Factor ShMem
Prog. M0 M1 M2 M3 M4 Tx M0,M1 M2,M3 M4 Mx NL M4 M5 M4,M5
bzip2* 162.9 45.0 23.7 20.8 19.2 8.5 27,040 21,424 4,880 5.5 3.6 17.1 12.9 47,888
crafty 252.0 110.5 51.3 38.1 35.1 7.2 5,584 3,856 864 6.5 7.0 35.9 26.2 864
eon* 380.7 201.8 133.8 59.8 55.1 6.9 5,296 2,416 656 8.1 8.4 51.9 51.1 704
gap 243.4 113.1 48.4 35.7 30.2 8.1 78,304 43,312 7,664 10.2 4.1 26.7 17.4 49,728
gcc* 234.9 112.5 51.0 39.9 37.0 6.3 14,944 12,064 2,873 5.2 5.2 33.3 24.5 25,523
gzip* 173.9 40.4 26.3 20.3 15.7 11.1 12,496 10,768 2,496 5.0 3.0 13.7 10.6 48,576
mcf 184.9 63.5 32.0 19.5 16.1 11.5 109,264 3,136 512 213.4 2.1 7.1 5.4 528
parser 216.7 72.6 45.0 24.1 18.4 11.8 38,128 8,896 2,144 17.8 3.9 17.9 13.9 7,200
perlbmk* (omitted; run-time too short) 3,496 1,264 464 7.5 4.8 25.3 18.9 40,512
twolf 156.7 47.7 32.8 28.3 25.4 6.2 4,432 2,704 704 6.3 3.2 15.8 11.5 5,584
vortex* 238.9 119.1 64.4 49.0 43.7 5.5 36,904 34,672 7,632 4.8 6.9 41.2 30.8 20,784
vpr* 172.2 52.6 30.9 24.2 21.4 8.0 3,568 2,056 512 7.0 4.3 20.3 14.1 1,552
ammp 113.6 39.9 36.1 32.7 28.2 4.0 28,192 22,504 5,072 5.6 3.6 32.7 27.0 5,088
applu 222.2 38.8 27.5 27.2 25.1 8.9 221,728 13,072 3,008 73.7 5.4 19.3 11.9 47,728
apsi 223.1 26.3 20.6 19.1 16.9 13.2 223,168 221,872 49,392 4.5 3.7 16.2 11.1 49,600
art* 207.1 45.0 44.1 43.0 25.9 8.0 7,024 5,152 1,200 5.9 5.1 24.4 21.6 1,568
equake 205.1 59.4 25.5 21.0 17.9 11.5 29,632 28,696 6,320 4.7 4.3 17.1 13.3 25,472
facerec 135.1 27.4 20.8 16.7 14.2 9.5 33,880 29,056 6,480 5.2 4.9 18.4 11.8 6,848
fma3d (omitted; run-time too short) 5,872 2,704 736 8.0 4.3 25.4 18.2 28,592
lucas 331.8 67.8 37.6 27.2 24.8 13.4 3,928 2,056 576 6.8 4.1 23.3 14.6 37,056
mesa 202.2 109.3 45.1 30.9 29.2 6.9 27,256 11,200 2,560 10.6 5.9 58.8 33.4 2,704
mgrid 205.5 20.0 20.0 20.0 17.5 11.7 67,000 65,632 14,672 4.6 4.1 16.8 11.2 14,720
sixtrack 268.8 34.8 27.8 22.3 20.2 13.3 74,560 39,352 8,464 8.8 6.4 19.8 15.2 9,648
swim 183.8 25.8 16.7 16.0 13.5 13.6 222,736 87,232 19,456 11.4 3.7 10.8 7.1 49,296
wupwise 279.1 63.2 35.6 30.0 25.6 10.9 205,960 204,520 45,520 4.5 7.8 26.9 19.1 45,536
geo. mean 209.6 56.2 34.7 27.2 23.4 8.9 254,65 12,928 3,013 8.5 4.6 22.2 16.0 11,144
rel. imp. 3.73 1.62 1.28 1.16 1.97 4.29 1.38
Table 1. Performance of six Memcheck variants (M0–M5) and Nulgrind (NL). Column 1 gives the program name; integer programs are
listed before floating-point programs. Columns 2–6 give the slow-down factors for M0–M4 (with “test” inputs), and column 7 (Tx) gives
the overall speed improvement from M0 to M4. Columns 8–10 give the shadow memory sizes for M0–M4, and column 11 (Mx) gives the
overall shadow memory reduction from M0 to M4. Columns 12–14 give the slow-down factors for Nulgrind, M4 and M5 (with “ref” inputs).
Column 15 gives the shadow memory size for M4 and M5. The second-last row gives geometric means of each column. The last row gives
the relative improvements in the means for M1–M4.
programs, and half-and-half shadow memory is one of these tech- state code is like Memcheck’s compressed VA bits but without the
niques. PARTDEFINED value for handling PDBs. We know of no published
Unfortunately, although half-and-half is simple and fast, its less information about the bit table’s structure.
flexible layout means it fails for some programs under Linux, and VisualThreads [5], another data-race detector, uses a two-level
is incompatible with OSes with more restrictive memory layouts table like Memcheck, but with much larger secondary maps (16MB
such as Mac OS X and AIX, as Section 5.1 explained. vs. 64KB). Judging from the cited paper, the primary map is a
For these reasons, for 32-bit machines, half-and-half is unsuit- structure with a non-constant lookup time such as a tree. This is in
able for Memcheck and related Valgrind tools, for which robustness contrast to Memcheck’s first-level lookup which is constant-time.
is as important or more important than performance. For 64-bit ma- Larger secondary maps cause more memory to be wasted in the
chines the situation is less clear, but we suspect similar problems cases where secondary maps are only partially used, and DSMs are
would arise with half-and-half in that setting. In comparison, the likely to be less effective. The paper also says: “This table lookup
two-level table approach provides acceptable performance and ex- was added for improved robustness necessary in a product, even at
cellent robustness. This is an example of a crucial design trade-off. the cost of some additional execution overhead.” We suspect this
The Hobbes, TaintTrace and LIFT papers are notable for being cryptic statement corroborates our claim that a flexible layout is
the only other publications we know of that describe a shadow required for robustness, as opposed to the half-and-half scheme.
memory implementation in detail beyond a couple of sentences. pinSEL [11] uses a two-level table, with smaller secondary
Also, all three tools could be changed to use a two-level shadow maps than Memcheck (4KB vs. 16KB). Its primary map is a hash
memory implementation. table. The reported slow-down for pinSEL is in the range 10–163x,
with an average of 93x.
Other tools. The original version of Eraser [20] used the half-and-
DRD [19] structures shadow memory differently. It needs to
half approach. The commercial version uses an approach more like
record all the memory bytes accessed during a segment (a time-
Memcheck’s—each memory page has a shadow page, a shadow
slice). For each segment it uses a bit-map, where each bit represents
page table does the real-to-shadow page mapping, and an array
a memory byte. Each bit-map is structured like our two-level table,
is used as a mapping cache (shadow TLB) [2]—but there is no
but instead with nine levels. This makes lookups slower, but results
publication describing it.
in very little wasted space in the sparsely populated segment bit-
Purify [6] uses “a bit table that holds a two-bit state code for
each byte in the heap, stack, data and bss sections”. The two-bit
maps, which is important as there can be many segments live at [4] W. Cheng, Q. Zhao, B. Yu, and S. Hiroshige. Tainttrace: Efficient
one time. The measured slow-down factors ranged from 10–247. flow tracing with dynamic binary rewriting. In Proceedings of ISCC
2006, pages 749–754, Cagliari, Sardinia, Italy, June 2006.
Other DBI frameworks. Although this paper described a tool
implemented using Valgrind, the techniques described here would [5] J. J. Harrow, Jr. Runtime checking of multithreaded applications
with Visual Threads. In Proceedings of SPIN 2000, pages 331–342,
be suitable for use with shadow memory tools built with other DBI
Stanford, California, USA, August 2000.
frameworks such as Pin [9] and DynamoRIO [1].
[6] R. Hastings and B. Joyce. Purify: Fast detection of memory leaks
OS page tables. Memcheck’s two-level shadow memory table and access errors. In Proceedings of the Winter USENIX Conference,
looks somewhat like an operating system (OS) page table. The pages 125–136, San Francisco, California, USA, January 1992.
obvious similarity is that page tables divide address space up into [7] K. Hazelwood. Code Cache Management in Dynamic Optimization
smallish chunks, as Memcheck’s table does. Systems. PhD thesis, Harvard University, Cambridge, Mass., USA,
However, there are many differences. OS page tables point to May 2004.
pages of original values rather than shadow values, so there are no
[8] V. Kiriansky, D. Bruening, and S. Amarasinghe. Secure execution via
questions about shadow value representation, such as whether com- program shepherding. In Proceedings of the 11th USENIX Security
pression is suitable. Also, shadow value tools do not have to deal Symposium, pages 191–206, San Francisco, California, USA, August
with issues that OSes do, such as making decisions about which 2002.
pages should be swapped out, nor track which files are mapped to
[9] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney,
which pages. Finally, the performance issues are completely dif- S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building customized
ferent because page tables benefit from hardware TLBs. Could a program analysis tools with dynamic instrumentation. In Proceedings
shadow value tool somehow utilise a hardware TLB to speed it up? of PLDI 2005, pages 191–200, Chicago, Illinois, USA, June 2005.
We do not see how it could, since all existing shadow value tools
[10] A. Mühlenfeld and F. Wotawa. Fault detection in multi-threaded
we know of are user-mode programs. C++ server applications. In Informal Proceedings of TV06, pages
191–200, Seattle, Washington, USA, August 2006.
7. Future Work and Conclusion [11] S. Narayanasamy, C. Pereira, H. Patil, R. Cohn, and B. Calder. Au-
A number of powerful DBA tools share one crucial characteris- tomatic logging of operation system effects to guide application-level
tic: the use of shadow memory. We have shown how to implement architecture simulation. In Proceedings of SIGMetrics/Performance
shadow memory in a manner that is highly robust and acceptably 2006, pages 216–227, St. Malo, France, June 2006.
fast. We began with a simple but slow implementation in Mem- [12] N. Nethercote. Dynamic Binary Analysis and Instrumentation. PhD
check, and improved it by (a) speeding up common cases such as thesis, University of Cambridge, United Kingdom, November 2004.
loads, stores, range-setting and stack pointer updates, and (b) re- [13] N. Nethercote and J. Fitzhardinge. Bounds-checking entire programs
ducing the size of shadow memory using both high-level and low- without recompiling. In Informal Proceedings of SPACE 2004,
level compression. The resulting implementation is fairly fast, very Venice, Italy, January 2004.
compact and robust, and used by thousands of programmers daily. [14] N. Nethercote and A. Mycroft. Redux: A dynamic dataflow tracer.
The results show the importance of low-level representation details Electronic Notes in Theoretical Computer Science, 89(2), 2003.
and operations in good shadow memory implementations.
[15] N. Nethercote and J. Seward. Valgrind: A program supervision
We think there are three main areas of future work in shadow
framework. Electronic Notes in Theoretical Computer Science, 89(2),
memory. First, the performance issues thrown up by 64-bit address 2003.
spaces and multi-processor machines need to be addressed. Second,
the performance of shadow memory tools could still be improved, [16] Nicholas Nethercote and Julian Seward. Valgrind: A framework for
heavyweight dynamic binary instrumentation. In Proceedings of
perhaps with better representations, or by finding ways to omit
PLDI 2007, San Diego, California, USA, June 2007.
unimportant shadow memory operations. Third, new tools that use
shadow memory in new ways could be created. For example, a [17] J. Newsome and D. Song. Dynamic taint analysis for automatic de-
profiling tool that tracks how values flow through memory and how tection, analysis, and signature generation of exploits on commodity
software. In Proceedings of NDSS ’05, San Diego, California, USA,
often they are copied might help programmers reduce the memory
February 2005.
bandwidth requirements of their programs; shadow memory would
be an important part of such a tool. [18] F. Qin, C. Wang, Z. Li, H. Kim, Y. Zhou, and Y. Wu. Lift:
Shadow memory tools are powerful. We look forward to seeing A low-oeverhead practical information flow tracking system for
detecting security attacks. In Proceedings of the Annual IEEE/ACM
them become better, faster, and more widely-used.
International Symposium on Microarchitecture (Micro’06), Orlando,
Florida, USA, December 2006.
Acknowledgments [19] M. Ronsse, B. Stougie, J. Maebe, F. Cornelis, and K. De Bosschere.
Thanks to Greg Parker for his Mac OS X expertise, Jeremy An efficient data race detector backend for DIOTA. In Parallel
Fitzhardinge for the multiple DSMs idea and implementation, Computing: Software Technology, Algorithms, Architectures &
Donna Robinson for encouragement, and Mike Bond, Kim Hazel- Applications, volume 13, pages 39–46. Elsevier, February 2004.
wood, Kathryn McKinley, Jeremy Singer and the anonymous re- [20] S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson.
viewers for helpful comments on earlier versions of this paper. Eraser: A dynamic data race detector for multithreaded programs.
ACM Transactions on Computer Systems, 15(4):391–411, November
References 1997.
[1] D. Bruening, T. Garnett, and S. Amarasinghe. An infrastructure for [21] J. Seward and N. Nethercote. Using Valgrind to detect undefined
adaptive dynamic optimization. In Proceedings of CGO’03, pages value errors with bit-precision. In Proceedings of the USENIX’05
265–276, San Francisco, California, USA, March 2003. Annual Technical Conference, Anaheim, California, USA, April
2005.
[2] M. Burrows. Personal communication, February 2006.
[22] The Valgrind Developers. Valgrind.
[3] M. Burrows, S. N. Freund, and J. L. Wiener. Run-time type checking http://www.valgrind.org/.
for binary programs. In Proceedings of CC 2003, pages 90–105,
Warsaw, Poland, April 2003.